The Twitter Election Integrity Dataset¶

(TEI)

Click me to explore the dataset!

The Background¶

In a nutshell...¶

Twitter intends to comply with transparency principles when it comes to social influence campaigns
Hence, they are releasing archives of Tweets and media that they believe results from state-backed operations.

The first archive was launched in October 2018
This, as per their claims, is part of a serious effort to build technological framework to protect public conversation

The first released accounts where linked to Russia and Iran

The Release¶

What kind of information do these accounts include?¶

Political themes, in particular, targeted towards influencing a particular event, such as Elections.
Fake accounts: use stolen avatar photos, use of copied/stolen profile bios, use intentionally misleading profile information
- e.g. they are trolls

If an account's activity is attributed to an entity knowen to violate the Twitter Rules, actions will be taken towards it.

Distribution of hacked material which may include private information or secrets. This includes accouts which claim responsibility for a hack.

Some of the reasons for each set of tweets to be collected...

🇧🇩 Bangladesh

Bengali tweets identified in 2019, focusing on regional political themes

🇮🇷 Iran

Reported malicious activity against a Twitter industry peer, via a influence campaign

🇷🇺 Russia

Accounts may be identified to originate from the Internet Research Agency (IRA); Twitter CEO, Jack Dorsey, testified in 2018 about activity coming from such source. lel

🇻🇪 Venezuela

These accounts happen to be a "foreign campaign of spammy content focused on divisive political themes".

🇨🇳 People’s Republic of China

Network of malicious actors, with 150,000 accounts designed to boost its content, e.g. the amplifiers. They were Tweeting predominantly in Chinese languages and spreading geopolitical narratives favorable to the Communist Party of China (CCP), while continuing to push deceptive narratives about the political dynamics in Hong Kong.

🇹🇷 Turkey

These accounts employ coordinated inauthentic activity to amplify political narratives favorable to the AK Parti, and demonstrated strong support for President Erdogan.

🇪🇸 Spain

These accounts were directly associated with the Catalan independence movement, specifically spreading content about the Catalan Referendum.

🇹🇷 Egypt

El Fagr network. The media group created inauthentic accounts to amplify messaging critical of Iran, Qatar and Turkey. Information we gained externally indicates it was taking direction from the Egyptian government.

🇦🇲 Armenia

These accounts were created in order to advance narratives that were targeting Azerbaijan and were geostrategically favorable to the Armenian government

The Structure¶

Sample Data from Russia, Iran, and China¶

Place of Origin	Year of Release	Earliest Activity	Latest Activity	Number of accounts
IRA	2018	2009-05-12 09:37:00	2018-05-29 21:31:00	3608
Iran	2019	2009-12-08 05:44:00	2018-11-28 14:47:00	3081
Russia	2019	2011-05-29 14:46:00	2018-11-05 14:31:00	416
Iran	2019	2008-04-30 12:29:00	2019-04-21 21:42:00	4716
China	2019	2008-02-05 18:26:00	2019-08-28 01:31:00	5241
China	2020	2018-01-11 09:29:00	2020-04-17 07:13:00	23750
Russia	2020	2009-05-19 09:34:00	2019-12-12 13:26:00	1153
IRA	2020	2020-01-09 08:56:00	2020-08-21 18:37:00	5
Iran	2020	2020-01-08 18:30:00	2020-07-01 06:30:00	104

Available Features¶

tweetid, userid, user_display_name, user_screen_name,
user_profile_url,
follower_count, following_count,
account_creation_date,
account_language, tweet_language,
tweet_text, tweet_time, tweet_client_name,
in_reply_to_tweetid, in_reply_to_userid, quoted_tweet_tweetid,
is_retweet, retweet_userid, retweet_tweetid,
latitude, longitude,
quote_count, reply_count, like_count, retweet_count,
hashtags,
urls,
user_mentions,
poll_choices

The Code¶

We've got a programatic solution (in Python 3) to download the complete dataset (Please request access)

https://github.com/alorozco53/deep-trolls/blob/master/downloader.py

dogmeme

A code sample, which helps us to download all zip files from the dataset's Google Cloud

BUCKET_PREFIX = 'gs://twitter-election-integrity/hashed'

class DownloadThread(Thread):
    def __init__(self, path):
        Thread.__init__(self)
        self.path = path

    def run(self):
        print(f'Downloading {self.path}')
        location = self.path.lstrip('gs://')
        directory, _ = os.path.split(location)
        if not os.path.exists(directory):
            os.makedirs(directory)
        command = f'gsutil cp {self.path} {location}'
        query(command)
        print(f'Extracting {location}')
        with ZipFile(location, 'r') as zipObj:
            zipObj.extractall(path=directory)
        print(f'Downloaded {location}')