The Twitter Election Integrity Dataset

(TEI)

Click me to explore the dataset!

The Background

In a nutshell...

  • Twitter intends to comply with transparency principles when it comes to social influence campaigns
  • Hence, they are releasing archives of Tweets and media that they believe results from state-backed operations.
  • The first archive was launched in October 2018
  • This, as per their claims, is part of a serious effort to build technological framework to protect public conversation

The first released accounts where linked to Russia and Iran

The Release

What kind of information do these accounts include?

  • Political themes, in particular, targeted towards influencing a particular event, such as Elections.
  • Fake accounts: use stolen avatar photos, use of copied/stolen profile bios, use intentionally misleading profile information
    • e.g. they are trolls
  • If an account's activity is attributed to an entity knowen to violate the Twitter Rules, actions will be taken towards it.
  • Distribution of hacked material which may include private information or secrets. This includes accouts which claim responsibility for a hack.

Some of the reasons for each set of tweets to be collected...

๐Ÿ‡ง๐Ÿ‡ฉ Bangladesh

Bengali tweets identified in 2019, focusing on regional political themes

๐Ÿ‡ฎ๐Ÿ‡ท Iran

Reported malicious activity against a Twitter industry peer, via a influence campaign

๐Ÿ‡ท๐Ÿ‡บ Russia

Accounts may be identified to originate from the Internet Research Agency (IRA); Twitter CEO, Jack Dorsey, testified in 2018 about activity coming from such source. lel

๐Ÿ‡ป๐Ÿ‡ช Venezuela

These accounts happen to be a "foreign campaign of spammy content focused on divisive political themes".

๐Ÿ‡จ๐Ÿ‡ณ Peopleโ€™s Republic of China

Network of malicious actors, with 150,000 accounts designed to boost its content, e.g. the amplifiers. They were Tweeting predominantly in Chinese languages and spreading geopolitical narratives favorable to the Communist Party of China (CCP), while continuing to push deceptive narratives about the political dynamics in Hong Kong.

๐Ÿ‡น๐Ÿ‡ท Turkey

These accounts employ coordinated inauthentic activity to amplify political narratives favorable to the AK Parti, and demonstrated strong support for President Erdogan.

๐Ÿ‡ช๐Ÿ‡ธ Spain

These accounts were directly associated with the Catalan independence movement, specifically spreading content about the Catalan Referendum.

๐Ÿ‡น๐Ÿ‡ท Egypt

El Fagr network. The media group created inauthentic accounts to amplify messaging critical of Iran, Qatar and Turkey. Information we gained externally indicates it was taking direction from the Egyptian government.

๐Ÿ‡ฆ๐Ÿ‡ฒ Armenia

These accounts were created in order to advance narratives that were targeting Azerbaijan and were geostrategically favorable to the Armenian government

The Structure

Sample Data from Russia, Iran, and China

Place of Origin Year of Release Earliest Activity Latest Activity Number of accounts
IRA 2018 2009-05-12 09:37:00 2018-05-29 21:31:00 3608
Iran 2019 2009-12-08 05:44:00 2018-11-28 14:47:00 3081
Russia 2019 2011-05-29 14:46:00 2018-11-05 14:31:00 416
Iran 2019 2008-04-30 12:29:00 2019-04-21 21:42:00 4716
China 2019 2008-02-05 18:26:00 2019-08-28 01:31:00 5241
China 2020 2018-01-11 09:29:00 2020-04-17 07:13:00 23750
Russia 2020 2009-05-19 09:34:00 2019-12-12 13:26:00 1153
IRA 2020 2020-01-09 08:56:00 2020-08-21 18:37:00 5
Iran 2020 2020-01-08 18:30:00 2020-07-01 06:30:00 104

Available Features

  • tweetid, userid, user_display_name, user_screen_name,
  • user_profile_url,
  • follower_count, following_count,
  • account_creation_date,
  • account_language, tweet_language,
  • tweet_text, tweet_time, tweet_client_name,
  • in_reply_to_tweetid, in_reply_to_userid, quoted_tweet_tweetid,
  • is_retweet, retweet_userid, retweet_tweetid,
  • latitude, longitude,
  • quote_count, reply_count, like_count, retweet_count,
  • hashtags,
  • urls,
  • user_mentions,
  • poll_choices

The Code

We've got a programatic solution (in Python 3) to download the complete dataset (Please request access)

https://github.com/alorozco53/deep-trolls/blob/master/downloader.py

dogmeme

A code sample, which helps us to download all zip files from the dataset's Google Cloud

BUCKET_PREFIX = 'gs://twitter-election-integrity/hashed'

class DownloadThread(Thread):
    def __init__(self, path):
        Thread.__init__(self)
        self.path = path

    def run(self):
        print(f'Downloading {self.path}')
        location = self.path.lstrip('gs://')
        directory, _ = os.path.split(location)
        if not os.path.exists(directory):
            os.makedirs(directory)
        command = f'gsutil cp {self.path} {location}'
        query(command)
        print(f'Extracting {location}')
        with ZipFile(location, 'r') as zipObj:
            zipObj.extractall(path=directory)
        print(f'Downloaded {location}')