fivethirtyeight / russian-troll-tweets Goto Github PK

View Code? Open in Web Editor NEW

764.0 764.0 214.0 489.61 MB

russian-troll-tweets's People

Contributors

Stargazers

Watchers

Forkers

macoj midnightslacker thegwynne demodonk08 junghwanyang aalali44 eyejosh slunski shaynak hammer-of-thor srinify stepheneb mpodell rollinsjw nkasmanoff myeong phonedude noa vischia meredithclarkphd goitom alekklincewicz johnasharifi oroeder hmahal since1968 georgedittmar rvaughan philippe2803 craftdata piedifreddi openglobe mbwolff sclayton29 jnario jonathanklinn hubbucket-team jacoblundeen twilliambell naushadzaman mtgilmore dnbaumann shuninglu shintakezou gsmith-to daisykornblum tjbay mxk1235 aaarash user01 oleksiyanokhin weiglemc scipionesarlo shivamgupta211 pbiecek gsuvorov kgraves114 tulaneadam ilestfoup90 pking74 raddick andrewhyi jaypipes walkerkq nknezek hailey0huong haileab fredriko mcomsa aaronpeikert cfiller123 yuxhuang asyapluggedin georgedumontier mattwaite deepstatelearning pankaj-lakhina sociallycompute augustgiles ckcole yashwantreddy whryan martincote1978 gridl graesser euwatch vigorousnorth dckc codr99 socrateslab iamsingularity vincentmwong seokjunbing manuelaalmeida sastoudt chuash abir-ia aroguetroll houstondatavis patrick-lee-warren

russian-troll-tweets's Issues

Impossible amounts of followers

Did anyone else notice that there are a set of accounts who have more followers than there are users on twitter? Twitter has 336 million users but some of the accounts in the data set have close to 1 billion followers listed. How have you been interpreting this?

Accounts with more than 336 million followers: CHICAGODAILYNEW, DAILYSANFRAN, JENN_ABRAMS, KADIROVRUSSIA, KANSASDAILYNEWS, MAXDEMENTIEV, NEWORLEANSON, NOVOSTIMSK, NOVOSTISPB, ROOMOFRUMOR, SCREAMYMONKEY, SEATTLE_POST, SPECIALAFFAIR, TEN_GOP, TODAYINSYRIA, TODAYNYCITY, TODAYPITTSBURGH, and WASHINGTONLINE.

Desperately Seeking Schema

It would be nice if the table in the README could be updated with information about the type of each field. In particular, for those fields that are enumerated constants (such as post_type and account_type), list the set of valid values and for all fields indicate whether they are nullable. Since the data format is not raw Twitter data, maybe a link to https://help.salesforce.com/articleView?id=mc_ss_csv_report_headers.htm&type=5 would be helpful, too.

Non-ASCii Characters Issue

Hi all,

I have seen the posts re: double-encoded, however when I try:

for file in IRAhandle_tweets_1.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

I get the following error :
iconv: IRAhandle_tweets_1.csv:6:84: cannot convert

this also generates a file: IRAhandle_tweets_1.csv.corrected , which can not be opened by macbook (file size is only 2kb!)

Ultimately, I would like to export all the English Language tweets into a txt file...any suggestions kindly appreciated.

Additional ID columns

It's been mentioned in some of the other issues here but I think it warrants its own.

First off thanks for releasing this, it's a fantastic dataset. I'm intereested in the network effects going on in these communities, and want to load them into a graph db, but without the tweet ID, retweet ID & account name (as opposed to screen name), I'm not able to connect retweets to tweets, or user mentions to users.

It's possible to get the account name and retweeted status id from the t.co/... url (most redirect to an 'account suspended' page, but you can still see the full url in the request history), but without the (re)tweet ID, you can't connect things up.

IRAhandle_tweets_5.csv contains 4 records with '-1' in the external_author_id column

IRAhandle_tweets_5.csv contains 4 records with -1 in the external_author_id column, as follows, -1 replaced with NA,

external_author_id	author	content	region	language	publish_date	harvested_date	following	followers	updates	post_type	account_type	account_category	new_june_2018	alt_external_id	tweet_id	article_url	tco1_step1	tco2_step1	tco3_step1
NA	FINDDIET	muvva @_SlaYonce Marissa @marissafabrizi Tawnya @sdxNomadMom Happy @HelenFJohnson http://t.co/Z8DHogbMn8 https://t.co/9Fe1weThaC	United States	LANGUAGE UNDEFINED	8/12/2015 11:04	8/12/2015 11:12	3	308	29699	NA	Commercial	Commercial	1	NA	6.314209e+17	http://twitter.com/FindDiet/statuses/631420894506713088	https://twitter.com/marissafabrizi_/status/631419756613124096	https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FLITERALLY-SAME-WHEN-I-TRY-AND-RUN-TO-LOSE-WEIGHT.ASP	NA
NA	FINDDIET	http://t.co/MKzEwWBPwr Toe @hongtony22 ժɑɾíɑղ @woaahdaare Tara @intowalls Delaney @vandell66 Fattie @FattaySD https://t.co/cQrguxWQtT	United States	LANGUAGE UNDEFINED	8/12/2015 23:08	8/12/2015 23:08	3	331	30648	NA	Commercial	Commercial	1	NA	6.316031e+17	http://twitter.com/FindDiet/statuses/631603122641596417	https://twitter.com/ivannvega23/status/631601740173975552	https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FGIRLS-WHO-WORKOUT.ASP	NA
NA	FINDDIET	http://t.co/wFGUVzURSM Jay .@Jay_zo Nina .@xxooninaooxx Medeia .@sharifwrites K♛ .@Stcrbuks Jess .@jjeessss81 https://t.co/GOctHrGhiG	United States	LANGUAGE UNDEFINED	8/13/2015 0:18	8/13/2015 0:19	3	334	30756	NA	Commercial	Commercial	1	NA	6.316208e+17	http://twitter.com/FindDiet/statuses/631620795354058752	https://twitter.com/xxooninaooxx/status/631620042229215232	https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FMY-FAVE-WORKOUT.ASP	NA
NA	FINDDIET	http://t.co/U4DFGCaSEF Davi_hoops @comiclyfe_12 YELLA @ImYellaForeal perelondon @perelondon1 Sharday @whitegalshay https://t.co/7mh7C2GkKE	United States	LANGUAGE UNDEFINED	8/13/2015 6:27	8/13/2015 6:27	3	350	31267	NA	Commercial	Commercial	1	NA	6.317136e+17	http://twitter.com/FindDiet/statuses/631713577162805248	https://twitter.com/rainbowsonlou/status/631712761962233856	https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FGOOD-WORKOUT-AT-THE-GYM.ASP	NA

Compress files

i see youve split file into multiple smaller files. it could be good to also compress files to make download easier.

Many entries tagged with language=swedish are in fact in german

I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German.
Such as entry nr 322020 in IRAhandle_tweets_1.csv:
7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish
The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE.
You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?

Can We Get IP of Tweet?

Is there anyway we can add an IP address or other data piece that we could create a map from? It would be interesting to create a web map showing where these are coming from?

If it's not possible, then just close this issue.

Mission regions and shifted columns

There are a handful of tweets that are missing a region and have the following columns shifted to the left:

external_author_id	author	content	region	language	publish_date	harvested_date	following	followers	updates	post_type	new_june_2018	account_category	date_mysql
1670762347	ADAMCHAPMANJR	Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States	English	10/6/2016 21:04	10/6/2016 21:07	582	934	1807	0	left	1	NULL	2016-10-06 21:07:00
1850866398	BRICEGELLER	Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States	English	10/6/2016 20:54	10/6/2016 20:55	851	852	1587	0	left	1	NULL	2016-10-06 20:55:00
1626302035	CLAYPAIGEBOO	Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States	English	10/6/2016 20:58	10/6/2016 20:58	776	916	1562	0	left	1	NULL	2016-10-06 20:58:00
1692501152	CORNELLBURCHET	Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States	English	10/6/2016 20:55	10/6/2016 20:55	725	774	1536	0	left	1	NULL	2016-10-06 20:55:00
2882037326	DANAGEEZUS	I need a dance break right meow! (•_•) <) )╯all the single ladies / \ (•_•) \( (> all the single ladies / \ (•_•) <) )╯oh oh oh / ,United States	English	7/6/2015 15:03	7/6/2015 15:03	3740	9351	1849	NULL	Hashtager	0	NULL	2015-07-06 15:03:00
2577152109	DENN_NIKITIN	курточка моя так хорошо сидит на ней \ты пропел эти слова в голове,Unknown	Russian	4/15/2017 13:19	4/15/2017 13:19	68	113	6277	0	Russian	1	NULL	2017-04-15 13:19:00

Irregularities matching authors in the dataset with the 2017/2018 Congressional lists

I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…

Every author has multiple tweets. But some info (should) remain constant for a particular author across all their tweets: external_author_id, account_type, account_category and the new_june_2018 flag. This spreadsheet provides a summary of all the authors and their associated properties.
The November 2017 list contains user_ids, i.e. external_author_id. This can be used to resolve some of the floating point external_author_ids raised in issue #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.)
The dataset uses all-caps for each author; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.

However, there are two problems that I ran across:

There are 17 authors in the dataset who each have two different external_author_ids. Example: in IRAhandle_tweets_1.csv, there are some rows with author 4MYSQUAD and external_author_id 4036537452. Other rows have the same author 4MYSQUAD, but a different external_author_id 3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)

For some authors it appears this seems to be tied to floating point external_author_ids being rounded—such as KRISTINADRUCKER, who has some tweets with external_author_id 7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.

Is there a legit reason these accounts are associated with two different external_author_ids, or is this a mistake?
There are 5 authors in the dataset that are not listed in either of the PDFs from Congress:
- BABCHENKOVA_EVA: the 2018 PDF does list the “babchenkova” handle, which also appears in the dataset, but neither PDF lists an account name of BABCHENKOVA_EVA.
- JENNATRAVELLER: no similar account name is in either PDF.
- KARUCZ_00: the 2018 PDF does list the “Karucz” handle, which also appears in the dataset, but neither PDF lists an account name of KARUCZ_00.
- TERRAFORMA: the 2017 and 2018 PDFs do list the “taraformation” handle, which also appears in the dataset, but neither PDF lists an account name of TERRAFORMA.
- TOURETTESN: no similar account name is in either PDF.
It’s not clear why tweets from these five accounts are included in the dataset.

I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.

Should be available via BitTorrent and as a web database that can be queried

First, thanks for making the data available. I was asking about this recently. I would like to get a look at troll tweets, it might help us avoid arguing with them in the future.

However --

I wasn't able to download the file, abd this is not a great way to distribution the info. Better would be:

BitTorrent distribution. It was made for data like this. GitHub, not so much.
And it would be wonderful to have this online as a database that can be queried with SQL commands.

I would be happy to help either or both projects, assuming they don't already exist.

Thanks again for uploading the data.

Dave

New repo: separated tweet data and author data

[sorry, please disregard]

Quoting -1

You're quoting -1, which means it won't work in int fields without removing the quotes.

Version 2.0, fixing several issues and adding new data

I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???

It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.

Emojis in MySQL/MariaDB

Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.

retweet count?

Do you have the number of times each tweet was retweeted? If so, this would be a valuable addition to this dataset.

Resolving t.co links

It would be great to resolve the t.co links, did some research and you need Twitter Enterprise access,

Will there be a URL shortening or resolution API? There is no resolution API (although Twitter's enterprise products provide URL expansion enrichments). URLs are wrapped automatically at time of Tweet or Direct Message submission. Resolved URLs are only available in the context of Tweet or Direct Message content as part of the entities response.

I've requested access to the end point with 2million calls. If they approve I'll get upload the resolved links to unshorten/deanonymize the t.co endpoints.

If not, this should stay open until someone does it because it seems possible.

external_author_id is rounded as a floating point

They are in the format of "9.06000000000e+17" which I assume is incorrect and instead should be a "large integer".

@patrick-lee-warren - Full list of user IDs for suspended Twitter bot and troll accounts

Hi Patrick, please post the list with more of the troll User IDs than the shorter list with 454 users below. We'd like to use the longer list to remove bots from research data.
https://github.com/fivethirtyeight/russian-troll-tweets/files/2278803/external_author_id_rounding_fixes.zip

Also, can anyone recommend additional lists of known bot user IDs? We're only using User IDs in our research, not Twitter handles. Thanks for all the clean-up work on the rounded IDs! Death to Excel, long live Microsoft GitHub.

Repo appears to be over its data quota and is not allowing downloads

$ git-lfs smudge -- IRAhandle_tweets.csv
Error downloading object: IRAhandle_tweets.csv (80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b)

Smudge error: Error downloading 80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/

any plans for improvement?

Do you plan on improving this?
the tweets are missing the accounts re-tweeted, they should be prefixed with "RT @usernameRTed".
and the urls tweeted out would nice!

this is an example of a very standard twitter data pull to csv:
trump rally search 07.31.2018.xlsx

thanks!

character encoding for content column

seems the character encoding is off for the "content" column. special chars are showing up as weird text:

convert to UTF-8 maybe?

Add a license to this data set

Without an open source license applied to this repository, the authors retain all copyright to the data, and restrict the options of anyone who wants to use this data. See here for a discussion of how this applies to datasets: https://academia.stackexchange.com/questions/63139/public-dataset-without-license-what-is-allowed

You can add a license to the repository by following the instructions here: https://help.github.com/articles/licensing-a-repository/

Save the community some typing

I know this is an abuse of "issues" but it doesn't warrant a full repo. Here is some Python code you can cut/paste

class Rutweet:
    def __init__(self, external_author_id, author, content,
                region, language, publish_date, harvested_date,
                following, followers, updates, post_type, account_type,
                retweet, account_category, new_june_2018):
    
        self.external_author_id = external_author_id
        self.author = author
        self.content = content
        self.region = region
        self.language = language
        self.publish_date = publish_date
        self.harvested_date = harvested_date
        self.following = following
        self.updates = updates
        self.post_type = post_type
        self.account_type = account_type
        self.retweet = retweet
        self.account_category = account_category
        self.new_june_2018 = new_june_2018

And a quick loader.

def load_tweets(fn):
    with open(fn, 'r') as f:
        for line in f.readlines():
            fields = line.split(',')
            rut = Rutweet(fields[0], fields[1], fields[2],
                          fields[3], fields[4], fields[5],
                          fields[6], fields[7], fields[8],
                          fields[9], fields[10], fields[11],
                          fields[12], fields[13], fields[14],
                         )

Time Zone(s)?

Is there a time zone associated with publish_date and harvest_date (e.g. U.S. Eastern Time)? Thanks for making these data available!

Duplicates tweets (by tweet_id) & Integrity.

Love the postgres loader, but we're going to try to keep this repo clean and simple, so I won't merge it into master. Glad to see some cool forks with great new features and tools for analysis though. Thank you for your work.

But this is the problem your repo isn't clean. It's massive, and very hard to account for errors in integrity. Take for instance duplicate tweets. I cleaned them all up, and #29 has the commit fb59797

Github won't render it but try,

git diff fb5979762dca592109f919e4c805d0fb985aa9a9 fb5979762dca592109f919e4c805d0fb985aa9a9^

Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.

If you need examples look for tweet IDs,

psql:load.psql:11: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(612233279027064832) already exists.
CONTEXT:  COPY tweets, line 99245
COPY 233540
psql:load.psql:13: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(626060785005953025) already exists.
CONTEXT:  COPY tweets, line 20505
psql:load.psql:14: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(669263743784583168) already exists.
CONTEXT:  COPY tweets, line 200111
psql:load.psql:15: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(614858663417802752) already exists.
CONTEXT:  COPY tweets, line 103929

The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.

Data is double-encoded

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

external_author_id and author relationship

This field is confusing to me, some external_author_id have multiple authors

SELECT distinct external_author_id , author
FROM rustweets.tweets WHERE external_author_id = 753000000000000000;
 external_author_id |     author      
--------------------+-----------------
 753000000000000000 | ANGELABACH991
 753000000000000000 | ANGELA_LATTKE
 753000000000000000 | BECKRALFBECK265
 753000000000000000 | CHRISTINAPOOL61
 753000000000000000 | DARRELL_H_HUNT
 753000000000000000 | DOMINIKKELLER22
 753000000000000000 | EHERMANN66
 753000000000000000 | ERIKADIXONLOVE
 753000000000000000 | JOACHIMBUCHWITZ
 753000000000000000 | LARSWOLFLARS
 753000000000000000 | LGBTUNI
 753000000000000000 | LUISSTOCKBERG
 753000000000000000 | MALTE_ROSS
 753000000000000000 | MANUELKROSSS
 753000000000000000 | MARGARETHKURZ
 753000000000000000 | MARMARSCH1
 753000000000000000 | PETERSCHULZ541
(17 rows)

Time Zone

Time zone of date time fields is not mentioned.
Is it UTC or adjusted?

It will help in correlation with global geopolitical events and workday cycles of US, and Russia.

Thanks.

Is anyone using any analytic tools and if so which?

I have access to, well, all of the big dogs. I was curious if anyone else was using anything?