russian-troll-tweets's People
Forkers
macoj midnightslacker thegwynne demodonk08 junghwanyang aalali44 eyejosh slunski shaynak hammer-of-thor srinify stepheneb mpodell rollinsjw nkasmanoff myeong phonedude noa vischia meredithclarkphd goitom alekklincewicz johnasharifi oroeder hmahal since1968 georgedittmar rvaughan philippe2803 craftdata piedifreddi openglobe mbwolff sclayton29 jnario jonathanklinn hubbucket-team jacoblundeen twilliambell naushadzaman mtgilmore dnbaumann shuninglu shintakezou gsmith-to daisykornblum tjbay mxk1235 aaarash user01 oleksiyanokhin weiglemc scipionesarlo shivamgupta211 pbiecek gsuvorov kgraves114 tulaneadam ilestfoup90 pking74 raddick andrewhyi jaypipes walkerkq nknezek hailey0huong haileab fredriko mcomsa aaronpeikert cfiller123 yuxhuang asyapluggedin georgedumontier mattwaite deepstatelearning pankaj-lakhina sociallycompute augustgiles ckcole yashwantreddy whryan martincote1978 gridl graesser euwatch vigorousnorth dckc codr99 socrateslab iamsingularity vincentmwong seokjunbing manuelaalmeida sastoudt chuash abir-ia aroguetroll houstondatavis patrick-lee-warrenrussian-troll-tweets's Issues
Impossible amounts of followers
Did anyone else notice that there are a set of accounts who have more followers than there are users on twitter? Twitter has 336 million users but some of the accounts in the data set have close to 1 billion followers listed. How have you been interpreting this?
Accounts with more than 336 million followers: CHICAGODAILYNEW, DAILYSANFRAN, JENN_ABRAMS, KADIROVRUSSIA, KANSASDAILYNEWS, MAXDEMENTIEV, NEWORLEANSON, NOVOSTIMSK, NOVOSTISPB, ROOMOFRUMOR, SCREAMYMONKEY, SEATTLE_POST, SPECIALAFFAIR, TEN_GOP, TODAYINSYRIA, TODAYNYCITY, TODAYPITTSBURGH, and WASHINGTONLINE.
Desperately Seeking Schema
It would be nice if the table in the README could be updated with information about the type of each field. In particular, for those fields that are enumerated constants (such as post_type
and account_type
), list the set of valid values and for all fields indicate whether they are nullable. Since the data format is not raw Twitter data, maybe a link to https://help.salesforce.com/articleView?id=mc_ss_csv_report_headers.htm&type=5 would be helpful, too.
Non-ASCii Characters Issue
Hi all,
I have seen the posts re: double-encoded, however when I try:
for file in IRAhandle_tweets_1.csv; do
echo -n "Converting $file... "
iconv -f utf8 -t latin1 $file > $file.corrected &&
mv -f $file.corrected $file
echo "Done"
done
I get the following error :
iconv: IRAhandle_tweets_1.csv:6:84: cannot convert
this also generates a file: IRAhandle_tweets_1.csv.corrected
, which can not be opened by macbook (file size is only 2kb!)
Ultimately, I would like to export all the English Language tweets into a txt file...any suggestions kindly appreciated.
Additional ID columns
It's been mentioned in some of the other issues here but I think it warrants its own.
First off thanks for releasing this, it's a fantastic dataset. I'm intereested in the network effects going on in these communities, and want to load them into a graph db, but without the tweet ID, retweet ID & account name (as opposed to screen name), I'm not able to connect retweets to tweets, or user mentions to users.
It's possible to get the account name and retweeted status id from the t.co/... url (most redirect to an 'account suspended' page, but you can still see the full url in the request history), but without the (re)tweet ID, you can't connect things up.
IRAhandle_tweets_5.csv contains 4 records with '-1' in the external_author_id column
IRAhandle_tweets_5.csv contains 4 records with -1
in the external_author_id column, as follows, -1
replaced with NA,
Compress files
i see youve split file into multiple smaller files. it could be good to also compress files to make download easier.
Many entries tagged with language=swedish are in fact in german
I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German.
Such as entry nr 322020 in IRAhandle_tweets_1.csv:
7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish
The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE.
You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?
Can We Get IP of Tweet?
Is there anyway we can add an IP address or other data piece that we could create a map from? It would be interesting to create a web map showing where these are coming from?
If it's not possible, then just close this issue.
Mission regions and shifted columns
There are a handful of tweets that are missing a region and have the following columns shifted to the left:
external_author_id | author | content | region | language | publish_date | harvested_date | following | followers | updates | post_type | account_type | new_june_2018 | retweet | account_category | date_mysql |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1670762347 | ADAMCHAPMANJR | Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States | English | 10/6/2016 21:04 | 10/6/2016 21:07 | 582 | 934 | 1807 | 0 | left | 0 | 1 | 0 | NULL | 2016-10-06 21:07:00 |
1850866398 | BRICEGELLER | Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States | English | 10/6/2016 20:54 | 10/6/2016 20:55 | 851 | 852 | 1587 | 0 | left | 0 | 1 | 0 | NULL | 2016-10-06 20:55:00 |
1626302035 | CLAYPAIGEBOO | Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States | English | 10/6/2016 20:58 | 10/6/2016 20:58 | 776 | 916 | 1562 | 0 | left | 0 | 1 | 0 | NULL | 2016-10-06 20:58:00 |
1692501152 | CORNELLBURCHET | Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States | English | 10/6/2016 20:55 | 10/6/2016 20:55 | 725 | 774 | 1536 | 0 | left | 0 | 1 | 0 | NULL | 2016-10-06 20:55:00 |
2882037326 | DANAGEEZUS | I need a dance break right meow! (•_•) <) )╯all the single ladies / \ (•_•) \( (> all the single ladies / \ (•_•) <) )╯oh oh oh / ,United States | English | 7/6/2015 15:03 | 7/6/2015 15:03 | 3740 | 9351 | 1849 | NULL | Hashtager | 0 | 0 | 0 | NULL | 2015-07-06 15:03:00 |
2577152109 | DENN_NIKITIN | курточка моя так хорошо сидит на ней \ты пропел эти слова в голове,Unknown | Russian | 4/15/2017 13:19 | 4/15/2017 13:19 | 68 | 113 | 6277 | 0 | Russian | 0 | 1 | 0 | NULL | 2017-04-15 13:19:00 |
Irregularities matching authors in the dataset with the 2017/2018 Congressional lists
I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…
- Every
author
has multiple tweets. But some info (should) remain constant for a particular author across all their tweets:external_author_id
,account_type
,account_category
and thenew_june_2018
flag. This spreadsheet provides a summary of all theauthor
s and their associated properties. - The November 2017 list contains user_ids, i.e.
external_author_id
. This can be used to resolve some of the floating pointexternal_author_id
s raised in issue #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.) - The dataset uses all-caps for each
author
; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.
However, there are two problems that I ran across:
-
There are 17
author
s in the dataset who each have two differentexternal_author_id
s. Example: in IRAhandle_tweets_1.csv, there are some rows withauthor
4MYSQUAD andexternal_author_id
4036537452. Other rows have the sameauthor
4MYSQUAD, but a differentexternal_author_id
3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)For some authors it appears this seems to be tied to floating point
external_author_id
s being rounded—such as KRISTINADRUCKER, who has some tweets withexternal_author_id
7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.Is there a legit reason these accounts are associated with two different
external_author_id
s, or is this a mistake? -
There are 5
author
s in the dataset that are not listed in either of the PDFs from Congress:- BABCHENKOVA_EVA: the 2018 PDF does list the “babchenkova” handle, which also appears in the dataset, but neither PDF lists an account name of BABCHENKOVA_EVA.
- JENNATRAVELLER: no similar account name is in either PDF.
- KARUCZ_00: the 2018 PDF does list the “Karucz” handle, which also appears in the dataset, but neither PDF lists an account name of KARUCZ_00.
- TERRAFORMA: the 2017 and 2018 PDFs do list the “taraformation” handle, which also appears in the dataset, but neither PDF lists an account name of TERRAFORMA.
- TOURETTESN: no similar account name is in either PDF.
It’s not clear why tweets from these five accounts are included in the dataset.
I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.
Should be available via BitTorrent and as a web database that can be queried
First, thanks for making the data available. I was asking about this recently. I would like to get a look at troll tweets, it might help us avoid arguing with them in the future.
However --
I wasn't able to download the file, abd this is not a great way to distribution the info. Better would be:
-
BitTorrent distribution. It was made for data like this. GitHub, not so much.
-
And it would be wonderful to have this online as a database that can be queried with SQL commands.
I would be happy to help either or both projects, assuming they don't already exist.
Thanks again for uploading the data.
Dave
New repo: separated tweet data and author data
[sorry, please disregard]
Quoting -1
You're quoting -1, which means it won't work in int fields without removing the quotes.
Version 2.0, fixing several issues and adding new data
I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???
It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.
Emojis in MySQL/MariaDB
Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.
retweet count?
Do you have the number of times each tweet was retweeted? If so, this would be a valuable addition to this dataset.
Resolving t.co links
It would be great to resolve the t.co links, did some research and you need Twitter Enterprise access,
Will there be a URL shortening or resolution API? There is no resolution API (although Twitter's enterprise products provide URL expansion enrichments). URLs are wrapped automatically at time of Tweet or Direct Message submission. Resolved URLs are only available in the context of Tweet or Direct Message content as part of the entities response.
I've requested access to the end point with 2million calls. If they approve I'll get upload the resolved links to unshorten/deanonymize the t.co endpoints.
If not, this should stay open until someone does it because it seems possible.
external_author_id is rounded as a floating point
They are in the format of "9.06000000000e+17" which I assume is incorrect and instead should be a "large integer".
@patrick-lee-warren - Full list of user IDs for suspended Twitter bot and troll accounts
Hi Patrick, please post the list with more of the troll User IDs than the shorter list with 454 users below. We'd like to use the longer list to remove bots from research data.
https://github.com/fivethirtyeight/russian-troll-tweets/files/2278803/external_author_id_rounding_fixes.zip
Also, can anyone recommend additional lists of known bot user IDs? We're only using User IDs in our research, not Twitter handles. Thanks for all the clean-up work on the rounded IDs! Death to Excel, long live Microsoft GitHub.
Repo appears to be over its data quota and is not allowing downloads
$ git-lfs smudge -- IRAhandle_tweets.csv
Error downloading object: IRAhandle_tweets.csv (80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b)
Smudge error: Error downloading 80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/
any plans for improvement?
Do you plan on improving this?
the tweets are missing the accounts re-tweeted, they should be prefixed with "RT @usernameRTed".
and the urls tweeted out would nice!
this is an example of a very standard twitter data pull to csv:
trump rally search 07.31.2018.xlsx
thanks!
character encoding for content column
seems the character encoding is off for the "content" column. special chars are showing up as weird text:
Ð?Ñ?иÑ?ина #118 ЯÑ?оваÑ� пÑ?оÑ�иÑ? Ð?аÑ�илÑ?евÑ? Ñ�оздаÑ?Ñ? меÑ?од длÑ� Ñ?Ñ?иÑ?елей по вÑ?Ñ�влениÑ? инÑ?еÑ?неÑ?-завиÑ�имÑ?Ñ? деÑ?ей
convert to UTF-8 maybe?
Add a license to this data set
Without an open source license applied to this repository, the authors retain all copyright to the data, and restrict the options of anyone who wants to use this data. See here for a discussion of how this applies to datasets: https://academia.stackexchange.com/questions/63139/public-dataset-without-license-what-is-allowed
You can add a license to the repository by following the instructions here: https://help.github.com/articles/licensing-a-repository/
Save the community some typing
I know this is an abuse of "issues" but it doesn't warrant a full repo. Here is some Python code you can cut/paste
class Rutweet:
def __init__(self, external_author_id, author, content,
region, language, publish_date, harvested_date,
following, followers, updates, post_type, account_type,
retweet, account_category, new_june_2018):
self.external_author_id = external_author_id
self.author = author
self.content = content
self.region = region
self.language = language
self.publish_date = publish_date
self.harvested_date = harvested_date
self.following = following
self.updates = updates
self.post_type = post_type
self.account_type = account_type
self.retweet = retweet
self.account_category = account_category
self.new_june_2018 = new_june_2018
And a quick loader.
def load_tweets(fn):
with open(fn, 'r') as f:
for line in f.readlines():
fields = line.split(',')
rut = Rutweet(fields[0], fields[1], fields[2],
fields[3], fields[4], fields[5],
fields[6], fields[7], fields[8],
fields[9], fields[10], fields[11],
fields[12], fields[13], fields[14],
)
Time Zone(s)?
Is there a time zone associated with publish_date
and harvest_date
(e.g. U.S. Eastern Time)? Thanks for making these data available!
Duplicates tweets (by tweet_id) & Integrity.
Love the postgres loader, but we're going to try to keep this repo clean and simple, so I won't merge it into master. Glad to see some cool forks with great new features and tools for analysis though. Thank you for your work.
But this is the problem your repo isn't clean. It's massive, and very hard to account for errors in integrity. Take for instance duplicate tweets. I cleaned them all up, and #29 has the commit fb59797
Github won't render it but try,
git diff fb5979762dca592109f919e4c805d0fb985aa9a9 fb5979762dca592109f919e4c805d0fb985aa9a9^
Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.
If you need examples look for tweet IDs,
psql:load.psql:11: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(612233279027064832) already exists.
CONTEXT: COPY tweets, line 99245
COPY 233540
psql:load.psql:13: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(626060785005953025) already exists.
CONTEXT: COPY tweets, line 20505
psql:load.psql:14: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(669263743784583168) already exists.
CONTEXT: COPY tweets, line 200111
psql:load.psql:15: ERROR: duplicate key value violates unique constraint "tweets_pkey"
DETAIL: Key (tweet_id)=(614858663417802752) already exists.
CONTEXT: COPY tweets, line 103929
The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.
Data is double-encoded
The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh
escape sequences to help readability):
#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8
Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:
#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8
which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸
.
This double-encoding makes the files needlessly bigger and harder to work with.
external_author_id and author relationship
This field is confusing to me, some external_author_id
have multiple authors
SELECT distinct external_author_id , author
FROM rustweets.tweets WHERE external_author_id = 753000000000000000;
external_author_id | author
--------------------+-----------------
753000000000000000 | ANGELABACH991
753000000000000000 | ANGELA_LATTKE
753000000000000000 | BECKRALFBECK265
753000000000000000 | CHRISTINAPOOL61
753000000000000000 | DARRELL_H_HUNT
753000000000000000 | DOMINIKKELLER22
753000000000000000 | EHERMANN66
753000000000000000 | ERIKADIXONLOVE
753000000000000000 | JOACHIMBUCHWITZ
753000000000000000 | LARSWOLFLARS
753000000000000000 | LGBTUNI
753000000000000000 | LUISSTOCKBERG
753000000000000000 | MALTE_ROSS
753000000000000000 | MANUELKROSSS
753000000000000000 | MARGARETHKURZ
753000000000000000 | MARMARSCH1
753000000000000000 | PETERSCHULZ541
(17 rows)
Time Zone
Time zone of date time fields is not mentioned.
Is it UTC or adjusted?
It will help in correlation with global geopolitical events and workday cycles of US, and Russia.
Thanks.
Is anyone using any analytic tools and if so which?
I have access to, well, all of the big dogs. I was curious if anyone else was using anything?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.