Giter Site home page Giter Site logo

russian-troll-tweets's People

Contributors

dmil avatar gwezerek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

russian-troll-tweets's Issues

Impossible amounts of followers

Did anyone else notice that there are a set of accounts who have more followers than there are users on twitter? Twitter has 336 million users but some of the accounts in the data set have close to 1 billion followers listed. How have you been interpreting this?

Accounts with more than 336 million followers: CHICAGODAILYNEW, DAILYSANFRAN, JENN_ABRAMS, KADIROVRUSSIA, KANSASDAILYNEWS, MAXDEMENTIEV, NEWORLEANSON, NOVOSTIMSK, NOVOSTISPB, ROOMOFRUMOR, SCREAMYMONKEY, SEATTLE_POST, SPECIALAFFAIR, TEN_GOP, TODAYINSYRIA, TODAYNYCITY, TODAYPITTSBURGH, and WASHINGTONLINE.

Non-ASCii Characters Issue

Hi all,

I have seen the posts re: double-encoded, however when I try:

for file in IRAhandle_tweets_1.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

I get the following error :
iconv: IRAhandle_tweets_1.csv:6:84: cannot convert

this also generates a file: IRAhandle_tweets_1.csv.corrected , which can not be opened by macbook (file size is only 2kb!)

Ultimately, I would like to export all the English Language tweets into a txt file...any suggestions kindly appreciated.

Additional ID columns

It's been mentioned in some of the other issues here but I think it warrants its own.

First off thanks for releasing this, it's a fantastic dataset. I'm intereested in the network effects going on in these communities, and want to load them into a graph db, but without the tweet ID, retweet ID & account name (as opposed to screen name), I'm not able to connect retweets to tweets, or user mentions to users.

It's possible to get the account name and retweeted status id from the t.co/... url (most redirect to an 'account suspended' page, but you can still see the full url in the request history), but without the (re)tweet ID, you can't connect things up.

IRAhandle_tweets_5.csv contains 4 records with '-1' in the external_author_id column

IRAhandle_tweets_5.csv contains 4 records with -1 in the external_author_id column, as follows, -1 replaced with NA,

external_author_id author content region language publish_date harvested_date following followers updates post_type account_type retweet account_category new_june_2018 alt_external_id tweet_id article_url tco1_step1 tco2_step1 tco3_step1
NA FINDDIET muvva @_SlaYonce Marissa @marissafabrizi Tawnya @sdxNomadMom Happy @HelenFJohnson http://t.co/Z8DHogbMn8 https://t.co/9Fe1weThaC United States LANGUAGE UNDEFINED 8/12/2015 11:04 8/12/2015 11:12 3 308 29699 NA Commercial 0 Commercial 1 NA 6.314209e+17 http://twitter.com/FindDiet/statuses/631420894506713088 https://twitter.com/marissafabrizi_/status/631419756613124096 https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FLITERALLY-SAME-WHEN-I-TRY-AND-RUN-TO-LOSE-WEIGHT.ASP NA
NA FINDDIET http://t.co/MKzEwWBPwr Toe @hongtony22 ժɑɾíɑղ @woaahdaare Tara @intowalls Delaney @vandell66 Fattie @FattaySD https://t.co/cQrguxWQtT United States LANGUAGE UNDEFINED 8/12/2015 23:08 8/12/2015 23:08 3 331 30648 NA Commercial 0 Commercial 1 NA 6.316031e+17 http://twitter.com/FindDiet/statuses/631603122641596417 https://twitter.com/ivannvega23/status/631601740173975552 https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FGIRLS-WHO-WORKOUT.ASP NA
NA FINDDIET http://t.co/wFGUVzURSM Jay .@Jay_zo Nina .@xxooninaooxx Medeia .@sharifwrites K♛ .@Stcrbuks Jess .@jjeessss81 https://t.co/GOctHrGhiG United States LANGUAGE UNDEFINED 8/13/2015 0:18 8/13/2015 0:19 3 334 30756 NA Commercial 0 Commercial 1 NA 6.316208e+17 http://twitter.com/FindDiet/statuses/631620795354058752 https://twitter.com/xxooninaooxx/status/631620042229215232 https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FMY-FAVE-WORKOUT.ASP NA
NA FINDDIET http://t.co/U4DFGCaSEF Davi_hoops @comiclyfe_12 YELLA @ImYellaForeal perelondon @perelondon1 Sharday @whitegalshay https://t.co/7mh7C2GkKE United States LANGUAGE UNDEFINED 8/13/2015 6:27 8/13/2015 6:27 3 350 31267 NA Commercial 0 Commercial 1 NA 6.317136e+17 http://twitter.com/FindDiet/statuses/631713577162805248 https://twitter.com/rainbowsonlou/status/631712761962233856 https://twitter.com/safety/unsafe_link_warning?unsafe_link=http%3A%2F%2FWWW.LOSEFATTIPS.PW%2FTIPS%2FGOOD-WORKOUT-AT-THE-GYM.ASP NA

Compress files

i see youve split file into multiple smaller files. it could be good to also compress files to make download easier.

Many entries tagged with language=swedish are in fact in german

I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German.
Such as entry nr 322020 in IRAhandle_tweets_1.csv:
7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish
The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE.
You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?

Can We Get IP of Tweet?

Is there anyway we can add an IP address or other data piece that we could create a map from? It would be interesting to create a web map showing where these are coming from?

If it's not possible, then just close this issue.

Mission regions and shifted columns

There are a handful of tweets that are missing a region and have the following columns shifted to the left:

external_author_id author content region language publish_date harvested_date following followers updates post_type account_type new_june_2018 retweet account_category date_mysql
1670762347 ADAMCHAPMANJR Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States English 10/6/2016 21:04 10/6/2016 21:07 582 934 1807 0 left 0 1 0 NULL 2016-10-06 21:07:00
1850866398 BRICEGELLER Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States English 10/6/2016 20:54 10/6/2016 20:55 851 852 1587 0 left 0 1 0 NULL 2016-10-06 20:55:00
1626302035 CLAYPAIGEBOO Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States English 10/6/2016 20:58 10/6/2016 20:58 776 916 1562 0 left 0 1 0 NULL 2016-10-06 20:58:00
1692501152 CORNELLBURCHET Check out the video I made with @LAEducators to #ThankaLAEducator https://t.co/N7J70kSDsn,United States English 10/6/2016 20:55 10/6/2016 20:55 725 774 1536 0 left 0 1 0 NULL 2016-10-06 20:55:00
2882037326 DANAGEEZUS I need a dance break right meow! (•_•) <) )╯all the single ladies / \ (•_•) \( (> all the single ladies / \ (•_•) <) )╯oh oh oh / ,United States English 7/6/2015 15:03 7/6/2015 15:03 3740 9351 1849 NULL Hashtager 0 0 0 NULL 2015-07-06 15:03:00
2577152109 DENN_NIKITIN курточка моя так хорошо сидит на ней \ты пропел эти слова в голове,Unknown Russian 4/15/2017 13:19 4/15/2017 13:19 68 113 6277 0 Russian 0 1 0 NULL 2017-04-15 13:19:00

Irregularities matching authors in the dataset with the 2017/2018 Congressional lists

I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…

  • Every author has multiple tweets. But some info (should) remain constant for a particular author across all their tweets: external_author_id, account_type, account_category and the new_june_2018 flag. This spreadsheet provides a summary of all the authors and their associated properties.
  • The November 2017 list contains user_ids, i.e. external_author_id. This can be used to resolve some of the floating point external_author_ids raised in issue #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.)
  • The dataset uses all-caps for each author; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.

However, there are two problems that I ran across:

  1. There are 17 authors in the dataset who each have two different external_author_ids. Example: in IRAhandle_tweets_1.csv, there are some rows with author 4MYSQUAD and external_author_id 4036537452. Other rows have the same author 4MYSQUAD, but a different external_author_id 3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)

    For some authors it appears this seems to be tied to floating point external_author_ids being rounded—such as KRISTINADRUCKER, who has some tweets with external_author_id 7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.

    Is there a legit reason these accounts are associated with two different external_author_ids, or is this a mistake?

  2. There are 5 authors in the dataset that are not listed in either of the PDFs from Congress:

    • BABCHENKOVA_EVA: the 2018 PDF does list the “babchenkova” handle, which also appears in the dataset, but neither PDF lists an account name of BABCHENKOVA_EVA.
    • JENNATRAVELLER: no similar account name is in either PDF.
    • KARUCZ_00: the 2018 PDF does list the “Karucz” handle, which also appears in the dataset, but neither PDF lists an account name of KARUCZ_00.
    • TERRAFORMA: the 2017 and 2018 PDFs do list the “taraformation” handle, which also appears in the dataset, but neither PDF lists an account name of TERRAFORMA.
    • TOURETTESN: no similar account name is in either PDF.

    It’s not clear why tweets from these five accounts are included in the dataset.

I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.

screenshot 2018-08-02 at 09 43 30

Should be available via BitTorrent and as a web database that can be queried

First, thanks for making the data available. I was asking about this recently. I would like to get a look at troll tweets, it might help us avoid arguing with them in the future.

However --

I wasn't able to download the file, abd this is not a great way to distribution the info. Better would be:

  1. BitTorrent distribution. It was made for data like this. GitHub, not so much.

  2. And it would be wonderful to have this online as a database that can be queried with SQL commands.

I would be happy to help either or both projects, assuming they don't already exist.

Thanks again for uploading the data.

Dave

Quoting -1

You're quoting -1, which means it won't work in int fields without removing the quotes.

Emojis in MySQL/MariaDB

Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.

retweet count?

Do you have the number of times each tweet was retweeted? If so, this would be a valuable addition to this dataset.

Resolving t.co links

It would be great to resolve the t.co links, did some research and you need Twitter Enterprise access,

Will there be a URL shortening or resolution API? There is no resolution API (although Twitter's enterprise products provide URL expansion enrichments). URLs are wrapped automatically at time of Tweet or Direct Message submission. Resolved URLs are only available in the context of Tweet or Direct Message content as part of the entities response.

I've requested access to the end point with 2million calls. If they approve I'll get upload the resolved links to unshorten/deanonymize the t.co endpoints.

If not, this should stay open until someone does it because it seems possible.

@patrick-lee-warren - Full list of user IDs for suspended Twitter bot and troll accounts

Hi Patrick, please post the list with more of the troll User IDs than the shorter list with 454 users below. We'd like to use the longer list to remove bots from research data.
https://github.com/fivethirtyeight/russian-troll-tweets/files/2278803/external_author_id_rounding_fixes.zip

Also, can anyone recommend additional lists of known bot user IDs? We're only using User IDs in our research, not Twitter handles. Thanks for all the clean-up work on the rounded IDs! Death to Excel, long live Microsoft GitHub.

Repo appears to be over its data quota and is not allowing downloads

$ git-lfs smudge -- IRAhandle_tweets.csv
Error downloading object: IRAhandle_tweets.csv (80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b)

Smudge error: Error downloading 80f883c3b641733035cbb7fc0f7068760710dfc647cd69c99f42706cd490b68b: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/: batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/

character encoding for content column

seems the character encoding is off for the "content" column. special chars are showing up as weird text:

Ð?Ñ?иÑ?ина #118 ЯÑ?оваÑ� пÑ?оÑ�иÑ? Ð?аÑ�илÑ?евÑ? Ñ�оздаÑ?Ñ? меÑ?од длÑ� Ñ?Ñ?иÑ?елей по вÑ?Ñ�влениÑ? инÑ?еÑ?неÑ?-завиÑ�имÑ?Ñ? деÑ?ей

convert to UTF-8 maybe?

Save the community some typing

I know this is an abuse of "issues" but it doesn't warrant a full repo. Here is some Python code you can cut/paste

class Rutweet:
    def __init__(self, external_author_id, author, content,
                region, language, publish_date, harvested_date,
                following, followers, updates, post_type, account_type,
                retweet, account_category, new_june_2018):
    
        self.external_author_id = external_author_id
        self.author = author
        self.content = content
        self.region = region
        self.language = language
        self.publish_date = publish_date
        self.harvested_date = harvested_date
        self.following = following
        self.updates = updates
        self.post_type = post_type
        self.account_type = account_type
        self.retweet = retweet
        self.account_category = account_category
        self.new_june_2018 = new_june_2018

And a quick loader.

def load_tweets(fn):
    with open(fn, 'r') as f:
        for line in f.readlines():
            fields = line.split(',')
            rut = Rutweet(fields[0], fields[1], fields[2],
                          fields[3], fields[4], fields[5],
                          fields[6], fields[7], fields[8],
                          fields[9], fields[10], fields[11],
                          fields[12], fields[13], fields[14],
                         )

Time Zone(s)?

Is there a time zone associated with publish_date and harvest_date (e.g. U.S. Eastern Time)? Thanks for making these data available!

Duplicates tweets (by tweet_id) & Integrity.

Love the postgres loader, but we're going to try to keep this repo clean and simple, so I won't merge it into master. Glad to see some cool forks with great new features and tools for analysis though. Thank you for your work.

But this is the problem your repo isn't clean. It's massive, and very hard to account for errors in integrity. Take for instance duplicate tweets. I cleaned them all up, and #29 has the commit fb59797

Github won't render it but try,

git diff fb5979762dca592109f919e4c805d0fb985aa9a9 fb5979762dca592109f919e4c805d0fb985aa9a9^

Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.

If you need examples look for tweet IDs,

psql:load.psql:11: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(612233279027064832) already exists.
CONTEXT:  COPY tweets, line 99245
COPY 233540
psql:load.psql:13: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(626060785005953025) already exists.
CONTEXT:  COPY tweets, line 20505
psql:load.psql:14: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(669263743784583168) already exists.
CONTEXT:  COPY tweets, line 200111
psql:load.psql:15: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(614858663417802752) already exists.
CONTEXT:  COPY tweets, line 103929

The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.

Data is double-encoded

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

external_author_id and author relationship

This field is confusing to me, some external_author_id have multiple authors

SELECT distinct external_author_id , author
FROM rustweets.tweets WHERE external_author_id = 753000000000000000;
 external_author_id |     author      
--------------------+-----------------
 753000000000000000 | ANGELABACH991
 753000000000000000 | ANGELA_LATTKE
 753000000000000000 | BECKRALFBECK265
 753000000000000000 | CHRISTINAPOOL61
 753000000000000000 | DARRELL_H_HUNT
 753000000000000000 | DOMINIKKELLER22
 753000000000000000 | EHERMANN66
 753000000000000000 | ERIKADIXONLOVE
 753000000000000000 | JOACHIMBUCHWITZ
 753000000000000000 | LARSWOLFLARS
 753000000000000000 | LGBTUNI
 753000000000000000 | LUISSTOCKBERG
 753000000000000000 | MALTE_ROSS
 753000000000000000 | MANUELKROSSS
 753000000000000000 | MARGARETHKURZ
 753000000000000000 | MARMARSCH1
 753000000000000000 | PETERSCHULZ541
(17 rows)

Time Zone

Time zone of date time fields is not mentioned.
Is it UTC or adjusted?

It will help in correlation with global geopolitical events and workday cycles of US, and Russia.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.