Giter Site home page Giter Site logo

alexlitel / congresstweets Goto Github PK

View Code? Open in Web Editor NEW
102.0 7.0 39.0 1.11 GB

Datasets of the daily Twitter output of Congress.

License: MIT License

Ruby 4.67% HTML 37.75% SCSS 57.58%
congress tweets twitter usa senate house house-of-representatives

congresstweets's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

congresstweets's Issues

Empty data for 2020-09-01 / 2020-09-02

Hi,

Thanks for building this! We've a daily Airflow job that grabs the raw JSON files from this repo and ingests them into the https://www.splitgraph.com/splitgraph/congress_tweets dataset (making it importable into Postgres / queryable over a REST API/SQL).

I noticed that https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2020-09-01.json and https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2020-09-02.json are empty -- is this supposed to happen?

ID for retweets

HI,
Nice resource. I'm a bit puzzled about what the id is supposed to represent in the dataset. In many cases (I think it's in the case for retweets), the id doesn't match the number in the link. That sorta makes sense, since the link is the link for the original tweet, not the retweet. And I thought that was a bit odd, since I thought the link would go the legislator's timeline, not the timeline for the person who tweeted it.

The reason I ask is that I'm trying to hydrate the entire tweets for the tweets ID in your dataset. I just tried a bunch, and I only got about half (30,000 out of 60,000). People do delete tweets, so they wouldn't get collected during the hydration process, but that's a much higher percentage of missing tweets than I've ever seen before. I'm not entirely sure yet which ones are missing: I started looking at the IDs and the links and got confused as to how the dataset was put together.

why remove 2017 data?

hi @alexlitel, I was wondering why you removed 2017 data in this commit? 4b11913

also, have you considered mirroring to a public s3 bucket if there are any concerns with data size? the cost should be minimal

metadata and tweets

Hello,
I am interested in analyzing only Congress members tweets. I realized that not all the info in the metadata file (users-filtered.json) matches the tweeter data.

Do you suggest a reliable way to match Congress metadata info (e.g., party affiliation) to the tweets data? Screen names are not always the same in both data sources.

Thanks!

Clarification on URL Formatting in Tweet Data Samples

I really appreciate your incredible work on this project. I have a question about the data. In the randomly drawn samples of Tweets I reviewed, I noticed that all the URLs in the posts are extended (not shortened), at least in the samples I checked. For double-checking, could I ask whether this was systematically processed during the data collection, or am I only seeing selective cases? I tried to find the answer in the readme and other source codes but couldn't find it. I apologize if I overlooked it and would be grateful if you could let me know.

Error in 2018-01-08.json

Hi, the first entry in the 2018-01-08.json file is a single open bracket "[". Its hard to tell if this suggests that some data is missing here or if there was simply a parsing error or what. Do you know what might have caused this and whether data might be missing in that spot?

data license

Thanks for the repo.

I understand that code is under MIT license, is the data in public domain?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.