alexlitel / congresstweets Goto Github PK

View Code? Open in Web Editor NEW

102.0 7.0 39.0 1.11 GB

Datasets of the daily Twitter output of Congress.

License: MIT License

Ruby 4.67% HTML 37.75% SCSS 57.58%

congress tweets twitter usa senate house house-of-representatives

congresstweets's People

Stargazers

Watchers

Forkers

jamesmonette mich4916 fjccoin kmjohnson fushu18 tkdgirlbb emckay dunovank datatroy ljanastas citrus03 cfs9805804 geol1729 amanpanjwani stiles ttruty gregbhutchins stenke lizhong-liu likelyt nmpatterson22 curtissalinger younggns medeiros-erika andrewrcharron wethepeopleonline davidm-mcgrath mattpaz jamison413 prashgarg dhanushapathakota chasen-jeffries abono2000 sun-yixiao surya-guiyu leungkp benjamin-m-gold tamirlibel aurelieschool

congresstweets's Issues

Empty data for 2020-09-01 / 2020-09-02

Hi,

Thanks for building this! We've a daily Airflow job that grabs the raw JSON files from this repo and ingests them into the https://www.splitgraph.com/splitgraph/congress_tweets dataset (making it importable into Postgres / queryable over a REST API/SQL).

I noticed that https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2020-09-01.json and https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2020-09-02.json are empty -- is this supposed to happen?

ID for retweets

HI,
Nice resource. I'm a bit puzzled about what the id is supposed to represent in the dataset. In many cases (I think it's in the case for retweets), the id doesn't match the number in the link. That sorta makes sense, since the link is the link for the original tweet, not the retweet. And I thought that was a bit odd, since I thought the link would go the legislator's timeline, not the timeline for the person who tweeted it.

The reason I ask is that I'm trying to hydrate the entire tweets for the tweets ID in your dataset. I just tried a bunch, and I only got about half (30,000 out of 60,000). People do delete tweets, so they wouldn't get collected during the hydration process, but that's a much higher percentage of missing tweets than I've ever seen before. I'm not entirely sure yet which ones are missing: I started looking at the IDs and the links and got confused as to how the dataset was put together.

GitHub Action has not run for the past two days

Are the tweets no longer updated?

why remove 2017 data?

hi @alexlitel, I was wondering why you removed 2017 data in this commit? 4b11913

also, have you considered mirroring to a public s3 bucket if there are any concerns with data size? the cost should be minimal

Data Missing 2020-10-11 to current

Empty string for 2020-10-11 through 2020-10-17. Looks like the scraper is not pulling information successfully.

metadata and tweets

Hello,
I am interested in analyzing only Congress members tweets. I realized that not all the info in the metadata file (users-filtered.json) matches the tweeter data.

Do you suggest a reliable way to match Congress metadata info (e.g., party affiliation) to the tweets data? Screen names are not always the same in both data sources.

Thanks!

Clarification on URL Formatting in Tweet Data Samples

I really appreciate your incredible work on this project. I have a question about the data. In the randomly drawn samples of Tweets I reviewed, I noticed that all the URLs in the posts are extended (not shortened), at least in the samples I checked. For double-checking, could I ask whether this was systematically processed during the data collection, or am I only seeing selective cases? I tried to find the answer in the readme and other source codes but couldn't find it. I apologize if I overlooked it and would be grateful if you could let me know.

Error in 2018-01-08.json

Hi, the first entry in the 2018-01-08.json file is a single open bracket "[". Its hard to tell if this suggests that some data is missing here or if there was simply a parsing error or what. Do you know what might have caused this and whether data might be missing in that spot?

https://alexlitel.github.io/congresstweets/data/2017-09-14.json

alexlitel / congresstweets Goto Github PK

congresstweets's People

Stargazers

Watchers

Forkers

congresstweets's Issues

Empty data for 2020-09-01 / 2020-09-02

ID for retweets

GitHub Action has not run for the past two days

why remove 2017 data?

Data Missing 2020-10-11 to current

metadata and tweets

Clarification on URL Formatting in Tweet Data Samples

Error in 2018-01-08.json

Tweets from 2021-06-13 to 2021-06-22 are missing

data license

Problem with Page 2

json file error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent