Comments (11)
I think I've noticed unshorten writing empty lines sometimes. Is that what your invalid lines look like?
from twarc.
Here is one (sorry it took a while, I couldn't get awk
or sed
to print out a specifc line number from this file):
$ head -9766769 all-tweets-unshortened-urls-20150129.json | tail -1
{"contributors": null, "truncated": false, "text": "RT @RaphBotts: Sur la place d'arme il y a un monde c'est du jamais vu \ud83d\ude31 #JesuisCharlie", "in_reply_to_status_id": null, "id": 552955311666253825, "favorite_count": 0, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [{"indices": [3, 13], "id_str": "798216505", "screen_name": "RaphBotts", "name": "Raph.", "id": 798216505}], "hashtags": [{"indices": [72, 86], "text": "JesuisCharlie"}], "urls": []}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 9, "id_str": "552955311666253825", "favorited": false, "retweeted_status": {"contributors": null, "truncated": false, "text": "Sur la place d'arme il y a un monde c'est du jamais vu \ud83d\ude31 #JesuisCharlie", "in_reply_to_status_id": null, "id": 552914408524226560, "favorite_count": 5, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [57, 71], "text": "JesuisCharlie"}], "urls": []}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 9, "id_str": "552914408524226560", "favorited": false, "user": {"follow_request_sent": false, "profile_use_background_image": true, "profile_text_color": "333333", "id": 798216505, "verified": false, "profile_location": null, "profile_image_url_https": "https://pbs.twimg.com/profile_images/551383654770159616/9ljxadjS_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "contributors_enabled": false, "entities": {"url": {"urls": [{"url": "http://t.co/YRVxCwwRdm", "indices": [0, 22], "expanded_url": "http://www.facebook.com/raphael.bottreau", "display_url": "facebook.com/raphael.bottre\u2026"}]}, "description": {"urls": []}}, "followers_count": 406, "profile_sidebar_border_color": "000000", "location": "Poitiers city / \u00cele de R\u00e9", "default_profile_image": false, "id_str": "798216505", "is_translation_enabled": false, "utc_offset": 3600, "statuses_count": 11858, "description": "Kim. La Clic. La Famille.", "friends_count": 308, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/551383654770159616/9ljxadjS_normal.jpeg", "notifications": false, "geo_enabled": false, "profile_background_color": "C0DEED", "profile_banner_url": "https://pbs.twimg.com/profile_banners/798216505/1419937479", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/510775501359570944/tCPBY6MM.jpeg", "name": "Raph.", "lang": "fr", "following": false, "profile_background_tile": true, "favourites_count": 855, "screen_name": "RaphBotts", "url": "http://t.co/YRVxCwwRdm", "created_at": "Sun Sep 02 12:56:22 +0000 2012", "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/510775501359570944/tCPBY6MM.jpeg", "time_zone": "Paris", "protected": false, "default_profile": false, "is_translator": false, "listed_count": 1}, "geo": null, "in_reply_to_user_id_str": null, "lang": "fr", "created_at": "Wed Jan 07 19:47:22 +0000 2015", "in_reply_to_status_id_str": null, "place": null, "metadata": {"iso_language_code": "fr", "result_type": "recent"}}, "user": {"follow_request_sent": false, "profile_use_background_image": true, "profile_text_color": "618238", "id": 280417651, "verified": false, "profile_location": null, "profile_image_url_https": "https://pbs.twimg.com/profile_images/551495029530066944/MqxgVTtE_normal.jpeg", "profile_sidebar_fill_color": "060A00", "contributors_enabled": false, "entities": {"url": {"urls": [{"url": "http://t.co/3VPPJFFK5v", "indices": [0, 22], "expanded_url": "http://instagram.com/ludivinee_", "display_url": "instagram.com/ludivinee_"}]}, "description": {"urls": []}}, "followers_count": 222, "profile_sidebar_border_color": "FFFFFF", "location": "France", "default_profile_image": false, "id_str": "280417651", "is_translation_enabled": false, "utc_offset": 3600, "statuses_count": 6609, "description": ".", "friends_count": 212, "profile_link_color": "E6A84A", "profile_image_url": "http://pbs.twimg.com/profile_images/551495029530066944/MqxgVTtE_normal.jpeg", "notifications": false, "geo_enabled": false, "profile_background_color": "000000", "profile_banner_url": "https://pbs.twimg.com/profile_banners/280417651/1412534415", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/378800000086764942/264e232fb90bd5297dfb2d9ff5b4db62.jpeg", "name": "Ukuthula \u270f", "lang": "fr", "following": false, "profile_background_tile": true, "favourites_count": 1146, "screen_name": "_Ludivinee", "url": "http://t.co/3VPPJFFK5v", "created_at": "Mon Apr 11 08:46:38 +0000 2011", "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/378800000086764942/264e232fb90bd5297dfb2d9ff5b4db62.jpeg", "time_zone": "Paris", "protected": false, "default_profile": false, "is_translator": false, "listed_count": 1}, "geo": null, "in_reply_to_user_id_str": null, "lang": "fr", "created_at": "Wed Jan 07 22:29:54 +0000 2015", "in_reply_to_status_id_str": null, "place": null, "metadata": {"iso_language_code": "fr", "result_type": "recent"}}
I can edit this comment and add more examples if need be.
from twarc.
I'm confused that's valid json :-) Perhaps just update validate.py to output the JSON it failed to parse in the log message?
from twarc.
...it is! I wonder if it printed the wrong line.
When I try and print a specific line from sample output from validate.py
from above with awk
, sed
, or grep
, I just get a blank line back. I wonder if that has something to do with it?
...and it looks like head
and tail
just did the same.
[nruest@gorille:all]$ head -9413314 all-tweets-unshortened-urls-20150129.json | tail -1
[nruest@gorille:all]$
from twarc.
Yes, I think that the invalid json line is simply a blank line. That's what I've seen in the past with unshorten.py's output. Thanks!
from twarc.
Well, now I feel silly for thinking all these *nix tools weren't behaving correctly for over a day 😄
from twarc.
When you get a chance try running the data through unshorten.py again and see if it is improved?
from twarc.
I'll fire it up now!
from twarc.
Reopening ; I'm not sure my fix worked. I will verify on a small dataset :-)
from twarc.
Cool. I'll wait until you verify.
from twarc.
Ok, now give it a try :-)
from twarc.
Related Issues (20)
- How to use Academic API to search archives with language filters? HOT 2
- How to extract the author user name of a retweet? HOT 1
- Twarc API V2 function to get conversations HOT 1
- Is it possible to use "next_token"? HOT 3
- Missing `variants` field in `media.fields`
- Add `include_ext_is_blue_verified=True`
- CLI: Allow to differentiate between 404 and connection timeout HOT 4
- Have twarc2 400 error while searching phrases. HOT 1
- CLI Errors fail to show for invalid input file
- CLI Errors fail to show for invalid input file HOT 1
- twarc2 timeline --no-context-annotations not pulling 500 tweets HOT 2
- Adding query parameters to search terms imported via .txt? HOT 1
- Is it possible to convert twarc jsnol files to twarc2 jsonl files? HOT 2
- Impossible to specify No expansions at all in CLI.
- Implement new compliance Streams
- Cannot find user_verified_type in the tweet metadata HOT 18
- twarc2 hydrate HOT 2
- twarc2 hydrate freezes at 20% HOT 3
- Client forbidden HOT 3
- Forbidden error after few tweets hydrated (403) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from twarc.