taspinar / twitterscraper Goto Github PK
View Code? Open in Web Editor NEWScrape Twitter for Tweets
License: MIT License
Scrape Twitter for Tweets
License: MIT License
Sir
I am getting the data but the number of retweets and likes are always shown as zero. I wonder why is it so! And also I wanted to know if there is a way to extract tweets of a specific person using username?
Is it possible to get more attributes like number of retweets, replies, and favorites? This is a feature request I guess
Would be nice to be able to use this in python 3.
(pythonclock.org :))
I get the following error when trying to use this.
Installed in a venv via pip
Traceback (most recent call last):
File "collector.py", line 1, in <module>
import twitterscraper
File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/__init__.py", line 13, in <module>
from twitterscraper.query import query_tweets
File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/query.py", line 14, in <module>
from tweet import Tweet
ImportError: No module named 'tweet'
I successfully installed twitterscraper on my notebook, using Linux. But, when I tried to run it, I got the following error message:
"Error occurred during getting browser"
What should I do?
Thanks.
We're currently extracting human readable timestamps, however there's a property data-time-ms
in the span
within the a
which contains it: <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-time="1476057559" data-time-ms="1476057559000" data-long-form="true">Oct 9</span>
- parsing the other string into proper date objects is almost impossible, they sometimes contain AM/PM, sometimes not, sometimes dots here and there, sometimes not, occasionally I get localized months...
I'm trying to extract specified users' tweets.
By using this command line: twitterscraper Trump --limit 100 --output=tweets.json
it just extracts all twists that the person name is mentioned in it instead of the users' tweets.
My question, how can extract all specified users' tweets
Thank you...
Docu gives:
"You can use any advanced query twitter supports. Simply compile your query at https://twitter.com/search-advanced."
Lets say I try to get all tweets from user 'username'
I get the url https://twitter.com/search?f=tweets&q=from%3Ausername&src=typd
Which part (if not the whole url) is the query?
Hi Taspinar and Sils,
I was collecting movie data of last year today, it seems like the date issue occurring again, I cannot get the data earlier than 12 days ago :( and I have tried many times. It's as if some sort of notification occurred that enabled Twitter to know I was trying to go back further than 12 days. So how can I solve this problem?
Thank you so much!
I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for test
or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)
@taspinar if you have any idea that'd help a lot :/
So we see everything works on the versions we want to support, should be easy to do.
the namedtuple just jsonifies them as tuples, would be better to be more dict like and have the member names as keys in the outputted JSON
"scraping the Services without the prior consent of Twitter is expressly prohibited"
Hello again !
I've just tried my precedent script in Python 3, and got immediately these error :
File "myscript.py", line 4, in
from twitterscraper import TwitterScraper
File "/usr/local/lib/python3.4/dist-packages/twitterscraper/TwitterScraper.py", line 109
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax
Maybe it's a naive alternative, but I've discovered recently requests and found this module more powerful than urllib. Here some scraping example with requests !
Hello @taspinar
Recently I run twitterscraper from my command line.
C:\Python27\Scripts\twitterscraper Telkomsel -o tweets.json
Unfortunately, resulting zero result. But if I add another keyword like Telkomsel mengecewakan
resulting the tweet related keyword.
In the other hand, if I write
C:\Python27\Scripts\twitterscraper Trump -o tweets.json
it runs very well.
Why it happens ?
This is weird, I checked Telkomsel on Twitter, sometimes it reloads and sometimes it stucks at all. Is it part of Twitter bug ?
And there's no way we can get all the tweets for that day I presume.
Apparently after 100000 tweets or so twitter stops serving new pages.
I am using twitterscraper to get the replies to some twitter accounts.
I am running the following queries as a test:
to%3Amatteorenzi%20since%3A2017-08-21%20until%3A2017-08-27
to%3Amatteosalvinimi%20since%3A2017-08-21%20until%3A2017-08-27
When performing multiple runs I get a different amount of results each time as below, with left number being the result of first query and right one for the second. Each line is a different run.
544, 4216
386, 4121
295, 4180
Why does this happen? Any way I can prevent it?
Dear author
Thanks very much for your kind work! I am a beginner on python programming and hope it would not trouble you too much.
Here is the problem that I am gonna apply the script on mining non-English text (for twitter advanced search page), such as "戦う", while non-English text in the output file is always displayed as unicode like "'\xe7\x8e\xb2\xe9\ ... ..."
even when typing the command "print(tweet.text.encode('utf-8'))"(or with other encode), the output is still the same.
I am wondering if there is some specific measures to display the non-English text correctly?
Thanks!
Casing only introduces problems and confusion, also the package name on pypi is lowercase.
i'm new to python.
this script works fine. But i also want user location which is given in his profile and geo location.
How i can get these information ?
twitterscraper "%24PEP"%20since%3A2017-10-05 -o pep.out
this works, but when running it
twitterscraper "%24PEP"%20since%3A2017-10-05%20until%3A2017-10-05 -o pep.out
it doesnt work.
Ie. i want to limit the results to only one single day. wont' work.
Hi,
I did a python 3 rewrite. It's a bit shorter, only takes about 90 LOC, and has a cleaner API IMO. It supports arbitrary queries and I basically got rid of the IO and a lot of other stateful stuff: https://github.com/sils/twitterscraper/blob/sils/auth/scrape_from_author.py
it installed correctly via pip or from source here but when trying to use cli || in python shell i get this:
from twitterscraper import query_tweets
Traceback (most recent call last):
File "", line 1, in
File "twitterscraper/init.py", line 13, in
from twitterscraper.query import query_tweets
File "twitterscraper/query.py", line 10, in
from twitterscraper.tweet import Tweet
File "twitterscraper/tweet.py", line 3, in
from bs4 import BeautifulSoup
File "/usr/local/lib/python2.7/dist-packages/bs4/init.py", line 30, in
from .builder import builder_registry, ParserRejectedMarkup
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/init.py", line 314, in
from . import _html5lib
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_html5lib.py", line 70, in
class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
AttributeError: 'module' object has no attribute '_base'
i have the twitter api installed also.
Returns 503 error when trying to specify dates
Running twitterscraper, I ran into this error using the example given in the readme twitterscraper Trump%20since%3A2017-01-03%20until%3A2017-01-04 -o tweets.json
I was running a version from March and then upgraded to the latest master.zip but I still got the same error... Any ideas on how to resolve this? I'm running Ubuntu 16.04...
Traceback (most recent call last):
File "/usr/local/bin/twitterscraper", line 9, in <module>
load_entry_point('twitterscraper==0.3.1', 'console_scripts', 'twitterscraper')()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
return ep.load()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
return self.resolve()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "build/bdist.linux-x86_64/egg/twitterscraper/__init__.py", line 13, in <module>
File "build/bdist.linux-x86_64/egg/twitterscraper/query.py", line 14, in <module>
File "/usr/local/lib/python2.7/dist-packages/fake_useragent/fake.py", line 139, in __getattr__
raise FakeUserAgentError('Error occurred during getting browser') # noqa
fake_useragent.errors.FakeUserAgentError: Error occurred during getting browser
Is there any way to scrap geo data without using the api? This isn't an issue, it's more of a question. Been searching for a while and I can't seem to find anything.
Reading the stdout of a command is much more efficient when handling a lot of requests, rather than taxing the server memory by creating many output json files. I believe that having an option to output to the console as stdout rather than outputting to a file would be a great feature that would expand the way that people can use this project.
When running twitterscraper from command line, the source parameter is not accurately passed to the script if used with apostrophe.
Example:
#news AND source:"Twitter for Android"
twitterscraper %23news%20AND%20source%3A"Twitter%20for%20Android" --output=tweets_new_Android.json
tweets_new_Android.json is empty, but https://twitter.com/search?q=%23news%20AND%20source%3A%22Twitter%20for%20Android%22&src=typd shows results.
it works for sources without apostrophe:
#news AND source:"Tweetdeck"
twitterscraper %23news%20AND%20source%3A"Tweetdeck" --output=tweets_new_Tweetdeck.json
Successfully installed twitterscraper in Python36 but I get the above message from the CMD prompt indicating a problem with the filename.
I think that it is because there is no data to store in the file as there is nothing onscreen from the "print(tweet)" in line 6
Please help (python novice)
John
Hello @taspinar
I just found a bug while tweet scraping. When my connection is unstable I got the error message like following :
ERROR:root:An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 93, in query_tweets_once
pos is None
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 53, in query_single_page
except requests.exception.ConnectionError as e:
AttributeError: 'module' object has no attribute 'exception'
Solved, just need to updgrade.
So I ran the scraper for a tweeting period of around a year, with the limit of 40.000, so:
twitterscraper "%23bitcoin AND %23bubble since%3A2016-09-01 until%3A2017-10-10&src=typd" -l 40000 -o bitcoinbubble.json
While running it counted all the way up to 40 thousand:
INFO: Got 39953 tweets (18 new).
INFO: Got 39971 tweets (19 new).
INFO: Got 39990 tweets (17 new).
INFO: Got tweets ranging from 2017-09-08 to 2017-10-09
But when i load the json file, it only contains 1528 tweets - what explains this?
Hello @taspinar
I'm new at programming.
Could you please give an example about an advanced query. In particular scraping by location and specific time.
Thankyou
So whenever I am trying to run this command on my server, it's saying "Command not found". I have installed it in my home directory. Please help. Any help would be appreciated.
If I control C out of the command line execution the program does not seem to save its results anywhere. The program also continues its execution on a second iteration, which is not always desired. I ran a large search last night on separate machines and neither of them saved their search data when control-c was used
import twitterscraper as ts
'usr='kingjames'
for tweet in ts.query_tweets(usr,10)[:10]:
print(tweet.user.encode('utf-8'))
#out:
b'Rypuur'
b'Powperezdiez'
b'joey_a_george'
b'mikey_rakkar'
b'yarapgv'
b'V_Nasty10'
b'downtownbrownxx'
b'DeclanJoyce'
b'atnissaa'
b'WestifiedMJ'
Given that this is just parameter in the Twitter API, it should be easy to do, and frustrating that it isn't already available
Seems to be nestled in data-permalink-path
, should be an easy scrape
Tweet data for these fields is not being properly parsed if the values exceed 999.
I suspect that it relates to the fact that Twitter displays those values with letters in them. eg, "1.1k" instead of 1100.
In any case, Twitterscraper returns those values as 0.
This question might sound silly, but I am able to use TwitterScraper successfully (with the command twitterscraper “” --output=tweets.json”. but I am unable to retrieve my json file (Logging shows that data is being collected: Example :
INFO: Got 137 tweets (20 new).
INFO: Got 157 tweets (20 new).
INFO: Got 177 tweets (19 new).
INFO: Got 196 tweets (19 new).
¸INFO: Got 215 tweets (17 new).
INFO: Got 232 tweets (20 new).
INFO: Got 252 tweets (19 new).
)
Specifying the exact path /Users/blahblah/tweets.JSON did not make a difference.
What am I missing? Thanks for your help in advance,
I have been running the following command:
twitterscraper trump -l 3 -o tweets.json
, which I figured would limit the amount of tweets to 3, according to the documentation.
Why is it that -l
is not limiting the tweet download to just 3? I'm assuming this is not intended behavior. I have also tested this with -l
at a higher integer, and when set to -l 30
, it always downloads 40 tweets.
I'm thinking that this behavior is caused by new tweets being tweeted as the scraper is running? Twitter briefly explains this in this article: https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines
The output of tweets.json
is the following when using --limit 3
(contains 20 tweets):
[{"timestamp": "2017-11-02T18:26:36", "text": "trump owns it now since he gutted the subsidies.", "user": "MoOkonski", "retweets": "0", "replies": "0", "fullname": "Maureenski", "id": "926153585397780480", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Congress, impeach Trump or resign \u2026http://makeamericagreatagainreally.blogspot.com/2017/10/the-workings-of-donald-j-trumps-mind.html\u00a0\u2026 #Congress #impeachmentpic.twitter.com/lQz5q6ZW5Z", "user": "THIRDSTONE56", "retweets": "0", "replies": "0", "fullname": "THIRD STONE", "id": "926153585750085632", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "#trump ahora es un asesino tambi\u00e9n.", "user": "rikrdotc", "retweets": "0", "replies": "0", "fullname": "Ricardo C", "id": "926153585800482817", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Donna Brazile: I found 'proof' the DNC rigged the nomination for Hillary Clinton #DrainTheSwamp #Trump POTUS http://www.foxnews.com/politics/2017/11/02/donna-brazile-found-proof-dnc-rigged-nomination-for-hillary-clinton.html\u00a0\u2026", "user": "DavidDoright", "retweets": "0", "replies": "0", "fullname": "D.W.Trump\u00a0\ud83c\uddfa\ud83c\uddf8", "id": "926153586098294785", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z9xKoh\u00a0", "user": "BreakingNewss3", "retweets": "0", "replies": "0", "fullname": "Breaking News", "id": "926153586958053376", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Nixon used his China trip as distraction to investigations of him. Trump going to Asia; echoes of the same or misdirect to a deeper issue.", "user": "TalkinToU", "retweets": "0", "replies": "0", "fullname": "TalkinToU", "id": "926153587268263936", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "George Papadopoulos was much more than what Trump says he was. https://twitter.com/SethAbramson/status/925923595079045120\u00a0\u2026", "user": "Resistacat", "retweets": "0", "replies": "0", "fullname": "Dee Ramee", "id": "926153592427466753", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Mysterious Trump backer Mercer stepping down at fund, selling Breitbart stake. #Trump #Breibarthttps://www.cnbc.com/2017/11/02/billionaire-trump-backer-robert-mercer-to-step-down-from-hedge-fund.html\u00a0\u2026", "user": "PSuiteNetwork", "retweets": "0", "replies": "0", "fullname": "John Cutler", "id": "926153593635459072", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "This is far from over. Wait for it. And the collusion won't be over the election it will be over Trump's shady business dealings in Russia", "user": "HarryJoachim", "retweets": "0", "replies": "0", "fullname": "Harry Joachim", "id": "926153594939871234", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Donna Brazil confession: Trump & Bernie were right, the DNC rigged the nomination for Hillary, big league!!\n\nhttps://townhall.com/tipsheet/guybenson/2017/11/02/donna-brazile-trump-and-bernie-were-right-the-dnc-rigged-it-for-hillary-big-league-n2403847\u00a0\u2026", "user": "LovToRideMyTrek", "retweets": "0", "replies": "0", "fullname": "BOYCOTT HOLLYWOOD\u00a0\ud83c\udf83", "id": "926153595346616327", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Why Harry Belafonte's Warning About Trump Is Important Now More Than Ever. Read here: http://allthat.tv/posts/why-harry-belafonte-s-warning-about-trump-is-important-now-more-than-ever\u00a0\u2026", "user": "ArmChairPundt", "retweets": "0", "replies": "0", "fullname": "Lachelle", "id": "926153596340649984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Thank Trump for that", "user": "DennisG_Shea", "retweets": "0", "replies": "0", "fullname": "Dennis Shea", "id": "926153596567203841", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "GREAT AGAIN: POTUS Trump Announces $100 Billion Company\u2019s Return To USA (VIDEO)\nhttps://goo.gl/SaF4Us\u00a0\n\nNovember 2, 2017\nby Joshua ...pic.twitter.com/dL0nG1oOT8", "user": "warfarenews", "retweets": "0", "replies": "0", "fullname": "Warfare Web", "id": "926153596629950464", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Time To Turn The Channel. I Can Only Handle So Much In One Day Of Trump & The Counterfeit Assholes Surrounding Him! LIES-LIES-LIES!!", "user": "Brokenknee1Jim", "retweets": "0", "replies": "0", "fullname": "James", "id": "926153597427113984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump doesn\u2019t really want you to know Obamacare enrollment just started -- By @svdate https://www.huffingtonpost.com/entry/trump-obamacare-enrollment_us_59fa3adfe4b01b47404810d0?ncid=engmodushpmg00000004\u00a0\u2026 via @HuffPostPol", "user": "michaellamperd", "retweets": "0", "replies": "0", "fullname": "Mick", "id": "926153597489881088", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "The Trump-Russia dossier cost $168,000, not $12 million, like president claimed http://www.newsweek.com/trump-dossier-cost-millions-699816\u00a0\u2026", "user": "XtyMiller", "retweets": "0", "replies": "0", "fullname": "Kilikina", "id": "926153598140014592", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "MOMENTS AGO: Pres. Trump: \"Congress must end chain migration so that we can have a system that is security based, not the way it is now.\"...", "user": "The_News_Corner", "retweets": "0", "replies": "0", "fullname": "Ok", "id": "926153598202912769", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump Is Quietly Deregulating All the Things | Brittany Hunter https://fee.org/articles/trump-is-quietly-deregulating-all-the-things/\u00a0\u2026 via @feeonline", "user": "badcraigsnews", "retweets": "0", "replies": "0", "fullname": "Badcraigsnews", "id": "926153598433660928", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z7GXgZ\u00a0", "user": "_politic_us_", "retweets": "0", "replies": "0", "fullname": "Audrey", "id": "926153598437818368", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "House Democrats file lawsuit over access to Trump hotel documents - Politico https://www.politico.com/story/2017/11/02/trump-hotel-documents-lawsuit-244455\u00a0\u2026", "user": "PS641600", "retweets": "0", "replies": "0", "fullname": "PeterS", "id": "926153598446301184", "likes": "0"}]
while running TwitterScraper "test" --output tweets.json --all
for ~10 minutes
ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 96, in query_tweets_once
pos is None
File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 46, in query_single_page
tweets = list(Tweet.from_html(html))
File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 34, in from_html
yield cls.from_soup(tweet)
File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 19, in from_soup
user=tweet.find('span', 'username').text[1:],
AttributeError: 'NoneType' object has no attribute 'text'
This results in a bunch of empty files and incrementally increasing filenames, can be annoying for testing.
Hello !
I'm trying to scrap every tweets from an account. My script is quite simple :
_!/usr/bin/env python
encoding: utf-8
from twitterscraper import TwitterScraper
topic = ""
cible = "username"
filename = 'username_tweets.csv'
scraper = TwitterScraper.Scraper(topic, 21000, authors=cible, filename = filename)
scraper.scrape()_
It works for hundreds of tweets, but then I've got these error :
Traceback (most recent call last):
File "myscript.py", line 10, in
scraper.scrape()
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 148, in scrape
self.write(post)
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 136, in write
self.writer.writerow(post)
(Yes, I'm using python 2.7, don't know if the problem came from here or not)
Thanks in advance
Taspinar and Sils, nice job!
A little issue in Readme.md usage example: "print(tweet.username)" should be changed to "print(tweet.user)"
Am I right in thinking that this scraper only looks at q=
to return results and therefore you cannot pass a location?
Hi there,
It would be great, if there was a basic example that does the following in a Python script:
Probably this is just me being new to Python, but a general documentation with a brief description for each functionality would also be nice.
Thanks in advance!
After carrying out the CLi search, how do I access the JSON file which the tweets are stored in?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.