jonbakerfish / tweetscraper Goto Github PK
View Code? Open in Web Editor NEWTweetScraper is a simple crawler/spider for Twitter Search without using API
License: GNU General Public License v2.0
TweetScraper is a simple crawler/spider for Twitter Search without using API
License: GNU General Public License v2.0
Hi, is it possible to write all the tweets into one file, preferbly a "json" file, instead of creating separate files for every tweet?
Thanks!
SQLite is a lightweight DB, doesn't require installation, so it could be a natural choice for many users.
The scraped tweet text has unwanted spaces.
Demonetisation instead of #Demonetisation
narendramodi instead of @narendramodiAnd unwanted spaces in urls in tweet text
www.
business-standard.
com/article/market
s/demonetisation-
is-another-proof-that-modi-is-reforming-india-chris-wood
-116111400993_1.htmlFixing spaces in urls with regex after scraping is a headache.
Excuse me, can you give me an example about the keywords USER_AGENT?
if i want to get some twitters with the emotion signs ":)" from Apri, 2013 on the website twitter, what should i set about the keywords USER_AGENT in the setting file?
When the SavetoMySQLPipeline class in the pipelines.py file tries to insert a record with a single quote, an error is being thrown and the record is rejected. This is because MySQL throws single quotes to begin and end a string value in its insert statement. In order to insert a string with a quote you must use '' (two single quotes in a row). The bad insert statement is being created on line 141 of the pipelines.py file. To fix this I updated the code on line 135:
text = item['text']
with the following code to replace each single quote with two single quotes:
text = item['text'].replace(''', '''')
This could be added to the other variables there for completeness, but wasn't necessary in my case.
Thanks for publishing this!
I am conducting research on searching related tweets. I was wondering whether the code can be developed to return the corresponding coordinates and mentioned users' id. Thank you very much!
I have noticed some odd behaviour with this, otherwise great, scraper.
query='monkey,coffee'
runs with no problems.query='monkey,#coffee'
runs with no problems.query='#monkey,#coffee'
breaks.query='#monkey,#coffee,#love,#gorilla'
breaks.query='monkey,#coffee,love,gorilla'
breaks.query='monkey,coffee,love,#gorilla'
breaks.query='#monkey,coffee,love,gorilla'
breaks.query='monkey,coffee,love,#gorilla'
breaks.query='monkey,coffee,love,gorilla'
breaks.query='monkey,coffee,love'
runs with no problems.query='monkey,coffee,#love'
breaks.query='#love'
runs with no problems.Tentative conclusions
There is probably some underlying thing that I am missing here...?
I run the command: scrapy crawl TweetScraper -a query="from: {UserId}"
It returns some tweets that don't even contain {UserId}
at all, and not all {UserId}'s tweets are returned.
I think this is a recent regression.
The expected behavior: the above command fetches all tweets with {"usernameTweet": "{UserId}", ...
I am having trouble crawling a user which I know has retweets: all the scraped tweets have an is_retweet=False, which means retweets are not scraped. Is that fixed in any known way?
I haven't checked whether it's saving only 140 characters or 280, but I am getting this error often:
Data too long for column 'text' at row 1
and I don't get the tweet it is about.
When giving the command as: scrapy crawl Tweetscraper -a query=Trump , it throws error.
ERROR: Error downloading <GET https://twitter.com/i/search/timeline?l=&f=tweets&q=Trump&src=typed&max_position=>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/robotstxt.py", line 46, in process_request_2
if not rp.can_fetch(to_native_str(self._useragent), request.url):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 129, in to_native_str
return to_bytes(text, encoding, errors)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 119, in to_bytes
'object, got %s' % type(text).name)
TypeError: to_bytes must receive a unicode, str or bytes object, got list
2019-07-16 14:27:59 [scrapy.core.engine] INFO: Closing spider (finished)
Trying to use this but nothing is being saved - I even tried the simple
"scrapy crawl TweetScraper -a query=foo,#bar"
but that didn't save either
I have this program going smoothly at first. I try to crawl data successfully within Oct, 2018.
But then, I begin to crawl data since 2013-10-31 until 2018-10-31 and store them in the folder named "tweet", when I find that the folder I used to store the files has some problems, I exit the program.
After that I can not open my "tweet" folder and I try to move it to my trash. It takes a long time and in this period I try to run the program to get data within Oct, 2018 again but find I can not get the data files in the new folder I created.
My code is as below:2018-11-16 19:15:49 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2018-11-16 19:15:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-18.0.0-x86_64-i386-64bit
2018-11-16 19:15:49 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2018-11-16 19:15:49 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/Users/tangfuyu/anaconda3/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 674, in exec_module
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
builtins.SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print "Creating table...")? (pipelines.py, line 82)
2018-11-16 19:15:49 [twisted] CRITICAL:
Traceback (most recent call last):
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/Users/tangfuyu/anaconda3/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 674, in exec_module
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
File "/Users/tangfuyu/Desktop/Project/TweetScraper/TweetScraper/pipelines.py", line 82
print "Creating table..."
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print "Creating table...")?
tangfuyudeMacBook-Air:TweetScraper tangfuyu$ scrapy crawl TweetScraper -a query="DJIA since:2018-11-12 until:2018-11-13"
2018-11-16 19:16:49 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2018-11-16 19:16:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-18.0.0-x86_64-i386-64bit
2018-11-16 19:16:49 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-16 19:16:50 [scrapy.middleware] INFO: Enabled item pipelines:
['TweetScraper.pipelines.SaveToFilePipeline']
2018-11-16 19:16:50 [scrapy.core.engine] INFO: Spider opened
2018-11-16 19:16:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 19:17:08 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-16 19:17:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18452,
'downloader/request_count': 20,
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 335803,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 11, 17, 8, 390682),
'item_scraped_count': 380,
'log_count/INFO': 7,
'memusage/max': 55123968,
'memusage/startup': 55123968,
'request_depth_max': 20,
'response_received_count': 20,
'scheduler/dequeued': 20,
'scheduler/dequeued/memory': 20,
'scheduler/enqueued': 20,
'scheduler/enqueued/memory': 20,
'start_time': datetime.datetime(2018, 11, 16, 11, 16, 50, 69474)}
2018-11-16 19:17:08 [scrapy.core.engine] INFO: Spider closed (finished)
Anyone knows the problem? I really need help because I am doing a project and this data is really crucial to me. If anyone can help me to crawl these datas and send to me I will appreciate a lot!!
hi there,
this is only scraping retweets data and its not scraping tweets.
whats the issue?
thanks
I was wondering whether the code can be updated to retrieve the list of the users who like the tweet, the users who retweet the tweet, and the users who are mentioned in the tweet. if someone can do that, please share the approach with us. Thank you very much! @MathiasDesch
I checked the twitter, and found "https://twitter.com/search?l=&q=from%3Aelonmusk", this should be the base url
After executing the commands for clone and install the following error appears after the command scrapy list:
oot@host:~/scrape/TweetScraper# scrapy list Traceback (most recent call last): File "/usr/bin/scrapy", line 4, in <module> execute() File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help func(*a, **kw) File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command cmd.run(args, opts) File "/usr/lib/python2.7/dist-packages/scrapy/commands/list.py", line 13, in run crawler = self.crawler_process.create_crawler() File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 87, in create_crawler self.crawlers[name] = Crawler(self.settings) File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 25, in __init__ self.spiders = spman_cls.from_crawler(self) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 35, in from_crawler sm = cls.from_settings(crawler.settings) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 31, in from_settings return cls(settings.getlist('SPIDER_MODULES')) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 22, in __init__ for module in walk_modules(name): File "/usr/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 68, in walk_modules submod = import_module(fullpath) File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) File "/root/scrape/TweetScraper/TweetScraper/spiders/TweetCrawler.py", line 1, in <module> from scrapy.linkextractors.sgml import SgmlLinkExtractor ImportError: No module named linkextractors.sgml
Hi,
i would like to thank you first, for your awesome job to crawl Twitter without the needs of their credential for API. I would just tell you about using your code for Python 3 users, you need to change your :
print Inserting...
by :
print("Inserting...")
Otherwise it will raise an error at launching.
Have a nice day,
Bastien. ๐
Good Day, this script has been working great. I haven't had time to troubleshoot it yet but here with out put I am seeing currently 05/21/18 11:36AM PST...Notice scrapy error mid-way down..
Example: scrapy crawl TweetScraper -a query="Trump"
2018-05-21 11:35:09 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: TweetScraper)
2018-05-21 11:35:10 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-05-21 11:35:10 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled item pipelines:
['TweetScraper.pipelines.SaveToFilePipeline']
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Spider opened
2018-05-21 11:35:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-21 11:35:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mobile.twitter.com/i/search/timeline?l=&f=tweets&q=Trump&src=typed&max_position=>: HTTP status code is not handled or not allowed
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-21 11:35:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 619,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 3286,
'downloader/response_count': 2,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 21, 18, 35, 10, 574800),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'log_count/INFO': 8,
'memusage/max': 68751360,
'memusage/startup': 68751360,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 5, 21, 18, 35, 10, 227440)}
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Spider closed (finished)
Dear developer,
I found this in the setting.py: 'TweetScraper.pipelines.SaveToFilePipeline':100,
I understand that it means we can choose to save to files or save to database, but what does that "100" mean? Can you please give me a hint. Thanks.
Regards
While i can see the requests TweetScraper processes, TS isn't able to save the output in ./Data/tweet
.
I get this error message when TweetScraper processes the query:
Traceback (most recent call last): File "/Users/me/Library/Python/2.7/lib/python/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/me/my/ts/dir/TweetScraper/TweetScraper/pipelines.py", line 187, in process_item self.save_to_file(item,savePath) File "/Users/me/my/ts/dir/TweetScraper/TweetScraper/pipelines.py", line 210, in save_to_file with open(fname,'w', encoding='utf-8') as f: TypeError: 'encoding' is an invalid keyword argument for this function
Hi,
When I start crawling, it continously creates error logs starting with the words "Error tweet". As a result, it cannot scrap any tweet succesfuly. My log level is in ERROR. What could be the problem? Is there any major change in general structure of Tweets, which causes to the errors during parsing operation?
[TweetScraper.spiders.TweetCrawler] ERROR: Error tweet: <div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content
original-tweet js-original-tweet
has-cards has-content
a-tweet-id="748650033210265600" data-item-id="748650033210265600" data-permalink-path="/rmc_lee/status/748650033210265600" data-conversation-id="748650033210265600"
tweet-nonce="748650033210265600-93347977-5360-4fdc-b571-8c945f28a106" data-tweet-stat-initialized="true" data-screen-name="rmc_lee" data-name="Jonathan" data-user-id
0161618" data-you-follow="false" data-follows-you="false" data-you-block="false" data-reply-to-users-json='[{"id_str":"1030161618","screen_name":"rmc_lee","name":"Jo
n","emojified_name":{"text":"Jonathan","emojified_text_as_html":"Jonathan"}}]' data-disclosure-type="" data-has-cards="true" data-component-context="tweet">
div class="context">
/div>
div class="content">
.
..
.
.
.
Is it possible to use user ID as parameter? e.g. 789944936036655108
<=> @cloudpie_
When the displayed URL is biorxiv.org/content
and its tooltip is https://www.biorxiv.org/content
, the scraped version has a space: https://www. biorxiv.org/content
.
Hi,
I try to run the command scrapy crawl TweetScraper -a query=bitcoin since: 2017-01-01 until: 2017-01-01
to retrieve the tweets, but I got following error: crawl: error: running 'scrapy crawl' with more than one spider is no longer supported.
What's more, when I not designate the date, it will retrieve too much tweets for me in one day. Can I set a limit for the tweets of every single day, e.g 1000 tweets per day?
Thanks!
Hi,
when I run scrapy crawl TweetScraper -a query=foo,#bar, I got this error
AttributeError: module 'OpenSSL.SSL' has no attribute 'OP_SINGLE_ECDH_USE'
Can you tell how to solve it ?
Thanks
Traceback (most recent call last):
File "//anaconda/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "//anaconda/lib/python3.5/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "//anaconda/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/init.py", line 65, in download_request
return handler.download_request(request, spider)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 63, in download_request
return agent.download_request(request)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 300, in download_request
method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1649, in request
endpoint = self._getEndpoint(parsedURI)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1633, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1510, in endpointForURI
uri.port)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 56, in getContext
return self.getCertificateOptions().getContext()
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 51, in getCertificateOptions
acceptableCiphers=DEFAULT_CIPHERS)
File "//anaconda/lib/python3.5/site-packages/twisted/python/deprecate.py", line 792, in wrapped
return wrappee(*args, **kwargs)
File "//anaconda/lib/python3.5/site-packages/twisted/internet/_sslverify.py", line 1583, in init
self._options |= SSL.OP_SINGLE_DH_USE | SSL.OP_SINGLE_ECDH_USE
AttributeError: module 'OpenSSL.SSL' has no attribute 'OP_SINGLE_ECDH_USE'
When my query contains two keywords, the error will occur.
Not able to understand the base url . Whats the idea behind it ?
How can I use the lang part in the query?
I've tried all possible ways I can think but it doesn't work. In most combinations doesn't do anything, and in this, it looks like it's working but it doesn't scrape anything:
scrapy crawl TweetScraper -a query="@foo -a lang['en']"
thanks in advance
Hi!
There is no module scrapy.conf
scrapy.conf.settings is deprecated since 2014 per scrapy/scrapy@993b543#diff-5fa56a6cbc91c5af447e69c97083d78e
I could solve the bug by following this https://stackoverflow.com/questions/54717251/scrapydeprecationwarning-module-scrapy-conf-is-deprecated-use-crawler-setti
from scrapy.utils.project import get_project_settings
SETTINGS = get_project_settings()
Thanks for sharing your scraper with a free software license.
So I've been hacking on this the past few days, this project has been very helpful. I've needed to collect large numbers of tweets and users, and this has been the only feasible tool.
However, when scraping large numbers of objects, having one file per tweet/user is a serious problem.
It wastes a huge amount of space. I had 230k files that averaged a bit under 390 bytes each, however, each file occupied the block size of my drive- 4kb. So the disk usage for me is >10x the actual size of the data.
It breaks many of the tools that someone might use to fix this. cat
, ls
, find
, etc all break. My graphical file manager hung. I eventually figured out a solution, in case anyone ends up needing it, but this really shouldn't happen.
ls | xargs cat > output
It's very unexpected and unpleasant behavior. I've put together a solution that saves tweets and users to their respective configured paths as newline-delimited json. The files are opened with mode a+ so they can be closed properly and flushed in between saves and they won't truncate an existing file.
For the issue of duplication, I'm just holding a set of user/tweet IDs in memory. At 64 bits each, even at 100 million IDs the memory cost is less than 1 GB, The only issue I have left is finding a way to persist the IDs that have been scraped across sessions without having to read in the full amount of data that was saved previously. I'll want to polish it and make sure all is well, but I'd be happy to make a PR and discuss.
Thanks!
Hello,
I was able to install TweetScraper and when I prompted the command, 'scrapy list' I received a message: command not found and NOT. I'm going to reinstall, but I am doing all of this on an Ubuntu VM. Frustrating.
From the items.py it seems that there are various items (or tweet attributes) available that are not written to json files:
has_image = Field() # True/False, whether a tweet contains images
images = Field() # a list of image urls, empty if none
has_video = Field() # True/False, whether a tweet contains videos
videos = Field() # a list of video urls
has_media = Field() # True/False, whether a tweet contains media (e.g. summary)
medias = Field() # a list of media
How can I also write these to the json files of the tweets?
Dear developer,
I found an issue yesterday:
All the date and time crawled are the same, while this didn't occur one week before.
Is this due to the limitation of twitter API, or Twitter has changed the crawling page?
I am so stressed as I need to crawl the data soon to do further analysis, as my dissertation deadline is coming ><, can you help to figure this problem?
Many Thanks
Jason
This doesn't seem to be scraping all of the tweets when using date filters...
Is it only scraping "Top" tweets? If so, is there anyway that I can change it to "All"?
Thank you!
Zach
I run TweetScraper on Python 2.7 and dump the tweets into the filesystem. I noticed that the tweet texst contains no emoji codes. It seems that TweetCralwer.py does not extract emojis from the tweet texts. What am I doing wrong?
Hello, this project has been a great help for me, but I got a problem with this command :
scrapy crawl TweetScraper -a query=from:detikcom,since:2017-10-01,until:2017-12-30 -a crawl_user=True -a top_tweet=True
It returns no scraped data, but when I search for //from:detikcom since:2017-10-01 until:2017-12-30// in Twitter advanced search, it returns some. Really need help for my study, and thanks!
Fix included for the above(#14)
It looks like TweetScraper, after installing all requirements in a Python 2.7 virtual environment, is still missing configparser.
Should we add it to the requirements file?
(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ scrapy list
Traceback (most recent call last):
File "/home/elkcloner/.virtualenvs/TweetScraper/bin/scrapy", line 6, in <module>
from scrapy.cmdline import execute
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 10, in <module>
from scrapy.crawler import CrawlerProcess
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/crawler.py", line 11, in <module>
from scrapy.core.engine import ExecutionEngine
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 14, in <module>
from scrapy.core.scraper import Scraper
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 18, in <module>
from scrapy.core.spidermw import SpiderMiddlewareManager
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/spidermw.py", line 13, in <module>
from scrapy.utils.conf import build_component_list
File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/utils/conf.py", line 4, in <module>
import configparser
ImportError: No module named configparser
(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ pip install configparser
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting configparser
Using cached https://files.pythonhosted.org/packages/ba/05/6c96328e92e625fc31445d24d75a2c92ef9ba34fc5b037fe69693c362a0d/configparser-3.7.4-py2.py3-none-any.whl
Installing collected packages: configparser
Successfully installed configparser-3.7.4
(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ scrapy list
TweetScraper
Great utility. Is it possible to collect the identity of who replies are replying to?
Hi, before I post the issue I would like to say that your work on this project is superb!
As for the issue itself, I have used your scraper in the last 2-3 weeks and everything was working like a charm. But, when I started it yesterday it started throwing this exception. I tried to reinstall the dependencies, cloned the repo again, but nothing works. If you have any ideas how to solve this, I am thankful in advance.
Here is the full print of my action:
scrapy crawl TweetScraper -a query='near:Chicago within:1mi since:2016-01-01
until:2016-12-31' -a lang='en'
2019-01-23 16:13:43 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2019-01-23 16:13:43 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
2019-01-23 16:13:43 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': 'XXXXXX'}
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2019-01-23 16:14:15 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 40, in from_settings
mw = mwcls()
File "C:\Users\Tima\Desktop\SIAP\TweetScraper\TweetScraper\pipelines.py", line 27, in init
self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1991, in ensure_index
self.__create_index(keys, kwargs, session=None)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1847, in __create_index
with self._socket_for_writes() as sock_info:
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 196, in _socket_for_writes
return self.__database.client._socket_for_writes()
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\mongo_client.py", line 1085, in _socket_for_writes
server = self._get_topology().select_server(writable_server_selector)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 224, in select_server
address))
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 183, in select_servers
selector, server_timeout, address)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 199, in _select_servers_loop
self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: connection closed,connection closed,connection closed
2019-01-23 16:14:15 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 40, in from_settings
mw = mwcls()
File "C:\Users\Tima\Desktop\SIAP\TweetScraper\TweetScraper\pipelines.py", line 27, in init
self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1991, in ensure_index
self.__create_index(keys, kwargs, session=None)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1847, in __create_index
with self._socket_for_writes() as sock_info:
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 196, in _socket_for_writes
return self.__database.client._socket_for_writes()
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\mongo_client.py", line 1085, in _socket_for_writes
server = self._get_topology().select_server(writable_server_selector)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 224, in select_server
address))
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 183, in select_servers
selector, server_timeout, address)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 199, in _select_servers_loop
self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: connection closed,connection closed,connection closed
I ran the following command:
ย scrapy crawl TweetScraper -a query = "foo, #bar"
Although the tweets are saved to file but after commenting out the SaveToMongoPipeline method, I get the following error:
ย [TweetScraper.pipelines] INFO: Item type is not recognized! type = <class 'NoneType'>
what can be the cause?
Hello, i am trying to scrape ALL the tweets of a certain account but the spider closes due to being finished before completion. Any idea what is going on? The user is: REDACTED I've attempted to get it to work using both the from: and just searching the username.
Some accounts will have a "Caution: This profile may include potentially sensitive content" warning.
After exectuing the command scrapy crawl Tweetcraper -a query=example , it is showing an error that no module named sgmllib is found. I have installed latest module of scrapy and they are saying that sgmllib is depricated. Please help!!
{"nbr_retweet": 0, "user_id": "226709485", "url": "/BenRinghofer/status/979429412826435594", "text": "I love it when I see a recently posted photo on @Instagram and open the app a few minutes later to try to see it again only to find the instalgorithm has banished it to the abyss in favor for a sports score that's two days old", "usernameTweet": "BenRinghofer", "datetime": "2018-03-30 02:45:43", "is_reply": false, "is_retweet": false, "ID": "979429412826435594", "nbr_reply": 0, "nbr_favorite": 5}
and i go to twiiter.com/url and found the time is afternoon 11:45 - 2018 03 29
but in local json it's 2018-03-30 02:45:43
i think it might be some cache conflict ?
I'm trying to run the following query: scrapy crawl TweetScraper -a query = superhero since:2019-02-20
but I get the next error: crawl: error: Invalid -a value, use -a NAME=VALUE
Any idea of what is happening?
Thank you in advance. :)
Hello, I'm trying to get tweets of bitcoins between certain dates, as in the example: scrapy crawl TweetScraper -a query="bitcoin since:2016-09-01 until:2018-09-01" -a top_tweet=True lang="en", before there were some 76,000 files, now I ran again and only came 17 files, could you help me?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.