jonbakerfish / tweetscraper Goto Github PK

View Code? Open in Web Editor NEW

989.0 37.0 311.0 60 KB

TweetScraper is a simple crawler/spider for Twitter Search without using API

License: GNU General Public License v2.0

Python 93.42% Shell 6.58%

twitter-search scrapy tweets twitter

tweetscraper's People

Contributors

Stargazers

Watchers

Forkers

ralienpp leyiwang goodman1204 threatinteltest memogarcia rhuangfinance fengjing54 orapradeep siparker earthy123 vdeleon koyo-jakanees geapoch kafk3 leochencipher jnchenfeng bosencoding nabaparisdatasense gaspan wesavetheworld innovativecoder yitzhak-dash candyfaz chandreshiit simonlindgren wkryst akbari59 eportet flxw thosuperman gavinong10 gyco tomhttp tobecwb krashman drappiertechnologies rogershenyc hasi96 saudaljaloud sethlsx mzubairsaleem steven-gjhu ginking yeeler amarisai kyuhwas shzuwu deepmorph gouper lcwy220 kazfan29 danielguerra69 athulkrishnan sanchittanwar7 kozomonster eliastech engmedoo python-hxl adrrahman sylarair henfen mindaugasvaitkus2 ruhignieh khamfroush-lab nystud lingyi-lk alpharootbeta alorozco53 giserh oromanos catwhocode timothyshi-kika cpis697 momotoculteur coloraven depozitosuz kanihal ahmed-atta hallicarnassus gjlondon tarsbase cen005 aucan raphael-njamo mimta tingguo noahrottman beannguyen shuaibr viralsteroids copyspotter saryd hamzamogni sahwar jasonjiang09 alirezabayatmk ro9ueadmin triumphalliu hvaminor c3r34lk1ll3r

tweetscraper's Issues

Write Tweets in one file

Hi, is it possible to write all the tweets into one file, preferbly a "json" file, instead of creating separate files for every tweet?
Thanks!

Please add SaveToSQLitePipeline to store tweets into SQLite DB

SQLite is a lightweight DB, doesn't require installation, so it could be a natural choice for many users.

Unwanted spaces in tweet 'text'

The scraped tweet text has unwanted spaces.

# Demonetisation instead of #Demonetisation
@ narendramodi instead of @narendramodi

And unwanted spaces in urls in tweet text

http:// www. business-standard. com/article/market s/demonetisation- is-another-proof-that-modi-is-reforming-india-chris-wood -116111400993_1.html

Fixing spaces in urls with regex after scraping is a headache.

some issues about the keywords USER_AGENT

Excuse me, can you give me an example about the keywords USER_AGENT?
if i want to get some twitters with the emotion signs ":)" from Apri, 2013 on the website twitter, what should i set about the keywords USER_AGENT in the setting file?

MYSQL single quote insert error

When the SavetoMySQLPipeline class in the pipelines.py file tries to insert a record with a single quote, an error is being thrown and the record is rejected. This is because MySQL throws single quotes to begin and end a string value in its insert statement. In order to insert a string with a quote you must use '' (two single quotes in a row). The bad insert statement is being created on line 141 of the pipelines.py file. To fix this I updated the code on line 135:

text = item['text']

with the following code to replace each single quote with two single quotes:

text = item['text'].replace(''', '''')

This could be added to the other variables there for completeness, but wasn't necessary in my case.

Thanks for publishing this!

is it able to return the coordinates and mentioned users' id?

I am conducting research on searching related tweets. I was wondering whether the code can be developed to return the corresponding coordinates and mentioned users' id. Thank you very much!

Not running with some combinations of queries

I have noticed some odd behaviour with this, otherwise great, scraper.

query='monkey,coffee' runs with no problems.
query='monkey,#coffee' runs with no problems.
query='#monkey,#coffee' breaks.
query='#monkey,#coffee,#love,#gorilla' breaks.
query='monkey,#coffee,love,gorilla' breaks.
query='monkey,coffee,love,#gorilla' breaks.
query='#monkey,coffee,love,gorilla' breaks.
query='monkey,coffee,love,#gorilla' breaks.
query='monkey,coffee,love,gorilla' breaks.
query='monkey,coffee,love' runs with no problems.
query='monkey,coffee,#love' breaks.
query='#love' runs with no problems.

Tentative conclusions

We can use no more than three keywords at the same time.
Hashtag searches can only be one hashtag per search, combined with a maximum of one other (non-hashtag) search word.

There is probably some underlying thing that I am missing here...?

"from: {UserId}" doesn't find all tweets of the person, and also returns tweets from others

I run the command: scrapy crawl TweetScraper -a query="from: {UserId}"

It returns some tweets that don't even contain {UserId} at all, and not all {UserId}'s tweets are returned.

I think this is a recent regression.

The expected behavior: the above command fetches all tweets with {"usernameTweet": "{UserId}", ...

Retweets not crawling

I am having trouble crawling a user which I know has retweets: all the scraped tweets have an is_retweet=False, which means retweets are not scraped. Is that fixed in any known way?

Data too long for column 'text' at row 1

I haven't checked whether it's saving only 140 characters or 280, but I am getting this error often:

Data too long for column 'text' at row 1

and I don't get the tweet it is about.

Error: Error downloading <GET https://twitter.com/i/search/timeline?l=&f=tweets&q=Trump&src=typed&max_position=>

When giving the command as: scrapy crawl Tweetscraper -a query=Trump , it throws error.
ERROR: Error downloading <GET https://twitter.com/i/search/timeline?l=&f=tweets&q=Trump&src=typed&max_position=>

Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/robotstxt.py", line 46, in process_request_2
if not rp.can_fetch(to_native_str(self._useragent), request.url):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 129, in to_native_str
return to_bytes(text, encoding, errors)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 119, in to_bytes
'object, got %s' % type(text).name)
TypeError: to_bytes must receive a unicode, str or bytes object, got list
2019-07-16 14:27:59 [scrapy.core.engine] INFO: Closing spider (finished)

Nothing is saving into the Data folder

Trying to use this but nothing is being saved - I even tried the simple
"scrapy crawl TweetScraper -a query=foo,#bar"
but that didn't save either

No JSON file stored

I have this program going smoothly at first. I try to crawl data successfully within Oct, 2018.
But then, I begin to crawl data since 2013-10-31 until 2018-10-31 and store them in the folder named "tweet", when I find that the folder I used to store the files has some problems, I exit the program.
After that I can not open my "tweet" folder and I try to move it to my trash. It takes a long time and in this period I try to run the program to get data within Oct, 2018 again but find I can not get the data files in the new folder I created.
My code is as below:2018-11-16 19:15:49 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2018-11-16 19:15:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-18.0.0-x86_64-i386-64bit
2018-11-16 19:15:49 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-16 19:15:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2018-11-16 19:15:49 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/Users/tangfuyu/anaconda3/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import

File "", line 971, in _find_and_load

File "", line 955, in _find_and_load_unlocked

File "", line 665, in _load_unlocked

File "", line 674, in exec_module

File "", line 781, in get_code

File "", line 741, in source_to_code

File "", line 219, in _call_with_frames_removed

builtins.SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print "Creating table...")? (pipelines.py, line 82)

2018-11-16 19:15:49 [twisted] CRITICAL:
Traceback (most recent call last):
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Users/tangfuyu/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/Users/tangfuyu/anaconda3/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 674, in exec_module
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
File "/Users/tangfuyu/Desktop/Project/TweetScraper/TweetScraper/pipelines.py", line 82
print "Creating table..."
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(print "Creating table...")?
tangfuyudeMacBook-Air:TweetScraper tangfuyu$ scrapy crawl TweetScraper -a query="DJIA since:2018-11-12 until:2018-11-13"
2018-11-16 19:16:49 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2018-11-16 19:16:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-18.0.0-x86_64-i386-64bit
2018-11-16 19:16:49 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-16 19:16:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-16 19:16:50 [scrapy.middleware] INFO: Enabled item pipelines:
['TweetScraper.pipelines.SaveToFilePipeline']
2018-11-16 19:16:50 [scrapy.core.engine] INFO: Spider opened
2018-11-16 19:16:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 19:17:08 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-16 19:17:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18452,
'downloader/request_count': 20,
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 335803,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 11, 17, 8, 390682),
'item_scraped_count': 380,
'log_count/INFO': 7,
'memusage/max': 55123968,
'memusage/startup': 55123968,
'request_depth_max': 20,
'response_received_count': 20,
'scheduler/dequeued': 20,
'scheduler/dequeued/memory': 20,
'scheduler/enqueued': 20,
'scheduler/enqueued/memory': 20,
'start_time': datetime.datetime(2018, 11, 16, 11, 16, 50, 69474)}
2018-11-16 19:17:08 [scrapy.core.engine] INFO: Spider closed (finished)

Anyone knows the problem? I really need help because I am doing a project and this data is really crucial to me. If anyone can help me to crawl these datas and send to me I will appreciate a lot!!

not scraping tweets text

hi there,
this is only scraping retweets data and its not scraping tweets.
whats the issue?
thanks

Update the code to retrieve the list of the users who like the tweet and the users who retweet the tweet

I was wondering whether the code can be updated to retrieve the list of the users who like the tweet, the users who retweet the tweet, and the users who are mentioned in the tweet. if someone can do that, please share the approach with us. Thank you very much! @MathiasDesch

search url update

I checked the twitter, and found "https://twitter.com/search?l=&q=from%3Aelonmusk", this should be the base url

ImportError: No module named linkextractors.sgml

After executing the commands for clone and install the following error appears after the command scrapy list:

oot@host:~/scrape/TweetScraper# scrapy list Traceback (most recent call last): File "/usr/bin/scrapy", line 4, in <module> execute() File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help func(*a, **kw) File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command cmd.run(args, opts) File "/usr/lib/python2.7/dist-packages/scrapy/commands/list.py", line 13, in run crawler = self.crawler_process.create_crawler() File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 87, in create_crawler self.crawlers[name] = Crawler(self.settings) File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 25, in __init__ self.spiders = spman_cls.from_crawler(self) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 35, in from_crawler sm = cls.from_settings(crawler.settings) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 31, in from_settings return cls(settings.getlist('SPIDER_MODULES')) File "/usr/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 22, in __init__ for module in walk_modules(name): File "/usr/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 68, in walk_modules submod = import_module(fullpath) File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) File "/root/scrape/TweetScraper/TweetScraper/spiders/TweetCrawler.py", line 1, in <module> from scrapy.linkextractors.sgml import SgmlLinkExtractor ImportError: No module named linkextractors.sgml

Print in Python 3

Hi,
i would like to thank you first, for your awesome job to crawl Twitter without the needs of their credential for API. I would just tell you about using your code for Python 3 users, you need to change your :
print Inserting...
by :
print("Inserting...")

Otherwise it will raise an error at launching.
Have a nice day,
Bastien. 😄

Something has changed with Twitter so TweetScraper will not work?

Good Day, this script has been working great. I haven't had time to troubleshoot it yet but here with out put I am seeing currently 05/21/18 11:36AM PST...Notice scrapy error mid-way down..

Example: scrapy crawl TweetScraper -a query="Trump"

2018-05-21 11:35:09 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: TweetScraper)
2018-05-21 11:35:10 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-05-21 11:35:10 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': '[email protected]'}
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-21 11:35:10 [scrapy.middleware] INFO: Enabled item pipelines:
['TweetScraper.pipelines.SaveToFilePipeline']
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Spider opened
2018-05-21 11:35:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-21 11:35:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mobile.twitter.com/i/search/timeline?l=&f=tweets&q=Trump&src=typed&max_position=>: HTTP status code is not handled or not allowed
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-21 11:35:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 619,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 3286,
'downloader/response_count': 2,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 21, 18, 35, 10, 574800),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'log_count/INFO': 8,
'memusage/max': 68751360,
'memusage/startup': 68751360,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 5, 21, 18, 35, 10, 227440)}
2018-05-21 11:35:10 [scrapy.core.engine] INFO: Spider closed (finished)

What does "100" mean here? 'TweetScraper.pipelines.SaveToFilePipeline':100

Dear developer,

I found this in the setting.py: 'TweetScraper.pipelines.SaveToFilePipeline':100,
I understand that it means we can choose to save to files or save to database, but what does that "100" mean? Can you please give me a hint. Thanks.

Regards

tweetscraper dosn't save the output to file

While i can see the requests TweetScraper processes, TS isn't able to save the output in ./Data/tweet.

I get this error message when TweetScraper processes the query:

Traceback (most recent call last): File "/Users/me/Library/Python/2.7/lib/python/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/me/my/ts/dir/TweetScraper/TweetScraper/pipelines.py", line 187, in process_item self.save_to_file(item,savePath) File "/Users/me/my/ts/dir/TweetScraper/TweetScraper/pipelines.py", line 210, in save_to_file with open(fname,'w', encoding='utf-8') as f: TypeError: 'encoding' is an invalid keyword argument for this function

Could not scrap tweets

Hi,

When I start crawling, it continously creates error logs starting with the words "Error tweet". As a result, it cannot scrap any tweet succesfuly. My log level is in ERROR. What could be the problem? Is there any major change in general structure of Tweets, which causes to the errors during parsing operation?

[TweetScraper.spiders.TweetCrawler] ERROR: Error tweet: <div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content
original-tweet js-original-tweet

has-cards has-content
a-tweet-id="748650033210265600" data-item-id="748650033210265600" data-permalink-path="/rmc_lee/status/748650033210265600" data-conversation-id="748650033210265600"
tweet-nonce="748650033210265600-93347977-5360-4fdc-b571-8c945f28a106" data-tweet-stat-initialized="true" data-screen-name="rmc_lee" data-name="Jonathan" data-user-id
0161618" data-you-follow="false" data-follows-you="false" data-you-block="false" data-reply-to-users-json='[{"id_str":"1030161618","screen_name":"rmc_lee","name":"Jo
n","emojified_name":{"text":"Jonathan","emojified_text_as_html":"Jonathan"}}]' data-disclosure-type="" data-has-cards="true" data-component-context="tweet">

div class="context">

/div>

div class="content">

.
..
.
.
.

User ID instead of username as scraping criteria

Is it possible to use user ID as parameter? e.g. 789944936036655108 <=> @cloudpie_

TweetScraper adds spaces into URLs

When the displayed URL is biorxiv.org/content and its tooltip is https://www.biorxiv.org/content, the scraped version has a space: https://www. biorxiv.org/content.

running 'scrapy crawl' with more than one spider is no longer supported

Hi,

I try to run the command scrapy crawl TweetScraper -a query=bitcoin since: 2017-01-01 until: 2017-01-01 to retrieve the tweets, but I got following error: crawl: error: running 'scrapy crawl' with more than one spider is no longer supported.

What's more, when I not designate the date, it will retrieve too much tweets for me in one day. Can I set a limit for the tweets of every single day, e.g 1000 tweets per day?

Thanks!

AttributeError: module 'OpenSSL.SSL' has no attribute 'OP_SINGLE_ECDH_USE'

Hi,

when I run scrapy crawl TweetScraper -a query=foo,#bar, I got this error

AttributeError: module 'OpenSSL.SSL' has no attribute 'OP_SINGLE_ECDH_USE'

Can you tell how to solve it ?

Thanks

Traceback (most recent call last):
File "//anaconda/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "//anaconda/lib/python3.5/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "//anaconda/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/init.py", line 65, in download_request
return handler.download_request(request, spider)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 63, in download_request
return agent.download_request(request)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 300, in download_request
method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1649, in request
endpoint = self._getEndpoint(parsedURI)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1633, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "//anaconda/lib/python3.5/site-packages/twisted/web/client.py", line 1510, in endpointForURI
uri.port)
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 56, in getContext
return self.getCertificateOptions().getContext()
File "//anaconda/lib/python3.5/site-packages/scrapy/core/downloader/contextfactory.py", line 51, in getCertificateOptions
acceptableCiphers=DEFAULT_CIPHERS)
File "//anaconda/lib/python3.5/site-packages/twisted/python/deprecate.py", line 792, in wrapped
return wrappee(*args, **kwargs)
File "//anaconda/lib/python3.5/site-packages/twisted/internet/_sslverify.py", line 1583, in init
self._options |= SSL.OP_SINGLE_DH_USE | SSL.OP_SINGLE_ECDH_USE
AttributeError: module 'OpenSSL.SSL' has no attribute 'OP_SINGLE_ECDH_USE'

crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

When my query contains two keywords, the error will occur.

Search Url

Not able to understand the base url . Whats the idea behind it ?

Use of lang

How can I use the lang part in the query?

I've tried all possible ways I can think but it doesn't work. In most combinations doesn't do anything, and in this, it looks like it's working but it doesn't scrape anything:

scrapy crawl TweetScraper -a query="@foo -a lang['en']"

thanks in advance

from scrapy.conf import settings exceptions.ImportError: No module named conf

Hi!

There is no module scrapy.conf
scrapy.conf.settings is deprecated since 2014 per scrapy/scrapy@993b543#diff-5fa56a6cbc91c5af447e69c97083d78e

I could solve the bug by following this https://stackoverflow.com/questions/54717251/scrapydeprecationwarning-module-scrapy-conf-is-deprecated-use-crawler-setti

from scrapy.utils.project import get_project_settings
SETTINGS = get_project_settings()

Thanks for sharing your scraper with a free software license.

Saving tweets with SaveToFilePipeline creates too many files

So I've been hacking on this the past few days, this project has been very helpful. I've needed to collect large numbers of tweets and users, and this has been the only feasible tool.

However, when scraping large numbers of objects, having one file per tweet/user is a serious problem.

It wastes a huge amount of space. I had 230k files that averaged a bit under 390 bytes each, however, each file occupied the block size of my drive- 4kb. So the disk usage for me is >10x the actual size of the data.
It breaks many of the tools that someone might use to fix this. cat, ls, find, etc all break. My graphical file manager hung. I eventually figured out a solution, in case anyone ends up needing it, but this really shouldn't happen.

ls | xargs cat > output

It's very unexpected and unpleasant behavior. I've put together a solution that saves tweets and users to their respective configured paths as newline-delimited json. The files are opened with mode a+ so they can be closed properly and flushed in between saves and they won't truncate an existing file.

For the issue of duplication, I'm just holding a set of user/tweet IDs in memory. At 64 bits each, even at 100 million IDs the memory cost is less than 1 GB, The only issue I have left is finding a way to persist the IDs that have been scraped across sessions without having to read in the full amount of data that was saved previously. I'll want to polish it and make sure all is well, but I'd be happy to make a PR and discuss.

Thanks!

Installation

Hello,

I was able to install TweetScraper and when I prompted the command, 'scrapy list' I received a message: command not found and NOT. I'm going to reinstall, but I am doing all of this on an Ubuntu VM. Frustrating.

Unused items / attributes

From the items.py it seems that there are various items (or tweet attributes) available that are not written to json files:

has_image = Field() # True/False, whether a tweet contains images
images = Field()    # a list of image urls, empty if none

has_video = Field() # True/False, whether a tweet contains videos
videos = Field()    # a list of video urls

has_media = Field() # True/False, whether a tweet contains media (e.g. summary)
medias = Field()    # a list of media

How can I also write these to the json files of the tweets?

Exception Error:

Couldn't crawl date and time, need help

Dear developer,

I found an issue yesterday:
All the date and time crawled are the same, while this didn't occur one week before.
Is this due to the limitation of twitter API, or Twitter has changed the crawling page?

I am so stressed as I need to crawl the data soon to do further analysis, as my dissertation deadline is coming ><, can you help to figure this problem?

Many Thanks
Jason

Missing many tweets...?

This doesn't seem to be scraping all of the tweets when using date filters...

Is it only scraping "Top" tweets? If so, is there anyway that I can change it to "All"?

Thank you!
Zach

Emoji Support

I run TweetScraper on Python 2.7 and dump the tweets into the filesystem. I noticed that the tweet texst contains no emoji codes. It seems that TweetCralwer.py does not extract emojis from the tweet texts. What am I doing wrong?

Problem with from, since, and until query combination.

Hello, this project has been a great help for me, but I got a problem with this command :

scrapy crawl TweetScraper -a query=from:detikcom,since:2017-10-01,until:2017-12-30 -a crawl_user=True -a top_tweet=True

It returns no scraped data, but when I search for //from:detikcom since:2017-10-01 until:2017-12-30// in Twitter advanced search, it returns some. Really need help for my study, and thanks!

Compatibility Issue: Python3x urllib.quote()

Fix included for the above(#14)

ImportError: No module named configparser

It looks like TweetScraper, after installing all requirements in a Python 2.7 virtual environment, is still missing configparser.
Should we add it to the requirements file?

(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ scrapy list
Traceback (most recent call last):
  File "/home/elkcloner/.virtualenvs/TweetScraper/bin/scrapy", line 6, in <module>
    from scrapy.cmdline import execute
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 10, in <module>
    from scrapy.crawler import CrawlerProcess
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/crawler.py", line 11, in <module>
    from scrapy.core.engine import ExecutionEngine
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 14, in <module>
    from scrapy.core.scraper import Scraper
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 18, in <module>
    from scrapy.core.spidermw import SpiderMiddlewareManager
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/core/spidermw.py", line 13, in <module>
    from scrapy.utils.conf import build_component_list
  File "/home/elkcloner/.virtualenvs/TweetScraper/local/lib/python2.7/site-packages/scrapy/utils/conf.py", line 4, in <module>
    import configparser
ImportError: No module named configparser
(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ pip install configparser
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.                                                                                                                                    
Collecting configparser
  Using cached https://files.pythonhosted.org/packages/ba/05/6c96328e92e625fc31445d24d75a2c92ef9ba34fc5b037fe69693c362a0d/configparser-3.7.4-py2.py3-none-any.whl
Installing collected packages: configparser
Successfully installed configparser-3.7.4
(TweetScraper) elkcloner@debian:~/git/DocTocToc/TweetScraper/TweetScraper$ scrapy list
TweetScraper

Reply to identity

Great utility. Is it possible to collect the identity of who replies are replying to?

[twisted] CRITICAL: Unhandled error in Deferred

Hi, before I post the issue I would like to say that your work on this project is superb!

As for the issue itself, I have used your scraper in the last 2-3 weeks and everything was working like a charm. But, when I started it yesterday it started throwing this exception. I tried to reinstall the dependencies, cloned the repo again, but nothing works. If you have any ideas how to solve this, I am thankful in advance.

Here is the full print of my action:
scrapy crawl TweetScraper -a query='near:Chicago within:1mi since:2016-01-01
until:2016-12-31' -a lang='en'
2019-01-23 16:13:43 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: TweetScraper)
2019-01-23 16:13:43 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
2019-01-23 16:13:43 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': 'XXXXXX'}
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-23 16:13:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2019-01-23 16:14:15 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 171, in crawl
return self._crawl(crawler, *args, **kwargs)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 175, in _crawl
d = crawler.crawl(*args, **kwargs)
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 40, in from_settings
mw = mwcls()
File "C:\Users\Tima\Desktop\SIAP\TweetScraper\TweetScraper\pipelines.py", line 27, in init
self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1991, in ensure_index
self.__create_index(keys, kwargs, session=None)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1847, in __create_index
with self._socket_for_writes() as sock_info:
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 196, in _socket_for_writes
return self.__database.client._socket_for_writes()
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\mongo_client.py", line 1085, in _socket_for_writes
server = self._get_topology().select_server(writable_server_selector)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 224, in select_server
address))
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 183, in select_servers
selector, server_timeout, address)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 199, in _select_servers_loop
self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: connection closed,connection closed,connection closed

2019-01-23 16:14:15 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\Tima\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 70, in init
self.scraper = Scraper(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\core\scraper.py", line 71, in init
self.itemproc = itemproc_cls.from_crawler(crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\Users\Tima\Anaconda3\lib\site-packages\scrapy\middleware.py", line 40, in from_settings
mw = mwcls()
File "C:\Users\Tima\Desktop\SIAP\TweetScraper\TweetScraper\pipelines.py", line 27, in init
self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1991, in ensure_index
self.__create_index(keys, kwargs, session=None)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 1847, in __create_index
with self._socket_for_writes() as sock_info:
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\collection.py", line 196, in _socket_for_writes
return self.__database.client._socket_for_writes()
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\mongo_client.py", line 1085, in _socket_for_writes
server = self._get_topology().select_server(writable_server_selector)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 224, in select_server
address))
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 183, in select_servers
selector, server_timeout, address)
File "C:\Users\Tima\Anaconda3\lib\site-packages\pymongo\topology.py", line 199, in _select_servers_loop
self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: connection closed,connection closed,connection closed

mongodb: Item type is not recognized!

I ran the following command:
scrapy crawl TweetScraper -a query = "foo, #bar"

Although the tweets are saved to file but after commenting out the SaveToMongoPipeline method, I get the following error:
[TweetScraper.pipelines] INFO: Item type is not recognized! type = <class 'NoneType'>

what can be the cause?

query by date (since and until) seems returns less data by just query keywords

the pic shows that, from 08.22 to 09.11 , the tweets every day queried by date is so smoothy, compared the last few days queried by just keywords.

Perhaps it's policy of Twitter, have you guys ever had this kind of problems?

All tweets are not being scraped.

Hello, i am trying to scrape ALL the tweets of a certain account but the spider closes due to being finished before completion. Any idea what is going on? The user is: REDACTED I've attempted to get it to work using both the from: and just searching the username.

Does this work with sensitive Twitter accounts?

Some accounts will have a "Caution: This profile may include potentially sensitive content" warning.

No module named sgmllib

After exectuing the command scrapy crawl Tweetcraper -a query=example , it is showing an error that no module named sgmllib is found. I have installed latest module of scrapy and they are saying that sgmllib is depricated. Please help!!

Datetime outcome out not as expected

I am able to run the scrapy in python 2.7 with cmd in win7:
scrapy crawl TweetScraper -a query="sport"
I got 100K+ tweets inside local tweet folder. but the 'datetime' vector in those json data are gathering around a certain time period, that's weird.
i go to twitter.com to check those tweet manually and found those datetime does not match the actual send datetime in the browser.
i am confused is there something wrong with my understanding or what?
example:

{"nbr_retweet": 0, "user_id": "226709485", "url": "/BenRinghofer/status/979429412826435594", "text": "I love it when I see a recently posted photo on @Instagram and open the app a few minutes later to try to see it again only to find the instalgorithm has banished it to the abyss in favor for a sports score that's two days old", "usernameTweet": "BenRinghofer", "datetime": "2018-03-30 02:45:43", "is_reply": false, "is_retweet": false, "ID": "979429412826435594", "nbr_reply": 0, "nbr_favorite": 5}

and i go to twiiter.com/url and found the time is afternoon 11:45 - 2018 03 29
but in local json it's 2018-03-30 02:45:43
i think it might be some cache conflict ?

crawl: error: Invalid -a value, use -a NAME=VALUE

I'm trying to run the following query: scrapy crawl TweetScraper -a query = superhero since:2019-02-20
but I get the next error: crawl: error: Invalid -a value, use -a NAME=VALUE
Any idea of what is happening?
Thank you in advance. :)

Query TweetScraper

Hello, I'm trying to get tweets of bitcoins between certain dates, as in the example: scrapy crawl TweetScraper -a query="bitcoin since:2016-09-01 until:2018-09-01" -a top_tweet=True lang="en", before there were some 76,000 files, now I ran again and only came 17 files, could you help me?