Comments (7)
Did you have a chance to check if these reviews definitely have the missing data in the HTML?
from steam-scraper.
Not sure, but i saw that it was giving incomplete info in DEBUG in the console.
Where can i see the HTML code i get from the responses?
I did check these reviews in browser and in indivduals scrapy runs.
After 15 hours i had aprox 430 thousand reviews but after 23 only 512 thousand so it seems like it slowed down the longer it was on.
Closed the spider after 23 hours of scraping
2018-04-23 15:35 [scrapy.extionsions.feedexport] INFO: Stored jl feed (512038 items) in: testreviews.jl
2018-04-23 15:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 77937165,
'downloader/request_count': 59933,
'downloader/request_method_count/GET': 59933,
'downloader/response_bytes': 246292536,
'downloader/response_count': 59933,
'downloader/response_status_count/200': 59454,
'downloader/response_status_count/302': 473,
'downloader/response_status_count/504': 6,
'dupefilter/filtered': 101,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 4, 23, 13, 35, 20, 146443),
'httpcache/firsthand': 59927,
'httpcache/hit': 6,
'httpcache/miss': 59927,
'httpcache/store': 59454,
'httpcache/uncacheable': 473,
'httperror/response_ignored_count': 2,
'httperror/response_ignored_status_count/504': 2,
'item_scraped_count': 512038,
'log_count/DEBUG': 574891,
'log_count/ERROR': 2,
'log_count/INFO': 1265,
'request_depth_max': 2749,
'response_received_count': 59456,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/504 Gateway Time-out': 4,
'scheduler/dequeued': 59930,
'scheduler/dequeued/disk': 59930,
'scheduler/enqueued': 59948,
'scheduler/enqueued/disk': 59948,
'spider_exceptions/IndexError': 2,
'start_time': datetime.datetime(2018, 4, 22, 16, 40, 42, 767437)}
2018-04-23 15:35:20 [scrapy.core.engine] INFO: Spider closed (shutdown)
257513 Successful Rest failed, missing "recommended. So aprox half of the rows are failed. Less failed rows in the begining and more later on in the fiile.
Tried changing the 'USER AGENT' in settings to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
after seeing it here https://stackoverflow.com/questions/33851754/scrapy-misses-some-html-elements
Started getting these error messages instead and then ONLY reviews with missing info, so only failed responese.
2018-04-23 16:09:01 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\crawler.py", line 291, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 122, in _next_request
if not self._next_request_from_scheduler(spider):
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 149, in _next_request_from_scheduler
request = slot.scheduler.next_request()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 71, in next_request
request = self._dqpop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 106, in _dqpop
d = self.dqs.pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\pqueue.py", line 43, in pop
m = q.pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\squeues.py", line 19, in pop
s = super(SerializableQueue, self).pop()
File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\queue.py", line 162, in pop
self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
builtins.OSError: [Errno 22] Invalid argument
Changed it back to 'Steam Scraper' and tested with test_urls.txt and now im getting the same error above over and over again until aborting the scraping. Still testing around and might reinstall the env to test again.
from steam-scraper.
Seems like some file has been corrupted. Found a similiar problem in scrapy/scrapy#845. Gonna try restoring/reseting the whole project.
from steam-scraper.
OK, so it seems that there's no problem with the actual scraper and more a problem with the HTML sites.
Managed to get a HTML site from the missing/damaged(Lets just call them BAD) reviews during the console/debug run
So the reason I'm getting multiple BAD reviews within multiple reviews in a row is that the all originate from the same html site except they get diffrent page order.
Example:
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 5,
'product_id': '304050',
'user_id': 123377508,
'username': 'Chaoz Designz'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 6,
'product_id': '304050',
'user_id': 123377508,
'username': 'Aynat | twitch.tv/aynatanya <3'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 7,
'product_id': '304050',
'user_id': 123377508,
'username': 'Evilagician'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
'page': 141,
'page_order': 8,
'product_id': '304050',
'user_id': 123377508,
'username': `'Evilagician'}
And it goes on to page order 25 where it stops. other examples variate on how many pageorders it goes on for. Haven't found out why yet.
Now i checked the HTML code and surely enough it missing all important review infos becuase it's not a review page at all. It's some news announcment page or something similar without any reviews but containing Mabye users that own the game or have commented somethins so the scraper picks up. names, user ids, early access from the html code and all other info it gets without the html scraping.
So the problem is that the scraper is scraping non-review sites
Dunno why yet and have no idea how to solve it. Any tips/thoughts?
EDIT: Noticed that its always links with starting with
https://steamcommunity.com/app/304050/homecontent/?announcementsoffset
since its supposed to be
https://steamcommunity.com/app/304050/homecontent/?userreviewsoffset</b>
EDIT2 PS: Made the scraper faster by changing all the urls in urls.txt from HTTP to HTTPS since it was redirecting for every link. Will look into implementing it in the code later
EDIT3: Im guessing the bug/problem lies in review_spider.py Somewhere between line 104 and 131. Specifically line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')
EDIT4:
Line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')
scrapes out the following:
<form method="GET" id="MoreContentForm1" name="MoreContentForm1" action="https://steamcommunity.com/app/9010/homecontent/">
<input type="hidden" name="userreviewsoffset" value="10"><input type="hidden" name="p" value="2"><input type="hidden" name="workshopitemspage" value="2"><input type="hidden" name="readytouseitemspage" value="2"><input type="hidden" name="mtxitemspage" value="2"><input type="hidden" name="itemspage" value="2"><input type="hidden" name="screenshotspage" value="2"><input type="hidden" name="videospage" value="2"><input type="hidden" name="artpage" value="2"><input type="hidden" name="allguidepage" value="2"><input type="hidden" name="webguidepage" value="2"><input type="hidden" name="integratedguidepage" value="2"><input type="hidden" name="discussionspage" value="2"><input type="hidden" name="numperpage" value="10"><input type="hidden" name="browsefilter" value="mostrecent">
Where name="userreviewsoffset" sometimes is "announcementsoffset"
Does that mean there are no more reviews? Is there a way to skip announcement/news pages and make sure it's a "userreviewoffset" site? Is there a way to ignore/not scrape from these sites and just move on with next page if there still are reviews to be scraped?
Questions! I will keep looking to see if i can fix this.
from steam-scraper.
@RamiJabor I just successfully scraped all of the reviews for the game you are getting bad data from with the following command:
scrapy crawl reviews -a steam_id=304050 -o 304050.jl --loglevel INFO
I was getting about 60-65 pages per minute (~600-700 reviews per minute) which is roughly in-inline with what you reported in your first 15 hours.
It looks like you've either hit Steam's server too much, have encountered a glitch of some kind, or the behavior you're reporting is not present in the US store.
from steam-scraper.
Yes, i did the same. Tested out the games i was getting bad reviews from individually and got good results.
I tested increasing the number of urls in the url.txt file and at 9 or 10 urls it starts to give bad data from announcement pages. I've tried setting the concurrent_request=1 still the same....
I don't think it has to do with the steam server but frankly i have no idea what it could be anymore.
I would appreciate if anyone could try scraping from textfile with over 15 urls so i know if i'm the only one having this problem.
from steam-scraper.
@RamiJabor Since this issue isn't related to the scraper code, I'm going to close it. In the meantime, consider slowing down the scraper from the IP you're using.
from steam-scraper.
Related Issues (10)
- Mature content + Age check
- spider about user consuming behavior
- Price field incorrect if only DLC options listed HOT 1
- Running more than one spider HOT 1
- Recommended column is not being filled HOT 1
- Question about stopping condition for reviews HOT 4
- Decoding unicode escaped characters HOT 1
- Exporting reviews by keywords HOT 1
- Missing User_ids and n_reviews HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from steam-scraper.