Giter Site home page Giter Site logo

Comments (7)

prncc avatar prncc commented on July 29, 2024

Did you have a chance to check if these reviews definitely have the missing data in the HTML?

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

Not sure, but i saw that it was giving incomplete info in DEBUG in the console.
Where can i see the HTML code i get from the responses?
I did check these reviews in browser and in indivduals scrapy runs.

After 15 hours i had aprox 430 thousand reviews but after 23 only 512 thousand so it seems like it slowed down the longer it was on.

Closed the spider after 23 hours of scraping

2018-04-23 15:35 [scrapy.extionsions.feedexport] INFO: Stored jl feed (512038 items) in: testreviews.jl
2018-04-23 15:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 77937165,
 'downloader/request_count': 59933,
 'downloader/request_method_count/GET': 59933,
 'downloader/response_bytes': 246292536,
 'downloader/response_count': 59933,
 'downloader/response_status_count/200': 59454,
 'downloader/response_status_count/302': 473,
 'downloader/response_status_count/504': 6,
 'dupefilter/filtered': 101,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 4, 23, 13, 35, 20, 146443),
 'httpcache/firsthand': 59927,
 'httpcache/hit': 6,
 'httpcache/miss': 59927,
 'httpcache/store': 59454,
 'httpcache/uncacheable': 473,
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/504': 2,
 'item_scraped_count': 512038,
 'log_count/DEBUG': 574891,
 'log_count/ERROR': 2,
 'log_count/INFO': 1265,
 'request_depth_max': 2749,
 'response_received_count': 59456,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/504 Gateway Time-out': 4,
 'scheduler/dequeued': 59930,
 'scheduler/dequeued/disk': 59930,
 'scheduler/enqueued': 59948,
 'scheduler/enqueued/disk': 59948,
 'spider_exceptions/IndexError': 2,
 'start_time': datetime.datetime(2018, 4, 22, 16, 40, 42, 767437)}
2018-04-23 15:35:20 [scrapy.core.engine] INFO: Spider closed (shutdown)

257513 Successful Rest failed, missing "recommended. So aprox half of the rows are failed. Less failed rows in the begining and more later on in the fiile.

Tried changing the 'USER AGENT' in settings to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
after seeing it here https://stackoverflow.com/questions/33851754/scrapy-misses-some-html-elements

Started getting these error messages instead and then ONLY reviews with missing info, so only failed responese.

2018-04-23 16:09:01 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1243, in run
    self.mainLoop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 122, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\engine.py", line 149, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 71, in next_request
    request = self._dqpop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\core\scheduler.py", line 106, in _dqpop
    d = self.dqs.pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\pqueue.py", line 43, in pop
    m = q.pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\scrapy\squeues.py", line 19, in pop
    s = super(SerializableQueue, self).pop()
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\queuelib\queue.py", line 162, in pop
    self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
builtins.OSError: [Errno 22] Invalid argument

Changed it back to 'Steam Scraper' and tested with test_urls.txt and now im getting the same error above over and over again until aborting the scraping. Still testing around and might reinstall the env to test again.

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

Seems like some file has been corrupted. Found a similiar problem in scrapy/scrapy#845. Gonna try restoring/reseting the whole project.

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

OK, so it seems that there's no problem with the actual scraper and more a problem with the HTML sites.
Managed to get a HTML site from the missing/damaged(Lets just call them BAD) reviews during the console/debug run
So the reason I'm getting multiple BAD reviews within multiple reviews in a row is that the all originate from the same html site except they get diffrent page order.
Example:

2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 5,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Chaoz Designz'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 6,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Aynat | twitch.tv/aynatanya <3'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 7,
 'product_id': '304050',
 'user_id': 123377508,
 'username': 'Evilagician'}
2018-04-23 22:39:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/app/304050/homecontent/?announcementsoffset=413&lastNewsTime=1441544426&userreviewsoffset=61&p=141&workshopitemspage=141&readytouseitemspage=141&mtxitemspage=141&itemspage=141&screenshotspage=141&videospage
=141&artpage=141&allguidepage=141&webguidepage=141&integratedguidepage=141&discussionspage=141&numperpage=5&browsefilter=trend&appid=304050&appHubSubSection=1&l=english&filterLanguage=default&searchText=&forceanon=1>
{'early_access': False,
 'page': 141,
 'page_order': 8,
 'product_id': '304050',
 'user_id': 123377508,
 'username': `'Evilagician'}

And it goes on to page order 25 where it stops. other examples variate on how many pageorders it goes on for. Haven't found out why yet.
Now i checked the HTML code and surely enough it missing all important review infos becuase it's not a review page at all. It's some news announcment page or something similar without any reviews but containing Mabye users that own the game or have commented somethins so the scraper picks up. names, user ids, early access from the html code and all other info it gets without the html scraping.

So the problem is that the scraper is scraping non-review sites

Dunno why yet and have no idea how to solve it. Any tips/thoughts?

EDIT: Noticed that its always links with starting with
https://steamcommunity.com/app/304050/homecontent/?announcementsoffset
since its supposed to be
https://steamcommunity.com/app/304050/homecontent/?userreviewsoffset</b>

EDIT2 PS: Made the scraper faster by changing all the urls in urls.txt from HTTP to HTTPS since it was redirecting for every link. Will look into implementing it in the code later

EDIT3: Im guessing the bug/problem lies in review_spider.py Somewhere between line 104 and 131. Specifically line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')

EDIT4:
Line 114 form = response.xpath('//form[contains(@id, "MoreContentForm")]')
scrapes out the following:

<form method="GET" id="MoreContentForm1" name="MoreContentForm1" action="https://steamcommunity.com/app/9010/homecontent/">
<input type="hidden" name="userreviewsoffset" value="10"><input type="hidden" name="p" value="2"><input type="hidden" name="workshopitemspage" value="2"><input type="hidden" name="readytouseitemspage" value="2"><input type="hidden" name="mtxitemspage" value="2"><input type="hidden" name="itemspage" value="2"><input type="hidden" name="screenshotspage" value="2"><input type="hidden" name="videospage" value="2"><input type="hidden" name="artpage" value="2"><input type="hidden" name="allguidepage" value="2"><input type="hidden" name="webguidepage" value="2"><input type="hidden" name="integratedguidepage" value="2"><input type="hidden" name="discussionspage" value="2"><input type="hidden" name="numperpage" value="10"><input type="hidden" name="browsefilter" value="mostrecent">

Where name="userreviewsoffset" sometimes is "announcementsoffset"
Does that mean there are no more reviews? Is there a way to skip announcement/news pages and make sure it's a "userreviewoffset" site? Is there a way to ignore/not scrape from these sites and just move on with next page if there still are reviews to be scraped?
Questions! I will keep looking to see if i can fix this.

from steam-scraper.

prncc avatar prncc commented on July 29, 2024

@RamiJabor I just successfully scraped all of the reviews for the game you are getting bad data from with the following command:

scrapy crawl reviews -a steam_id=304050 -o 304050.jl --loglevel INFO

I was getting about 60-65 pages per minute (~600-700 reviews per minute) which is roughly in-inline with what you reported in your first 15 hours.

It looks like you've either hit Steam's server too much, have encountered a glitch of some kind, or the behavior you're reporting is not present in the US store.

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

Yes, i did the same. Tested out the games i was getting bad reviews from individually and got good results.
I tested increasing the number of urls in the url.txt file and at 9 or 10 urls it starts to give bad data from announcement pages. I've tried setting the concurrent_request=1 still the same....
I don't think it has to do with the steam server but frankly i have no idea what it could be anymore.
I would appreciate if anyone could try scraping from textfile with over 15 urls so i know if i'm the only one having this problem.

from steam-scraper.

prncc avatar prncc commented on July 29, 2024

@RamiJabor Since this issue isn't related to the scraper code, I'm going to close it. In the meantime, consider slowing down the scraper from the IP you're using.

from steam-scraper.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.