Giter Site home page Giter Site logo

Comments (14)

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024 2
response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()[0]

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024 1

I looked into the problem with missing STeam user ids and it seems it has with steams profile urls not being consitent. Some profile urls just have the SteamID64 in them like this
https://steamcommunity.com/profiles/76561198384621512/
And some have it like this where it's just the name and the reviewspider cant pick up the Steamid64 from the profile url
https://steamcommunity.com/id/RollyPollyDwarfHeads/

The steamID64 cant be scraped in these cases but the steamID3 is stated in
<div class="apphub_friend_block" data-miniprofile="424355784">
so i'm thinking SteamID3 might be better to scrape with

Replaced line 28 in review_spider.py with
review.xpath('//div[@class="apphub_friend_block"]/@data-miniprofile').extract()[0]
And with works fine!
Thanks for all help, jonny, and once again thanks for a great Stean-Scraper!

One last extra thing: the urls = shuffle(urls) in split_review_urls removes makes urls = None for some reason.... Removed it. I don't see a necessary reason for it to be there.

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

I think there's another issue with my products_all.jl since its missing n_reviews too which is preventing me from moving on to review scraping. The example output is not matching what i get in my products_all.jl

{"url": "http://store.steampowered.com/app/800200/Witching_Tower_VR/", "reviews_url": "http://steamcommunity.com/app/800200/reviews/?browsefilter=mostrecent&p=1", "id": "800200", "title": "Witching Tower VR", "genres": ["Action", "Adventure", "Indie"], "developer": "Daily Magic Productions", "publisher": "Daily Magic Productions", "release_date": "Summer 2018", "app_name": "Witching Tower VR", "specs": ["Single-player"], "tags": ["Action", "Adventure", "Indie", "VR", "Violent", "Puzzle", "Atmospheric"], "early_access": false}

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

Im getting this error when trying to run the the split_review_urls.

(env) C:\Users\Rami\steam-scraper\scripts>py split_review_urls.py --scraped-products C:\Users\Rami\steam-scraper\output/products_all.jl --output-dir C:\Users\Rami\steam-scraper\output
Traceback (most recent call last):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "split_review_urls.py", line 71, in <module>
    main()
  File "split_review_urls.py", line 48, in main
    blx_has_reviews = df['n_reviews'] > 0
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

I think there might be something wrong going on at line 63 in ProductSpider product_spider.py
and line 100 at items.py since the n_reviews isn't being created at all

from steam-scraper.

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024

I haven't tested yet, but it's probably because 'reviews' does not appear in the text of the product's page. So you can change this line:

https://github.com/prncc/steam-scraper/blob/master/steam/spiders/product_spider.py#L63

so it reads

loader.add_css('n_reviews', '.responsive_hidden', re='\(([\d,]+)\)')

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

I did after i saw that steam html codes have removed "reviews" from the number of reviews field. Still didn't work. Tested it out in the shell and there it works so i don't think the problems there.
Even tried it with this

    n_reviews = response.css('.responsive_hidden').re('\(([\d,]+)\)')
    n_reviews = [int(r.replace(',', '')) for r in n_reviews]
    n_reviews = max(n_reviews)
    loader.add_value('n_reviews',n_reviews)

And changing the item.py line 100 to n_reviews = scrapy.Field().
Still didn't work... im really stumped. Dunno what do further. Somehow the n_reviews isn't even showing up at all. I would suspect that it was the formating/parsing problem if it showed up with empty/NaN values but it's not even created.

Thank you for the response. I'll try poking around some more

from steam-scraper.

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024

So a quick look at for example this one: http://store.steampowered.com/app/848270/Sky_Conqueror/

results in this being scraped:

{
  "url": "http://store.steampowered.com/app/848270/Sky_Conqueror/", 
  "reviews_url": "http://steamcommunity.com/app/848270/reviews/?browsefilter=mostrecent&p=1", 
  "id": "848270", 
  "title": "Sky Conqueror", 
  "genres": ["Action", "Adventure", "Casual", "Indie"], 
  "developer": "Poseidon's kiss", 
  "publisher": "Poseidon's kiss", 
  "release_date": "2018-05-03", 
  "app_name": "Sky Conqueror", 
  "specs": ["Single-player", "Steam Achievements"], 
  "tags": ["Casual", "Action", "Adventure", "Indie"], 
  "early_access": false
}

which doesn't have a n_reviews because it has No user reviews. You can either put a catchall for when no entry is found to put 0 or modify the split script to not require n_reviews using get('n_reviews', 0) so it doesn't raise the KeyError.

from steam-scraper.

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024

Interesting. Also found this one where the style is also different:

http://store.steampowered.com/app/831810/Bane_of_Asphodel/

8 user reviews

from steam-scraper.

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024

To be honest, the best thing to do seems to modify the code to access the microdata in the HTML.

<meta itemprop="reviewCount" content="3">

But that's beyond my scrapy/XPath skills. You could modify the regex to something like;

\(([\d,]+)\)|([\d,]+)\suser\sreviews

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

Yea, when there is too little user reviews to set a sentiment steam just posts the total ser reviews then so the scraper gets "# user reviews" in sentiment.

from steam-scraper.

jonnybazookatone avatar jonnybazookatone commented on July 29, 2024

I went with this:

n_reviews = response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()
n_reviews = '0' if len(n_reviews) == 0 else n_reviews[0]
loader.add_value('n_reviews', n_reviews)

Not the most elegant, but seems to be working.

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

I'm a bloody fool.... I had no idea i have to resstart/make a new virtualenv every time a make a change. None of the changes i was making had any effect because i was still in the same virtualenv. I even deleted the python files and the scraper was still working the same....

Srry. Have been on a wild goose chase whole day.
It works now!
Thanks for the help!

from steam-scraper.

prncc avatar prncc commented on July 29, 2024

@RamiJabor and @jonnybazookatone If one of you wants to make a PR to integrate some of these fixes, that'd be great. Let me know either way?

from steam-scraper.

RamiJabor avatar RamiJabor commented on July 29, 2024

I might do it in 1-2 weeks. Really busy with thesis work atm and hoping i can cram in the steam data into the report.

Also i have a question about the split_reviews_urls.py. I tried it with "--peices 1" and it didn't allow it with
step = int(math.ceil(float(n)/args.pieces)) is there a particular reason you made it so?
Wanted all the urls in one file so i could run one long continous scrape of all the reviews so i just copy-pasted all of them into one txt file

from steam-scraper.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.