Hello! Thanks for this great scraper! I tried it with the test_urls and approximat

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

So a quick look at for example this one: <a href="http://store.steampowered.com/app/84

Missing User_ids and n_reviews about steam-scraper HOT 14 OPEN

prncc commented on July 29, 2024

Missing User_ids and n_reviews

from steam-scraper.

Comments (14)

jonnybazookatone commented on July 29, 2024 2

response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()[0]

from steam-scraper.

RamiJabor commented on July 29, 2024 1

I looked into the problem with missing STeam user ids and it seems it has with steams profile urls not being consitent. Some profile urls just have the SteamID64 in them like this
https://steamcommunity.com/profiles/76561198384621512/
And some have it like this where it's just the name and the reviewspider cant pick up the Steamid64 from the profile url
https://steamcommunity.com/id/RollyPollyDwarfHeads/

The steamID64 cant be scraped in these cases but the steamID3 is stated in
<div class="apphub_friend_block" data-miniprofile="424355784">
so i'm thinking SteamID3 might be better to scrape with

Replaced line 28 in review_spider.py with
review.xpath('//div[@class="apphub_friend_block"]/@data-miniprofile').extract()[0]
And with works fine!
Thanks for all help, jonny, and once again thanks for a great Stean-Scraper!

One last extra thing: the urls = shuffle(urls) in split_review_urls removes makes urls = None for some reason.... Removed it. I don't see a necessary reason for it to be there.

from steam-scraper.

RamiJabor commented on July 29, 2024

I think there's another issue with my products_all.jl since its missing n_reviews too which is preventing me from moving on to review scraping. The example output is not matching what i get in my products_all.jl

{"url": "http://store.steampowered.com/app/800200/Witching_Tower_VR/", "reviews_url": "http://steamcommunity.com/app/800200/reviews/?browsefilter=mostrecent&p=1", "id": "800200", "title": "Witching Tower VR", "genres": ["Action", "Adventure", "Indie"], "developer": "Daily Magic Productions", "publisher": "Daily Magic Productions", "release_date": "Summer 2018", "app_name": "Witching Tower VR", "specs": ["Single-player"], "tags": ["Action", "Adventure", "Indie", "VR", "Violent", "Puzzle", "Atmospheric"], "early_access": false}

from steam-scraper.

RamiJabor commented on July 29, 2024

Im getting this error when trying to run the the split_review_urls.

(env) C:\Users\Rami\steam-scraper\scripts>py split_review_urls.py --scraped-products C:\Users\Rami\steam-scraper\output/products_all.jl --output-dir C:\Users\Rami\steam-scraper\output
Traceback (most recent call last):
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "split_review_urls.py", line 71, in <module>
    main()
  File "split_review_urls.py", line 48, in main
    blx_has_reviews = df['n_reviews'] > 0
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Rami\steam-scraper\env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_reviews'

I think there might be something wrong going on at line 63 in ProductSpider product_spider.py
and line 100 at items.py since the n_reviews isn't being created at all

from steam-scraper.

jonnybazookatone commented on July 29, 2024

I haven't tested yet, but it's probably because 'reviews' does not appear in the text of the product's page. So you can change this line:

https://github.com/prncc/steam-scraper/blob/master/steam/spiders/product_spider.py#L63

so it reads

loader.add_css('n_reviews', '.responsive_hidden', re='\(([\d,]+)\)')

from steam-scraper.

RamiJabor commented on July 29, 2024

I did after i saw that steam html codes have removed "reviews" from the number of reviews field. Still didn't work. Tested it out in the shell and there it works so i don't think the problems there.
Even tried it with this

    n_reviews = response.css('.responsive_hidden').re('\(([\d,]+)\)')
    n_reviews = [int(r.replace(',', '')) for r in n_reviews]
    n_reviews = max(n_reviews)
    loader.add_value('n_reviews',n_reviews)

And changing the item.py line 100 to n_reviews = scrapy.Field().
Still didn't work... im really stumped. Dunno what do further. Somehow the n_reviews isn't even showing up at all. I would suspect that it was the formating/parsing problem if it showed up with empty/NaN values but it's not even created.

Thank you for the response. I'll try poking around some more

from steam-scraper.

jonnybazookatone commented on July 29, 2024

So a quick look at for example this one: http://store.steampowered.com/app/848270/Sky_Conqueror/

results in this being scraped:

{
  "url": "http://store.steampowered.com/app/848270/Sky_Conqueror/", 
  "reviews_url": "http://steamcommunity.com/app/848270/reviews/?browsefilter=mostrecent&p=1", 
  "id": "848270", 
  "title": "Sky Conqueror", 
  "genres": ["Action", "Adventure", "Casual", "Indie"], 
  "developer": "Poseidon's kiss", 
  "publisher": "Poseidon's kiss", 
  "release_date": "2018-05-03", 
  "app_name": "Sky Conqueror", 
  "specs": ["Single-player", "Steam Achievements"], 
  "tags": ["Casual", "Action", "Adventure", "Indie"], 
  "early_access": false
}

which doesn't have a n_reviews because it has No user reviews. You can either put a catchall for when no entry is found to put 0 or modify the split script to not require n_reviews using get('n_reviews', 0) so it doesn't raise the KeyError.

from steam-scraper.

jonnybazookatone commented on July 29, 2024

Interesting. Also found this one where the style is also different:

http://store.steampowered.com/app/831810/Bane_of_Asphodel/

8 user reviews

from steam-scraper.

jonnybazookatone commented on July 29, 2024

To be honest, the best thing to do seems to modify the code to access the microdata in the HTML.

<meta itemprop="reviewCount" content="3">

But that's beyond my scrapy/XPath skills. You could modify the regex to something like;

\(([\d,]+)\)|([\d,]+)\suser\sreviews

from steam-scraper.

RamiJabor commented on July 29, 2024

Yea, when there is too little user reviews to set a sentiment steam just posts the total ser reviews then so the scraper gets "# user reviews" in sentiment.

from steam-scraper.

jonnybazookatone commented on July 29, 2024

I went with this:

n_reviews = response.xpath('//meta[@itemprop="reviewCount"]/@content').extract()
n_reviews = '0' if len(n_reviews) == 0 else n_reviews[0]
loader.add_value('n_reviews', n_reviews)

Not the most elegant, but seems to be working.

from steam-scraper.

RamiJabor commented on July 29, 2024

I'm a bloody fool.... I had no idea i have to resstart/make a new virtualenv every time a make a change. None of the changes i was making had any effect because i was still in the same virtualenv. I even deleted the python files and the scraper was still working the same....

Srry. Have been on a wild goose chase whole day.
It works now!
Thanks for the help!

from steam-scraper.

prncc commented on July 29, 2024

@RamiJabor and @jonnybazookatone If one of you wants to make a PR to integrate some of these fixes, that'd be great. Let me know either way?

from steam-scraper.

RamiJabor commented on July 29, 2024

I might do it in 1-2 weeks. Really busy with thesis work atm and hoping i can cram in the steam data into the report.

Also i have a question about the split_reviews_urls.py. I tried it with "--peices 1" and it didn't allow it with
step = int(math.ceil(float(n)/args.pieces)) is there a particular reason you made it so?
Wanted all the urls in one file so i could run one long continous scrape of all the reviews so i just copy-pasted all of them into one txt file

from steam-scraper.

Missing User_ids and n_reviews about steam-scraper HOT 14 OPEN

Comments (14)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent