prncc / steam-scraper Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 39.0 53 KB

A pair of spiders for scraping product data and reviews from Steam.

Home Page: https://intoli.com/blog/steam-scraper/

Python 90.43% Shell 9.57%

steam-scraper's People

Contributors

Stargazers

Watchers

steam-scraper's Issues

Question about stopping condition for reviews

Thank you so much for sharing this.

I am confused about when should this scraper stops scraping. Some products have millions of reviews and sometimes I need to run the code several times until I get all reviews because it may stop for some reason before it scrapes all the reviews. I am not sure why this happens, maybe internet issue? Currently, I can get full of reviews every three or four times running the codes.

Another issue I notice is that is it possible to scrape extra reviews based on current reviews.jl? Because it is real-time steaming data. Or I have to scrape the whole thing from beginning every time I want to update the database.

Thanks.

spider about user consuming behavior

Hi,
Thanks for your sharing for steam product spider.py.
I am wondering if you have the spider.py of user consuming behavior about products.

I am not good at spidering.

Thank you very much.

Recommended column is not being filled

Thank you for the scraper!

One note, for me it didn't properly take the recommended value, it just always gave a TRUE result.
So in the items.py file I changed the following from:

    recommended = scrapy.Field(
        input_processor=simplify_recommended,
        output_processor=TakeFirst(),
    )

To the below and now it works (again):

 recommended = scrapy.Field(
        output_processor=Compose(TakeFirst(), str_to_float)
    )

Decoding unicode escaped characters

Hi! Thank you for this useful Scrapy spider!
Beforehand I would like to say that I'm a newbie in using Python.
I have an issue with foreign reviews (Russian, Chinese, Japanese etc.). In my output file (reviews.jl) these reviews display as \u0441\u043b\u0438\u0448\u043a\u043e\u043c etc. (after decoding it looks like this "слишком")
Is there any workaround for this issue - any chance of changing the script code so that the review text will export correctly without unicode escaped characters.

Right now I'm using Notepad++ plugin called HTMLPad to Decode JS. It works but it can't decode large amount of text at once (26000 reviews for example), so I have to select 100-200 strings and decode them manually which is real pain in the ass for 26000 reviews...

Missing User_ids and n_reviews

Hello! Thanks for this great scraper!
I tried it with the test_urls and approximately half of the reviews were missing "user_id" completetly.
Does this have something to do with steam, the scraper or my settings?

Incomplete data in Reviews.jl and occasional ding sound

I'm getting multiple rows 10-50 of incomplete data at seemingly random places at multiple places in the jl file while checking up on it during reviewscraping, It seems to be consistent over a single period of rows and a single appid at that period. I also keep getting a Windows Message 'Ding' sound every 3-5 min but not sure if that's connected to the error.

I don't get the problem while only reviewscraping a couple games but crawling many give me this so might be that im getting incomplete responses due to to much fetching. Really dont know

Missing data is usually the "recommended", "text" "date",

Examples with added "BAD", "GOOD":

GOOD{"product_id": "24010", "page": 114, "page_order": 1, "text": "All TS2014 users will receive a small automatic update via Steam Tonight following our change of name from RailSimulator.com to Dovetail Games. The update will replace the previous RailSimulator.com logos and copyright notices on the user interface (UI) and elsewhere within the software with Dovetail Games logos and copyright notices. The End User License Agreement (EULA) wording will also be updated to reflect our change of name from RailSimulator.com to Dovetail Games. We are also taking this opportunity to provide a small fix to prevent the track diagram flickering experienced at certain points on the Holiday Express add-on. This update will not change the functionality or operation of Train Simulator routes, locos, scenarios or any add-on content, and we are not making any other changes to the EULA in this update. We would like to assure all Train Simulator users that this is a very small update which is not intended to change the operation or content of Train Simulator in any way other than the changes described above.", "user_id": 170181655, "early_access": false}
GOOD{"product_id": "24010", "page": 114, "page_order": 2, "text": "The commuter BR111 electric locomotive seen across Germany since the 1970s is now available for Train Simulator, with accompanying DBbzf Control Car and double decker passenger coaches.\nIn the early 1970s, Deutsche Bahn’s demand for electric locomotives for passenger trains saw the development of the BR111, a successor to the Class 110 and built to accommodate faster speeds on passenger services.  A total of 227 models were built in the class between 1970 and 1982, with the first locomotive delivered in December 1974.\nMany of the class were put into service on S-Bahn services, although with ageing locomotive stock serving Intercity routes, some were fitted for Intercity services and operated across Germany. Each of the locomotive’s four axles were fitted with engines, supplying 4,990hp (3,720 kW) and a top speed of 160km/h (99mph).\nThe BR111 often operated in conjunction with a rear control car, giving push-pull capabilities on commuter services due to its ZWS remote control, operated from the locomotive. A third generation DBbzf control car is included with the BR111 for Train Simulator, alongside double-decker DBz and DAbz coaches.\nThe BR111 for Train Simulator is available in two Deutsche Bahn liveries – Orient Red and Traffic Red - and features a DBbzf control car in mint turquoise and traffic red liveries, realistic wheel slip and sanding effects, SIFA driver vigilance device, PZB train protection system and double-decker coaches.\nThe locomotive is also Quick Drive compatible, giving you the freedom to drive the DB BR111 on any Quick Drive enabled route for Train Simulator, such as those available through Steam.\nAlso included are six scenarios for the Hamburg-Hanover route:\nMore scenarios are available on Steam Workshop online and in-game. Train Simulator’s Steam Workshop scenarios are free and easy to download, adding many more hours of exciting gameplay\nKey Features\n•\tBR111 in Deutsche Bahn Traffic Red and Orient Red liveries\n•\tDBbzf Control Car in mint turquoise and traffic red liveries\n•\tPZB and SIFA systems\n•\tDouble-decker passenger coaches\n•\tQuick Drive compatible\n•\tScenarios for the Hamburg-Hanover route\n•\tDownload size: 423mb\nGet it now on Steam - http://store.steampowered.com/app/222598/", "user_id": 170181655, "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 3, "user_id": 170181655, "username": "InterCity560", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 4, "user_id": 170181655, "username": "raphaël-2903", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 5, "user_id": 170181655, "username": "Dash 7 Studios", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 6, "user_id": 170181655, "username": "raphaël-2903", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 7, "user_id": 170181655, "username": "AddictiveBiscuit", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 8, "user_id": 170181655, "username": "Dash 7 Studios", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 9, "user_id": 170181655, "username": "ljbreci", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 10, "user_id": 170181655, "username": "Whitemead", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 11, "user_id": 170181655, "username": "mwaldham131", "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 1, "recommended": false, "date": "2016-04-23", "text": "My Grandfather smoked his whole life. I was about 10 years old when my mother said to him, 'If you ever want to see your grandchildren graduate, you have to stop immediately.'. Tears welled up in his eyes when he realized what exactly was at stake. He gave it up immediately. Three years later he died of lung cancer. It was really sad and destroyed me. My mother said to me- 'Don't ever smoke. Please don't put your family through what your Grandfather put us through.\" I agreed. At 28, I have never touched a cigarette. I must say, I feel a very slight sense of regret for never having done it, because this game gave me cancer anyway.", "hours": 6.7, "user_id": 92964214, "username": "Monkey D. Luffy 💦", "products": 305, "found_funny": 4, "early_access": false}
{"product_id": "24010", "page": 114, "page_order": 12, "user_id": 170181655, "username": "Banter420", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 13, "user_id": 170181655, "username": "Neal08Ni/NealLIVE", "early_access": false}
BAD{"product_id": "24010", "page": 114, "page_order": 14, "user_id": 170181655, "username": "StevenJam", "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 3, "recommended": true, "date": "2016-04-17", "text": "one word: epic and tricky", "hours": 0.7, "user_id": 92964214, "username": "babyboi", "products": 32, "early_access": false}
GOOD{"product_id": "1930", "page": 50, "page_order": 4, "recommended": true, "date": "2016-04-15", "text": "If you do not like this game..,,you suck. What a hidden gem, I got it on sale and played it a bit like I do most of the games I buy and then go back to them later and play a bit more. But this game is quite addictive, and I really dont see anything wrong with the graphics. Nice enviroments, somewhat ugly animals but I can live with that. The human characters are a bit goofy but there are so many good things about this game that any not up to snuff eye candy is easily overlooked. Very large map and a good bit of content makes this a sleep killer.", "hours": 61.2, "user_id": 92964214, "username": "BrookenG", "products": 177, "early_access": false}

I don't know what the problem could be. Haven't found any similar problems/issues while looking for solutions

Price field incorrect if only DLC options listed

In instances where a game doesn't have a price but has tons of DLC, the sum of the DLC is added as the base price for the title.

Correct approach would be to change the
price = response.css('.game_purchase_price ::text').extract_first()
to either verify that it's parent element isn't "dlc_purchase_action"
or to verify that the hierarchy contains the class "game_area_purchase_game"

Running more than one spider

Hey,
I'm still somewhat new to python and thought I'd give this project a go so I can learn python along the way, so this may be a simple issue, but it is really frustrating me.

Basically, I put in the command:
scrapy crawl products -o output/products_all.jl --logfile=output/products_all.log \ --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False

However, I keep getting the error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

Any help would be much appreciated,
Thank you!

Exporting reviews by keywords

Hi! Thank you for fixing Encoding for unicode escape characters. You freed me from some hassle 💃
I've got another issue with reviews exporting. Let's say I need to search some reviews by certain keywords. Maybe related to combat system or translation issue. Right now I have a file with those certain keywords that I'd like to use. So that in my output file I will only have the reviews that include those keywords in the text.
Is there any way to create the rule in source code so it could only export reviews with these keywords?

Some keywords for example: translation, 翻訳, traducción, Übersetzung.

Mature content + Age check

Hello,

Great scraper! Made my life a lot easier. :)
Could you guys give me a hand with bypassing the mature content and the age check pages?

Thanks heaps

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.