Giter Site home page Giter Site logo

chaoannricardo / steam-crawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aesuli/steam-crawler

0.0 1.0 0.0 95.67 MB

A set of scripts that crawls STEAM website to download game reviews.

License: GNU General Public License v3.0

Python 100.00%

steam-crawler's Introduction

STEAM crawler

Introduction

The repository is forked from the original branch by aesuli/steam-crawler

Within this edition, several editions and add-ons are made, which include revisions of, 0_view_data.py, 4_steam-review-extractor.py, 7_steam-review-check-duplicate.py, 8_steam-selenium-extractor.py, 9_steam-check-website-condition.py.

Further details of the repository are written below.


Further Details of the Code

This set of scripts crawls STEAM website to download game reviews.

These scripts are aimed at students that want to experiment with text mining on review data.

The script have an order of execution.

  • 1_steam-game-crawler.py download pages that lists games into ./data/games/

  • 2_steam-game-extractor.py extracts games ids from the downloaded pages, saving them into ./data/games.csv

  • 3_steam-review-crawler.py uses the above list to download game reviews pages into ./data/reviews This process can take a long time (it's a lot of data and the script sleeps between requests to be fair with the server). When the script is stopped and restarted it will skip games for which all reviews have been downloaded on the previous run (it does not downloads new reviews for such games).

  • 4_steam-review-extractor.py extracts reviews and other info from the downloaded pages, saving them into ./data/reviews.csv

  • 5_steam-reviews-stats.py is a sample script that processes the review.csv file and outputs some basic info and stats in json files.

  • 6_steam-review-check-duplicate.py checks whether there are duplicated rows in the crawled data, and removing all while keeping the first record of them.

  • 7_steam-selenium-extractor.py extracts traditional chinese review data of given steam website, the process is facilitated by selenium package.

  • 8_steam-check-website-condition.py check whether the game-id exists and would export a csv file which includes the games-id of those are valid.

Column in the reviews.csv file:

  • game id
  • number of people that found the review to be useful
  • number of people that found the review to be funny
  • username of the reviewer
  • number of games owned by the reviewer
  • number of reviews written by the reviewer
  • 1=recommended, -1=not recommended
  • hours played by the reviewer on the game
  • date of creation of the review
  • text of the review

The last script steam-reviews-stats.py is a sample script that processes the review.csv file and outputs some basic info and stats in json files:

  • ./data/games.json number of reviews and played hours for every game.

  • ./data/users.json number of game owned (as reported by user's badge on STEAM) and number of played hours.

  • ./data/summary.json number of reviews, number of played hours, number of users, number of games.

On March 15, 2018 those last statistics are:

reviews        6614765
played hours 554702535
users          2720777
games            26677

steam-crawler's People

Contributors

chaoannricardo avatar aesuli avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.