Giter Site home page Giter Site logo

steam-reviews's Introduction

Context

I was inspired by one of Internet Historian video, The Engoodening of No Man's Sky, particularly this following segment:

"Then he starts breaking that (all the feedback of the game from various sources) down into datasets: people who haven't bought the game, people who have bought it and played it for a hundred hours, people who have returned it, etc.

Then he starts compiling those complaints into usable data, focusing on the people with the most sincere experience of the game.

Then he starts making a big list of all the things that need adding and prioritize them."

This workflow was really interesting to me, because it is applicable in many cases, especially in the industry. So here it is, my first attempt on exploring the field of NLP. Enjoy!

Files

This repo contains this README, a python notebook named steam_reviews_code.ipynb, slides in pdf format to highlight the most important aspects of this work named 'Steam Reviews - Slides.pdf', and the dataset called steam_reviews.csv used is stored in the data directory.

Data

I use (a subset of) the Steam Reviews Dataset from Kaggle, provided by the user Luthfi Mahendra. Since the original dataset contains ~400000 rows, and that is too many for my poor PC to handle, I decided to only use a subset of it. I randomly selected 20000 reviews from the original dataset, which I provide in the data file. This dataset contains the following columns:

  • date_posted: The date a review is posted.
  • funny: How many other player think the review is funny.
  • helpful: How many other player think the review is helpful.
  • hour_played: How many hour a reviewer play the game before make a review.
  • recommendation: Whether the reviewer recommended the game or not.
  • review: The text of user review.
  • title: The title of the game that is being reviewed.

The important columns are review (feature) and recommendation (target), but I also use the other columns for data exploration.

Goals

The exact goals of this project are:

  • to predict whether a reviewer recommended the game or not based on the review; and
  • to analyze what words are associated with good reviews (i.e. reviewer recommended the game).

I also want to explore text processing and Bayesian optimization algorithm.

Methods

I apply basic text processing to the reviews: removing non-alphabetic characters, removing stop words, stemming, and removing irrelevant words. Then I vectorize the document using two vectorizers: the bag-of-words model and TF-IDF to compare the performance of the two. Then I employ Bayesian search using hyperopt to tune the hyperparameters in order to get the most optimal model for a vectorizer+classifier combination. The model performance are evaluated using cross-validation method in order to obtain the distributions of the performance metric. The best model is used to make predictions on the held out test data. Finally, I analyze the features of the best model by looking at the words with largest weights to obtain relevant feedback.

This project requires the standard numpy, pandas, matplotlib, seaborn, and sklearn packages.

In addition, some non-standard packages include: hyperopt for the hyperparameter tuning, lightgbm to get LGBM Classifier, and eli5 to highlight the most important words in the model.

Results

I find that the tuned TF-IDF+Logistic Regression model gives the best performance based on the results from cross-validation. Using this model to the test data, we get an accuracy 94.2% and a ROC AUC score of 0.916, which is much better than the baseline model's (BOW+Naive Bayes) accuracy of 83.6% and ROC AUC score of 0.784. This means that our model will be able to classify the sentiments of larger, unseen data, which we can use to gain more comprehensive feedback on the current games. The most common problems in the games include: modding (players want it back), cheater/hacker, and game crashing. These are valuable input for game developers and companies if they want to increase the players' gaming experience.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.