Giter Site home page Giter Site logo

yelp-reviews's Introduction

Predicting Customer Satisfaction from Yelp Reviews using NLP

Yelp provides a crowd-sourced review forum for businesses and services. People share their reviews and rate businesses and services. Other users can also rate other people's reviews.

The goal of this project is to build a machine learning model that can predict whether customers are satisfied or not based on the reviews they have written.

My personal learning objective from this project is to explore the basics of Natural Language Processing (NLP), apply feature extraction using Count Vectorizers and understand the theory behind Naive Bayes classifiers.

Exploratory Data Analysis

Summary Statistics

stars cool useful funny
count 10000 10000 10000 10000
mean 3.78 0.88 1.41 0.70
std 1.21 2.08 2.34 1.91
min 1 0 0 0
25% 3 0 0 0
50% 4 0 1 0
75% 5 1 2 1
max 5 77 76 57

Observations

  • The dataset contains 10000 reviews and 10 features.
  • On average the stars given in this dataset is 3.78 and standard deviation around 1.21.

Thoughts

  • It would be intresting to know how long each review is. How much detail, time and effort do people put into their reviews.

Histogram of Review Length

Alt Text

Observations

  • Looking at the histogram reviews contain around 400 to 800 words. The histogram begins to tail of from about 900 words.
  • The mean words per review is around 710.74 and a std of around 619.4. So this suggests there is high variability in the length of reviews.
  • The minimum is 1 and the maximum is 4997.

Count plot of number of stars

Alt Text

Histogram of stars as a function of length

The aim here is to see, what is the distribution of length (the amount of words people write in their reviews) for each star. So we can get an idea as to how much people write when giving good or bad reviews.

Alt Text

Observations

  • As we expect from looking at the countplot, frequency from 4 and 5 star reviews is much higher.
  • However in general, all histograms seem to follow the same shape.
  • We have an unbalanced dataset.

Model Results

A naive Bayes Classifier was fit.

Confusion Matrix

Alt Text

Classification Report

precision recall f1-score support
1 0.89 0.71 0.79 143
5 0.94 0.98 0.96 675
accuracy 0.93 818
macro avg 0.91 0.85 0.88 818
weighted avg 0.93 0.93 0.93 818

Observations

For class 1:

  • The precision is 0.89, indicating that out of all instances predicted as class 1, 89% were actually true positives.
  • The recall is 0.71, suggesting that 71% of the actual class 1 instances were correctly identified by the model.
  • The F1-score is 0.79, which is the harmonic mean of precision and recall, providing an overall measure of the model's performance for class 1.
  • The support is 143, indicating the number of instances in the testing data that belong to class 1.

For class 5:

  • The precision is 0.94, indicating that out of all instances predicted as class 5, 94% were actually true positives.
  • The recall is 0.98, suggesting that 98% of the actual class 5 instances were correctly identified by the model.
  • The F1-score is 0.96, which provides an overall measure of the model's performance for class 5.
  • The support is 675, indicating the number of instances in the testing data that belong to class 5.
  • The accuracy of the model on the testing data is reported as 0.93, indicating that the model correctly predicted the class labels for 93% of the instances in the testing set.

The macro average F1-score is 0.88, which is the average of the F1-scores for each class, giving equal weight to both classes.

The weighted average F1-score is also 0.93, which takes into account the class imbalance by considering the support of each class.

Overall, the model demonstrates strong performance, with high precision, recall, and F1-scores for both classes. Futher improvements could be made like, adding weights to certain words that may be more important and adding more features.

yelp-reviews's People

Contributors

adilsaid64 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.