Giter Site home page Giter Site logo

review_classifier's Introduction

The program classifies hotel reviews as real or fake using a Bayes Net classifier. It does this by calculating the probabilities P(class|words) = P(class|word_1) * P(class|word_2) * ... P(class|word_n) for each class and classifying a review to the class with the highest probability.

Word Frequency Dictionary and Probabilities

The program starts by going through each review in the training data and adding up the number of times each word appears in a document for each of the classes, deceptive and truthful. A word is only added at most once for a single document. For example, if the word "the" appeared 12 times in document A, 6 times in Document B, and 23 times in Document 6, the total frequency is 3. This is so that P(word|class) = the probabilty a word appears at least once in a document. Another dictionary is then made that holds the probability a word appears at least once in a document for each class, along with the probability the word appears at least once in any document. The entry for a word might look like {word: [0.3, 0.24, 0.6]}. Another list is made for P(class). In this case, the probabilities for each class is 0.5, but I wrote the program so that it could scale if needed (that is, we can add more classes or samples if needed).

Prediction

Next, the classify method calls the prediction method for each sample in the test data. The prediction method computes P(class|words) for each class and returns the argmax of the classes. The probability is computed using logs so that longer reviews don't cause underflow. It also throws out any word with a probability lower than sigma, in this case 1/1200 (a word appeared 0 or only 1 time).

Challenges

I tried removing punctuation, removing numbers, switching everthing to lowercase, and counting repeated words. Using the probability of the exsistence of a word (appears at least once) led to the best results. Also, removing punctuation led the slightly better results. Any other change actually lowered accuracy.

Using the log probability didn't actually lead to any increase in accuracy, but if longer reviews were introduced to the dataset, we would have to multiply by even more probabilities, potentially causing underflow in the future. So that my model could scale to any problem of this type, I used log probability.

Results

The program classified the testing data with an accuracy of 95.58%.

review_classifier's People

Contributors

marcusskinner avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.