Giter Site home page Giter Site logo

pombredanne / samr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rafacarrascosa/samr

0.0 1.0 0.0 244 KB

An entry to kaggle's 'Sentiment Analysis on Movie Reviews' competition

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

samr's Introduction

Sentiment Analysis on Movie Reviews

This is an entry to Kaggle's Sentiment Analysis on Movie Reviews (SAMR) competition.

It's written for Python 3.3 and it's based on scikit-learn and nltk.

Problem description

Quoting from Kaggle's description page:

This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive.

Some examples:

  • 4 (positive): "They works spectacularly well... A shiver-inducing, nerve-rattling ride."
  • 3 (somewhat positive): "rooted in a sincere performance by the title character undergoing midlife crisis"
  • 2 (neutral): "Its everything you would expect -- but nothing more."
  • 1 (somewhat negative): "But it does not leave you with much."
  • 0 (negative): "The movies progression into rambling incoherence gives new meaning to the phrase fatal script error."

So the goal of the competition is to produce an algorithm to classify phrases into these categories. And that's what samr does.

How to use it

After installing just run:

generate_kaggle_submission.py samr/data/model2.json > submission.csv

And that will generate a Kaggle submission file that scores near 0.65844 on the leaderboard (should take 3 minutes, and as of 2014-07-22 that score is the 2nd place).

The model2.json argument above is a configuration file for samr that determines how the scikit-learn pipeline is going to be built and other hyperparameters, here is how it looks:

{
 "classifier":"randomforest",
 "classifier_args":{"n_estimators": 100, "min_samples_leaf":10, "n_jobs":-1},
 "lowercase":"true",
 "map_to_synsets":"true",
 "map_to_lex":"true",
 "duplicates":"true"
}

You can try samr with different configuration files you make (as long as the options are implemented), yielding different scores and perhaps even better scores.

Just tell me how it works

In particular model2.json feeds a random forest classifier with a concatenation of 3 kinds of features:

  • The decision functions of set of vanilla SGDClassifiers trained in a one-versus-others scheme using bag-of-words as features. It's classifier inside a classifier, yo dawg!
  • The decision functions of set of vanilla SGDClassifiers trained in a one-versus-others scheme using bag-of-words on the wordnet synsets of the words in a phrase.
  • The amount of "positive" and "negative" words in a phrase as dictated by the Harvard Inquirer sentiment lexicon

During prediction, it also checks for duplicates between the training set and the train set (there are quite a few).

And that's it! Want more details? see the code! it's only 350 lines.

Installation

If you know the drill, this should be enough:

git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py

Then you will need to manually download train.tsv and test.tsv from the competition's data folder and unzip them into the samr/data folder. You may be asked to join Kaggle and/or accept the competition rules before downloading the data.

Even though samr is writen for Python 3.3 it may also work with Python 2.7 (and the last time I checked it was), but this is not supported and it may break in the future.

If the short instructions are not enough, read on.

Full instructions for Ubuntu

These instructions will install the development version of samr inside a Python 3.3 virtualenv and were thought for a blank, vanilla Ubuntu 14.04 and tested using Docker (awesome tool btw). They should work more or less unchanged with other Ubuntu versions and Debian-based OSs.

Open a console and 'cd' into an empty folder of your choice. Now, execute the following commands:

Install python 3.3 and compilation requirements for numpy and scipy:

sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install -y python3.3 python3.3-dev python-scipy gfortran libopenblas-dev liblapack-dev git wget

Create virtualenv and bootstrap pip:

python3.3 -m venv venv
source venv/bin/activate
wget https://bootstrap.pypa.io/get-pip.py
python3.3 get-pip.py
echo 'PATH="$VIRTUAL_ENV/local/bin:$PATH"; export PATH' >> venv/bin/activate
source venv/bin/activate

Clone and install samr:

git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py

Optionally run the tests:

nosetests samr/tests

Lastly, you will need to manually download train.tsv and test.tsv from the competition's data folder and unzip them into the samr/data folder. You may be asked to join Kaggle and/or accept the competition rules before downloading the data.

The installation is self-contained (within the folder you chose at the start) with two exceptions:

  • Lines starting with sudo apt-get made system-wide changes, to uninstall those you will to use sudo apt-get remove.
  • nltk downloads data to ~/nltk_data, once you don't use nltk it's safe to erase that folder.

Licensing

This project is open-source and BSD licensed, see the LICENSE file for details.

This license basically allows you to do anything, but in case you're wondering: I'm ok if you use samr to beat my score at the competition, just share back what you've learned!

Credits

This project was developed by Rafael Carrascosa, you can contact me at [email protected].

samr's People

Contributors

rafacarrascosa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.