Giter Site home page Giter Site logo

text_classification_using_em_and_semisupervied_learning's Introduction

Text_Classification_Using_EM_And_Semisupervied_Learning

The acquisition of large amount of labeled text data for text classification is a tedious and expensive task, while there is huge amount of unlabeled data set on Web resources which are easy and cheap. In this project, we investigate the effectiveness of using semi-supervised learning and expectation-maximization(EM) algorithm to take advantage of large amount of unlabeled data to obtain highly accurate text classification. We built a simple multinomial Naive Bayes(NB) classifier and trained it using EM procedure and both labeled and unlabeled text data. And we studied the relation between the multiclass classification accuracy and the fraction of unlabeled data in the training data set. We also explore methods to reduce computational expense in EM procedures to speed up training process. The result showed that our semi-supervised EM NB classifier can achieve above 50% accuracy on average given only 2% labeled data, and above 70% accuracy given one third of training data labeled.

Getting Started

The models and algorithms of our project is implemented in Python code with the help of IPython Notebook for data and result visualization. All experiments are executed on local machine.

Prerequisites

Following Python packages have to be installed before executing the project code

numpy
scipy
sklearn
nltk==3.1
wordcloud
matplotlib
seaborn
  • Note nltk v3.2 may have issue with stemming functions.

Installing

IPython notebook can be installed separately using pip

pip install ipython

Or with Anaconda bundle.

conda update conda
conda update ipython

Running the tests

And IPython notebook can be viewed using available web browser by the following command-line in terminal inside the directory of code:

ipython notebook

The semi-supervised EM Naive Bayes class in python script is called inside experiment codes. Most of our code are recorded in ipython notebook cells. This notebook can be executed cell by cell in sequential order, or execute all at once using the Kernel starter. And the results will be visualized in images shown below the corresponding cells.

Expected Results

The result is to improve the multi-class text classification accuracy by semi-supervised EM Naive Bayes classifier given both labeled and unlabeled documents.

Classification Accuracy Improvement

Word Cloud of Most Probable Keywords in Each Class

For more details and intermediate results, please check the ipython notebooks in the folder code

Team Members

Acknowledgments

  • Thank Prof. Min Chi for the support on this project.
  • Thank all TAs of CSC591 course for the evaluation and feedback on this project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.