Giter Site home page Giter Site logo

opennlp / large-scale-text-classification Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 2.0 46 KB

Large Scale benchmarking of state of the art text vectorizers

License: Apache License 2.0

Python 100.00%
machine-learning natural-language-processing text-classification python elmo word2vec glove flair-embeddings fasttext tf-idf-vectorizer

large-scale-text-classification's Introduction

Large-Scale-Text-Classification

Sparse Victory - A Large Scale Systematic Comparison of count-based and prediction-based vectorizers for text classification Rupak Chakraborty, Ashima Elhence, Kapil Arora , Proceedings of the Recent Advances in Natural Language Processing, Varna, Bulgaria, 2019 [paper link]

Overview

In this paper we study the performance of several text vectorization algorithms on a diverse collection of 73 publicly available datasets. Traditional sparse vectorizers like Tf-Idf and Feature Hashing have been systematically compared with the latest state of the art neural word embeddings like Word2Vec, GloVe, FastText and character embeddings like ELMo, Flair. We have carried out an extensive analysis of the performance of these vectorizers across different dimensions like classification metrics (.i.e. precision, recall, accuracy), dataset-size, and imbalanced data (in terms of the distribution of the number of class labels). Our experiments reveal that the sparse vectorizers beat the neural word and character embedding models on 61 of the 73 datasets by an average margin of 3-5% (in terms of macro f1 score) and this performance is consistent across the different dimensions of comparison.

Resources

  • Datasets used in the experiment can be downloaded from the following link
  • Pre-trained embedding models can be downloaded from here
  • All result files can be viewed here
  • Detailed visualization of the feature vectors can be seen here

Steps to execute the code

  1. git clone the repository to your local system
  2. Run the following command to install all dependencies -
pip install -r requirements.txt
  1. Download the pre-trained models and create a folder named models in the root directory of the project and put these pre-trained models there
  2. Download the datasets from the url provided, then add this path to the file commonconstants.py under the constants package. Also modify other file locations as per your local system requirements
  3. Keep a local mongodb instance running to store all the result json files.
  4. Run the file benchmark_pipeline.py under the pipeline package to see the results on the screen.

Experimental Results

Category Name GloVe FastText Word2Vec ELMo Tf-Idf FeatureHash Flair
Sentiment (10) 41.6/38.1/59.5 42.9/38.9/59.9 42.9/38.2/59.4 36.1/35.1/57.1 47.0/42.2/63.3 45.0/41.3/61.8 43.3/38.9/60.0
Emotion (1) 14.3/10.3/21.2 12.5/9.1/20.4 11.7/9.6/20.8 7.9/7.0/19.0 14.2/10.2/19.1 15.0/10.6/18.3 8.6/8.2/18.6
General Classification (8) 56.8/49.5/64.8 55.9/49.2/64.6 54.3/48.6/64.0 46.8/44.9/61.5 60.7/55.3/68.3 58.2/51.8/65.1 56.5/52.2/65.0
Other (5) 59.7/56.8/67.8 59.7/56.4/67.4 59.1/56.6/67.6 52.9/52.1/65.5 61.5/55.6/69.8 57.1/53.3/68.6 59.1/52.8/67.0
Reviews (2) 52.1/37.6/83.4 44.2/37.5/83.2 52.1/37.6/83.2 45.6/37.7/83.1 57.4/43.9/85.4 50.0/43.6/84.1 55.8/42.2/84.0
Spam-Fake-Ironic-Hate (5) 75.9/71.0/82.6 78.0/72.4/83.7 77.8/72.4/83.6 70.7/64.8/81.0 84.3/79.3/87.6 80.0/74.9/84.5 79.9/76.3/85.4
Medical (4) 45.2/40.2/70.3 42.9/40.3/70.1 45.6/40.8/70.3 40.6/36.9/68.7 53.8/45.9/73.8 47.3/42.2/70.6 49.3/42.2/71.3
News (4) 50.6/49.4/66.6 48.6/48.3/66.2 48.9/48.7/66.1 35.9/36.6/54.3 63.0/60.0/77.6 58.1/55.8/73.2 63.2/60.9/78.4

The table presents the values for Precision/Recall/Accuracy , the results have been averaged across all the classifiers used in the study. The size of the datasets used in the table is less than or equal to 10K. Please refer to our paper for detailed results over the entire dataset.

The images given above show the following metrics (from left to right) - 1. Violin Plot showing the accuracy of all the vectorizers used in the study across all the datasets. 2. Violin Plot showing the accuracy of the classifiers used in the present study, under the same conditions as 1. 3. Macro f1-score of the classifiers used. 4. Macro f1-score of the vectorizers used.

Support or Contact

We are always happy to receive feedback on ways to improve the framework. Feel free to raise a PR in case of you find a bug or would like to improve a feature. In case of any queries please feel free to reach out to Rupak or Ashima

large-scale-text-classification's People

Contributors

dependabot[bot] avatar rupakc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.