Giter Site home page Giter Site logo

jcodingstuff / nlpreddit Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 32.75 MB

Multinomial classification tasks in Reddit

Home Page: https://jcodingstuff.github.io/NLPReddit/

License: GNU General Public License v3.0

Jupyter Notebook 18.07% HTML 80.94% TeX 0.99%
machine-learning natural-language-processing classification multinomial-naive-bayes multinomial support-vector-machines support-vector-machine reddit reddit-api praw

nlpreddit's Introduction

1. Introduction

2. Tasks

  1. r/AmItheAsshole: YTA, NTA, ESH, NAH, SHP.
  2. r/LifeProTips, r/ShittyLifeProTips, r/UnethicalLifeProTips and r/IllegalLifeProTips.

3. Reddit content retrieval

For this Pushshift.io was used. It made the process of big scale data retrieval simpler and quicker than Reddit's API. We accessed Pushshift using the psaw library.

4. Am I The Asshole?

Label distribution
r/amitheasshole: 38k

  • NTA: 14406
  • YTA: 6361
  • NAH: 3662
  • SHP: 2270
  • ESH: 2149
  • Unlabelled: 9442

Train / Validation / Test split

  • Train: 9/16
  • Validation: 3/16
  • Test: 4/16

4.1. Validation and performance of ML models

Performance on test set Random forest n=10
Performance on test set

Performance on test set Random forest n=100
Performance on test set

Performance on test set Undersampling
Performance on test set

Performance on test set Oversampling
Performance on test set

5. ProTips

5.1. Data Retrieval and split

Post retrieval

  • r/lifeprotips (8 years old): 30k
  • r/shittylifeprotips (7 years old): 30k
  • r/unethicallifeprotips (3 years old): 30k
  • r/illegallifeprotips (2 years old): 10k

Train / Validation / Test split

  • Train: 4/9
  • Validation: 2/9
  • Test: 1/3

5.2. Validation and performance of ML models

Validation for Naive Bayes (BOW)
Optimization for NB (BOW)

Validation for SVM (Word2Vec)
Optimization for SVM (Word2Vec)

Validation for SVM (Pre-trained Word2Vec)
Optimization for SVM (Pre-trained Word2Vec)

Performance on test set
Performance on test set

Confusion matrix for Naive Bayes (BOW)
Confusion matrix for Naive Bayes (BOW)

Confusion matrix for SVM (Word2Vec)
Confusion matrix for SVM (Word2Vec)

Confusion matrix for SVM (Pre-trained Word2Vec)
SVM (Pre-trained Word2Vec)

6. Conclusions and further work

ProTips

  1. Best performace is achieved by SVM + Pre-trained word vectors.
  2. Further exploration could show how to improve the performance of BOW + Naive Bayes, or try to come up with better language representations so that more complicated classifiers such as NNs or SVMs can achieve higher performance and make the long training time worth it.
  3. More data and processing power would be needed.

AmITheAsshole

  1. An unbalanced dataset heavily reduces your recall and F-score, therefore under or oversampling (ideally both) should be used when facing an unbalanced dataset. However this requires some long waiting time unless you have a very strong processing power.
  2. Ethical judgement based on a title or some sentence is a very complex task. Therefore it would be interesting to set a human baseline for it. This could make the expectations clearer and put the results of a model in context.
  3. To improve the performance it could be explored to use the full body of the submission rather than only the title. However this should be done with well-balanced and big datasets since submissions can be very lengthy which could lead to overfitting.

nlpreddit's People

Contributors

jcodingstuff avatar lucas-ubm avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

lucas-ubm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.