Giter Site home page Giter Site logo

ga-dsi-proj-3's Introduction

Project 3: Web APIs & Classification

Table of Contents


Executive Summary

This project involves the use of Reddit's web API for the purpose of collecting data of posts belonging to two separate subreddit threads, "Developing Android Apps" and "iOS Programming".

These posts are then pre-processed, vectorized, and then transformed into tf-idf scores before being used to train two machine learning models, Logistic Regression and Naive Bayes classifier.

The accuracy of these two models are then compared against each other to identify the better model of the two.

In the first instance where the subtext of the posts are used, the Logistic Regression outperformed the Naive Bayes classifier in terms of accuracy.

In the second instance where both the subtext and the title of the posts were factored into each model, the Naive Bayes classifier outperformed the Logistic Regression in terms of accuracy and each of them individually had a better classification accuracy than in the first instance.

The results of this project showed that the features, subtext and title, ought to be considered when training the Logistic Regression and Naive Bayes classifier, and that the Naive Bayes would be a better model for classifying the posts accurately after these features were used to train the model.


Introduction

Reddit posts were fetched from the endpoints of two subreddit threads, "iOS Programming" and "Developing Android Apps". These reddit posts were then pre-processed before being separately trained on the Logistic Regression and the Multinomial Naive Bayes classifier to identify the model with the better classification accuracy. Thereafter, feature engineering was performed before both models were compared against one another to once again identify the model with the better classification accuracy.


Problem Statement

To identify the better model between Logistic Regression and Naive Bayes, based on their accuracies in classifying reddit posts into two categories, namely "iOS Programming" subreddit and "Developing Android Apps" subreddit.


Overview

The pipeline:

  • Import libraries
  • Import data
  • Exploratory Data Analysis
  • Modeling
  • Further Modeling
  • Conclusions and Recommendations

Datasets

Provided Data

For this project, there are two datasets provided:

These subreddit posts were fetched from the web APIs of reddit, "Developing Android App" subreddit and "iOS Programming" subreddit.


Conclusion and Recommendations

The classification metrics of the models used are as follows:

Before Feature Engineering After Feature Engineering
Metrics Naive Bayes Logistic Regression Naive Bayes Logistic Regression
Accuracy 0.841 0.850 0.888 0.869
Misclassification 0.159 0.150 0.112 0.131
Sensitivity 0.851 0.845 0.841 0.820
Specificity 0.831 0.854 0.935 0.918
Precision 0.837 0.855 0.928 0.910

It can be observed that the accuracy of the Logistic Regression classifier was higher than the accuracy of the Naive Bayes classifier before feature engineering was performed.

After feature engineering was performed, where the title and selftext was added together, the Naive Bayes classifier outperformed the Logistic Regression classifier in terms of accuracy.

Based on these findings, it is recommended that the title and selftext are both factored into the Naive Bayes classifier for the highest classification accuracy when classifying reddit posts as either of the two subreddit threads, "Developing Android Apps" and "iOS Programming".

ga-dsi-proj-3's People

Contributors

we-dsta avatar

Watchers

James Cloos avatar ngweiern avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.