Project 3: Web APIs & Classification

Executive Summary
Introduction
Problem Statement
Overview
Datasets
Conclusion and Recommendations

Executive Summary

This project involves the use of Reddit's web API for the purpose of collecting data of posts belonging to two separate subreddit threads, "Developing Android Apps" and "iOS Programming".

These posts are then pre-processed, vectorized, and then transformed into tf-idf scores before being used to train two machine learning models, Logistic Regression and Naive Bayes classifier.

The accuracy of these two models are then compared against each other to identify the better model of the two.

In the first instance where the subtext of the posts are used, the Logistic Regression outperformed the Naive Bayes classifier in terms of accuracy.

In the second instance where both the subtext and the title of the posts were factored into each model, the Naive Bayes classifier outperformed the Logistic Regression in terms of accuracy and each of them individually had a better classification accuracy than in the first instance.

The results of this project showed that the features, subtext and title, ought to be considered when training the Logistic Regression and Naive Bayes classifier, and that the Naive Bayes would be a better model for classifying the posts accurately after these features were used to train the model.

Introduction

Reddit posts were fetched from the endpoints of two subreddit threads, "iOS Programming" and "Developing Android Apps". These reddit posts were then pre-processed before being separately trained on the Logistic Regression and the Multinomial Naive Bayes classifier to identify the model with the better classification accuracy. Thereafter, feature engineering was performed before both models were compared against one another to once again identify the model with the better classification accuracy.

Problem Statement

To identify the better model between Logistic Regression and Naive Bayes, based on their accuracies in classifying reddit posts into two categories, namely "iOS Programming" subreddit and "Developing Android Apps" subreddit.

Overview

The pipeline:

Import libraries
Import data
Exploratory Data Analysis
Modeling
Further Modeling
Conclusions and Recommendations

Datasets

Provided Data

For this project, there are two datasets provided:

These subreddit posts were fetched from the web APIs of reddit, "Developing Android App" subreddit and "iOS Programming" subreddit.

Conclusion and Recommendations

The classification metrics of the models used are as follows:

	Before Feature Engineering		After Feature Engineering
Metrics	Naive Bayes	Logistic Regression	Naive Bayes	Logistic Regression
Accuracy	0.841	0.850	0.888	0.869
Misclassification	0.159	0.150	0.112	0.131
Sensitivity	0.851	0.845	0.841	0.820
Specificity	0.831	0.854	0.935	0.918
Precision	0.837	0.855	0.928	0.910

It can be observed that the accuracy of the Logistic Regression classifier was higher than the accuracy of the Naive Bayes classifier before feature engineering was performed.

After feature engineering was performed, where the title and selftext was added together, the Naive Bayes classifier outperformed the Logistic Regression classifier in terms of accuracy.

Based on these findings, it is recommended that the title and selftext are both factored into the Naive Bayes classifier for the highest classification accuracy when classifying reddit posts as either of the two subreddit threads, "Developing Android Apps" and "iOS Programming".

ngweiern / ga-dsi-proj-3 Goto Github PK

ga-dsi-proj-3's Introduction

Project 3: Web APIs & Classification

Table of Contents

Executive Summary

Introduction

Problem Statement

Overview

Datasets

Provided Data

Conclusion and Recommendations

ga-dsi-proj-3's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent