This project involves the use of Reddit's web API for the purpose of collecting data of posts belonging to two separate subreddit threads, "Developing Android Apps" and "iOS Programming".
These posts are then pre-processed, vectorized, and then transformed into tf-idf scores before being used to train two machine learning models, Logistic Regression and Naive Bayes classifier.
The accuracy of these two models are then compared against each other to identify the better model of the two.
In the first instance where the subtext of the posts are used, the Logistic Regression outperformed the Naive Bayes classifier in terms of accuracy.
In the second instance where both the subtext and the title of the posts were factored into each model, the Naive Bayes classifier outperformed the Logistic Regression in terms of accuracy and each of them individually had a better classification accuracy than in the first instance.
The results of this project showed that the features, subtext and title, ought to be considered when training the Logistic Regression and Naive Bayes classifier, and that the Naive Bayes would be a better model for classifying the posts accurately after these features were used to train the model.
Reddit posts were fetched from the endpoints of two subreddit threads, "iOS Programming" and "Developing Android Apps". These reddit posts were then pre-processed before being separately trained on the Logistic Regression and the Multinomial Naive Bayes classifier to identify the model with the better classification accuracy. Thereafter, feature engineering was performed before both models were compared against one another to once again identify the model with the better classification accuracy.
To identify the better model between Logistic Regression and Naive Bayes, based on their accuracies in classifying reddit posts into two categories, namely "iOS Programming" subreddit and "Developing Android Apps" subreddit.
The pipeline:
- Import libraries
- Import data
- Exploratory Data Analysis
- Modeling
- Further Modeling
- Conclusions and Recommendations
For this project, there are two datasets provided:
These subreddit posts were fetched from the web APIs of reddit, "Developing Android App" subreddit and "iOS Programming" subreddit.
The classification metrics of the models used are as follows:
Before Feature Engineering | After Feature Engineering | |||
---|---|---|---|---|
Metrics | Naive Bayes | Logistic Regression | Naive Bayes | Logistic Regression |
Accuracy | 0.841 | 0.850 | 0.888 | 0.869 |
Misclassification | 0.159 | 0.150 | 0.112 | 0.131 |
Sensitivity | 0.851 | 0.845 | 0.841 | 0.820 |
Specificity | 0.831 | 0.854 | 0.935 | 0.918 |
Precision | 0.837 | 0.855 | 0.928 | 0.910 |
It can be observed that the accuracy of the Logistic Regression classifier was higher than the accuracy of the Naive Bayes classifier before feature engineering was performed.
After feature engineering was performed, where the title and selftext was added together, the Naive Bayes classifier outperformed the Logistic Regression classifier in terms of accuracy.
Based on these findings, it is recommended that the title and selftext are both factored into the Naive Bayes classifier for the highest classification accuracy when classifying reddit posts as either of the two subreddit threads, "Developing Android Apps" and "iOS Programming".