by Harshini Konduri, Pallavi Damle, Shantanu Saha, Harsh Vardhanshukla, Koba Khitalishvili
In this project we are classifying articles based on the features generated by applying the probabilistic topic modelling algorithms like Latent Dirichlet Allocation to the corpus.
We wrote New York Times scraper which uses BeautifulSoup to scrape the article pages. Script scripts/webscrapers/nytscraper.py
works with python 2.7.
We also wrote an alternative scraper nytsnippetgetter.py
that does not parse the actual html pages to get the information. The difference is that instead of full article body we get just the snippet
, lead_paragraph
and asbtract
. The advantage is that we obtain data even about articles that are available for subscribers only. Additionally, it is much quicker. We were able to download 20 thousand files in around 15 minutes.
To see the code and see how to do the nitty visualization of the LDA model topics check out simple-intro.ipynb
A real world example where such a thing would be useful in 7 steps
Imagine we have a library of books without titles in a digital form. How can we classify them by topic without skimming through each book?
- Compute the TFxIDF matrix for the corpus
- Obtain topic term distribution estimates using LDA or NMF
- Choose topics you consider meaningful
- Train a classifier on the topic-term distribution estimates using a subset of the corpus
- Classify the rest of the corpus
- ???
- Profit
In this section we explore the possibility of classifying documents using the topic-term distribution estimates generated by the LDA and NMF.
We fit the LDA model on the original TFxIDF and get the topic-term distribution. Then, when we have new documents coming in we obtain the TFxIDF matrix for the new corpus and get the topic-term distribution using the LDA model trained on the existing corpus. We do the same for NMF.
For classification we tried Multinomial Naive Bayes and ExtraTreesClassifier. Below you can see results. In general the bigger the number of topics for the LDA model the better the classification accuracy. We went with 20. For NMF model 10 topics was enough to produce features that give us accuracy almost as good as when using the TFxIDF matrix.
Features / Classifier | LDA | NMF | TFxIDF |
---|---|---|---|
Multinomial Naive Bayes | 0.559 | 0.3039 | 0.7817 |
ExtraTreesClassifier | 0.7458 | 0.8723 | 0.8885 |
- [] - Use features from LDA and NMF in a one classifier
- - Do visuals
- - Do a test example.