Giter Site home page Giter Site logo

data-mining-using-nlp's Introduction

Data-mining-using-NLP

by Harshini Konduri, Pallavi Damle, Shantanu Saha, Harsh Vardhanshukla, Koba Khitalishvili

In this project we are classifying articles based on the features generated by applying the probabilistic topic modelling algorithms like Latent Dirichlet Allocation to the corpus.

We wrote New York Times scraper which uses BeautifulSoup to scrape the article pages. Script scripts/webscrapers/nytscraper.py works with python 2.7.

We also wrote an alternative scraper nytsnippetgetter.py that does not parse the actual html pages to get the information. The difference is that instead of full article body we get just the snippet, lead_paragraph and asbtract. The advantage is that we obtain data even about articles that are available for subscribers only. Additionally, it is much quicker. We were able to download 20 thousand files in around 15 minutes.

Code and Results

To see the code and see how to do the nitty visualization of the LDA model topics check out simple-intro.ipynb

Feature Extraction Using Topic Modelling

A real world example where such a thing would be useful in 7 steps

Imagine we have a library of books without titles in a digital form. How can we classify them by topic without skimming through each book?

  1. Compute the TFxIDF matrix for the corpus
  2. Obtain topic term distribution estimates using LDA or NMF
  3. Choose topics you consider meaningful
  4. Train a classifier on the topic-term distribution estimates using a subset of the corpus
  5. Classify the rest of the corpus
  6. ???
  7. Profit

In this section we explore the possibility of classifying documents using the topic-term distribution estimates generated by the LDA and NMF.

We fit the LDA model on the original TFxIDF and get the topic-term distribution. Then, when we have new documents coming in we obtain the TFxIDF matrix for the new corpus and get the topic-term distribution using the LDA model trained on the existing corpus. We do the same for NMF.

For classification we tried Multinomial Naive Bayes and ExtraTreesClassifier. Below you can see results. In general the bigger the number of topics for the LDA model the better the classification accuracy. We went with 20. For NMF model 10 topics was enough to produce features that give us accuracy almost as good as when using the TFxIDF matrix.

Features / Classifier LDA NMF TFxIDF
Multinomial Naive Bayes 0.559 0.3039 0.7817
ExtraTreesClassifier 0.7458 0.8723 0.8885

LDA

To do's

  • [] - Use features from LDA and NMF in a one classifier
  • - Do visuals
  • - Do a test example.

data-mining-using-nlp's People

Contributors

kobakhit avatar pallavidamle avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

hachik

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.