Giter Site home page Giter Site logo

crobertob / topic_modeling_tutorial Goto Github PK

View Code? Open in Web Editor NEW

This project forked from piskvorky/topic_modeling_tutorial

0.0 1.0 0.0 439 KB

Instructions & code for the EuroPython 2014 training session "Topic Modeling for Fun and Profit"

Home Page: https://ep2014.europython.eu/en/schedule/sessions/90/

Jupyter Notebook 97.39% Python 2.61%

topic_modeling_tutorial's Introduction

This repository contains code and instructions for my EuroPython 2014 tutorial, "Topic Modeling for Fun and Profit".

Tutorial setup

Install the following packages before the training starts:

$ pip install six cython numpy scipy ipython[notebook]
$ pip install nltk gensim pattern requests textblob
$ python -m textblob.download_corpora lite

If you run into problems, try to follow the specific packages' installation instructions (e.g. scipy instructions), ask on their mailing list (don't forget to report your operating system and the actual error) or contact me, in advance. There won't be much time for troubleshooting dependencies during the training itself!

For Windows users, it may be easier to use conda to manage the dependencies. Download miniconda from here, install it, then run:

$ conda create -n topic_modeling six cython numpy scipy ipython-notebook nltk requests pip
$ source activate topic_modeling
$ pip install nltk pattern gensim textblob
$ python -m textblob.download_corpora lite

Then download corpora we'll be using for topic modeling and indexing:

$ python download_data.py ./data

(or, alternatively, download these two files [14MB, 95MB] yourself. No need to unzip them or anything, just copy them under the ./data/ directory of this repository.)

You will need about 700MB of free disk space to run all the tutorial examples fully.

Check that everything works correctly by opening and running the first tutorial notebook:

$ ipython notebook '0 - Intro & Setup.ipynb'

Congratulations!

Objectives

The tutorial shows how to

  • process very large corpora efficiently, using practical NLP techniques,
  • automatically extract themes (topics) from them, using unsupervised topic modeling,
  • index documents for retrieval and
  • run semantic similarity queries ("Give me ten documents that are thematically the most similar to this one.").

The focus is on building practical applications and engineering, rather than on the theory behind topic modeling and the math itself.

Target audience

This training expects you are a reasonably advanced developer, who knows at least Python basics (dicts, lists, tuples, comprehensions). Knowing NumPy arrays and Python generators/iterators is a plus, but we'll go over what we need.

Same with relevant NLP (natural language processing) and IR (information retrieval) concepts like lemmatization, collocations and unsupervised machine learning (clustering): I'll cover what we need during the training.

How it works

Get this repository either via standard git clone https://github.com/piskvorky/topic_modeling_tutorial.git, or by downloading and unzipping this ZIP file.

The training materials are a set of IPython notebooks.

To run the notebooks interactively, type in shell:

$ ipython notebook

while in the folder of this repository.

This will open a new browser window, listing all the notebooks. Start from the first one, "0 - Intro & Setup", executing each cell in turn by holding down SHIFT+ENTER.

You can also view the notebooks non-interactively (read-only mode), as HTML in your browser (no Python needed):

0 - Intro & Setup 1 - Streamed Corpora 2 - Topic Modeling 3 - Indexing and Retrieval

These static HTML notebooks also contain rendered cell output, so you can compare your results to mine.


(c) 2014 Radim Řehůřek

topic_modeling_tutorial's People

Contributors

piskvorky avatar brendam avatar birdsarah avatar

Watchers

Roberto Camacho Barranco avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.