Giter Site home page Giter Site logo

topic-ensemble's Introduction

topic-ensemble

This repository contains a Python reference implementation of methods for ensemble topic modeling with Non-negative Matrix Factorization (NMF).

Details of these methods are described in the following paper (currently under review):

Stability of Topic Modeling via Matrix Factorization
Mark Belford, Brian MacNamee, Derek Greene (2017)

Draft pre-print: https://arxiv.org/abs/1702.07186

Additional pre-processed datasets for use with this package can be downloaded here (179MB).

Dependencies

Tested with Python 3.5, and requiring the following packages, which are available via PIP:

Basic Usage

Step 1.

Before applying topic modeling to a corpus, the first step is to pre-process the corpus and store it in a suitable format. The script 'parse-directory.py' can be used to parse a directory of plain text documents. Here, we parse all .txt files in the directory or sub-directories of 'data/sample-text'.

python parse-directory.py data/sample-text/ -o sample --tfidf --norm

The output will be sample.pkl, stored as a Joblib binary file. The identifiers of the documents in the dataset correspond to the original text input filenames.

Alternatively, if all of your documents are stored in a text file, with one document per line, the script 'parse-file.py' can be used:

python parse-file.py data/sample.txt -o sample --tfidf --norm

Step 2.

Next, we generate a set of "base" topic models, which represent the members of the ensemble. We provide two different ways to do this.

Firstly, we can generate a specified number of base topic models using NMF and random initialization (the "Basic Ensemble" approach). For instance, we can generate 20 models, each containing k=4 topics, where each NMF run will execute for a maximum of 100 iterations. The models will be written to the directory 'models/base' as separate Joblib files.

python generate-nmf.py sample.pkl -k 4 -r 20 --maxiters 100 -o models/base

Alternatively, we can use the "K-Fold" ensemble approach. For instance, to execute 5 repetitions of 10 folds, we run:

python generate-kfold.py sample.pkl -k 4 -r 5 -f 10 --maxiters 100 -o models/base

Step 3.

The next step is to combine the base topic models using an ensemble approach, to produce a final ensemble model. Note that we specify all of the factor files from the base topic models to combine, along with the number of overall ensemble topics (here again we specify k=4). The model will be written as a number of files to the directory 'models/ensemble'.

python combine-nmf.py sample.pkl models/base/*factors*.pkl -k 4 -o models/ensemble

Browsing Results

We can display the top 10 terms in the topic descriptors for the final ensemble results in tabular format:

python display-top-terms.py models/ensemble/ranks_ensemble_k04.pkl 

Or using a line-by-line format:

python display-top-terms.py -l models/ensemble/ranks_ensemble_k04.pkl 

Similarly, we can display the identifiers of the top-ranked documents for each topic:

python display-top-documents.py models/ensemble/factors_ensemble_k04.pkl 

Evaluation Measures

To evaluate the Normalized Mutual Information (NMI) accuracy of the document partitions associated with one or more topic models, relative to a ground truth dataset, run:

python eval-partition-accuracy.py sample.pkl models/base/partition*.pkl 

To evaluate the stability of a collection of document partitions using Pairwise Normalized Mutual Information (PNMI), run:

python eval-partition-stability.py models/base/partition*.pkl 

To evaluate the stability of a collection of term rankings from topic models using Average Term Stability (ATS), run:

python eval-term-stability.py models/base/ranks*.pkl 

To evaluate the stability of a collection of term rankings from topic models using Average Descriptor Set Difference (ADSD), run:

python eval-term-difference.py models/base/ranks*.pkl

topic-ensemble's People

Contributors

derekgreene avatar

Watchers

James Cloos avatar Yong Chen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.