news

topic modelling for news items

installation

prerequisites:

gensim
nltk

to install this module and prerequisites should just need:

python setup.py install

running

download some (large) data files, known as the RCV1 dataset, which consists of just over 800,000 Reuters news stories from 1996-7, mainly about topics of business interest:
```
 ./scripts/get_data.sh
```
This will create and populate an rcv1/ directory.
start up a server which trains a topic model on the RCV1 news items:
```
 python scripts/server.py rcv1
```
as soon as training has started, start up a client which queries the server for learned topics matching any text that you enter:
```
 python scripts/client.py
```
The top few terms for each matching topic will be displayed. Note that you'll actually see stemmed text tokens, i.e. partial words stripped of their endings, for each topic, as that's what the dataset contains. You should see sensible topics as long as the text you enter has some similarity with the items in the dataset. Obviously if you enter just a few words, or a sentence where none of the words ever appeared in the business papers in 1996-7, then all bets are off.

Although you can query the model right away, you should notice that the topics change (hopefully for the better) if you re-enter the same news item after a few more learning iterations.

background

As this project demonstrates, topic modelling, i.e. learning underlying topics automatically from text documents, is now a fairly mature field, and fast implementations of a number of standard models are easily available, including parallel implementations to deal with huge numbers of documents.

The implementation used here is "online" in the sense that documents can be streamed for learning and don't all have to be held in memory, hence it can process a large number of documents in a short time. The model learned is not itself truly "online" i.e. it doesn't adapt fully to new data once it has already seen a reasonable number of documents. However the quick training time means that simple strategies can be used to refresh or replace models in order to keep them well fitted to news stories as the underlying topics change over time.

gamboviol / news Goto Github PK

news's Introduction

news

installation

running

background

news's People

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent