derekgreene / dynamic-nmf Goto Github PK

View Code? Open in Web Editor NEW

281.0 281.0 87.0 1.53 MB

Dynamic Topic Modeling via Non-negative Matrix Factorization

License: Apache License 2.0

Python 100.00%

dynamic-nmf's People

Contributors

Stargazers

Watchers

Forkers

lenovor mrgloom milestonesvn shdut xpapag smartinsightsfromdata seebeyond ml-ai-nlp-ir saikswaroop pjdrm thiagoacarvalho wwbigdata902 alenzhao feilong0309 yflyzhang cdcrabtree rcprasanth prateek96 shiozakixlg zugenliu jmbowles xuehui0725 wanaberobot kastnerkyle akshayjh ajagaja koepermn mindis seenimohamed chaves ipsolar xavier90 sashaca2 bdubbya yaoxingqi ynroot danger119 richie94 xuezhizeng shubhampachori12110095 zshwuhan linron84 karajan1001 ada520 petershan1119 akbari59 sonpbc rex-du21 ximinwu ilovenorway ahoyosid gaoqy97 deepquantitative jayejndx avashlin katezam goodtom11 kgaugelo-dolo jieer34 saliha54 bingxindu lvecat fuctor imperialite jackyin68 ram-everlearnr zww9054 yangyijane mzmzj19960615 rkanagavel nocturne2333 jmparelman konradomorenko manishbhat5 ywp-2019 bluecici sotiriskot efreeway welpia marcwu77 zz2585 thisdp valdancs fullmoon9986 hsg7280 rishitsingh10 hebaoda

dynamic-nmf's Issues

document contribution to window topic

Hi:
In the paper, it says: "3) A ranking of every MEPs contributions relative to all window and dynamic topics in the corpus"

Is there a function that out put this result? Or we need to calculate it by self.

results of track-dynamic-topics.py

In the results of track-dynamic-topics.py, as shown in - Dynamic Topic: D01, there are 3 topics in window 3.
I am confused. The result means we got 3 topics in a time window?
how to get the topic evolution? I mean, how to know which topic in window 2 changes to topics in window 3, if we get more than 2 topics in window 2.

track-dynamic-topics KeyError

Hello everyone!
Thank you for providing such wonderful tool! I am studying topic model now.
I follow exactly the README instructions, but when I try 'track-dynamic-topics.py out....' I have a KeyError.

Traceback (most recent call last):
File "track-dynamic-topics.py", line 101, in
main()
File "track-dynamic-topics.py", line 57, in main
dynamic_topic_idx = assigned_window_map[window_topic_label]
KeyError: 'month1_06'

Could you please help me with it?
Thank you very much.

Regards,
Tong

Formula that permits to compute "model coherence"

Hello everyone! Thanks for the attention!

I'm using this library for an university project that has the scope to analyze the topics in Twitter's data. I used also LDA algorithm for discover topics in tweets. Now I'm interested on using this Dynamic Topic Modeling approach. So, my question is:

when I execute the step 2 and 3 in "Advanced Usage", this library returns a model coherence that I can't understand very well because I don't understand which is the formula that is used to compute this value (e.g. model coherence = 0.5923)
EXAMPLE:
When the library returns this strings listed after, the library returns also a model coherence value.
e.g. "Top recommendations for number of topics for 'month1': 6,5,9"
==> So, what is the formula used for this purpose?

P.S. If I have not been too much clear, I can re-write this question more precisely.

Again, thanks a lot for the attention!

Edoardo, an Italian Computer Science student!

Run on streams?

Hi --

From the example in the docs, it doesn't look like this is really designed to be run on streaming data, but is it possible to do so? If not, are you aware of similar dynamic topic modeling packages that could work on a data stream?

Thanks
Ben

n-gram

I want to use n-gram, to build my window topic model:
python prep-text.py data/sample/month1 data/sample/month2 data/sample/month3 -o data --tfidf --norm --ngram 3
python find-window-topics.py data/*.pkl -k 5 -o out
python display-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl

when I display the window topics, why it is still unigram?

How do you calculate the relevance of a document to each topic? Where is the code modified?

Please forgive if this is a stupid question as I'm new to topic modeling, coding, and DTM. I am a student from China and I want to get the probability that each document belongs to the topic. How can I get it ? Hope you can help me, thank you very much!!!!!!!

topic relevance in document?

Please forgive if this is a stupid question as I'm new to topic modeling, coding, and DTM. I've found your tutorial to be extremely helpful and user friendly. I just have one question: I'm running multiple DTM on thousands of documents and was wondering if there was a way to see which documents were the most relevant to a particular topic instead of just the arbitrary top 50 (or 100, 1000, etc)?

When I've run TM using MALLET I'm able to see the percentage of how relevant a topic is in any given document and can weed out the documents that have the highest relevance using that data. I'm wondering if there's a way to do that with your method.

Pushing to PIP?

Hi,

first of all many thanks for providing the code, but also your excellent homepage to document all of your analysis on the European Parliament. Are there any plans to enable PIP install for this model?
In my opinion your approach is the only valid Python option for including time in topic models. Meanwhile, the R package stm has become an enormous contribution for any social scientists working with textual data.
Structuring dynamic-nmf, such that it is as easy to use as possible - e.g. like gensim, would really be awesome for workflow, teaching purposes, etc.

Best,
Carsten

Excution Problem

I meet a problem when I try to follow your command.

python prep-text.py data/sample/month1 data/sample/month2 data/sample/month3 -o data --tfidf --norm
/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
Loaded 347 stopwords

Processing 'month1' from data/sample/month1 ...
Found 438 documents to parse
Pre-processing documents (347 stopwords, tfidf=True, normalize=True, min_df=10, max_ngram=1) ...
Traceback (most recent call last):
File "prep-text.py", line 91, in
main()
File "prep-text.py", line 81, in main
apply_norm = options.apply_norm, ngram_range = (1,options.max_ngram), lemmatizer=lemmatizer )
File "/home/jjc/桌面/dynamic-nmf-master/text/util.py", line 40, in preprocess
X = tfidf.fit_transform(docs)
File "/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1859, in fit_transform
X = super().fit_transform(raw_documents)
File "/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1220, in fit_transform
self.fixed_vocabulary_)
File "/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1131, in _count_vocab
for feature in analyze(doc):
File "/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 108, in _analyze
doc = ngrams(doc, stop_words)
File "/home/jjc/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 227, in _word_ngrams
tokens = [w for w in tokens if w not in stop_words]
TypeError: 'NoneType' object is not iterable

I don't know why... could you please help me to solve the problem?

Reference questions

Execution time

I was wondering about what kinds of runtime have you encountered in the practical application of this topic model (leaving aside the question of choosing K). In my limited experience, the scikit NMF decomposition algorithm has been extremely fast for small corpora (a matter of seconds) but it slows down drastically at higher K and larger matrices. I have a model currently running with K=20 on a sparse matrix with 4.3 million cells and it has been going for hours. Compared to standard LDA, this is significantly slower.

The scikit learn documentation mentions polynomial time complexity, which would explain the huge changes in execution time I experienced, and I would like to understand whether this is an issue for others as well.

How can this code dynamic?

Hi I'm Semiha Makinist,
I' a computer engineer and I'm working about aout topic detection and I want you to ask one question about this issue. You said finding dynamic topic, but you gived k value. How can this code dynamic? I didn't understand this situation.

"python find-dynamic-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl -k 5 -o out"

Already, Thank you indeed for helping. Have a good work and nice day.

Best regards,
Semiha Makinist