Giter Site home page Giter Site logo

Comments (34)

TommyJones avatar TommyJones commented on August 18, 2024 2

Hi, guys. @sweetmals, to answer your question: probabilistic coherence is now documented in one of the vignettes. GitHub version here: https://github.com/TommyJones/textmineR/blob/master/vignettes/c_topic_modeling.Rmd CRAN version here: https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html

However, I messed up the probabilities in the description. (Instead of "P(a|b) - P(b)", it should read "P(b|a) - P(b)", for example.) I've opened the issue here: #38

@manuelbickel did just implement a bunch of coherence measures in text2vec. He also cited a paper comparing various topic coherence measures that's worth checking out: https://pdfs.semanticscholar.org/03a0/62fdcd13c9287a2d4e1d6d057fd2e083281c.pdf

The UMass measure performs poorly. They also use a measure that seems to be identical to probabilistic coherence. (Looks like their citation was 2007. I independently derived probabilistic coherence in 2013. So, I guess it's theirs. :-/)

Anyway, I hope this is helpful. I'll get to #38 this summer.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024 1

Hi Manuel. CalcProbCoherence does not implement the measure proposed by Mimno et al. (http://dirichlet.net/pdf/mimno11optimizing.pdf) Instead, it is a measure that I developed. (I haven't yet written it up, but it will be part of my PhD dissertation. That'll be sometime in the next two years.)

Mimno's measure suffers from promoting topics full of words that are very frequent but statistically-independent of each other. For example, suppose you have a corpus of articles from the sports section of a newspaper. A topic with the words {sport, sports, ball, fan, athlete} would look great under Mimno's measure. But we actually know that it's a terrible topic because the words are so frequent in this corpus as to be meaningless. In other words, they are highly correlated with each other (what the Mimno measure captures) but they are statistically-independent of each other.

Probabilistic coherence corrects for this. For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic.

Here's the logic: if we restrict our search to only documents that contain the word {a}, then the word {b} should be more more probable in those documents than if chosen at random from the corpus. P(b|a) measures how probable {b} is only in documents containing {a}. P(b) measures how probable {b} is in the corpus as a whole. If {b} is not more probable in documents containing {a}, then the difference P(b|a) - P(b) should be close to zero.

The lines of code you highlighted are doing this calculation across the top M words. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate

  • P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d)
  • P(b|c) - P(c), P(b|d) - P(d)
  • P(c|d) - P(d)

And all 6 differences are averaged together, giving the probabilistic coherence measure.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024 1

Sounds good. I'll close the issue for now. If anything else comes up, please just comment here and we can re-open. It'll all still be online for citation.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

Hope that helps. Let me know if you need me to explain anything more. Fun fact: I've run simulations and found that probabilistic coherence is great at selecting the optimal number of topics in a corpus. And, yes, I need to write all this up in a research paper.

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

Hi Tommy,

thank you for your detailed answer and explanations about your new approach to topic coherence. I think, I got the main point about measuring the statistical dependence instead of only the correlation. Please correct me if I am wrong, but it follows a line of thinking like the pointwise mutual information (PMI) measure for collocations, of course, with a wider context and more complex.

Maybe I forgot to mention that my project is also part of a PhD thesis. Hence, I am planning to write a research paper with a focus on the content of papers in the field of "energy" (not on methodology of text mining). Therefore, I wanted to know if your approach is documented somewhere else than in your package (and now in this thread) that might serve as a potential source for citation. I know you already said "yes, I need to write all this up in a paper" but I wanted to ask anyway. Otherwise reviewers will simply have to check your code ;-).

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

Just realized that my comment concerning PMI was very uninformed after reading some more, since most coherence measures make use of the basic PMI concept but differ in how they use it. Sorry for asking before reading.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

Don't be so hard on yourself. But, yes, PMI is getting at the same thing. It has the same probabilities. i.e. PMI = log(P(b|a)/P(b)). If {a} and {b} are statistically-independent, then PMI will be zero. I prefer my measure as it is bounded by -1 and 1 whereas PMI goes from negative infinity to infinity. So, IMHO, probabilistic coherence is easier to interpret and compare to other contexts.

And, unfortunately, probabilistic coherence isn't documented anywhere. You can cite the textmineR package itself, however. (Hell, be an academic rebel and cite this thread!)

I sat down to write up probabilistic coherence a few years ago. I quickly realized that, lacking a statistical theory, there was no global and objective way to say one coherence measure was better than the other. So, I set about solving that problem and the coherence paper never got written. (And the statistical theory will be the bulk of my dissertation.)

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

Thanks for the quick and nice answer. Since the largest amount of time is consumed by fitting models for different number of topics, I think it might be worth to calculate all of the simple intrinsic coherence measures along the way that I can implement (x-fold-perplexity-cross-validation, UMass and your ProbCoherence) and also save some of the topic word lists to check which of the measures points to the most reasonable topics - I guess yours. On this basis, I can validate the measure (not statistically but on the basis of expertise in my field) and cite textmineR and this thread. I will let you know about the results. Thank you again for your time and explanations.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi Tommy & Manual,
Interesting discussion. I am also a PhD student, in the Information Systems (IS) stream and looking for an intrinsic topic coherence measure for part of my data analysis. Therefore, I am looking for R package (or implementation) for UMass. I came across the probabilistic coherence via textmineR package.
Any one of you have documented the probabilistic coherence measure by now? If so I'd appreciate if you could share me the reference.
Thanks!

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Thanks Manuel. I already had a look at the stm package 'topic_coherence' function. I will have a look at the text2vec package. Thanks again for the very prompt reply.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi Tommy,
That's great, it is documented. Thanks a lot for the information. Indeed it is helpful as I am new to this area.

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

As an addition to the information Tommy has provided, here is the link to a version of the paper of Röder et al. that includes some more detailed information on the metrics used: https://doi.org/10.1145/2684822.2685324. Furthermore, I have learned that the idea of coherence and several metrics have already been discussed much earlier at a more abstract level and not in direct connection to text mining, etc., e.g., by Eels and Fitelson in "Symmetries and Asymmetries in Evidential Support" and several other authors before (the paper of Röder also refers to some of these authors, check the references).

I agree with Tommy that UMass performs poorly. I have just tested it with one dataset, yet, but does not seem to be a useful measure, at least for finding the optimum number of topics (maybe an adapted version might be better)...just as a side note...

Furthermore, I want to highlight that without the efforts of @TommyJones - thanks! - to implement the probabilistic difference metric I would not have been able to implement the other measures. His implementation really helped me to understand how such metrics can be programmed in R - I hope there won`t turn out too many mistakes ;-) Also, his explanations are very straightforward but not encrypted, so that especially beginners in the field can understand the general idea quite quickly.

Concerning text2vec, the implemented difference or UMass metric produces the same results as the stm or textmineR package respectively. I think for the other metrics there is no other implentation in R, yet, for cross-checking.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Thank you so much for the support. You guys are so nice :-)
Any one of you have tested topic coherence measures using short texts (or tweets) and have any interesting insights or findings to share? I am working on a Twitter data set, so thought of asking as you two seems to be experts in the area.
I read two papers on that "Topics in Tweets: A User Study of Topic Coherence Metrics for Twitter Data" and "Examining the Coherence of the Top Ranked Tweet Topics". But these papers mention different measures.
Again thanks a lot for the information and help. I very much appreciate it.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Thanks Tommy.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi Tommy (@TommyJones ) & Manuel (@manuelbickel )

Sorry for troubling you guys a lot :-(

I just had a look at text2vec coherence implementation (https://github.com/dselivanov/text2vec/blob/master/R/coherence.R).

This might be a stupid question given that lack of my knowledge on this area.
Which metric is based on the UMass out of the above list? I plan to use both Tommy's metric in textmineR and UMass as an experiment.

Also what is the difference between 'distance' metrics (Cosine, Jaccard, Euclidean etc.) vs 'coherence' metrics? Although coherence is used to measure quality of topics models, isn't it the same thing (i.e. measuring how similar or how distant topics/terms are)?

Highly appreciate if you guys could share your thoughts on this.
Thanks in advance.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi Guys,

I found that 'coherence_mean_logratio' is the one which implements UMass. So my stupid question is answered by myself :-)
Would be great if you guys could share your thoughts on my second question above apart from the different implementation approaches.
Thanks!

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Thanks a lot for the detailed explanation Manuel. It is indeed helpful. Let you know, if I run into any issues.
Cheers!

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi Guys,
FYI, not sure you whether you both have seen this. Just to update you both that there is another package called 'SpeedReader' and an implementation for 'topic_coherence' (if you haven't seen this).
(https://www.rdocumentation.org/packages/SpeedReader/versions/0.9.1/topics/topic_coherence)
(https://github.com/matthewjdenny/SpeedReader/blob/master/R/topic_coherence.R)

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

As far as I understood by the code, it considers word frequencies in document term matrix not the values of beta in the topic model (logarithmized parameters of the word distribution for each topic).

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Okay, now I get the approach. I started from the stm 'semanticCoherence' which has used beta.
Thanks, I will hopefully get a better understanding by the time I go through textmineR and text2vec.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Ha ha Tommy, please go ahead, do it soon and publish your dissertation, so that I can easily refer to it :P

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hey Guys,
Just a short question. Sorry to bother again =(
Is create TCM function in both 'textmineR' and 'text2vec' equivalent to TermDocumentMatrix function in 'tm' package?

Also @TommyJones can I use 'Dtm2Tcm' function to convert a document term matrix created via 'tm' package (are there any consequences or inaccuracies doing that way)?

Thanks in advance.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

@sweetmals, the TCM functions are not the same as TermDocumentMatrix in tm. A term document matrix is just the transpose of a document term matrix. (In fact, this distinction is kind of pointless in my mind. It's one of many conventions the NLP community has that make thing unnecessarily confusing.) A TCM is a term co-occurrence matrix. It's a square matrix that counts the number of times word i and word j occur together in some window. For example, if you set "skipgram_window" to 5. It will count the number of times that words i and j occur together within 5 words of each other.

The Dtm2Tcm function shows the number of documents in which words i and j co-occur together. It's just calculating t(dtm > 0) %*% (dtm > 0).

@manuelbickel I look forward to seeing where you go with your research. Anecdotally, I had a dataset, the DOJ's office of justice programs' grant database, that had ~50k documents and prob. coherence found that I had about 50 topics. When I looked at the abstracts, many of them had very similar wording as they were grants from the same programs. Meanwhile, prob. coherence incated that 300 - 500 topics would work a smaller corpus of ~10k abstracts from NIH grants. The range of language is much wider there.

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

Also, @sweetmals, if you want to convert a tm document term matrix to the type of DTM used by textmineR and text2vec, I have a deprecated (and removed) function for this. It no longer ships with textmineR, but you can find it here

from textminer.

TommyJones avatar TommyJones commented on August 18, 2024

Also, also, tidytext has several "cast" functions that may do the same.

from textminer.

sweetmals avatar sweetmals commented on August 18, 2024

Hi @TommyJones thank you very much. Yeah, I went through the code [t(dtm > 0) %*% (dtm)] and it took me a while to understand the difference as I am refreshing my math and stats now.

@manuelbickel & @TommyJones : I look forward you guys documenting your outcomes (methods and comparisons) either in a joint research paper or individually. This indeed an area where there is a gap and one could make a significant research contribution. It may be helpful for scholars (scholars from IS, business etc. who are interested in the application side) who want to apply the techniques/methods directly without putting too much effort. Topic modeling is gaining popularity in the IS field now as it is useful for discourse analysis in particular for theory building (which I am doing as part of my PhD thesis).

from textminer.

manuelbickel avatar manuelbickel commented on August 18, 2024

from textminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.