We want to gain further insights regarding the topics and how they change using topic

Step 3: Apply topic modelling about cs-insights-crawler HOT 6 CLOSED

trannel commented on June 17, 2024

Step 3: Apply topic modelling

from cs-insights-crawler.

Comments (6)

trannel commented on June 17, 2024 1

Oh, I also meant that we derive the topics once and then just check which documents in which year belong to which topic the most. I would postpone this a bit though, because of the presentation.

from cs-insights-crawler.

jpwahle commented on June 17, 2024

It would be cool if you just plot the importance of the same topics over the years if possible. See here under "Topics over time".

from cs-insights-crawler.

trannel commented on June 17, 2024

I'm not entirely sure which behaviour you would expect when implemented. Currently you can add date ranges that you want to plot, but this will give you only one plot for all years combined. So you want an additional option, so all years are viewed separately and then plotted in a plot like in the one you linked? So you also just want the importance of the topics, which I assume is their numbering, not the plots by pyLDAvis over time?

from cs-insights-crawler.

truas commented on June 17, 2024

Once we have the topics generated for a collection, LDA allows us to identify which documents "belong" to each topic (and word within that topic). That being said, we can plot all (same) topics in a time window (let's say 2011-2020) using the count of the documents to show the importance of these topics on each year. Probably some topics are more popular in specific years. For example, we have the years 2011, 2012, 2013. and the topics A, B, C in a corpus of 10 documents. We would have:

2011
- A: 1
- B: 1
- C: 0
2012
- A: 1
- B: 3
- C: 1
2013
- A: 0
- B: 1
- C: 2
  Using a raw count, we can see that B was a hyped topic, but it fades away, while C was a growing theme.

Of course, how we consider the documents over the years can be different. If I'm not wrong, LDA will give an actual % of how much a given word is important for that topic. And for each document, you can see how much % it "belongs" to a topic. Again, here we can try a couple of things. We can use the weight/raw output and see the amount of document per topic (probably there will be 2-3 topics dominating all the others for a document), we can obtain the top-K similar documents wrt the topics at hand, or we could use the topics vectors as the centroids in a K-Means algorithm and see how the documents are plotted. Probably there will be a lot of overlap between the clusters.

from cs-insights-crawler.

trannel commented on June 17, 2024

Thank you for the explanation. So, if I understand correctly we train a model on the data from 2010 to 2020. Then we have 10 topics (or however many we defined previously). Next we check e.g. which documents published in 2010 belong to which topic and count how many documents there were per topic. Then we do this for every year and plot it like Jan showed.

If this is correct, I can try to do this, but I currently do not know how to get the information from the model. I'll have to look into this first. If i run into problems I'll let you know.

from cs-insights-crawler.

truas commented on June 17, 2024

I thought about deriving the topics from the entire corpus at once, from that we do the other verifications. However, your idea also seems interesting. If we have the topics per year we will be able to compare them against the overall. My guess is some topics will be always there, like AI, Machine Learning. Others will be more seasonal.

from cs-insights-crawler.

Step 3: Apply topic modelling about cs-insights-crawler HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent