Giter Site home page Giter Site logo

Comments (6)

trannel avatar trannel commented on June 17, 2024 1

Oh, I also meant that we derive the topics once and then just check which documents in which year belong to which topic the most. I would postpone this a bit though, because of the presentation.

from cs-insights-crawler.

jpwahle avatar jpwahle commented on June 17, 2024

It would be cool if you just plot the importance of the same topics over the years if possible. See here under "Topics over time".

from cs-insights-crawler.

trannel avatar trannel commented on June 17, 2024

I'm not entirely sure which behaviour you would expect when implemented. Currently you can add date ranges that you want to plot, but this will give you only one plot for all years combined. So you want an additional option, so all years are viewed separately and then plotted in a plot like in the one you linked? So you also just want the importance of the topics, which I assume is their numbering, not the plots by pyLDAvis over time?

from cs-insights-crawler.

truas avatar truas commented on June 17, 2024

Once we have the topics generated for a collection, LDA allows us to identify which documents "belong" to each topic (and word within that topic). That being said, we can plot all (same) topics in a time window (let's say 2011-2020) using the count of the documents to show the importance of these topics on each year. Probably some topics are more popular in specific years. For example, we have the years 2011, 2012, 2013. and the topics A, B, C in a corpus of 10 documents. We would have:

  • 2011
    • A: 1
    • B: 1
    • C: 0
  • 2012
    • A: 1
    • B: 3
    • C: 1
  • 2013
    • A: 0
    • B: 1
    • C: 2
      Using a raw count, we can see that B was a hyped topic, but it fades away, while C was a growing theme.

Of course, how we consider the documents over the years can be different. If I'm not wrong, LDA will give an actual % of how much a given word is important for that topic. And for each document, you can see how much % it "belongs" to a topic. Again, here we can try a couple of things. We can use the weight/raw output and see the amount of document per topic (probably there will be 2-3 topics dominating all the others for a document), we can obtain the top-K similar documents wrt the topics at hand, or we could use the topics vectors as the centroids in a K-Means algorithm and see how the documents are plotted. Probably there will be a lot of overlap between the clusters.

from cs-insights-crawler.

trannel avatar trannel commented on June 17, 2024

Thank you for the explanation. So, if I understand correctly we train a model on the data from 2010 to 2020. Then we have 10 topics (or however many we defined previously). Next we check e.g. which documents published in 2010 belong to which topic and count how many documents there were per topic. Then we do this for every year and plot it like Jan showed.

If this is correct, I can try to do this, but I currently do not know how to get the information from the model. I'll have to look into this first. If i run into problems I'll let you know.

from cs-insights-crawler.

truas avatar truas commented on June 17, 2024

I thought about deriving the topics from the entire corpus at once, from that we do the other verifications. However, your idea also seems interesting. If we have the topics per year we will be able to compare them against the overall. My guess is some topics will be always there, like AI, Machine Learning. Others will be more seasonal.

from cs-insights-crawler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.