Giter Site home page Giter Site logo

comp-755-project's Introduction

Temporal Topic Modeling Using LDA2Vec

Download these files before running

Link to cleaned NIPS papers (cleaned.txt):

https://drive.google.com/file/d/19EoVcRdTdQ5Lr6262Qtnb5k1PaRXCuoK/view?usp=sharing

Link to pretrained word embeddings (glove.6B.300d.txt):

https://drive.google.com/file/d/19_B-ip57uedacDN7SZN9DDVif29YD7vA/view?usp=sharing

Link to original NIPS papers (papers.csv):

https://drive.google.com/file/d/1ZvwH8whuG8pd0asa1LM7eeMEX6Tey4wi/view?usp=sharing

Preprocessing

Due to memory restrictions we sampled batches of the documents by year. Each pre-processing includes 300 non-overlapping documents taken from 2008-2017 in which every 3 documents are aggregated into 1 single document. Thus, there are essentially 100 aggregate documents in each aggregate folder.

lda2vec was run on each of these groups individually resulting in a unique set of embeddings that can be used to find a suitable set of hyperparameters or other experiments. The embeddings hold the output from the lda2vec for word, doc, and topic embeddings. The number of topics to cluster was determined by the coherence score, which is a measure of average cosine similarity between words within a single topic. The embeddings are clustered into 40 topics.

LDA2Vec

Here are the following coherence scores:

Aggregate 1

Coherence scores for each number of topic clusters

20 topics: 0.3107490686037474

30 topics: 0.34992583735742505

40 topics: 0.3965498286873723

50 topics: 0.3827791294533138

Aggregate 2

Coherence scores for each number of topic clusters

40 topics: 0.3817637460859906

Aggregate 3

Coherence scores for each number of topic clusters

40 topics: 0.38678243150003255

Aggregate 4

Coherence scores for each number of topic clusters

40 topics: 0.3831858639488928

comp-755-project's People

Contributors

mattbeze avatar patelvap avatar

Watchers

 avatar Kean Leung avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.