Giter Site home page Giter Site logo

ttm's Introduction

  1. TwitterLDA ===============

1.1 Reference

Comparing twitter and traditional media using topic models, Zhao et al., ECIR 2011

1.2 Input format

  • Each user's tweets are put in a file. The file is named by the user's user_id. Each line in the file has format: ... where wj is index of a word in tweet vocabulary. The is used for dividing tweets into batches for cross validation, or for labeling (part of) the tweets by pre-defined topics in semi-supervised learning (see STwitterLDA below)
  • All users' tweet files are put in "users" folder
  • Tweet vocabulary is put in "vocabulary.txt" file
  • The "users" folder and the "vocabulary.txt" file are put in "tweet" folder
  • The path to "tweet" folder is input for the program

1.3 Output format

  • All output files are put in a given output folder
  • File "coinBias.csv" contains (global) bias of users toward choosing words from background topic. Format:
  • File "likelihood-perplexity.csv" contains likelihood and perplexity of the model
  • File "tweetTopics.csv" contains topics' word distribution. Each line in the file is a topic. Format of a line: <topic_number>,<p_0>,...,<p_N> where p_j is probability of the j-th word in the tweet vocabulary
  • File "tweetTopicTopTweets.csv" contains top 20 tweets of the topic. Format: <topic_0> ,<tweet_id>, ... ,<tweet_id>, <topic_1> ,<tweet_id>, ... ,<tweet_id>, ...
  • File "tweetTopicTopWords.csv" contains top 20 words of the topic. Format: <topic_0> ,, ... ,, <topic_1> ,, ... ,, ...
  • File "userTopics.csv" contains users' topic distribution. Each line in the file is a user. Format of a line: <user_id>,<p_0>,...,<p_N> where p_k is probability of the k-th topic

1.4 Variants

  • STwitterLDA: TwitterLDA with semi-supervised learning: part of tweets are labeled by pre-defined topics. The in tweets input are topic_id or -1 if the tweets are not labeled.
  • NBTwitterLDA: TwitterLDA without background topic.
  1. BehaviorLDA ==================================

2.1 Reference

It is not just what we say, but how we say them: LDA-based behavior-topic model, Qiu et al., SDM 2013

2.2 Input format

  • Each user's tweets are put in a file. The file is named by the user's user_id. Each line in the file has format: ... where wj is index of a word in tweet vocabulary. The is used for dividing tweets into batches for cross validation

  • Each user's retweets are put in a file with the same format with the tweet files

  • All users' tweet files are put in "tweets" folder, and all users' retweets are put in "retweets" folder

  • Tweet vocabulary is put in "vocabulary.txt" file

  • The "tweets", "retweets" folders and the "vocabulary.txt" file are put in "data" folder

  • The path to the "data" folder is input for the program

2.3 Output format

  • All output files are put in a given output folder
  • File "coinBias.csv" contains (global) bias of users toward choosing words from background topic. Format:
  • File "likelihood-perplexity.csv" contains likelihood and perplexity of the model
  • File "behaviorBias.csv" contains (global) bias of users toward tweeting or retweeting specific to topics. Each line is a topic. Format: <topic_id>
  • File "tweetTopics.csv" contains topics' word distribution. Each line in the file is a topic. Format of a line: <topic_number>,<p_0>,...,<p_N> where p_j is probability of the j-th word in the tweet vocabulary
  • File "tweetTopicTopTweets.csv" contains top 20 tweets of the topic. Format: <topic_0> ,<tweet_id>, ... ,<tweet_id>, <topic_1> ,<tweet_id>, ... ,<tweet_id>, ...
  • File "tweetTopicTopWords.csv" contains top 20 words of the topic. Format: <topic_0> ,, ... ,, <topic_1> ,, ... ,, ...
  • File "userTopics.csv" contains users' topic distribution. Each line in the file is a user. Format of a line: <user_id>,<p_0>,...,<p_N> where p_k is probability of the k-th topic
  1. Generalized Topic Model ================================== 3.1 Reference ============= Modeling Topics and Behaviors of Microbloggers: An Integrated Approach, Hoang et al., Preprint

2.2 Input format

  • Each user's tweets are put in a file. The file is named by the user's user_id. Each line in the file has format: ... where wj is index of a word in tweet vocabulary. The is used for dividing tweets into batches for cross validation
  • All users' tweet files are put in "users" folder
  • Tweet vocabulary is put in "vocabulary.txt" file
  • The "users" folder and the "vocabulary.txt" file are put in "tweets" folder
  • For each type of behavior (e.g., hashtag, mention), the users' behavior adoptions are put in "user.txt" file. Each line is a user. Format of a line: <user_id><b1:number of adoptions of b1>...<bM:number of adoptions of bM> where bj is index of the behavior in the behavior vocabulary. The behavior vocabulary is put in "vocabulary.txt" file. The "user.txt" and "vocabulary.txt" files are put in folder named by behavior type
  • "tweets" and all behavior folders are put in "data" folder
  • The path to the "data" folder is input for the program

2.3 Output format

  • All output files are put in a given output folder
  • File "likelihood-perplexity.csv" contains likelihood and perplexity of the model
  • File "tweetTopics.csv" contains topics' word distribution. Each line in the file is a topic. Format of a line: <topic_number>,<p_0>,...,<p_N> where p_j is probability of the j-th word in the tweet vocabulary
  • File "tweetTopicTopTweets.csv" contains top 20 tweets of the topic. Format: <topic_0> ,<tweet_id>, ... ,<tweet_id>, <topic_1> ,<tweet_id>, ... ,<tweet_id>, ...
  • File "tweetTopicTopWords.csv" contains top 20 words of the topic. Format: <topic_0> ,, ... ,, <topic_1> ,, ... ,, ...
  • For each type of behavior, topics' behavior distribution and top behaviors are put in "<behavior_name>Topics.csv" and "<behavior_name>TopicTopWords.csv" similar to topics' word distribution and top words files.
  • File "userRealms.csv" contains users' bias toward realms and their realm distribution. Each line in the file is a user. Format of a line: <user_id>,,,<p_0>,...,<p_N> where p_r is probability of the r-th realm
  • File "userTopics.csv" contains users' topic distribution. Each line in the file is a user. Format of a line: <user_id>,<p_0>,...,<p_N> where p_k is probability of the k-th topic

ttm's People

Contributors

smutahoang avatar

Stargazers

 avatar Robin Wang avatar  avatar  avatar  avatar youngornever avatar hhxx2015 avatar pengpeng avatar  avatar

Watchers

James Cloos avatar  avatar Jia avatar james avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.