ai-se / pits_lda Goto Github PK

IST journal 2017: Tuning LDA

Home Page: https://github.com/amritbhanu/LDADE-package

Python 98.57% Shell 0.26% Scala 1.17%

hyperparameter-optimization hyperparameter-tuning tuning optimization clustering classification genetic-algorithm topic-modeling lda differential-evolution

pits_lda's Introduction

LDADE

The Cleaner code is available in LDADE folder with a test script
It requires python 2
Packages that are needed, numpy, scipy, sklearn

pits_lda's People

Contributors

Stargazers

Watchers

Forkers

zabihimayvan rajpratim so2jia skatingboy2006

pits_lda's Issues

a low alpha value places more weight on having each document composed of only a few dominant topics (whereas a high value will return many more relatively dominant topics). Similarly, a low beta value places more weight on having each topic composed of only a few dominant words.

terms overlap

7,70,700, 7000

Updated ToDos

Randomness

Lda topics on nasa Pits

Experiment 1

Words greater than 3
Extracted those logs whose severity is 2, and discarded the others

F CR Pop Graph

PitsA

PitsB

PitsC

PitsD

PitsE

PitsF

T1 - T2

Experiment

T1 = Represents severity reports from 1 and 2
T2 = Represents severity reports from 3,4 and 5
Each line is representing features of Top 10 topics

Results

Results show there is no overlap of topics between T1 and T2. Other than couple of words, both the topics clusters are different.

For T1 - Project A

projecta control inertia design specif perform attitud tabl spacecraft note 
interrupt uplink srup fsw verif error follow specif eeprom scr 
tabl initi use fals address ppu obc function event dump 
checksum calcul enabl progress process task idl text oper discuss 
text fault memori error plenum initi second number pressur indic 
mode flight issu execut sequenc current indic point set vml 
switch messag case file mode type code flexelint function use 
wait int variabl vml read dump write verif task verifi 
miss oper set text paramet state check valid indic number 
subaddress address telemetri packet word fsw data buffer request bootload

For T2 - Project A

obc safe fault projecta power address flight mode state spacecraft 
code data function line valu access variabl use messag record 
srobc rate spacecraft flight memori prd alloc provid link point 
non load int unsign bit eeprom obc comput control data 
control mode point attitud error plenum sroac target main high 
grand word tlm type packet count cmd byte header command 
file defin line tlm data statu macro array ambi len 
command softwar flight trace link srup task uplink time spacecraft 
variabl initi messag line code entri valu use extern mode 
test script verifi mode engcntrl link indic issu procedur data

Improving the Usability of Topic Models

[bibtex](@PhDThesis{yang2015improving,
title={Improving the Usability of Topic Models},
author={Yang, Yi},
year={2015},
school={NORTHWESTERN UNIVERSITY}
})

Problems:

Gibbs sampling inference method for LDA runs too slow for large dataset with many topics.
The topics learned by LDA sometimes are difficult to interpret by end users.
LDA suffers from instability problem

Motivation:

like to efficiently train a big topic model with prior knowledge

Terminologies:

Markov random field
First-Order Logic

Stability Measures:

try running the algorithm many times and choose the model with the highest likelihood
document-level stability and token-level stability.
the number of topics was set to 20, the number of iterations was 1000. We use a uniform α with a value of 1.0, a uniform β with a value of 0.01

General:

LDA can be viewed as dimension reduction tool for document modeling by reducing the dataset dimension from the vocabulary size V to the number of topics T.
users have external knowledge regarding word correlation, which can be taken into account to improve the semantic coherence of topic modeling
Methods:
SCLDA can handle different kinds of knowledge such as word correlation, document correlation,
document label and so on. One advantage of SC-LDA over existing methods is that it is very fast to converge.

Datasets:

Analogy when topics are unstable:

the mental map Jane has built for the paper collection is disrupted, resulting in confusion and frustration. The tool has become less useful to Jane unless she puts in some effort to update her mental map, which significantly increases her cognitive load

References:

Online LDA [23]
[41] presents an algorithm for distributed Gibbs sampling
[67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
[19] presents the Gibbs sampling method for LDA inference
Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
Dirichlet Forest LDA [3]
Logic LDA: [4]
Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
NMF-LDA: Similar to Quad-LDA, [64]
Markov Random Topic Fields (MRTF): [27]
Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
[47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

Summary:

Labeled LDA can only handle document label knowledge. Dirichlet Forest LDA, Quad-LDA, NMF-LDA and ITM can only handle word correlation knowledge. MRTF can only handle document correlation knowledge. Logic LDA can handle word correlation , document label knowledge and other kinds of knowledge. However, each knowledge has to be encoded as First Order Logic

LDA-GA

How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms

[bibtex](@inproceedings{panichella2013effectively,
title={How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms},
author={Panichella, Annibale and Dit, Bogdan and Oliveto, Rocco and Di Penta, Massimiliano and Poshyvanyk, Denys and De Lucia, Andrea},
booktitle={Proceedings of the 2013 International Conference on Software Engineering},
pages={522--531},
year={2013},
organization={IEEE Press}
})

Approaches:

Posterior Distribution over the assignments of words to topics
Computing the harmonic mean of posterior distribution

Parameters up for tuning:

k,n,a,b. (n comes from gibbs sampling generative model)

Definitions:

Dominant topic: Let θ be the topic-by-document matrix generated by a particular LDA configuration P = [k, n, α, β]. A generic document dj has a dominant topic ti, if and only if θ(i,j) = max{ (θ(h,j)), h = 1 . . . k}.
High inter cluster distance and low intra cluster distance

Evaluation criteria:

Internal- Cohesion (intra) and separation (inter). Silhouette Coefficient (-1 to 1)
External - External info needed.

Need clarity? - how to convert text into data points. To do the cluster goodness evaluation.

Actual LDA-GA

a stochastic search technique based on the mechanism of a natural selection and natural genetics.
Stochastic search is the method of choice for solving many hard combinatorial problems.
having multiple solutions (individuals) evolving in parallel to explore different parts of the search space
an individual (or chromosome) is a particular LDA configuration and the population is represented by a set of different LDA configuration
The fitness function that drives the GA evolution is the Silhouette coefficient.
α and β varied from 0-1, also α and β can be set to default of 50/k and 0.1
The LDA-GA has been implemented in R [37] using the topicmodels and GA libraries
STOPPING CRITERIA - For GA, we used the following settings: a crossover probability of 0.6, a mutation probability of 0.01, a population of 100 individuals, and an elitism of 2 individuals. As a stopping criterion for the GA, we terminated the evolution if the best results achieved did not improve for 10 generations; otherwise we stopped after 100 generations

Assumptions:

Top 10 words belonging to the topic with the highest probability in the obtained topic distribution were then used to label the class

DE results on pitsA

To Dos

Talk to wei
do the lit (50-100)
standard practice about toolkits, parameters and validation
what anyone else concluded in SE domain?
Can you get the data in order to replicate? From a senior recent study with high citations.
ICSE, FSE, RE, MSR, TSE, TOSEM, ESEM, JSS (Journal of Systems and Software), ISTR, IST
Can you find stability in their results? Can you fix them using DE?

Papers

Online LDA [23]
[41] presents an algorithm for distributed Gibbs sampling
[67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
[19] presents the Gibbs sampling method for LDA inference
Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
Dirichlet Forest LDA [3]
Logic LDA: [4]
Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
NMF-LDA: Similar to Quad-LDA, [64]
Markov Random Topic Fields (MRTF): [27]
Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
[47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

To DOs

Does it hold for PITS B-E?
Can you use DE to make it more stable and less stable?
Gibbs Sampling?
Predict severity using topics? Perplexity

How Many Topics? Stability Analysis for Topic Models

[bibtex](@incollection{greene2014many,
title={How many topics? stability analysis for topic models},
author={Greene, Derek and O’Callaghan, Derek and Cunningham, P{'a}draig},
booktitle={Machine Learning and Knowledge Discovery in Databases},
pages={498--513},
year={2014},
publisher={Springer}
})

Idea:

a term-centric stability approach for selecting the number of topics in a corpus, based on the agreement between term rankings generated over multiple runs of the same algorithm. Employed a “top-weighted” ranking measure, where higher-ranked terms have a greater degree of influence when calculating agreement scores.

Weekly Report - 10/11/2016

DOING:

judging the credibility of LDA of how many words should be reported for each topic.
experiment with tuned 7 words, against default 7 words, and 10 words.
reviewing the LDA Papers

DONE:

Check #34
expected was tuned 7 words should perform the same than untuned 7 words and bad than untuned 10 words
but tuned 7 words performed better than both. (WIN SITUATION ie TUNING HELPED)

TODO:

redo anything after getting feedback for above results
Does stability help classification? with about the same K (default and untuned), better alpha and beta help?
LDA+word vector featurization better than LDA features.

roadblocks:

wanted to reproduce this paper
but the link provided is not working. Have emailed the author but no response.

admin:

can above be solved?

Machine Reading Tea Leaves

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

[bibtex](@inproceedings{lau2014machine,
title={Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality.},
author={Lau, Jey Han and Newman, David and Baldwin, Timothy},
booktitle={EACL},
pages={530--539},
year={2014}
})

General:

A good paper which gives rational about the topics instability

Measures:

notion of topic “coherence”, and proposed an automatic method for estimating topic coherence based on pairwise pointwise mutual information (PMI) between the topic words
direct appraoch, asking people about topics, indirect approach by evaluating PMI, CP.
To create gold-standard coherence judgements, they used Amazon Mechanical Turk

Problems:

perplexity correlates negatively with topic interpretability

Research Question:

word intrusion measures topic interpretability differently to observed coherence

Terminologies:

topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics
“intruder word”, which has low probability in the topic of interest, but high probability in other topics

Can you use DE to improve stability?

parameters
pits

LDA topics as feature selector

Experiment rig:

The initial run was with default parameter of k=10. Documents had 10 features. The score was null.
neg/pos ratio was very high. Unbalanced dataset. Eg:
- SE0: 'no': 6008, 'yes': 309 - we got Fscore of about 0.5
- SE1 'no': 47201, 'yes': 1441 - we got Fscore of about 0.8
- SE3 no': 83583, 'yes': 654 - 0 Fscore
- SE6 'no': 15865, 'yes': 439 - 0 Fscore
- SE8 'no': 58076, 'yes': 195 - 0 Fscore
Still tuning experiments are running.

Change in experiment

Will try smote.
And also i can change this default parameter k in the steps of 20, 40, 80.

Results

How to read graphs?

These graphs are (tuning - untuning) results
If y value 0, tuned and untuned results are same.
If y >0, tuning improved by that much y margin
if y <0, tuning didn't help and made it worse.

Results

F = 0.3, CR = 0.7, Pop = 10

F = 0.7, CR = 0.3, Pop = 10

F = 0.3, CR = 0.7, Pop = 30

F = 0.7, CR = 0.3, Pop = 30

Conclusion

Tuning helped for sure in most number of cases.
Termination is based on no. of iterations, they can be increased to get better results for less no of terms overlap or they can be reduced to make it faster.
On HPC, it took like 4 hours to generate each graph.

Classification using LDA

Experiment Setup

Datasets - Manney Generator of Stack Exchange sites. 25 datasets.
Running tuning experiment with 5 terms overlap. Select those parameters with max stability score.
Find clusters, and each topic is assigned a sequential "1,2,3 labels".
Now each document will be labelled as 1,2,3, rather than tags
Run SVM. Binary classification.

We have the baseline results with no smote svm, smote svm.

Mail with Prof. Mika Mäntylä

tfidf 5%, they did 50%
more coherent data, better results.
might need more data preprocessing to include them as stopwords and porter stemming.
an R lib for LDADE

Results 06-23

paper Review
Experiment of Gibbs vs Online VEM

Results:

Good news. Tuning helped in Gibbs as well as VEM.

Coherence of Descriptors

An Analysis of the Coherence of Descriptors in Topic Modeling

[bibtex](@Article{o2015analysis,
title={An analysis of the coherence of descriptors in topic modeling},
author={O’Callaghan, Derek and Greene, Derek and Carthy, Joe and Cunningham, P{'a}draig},
journal={Expert Systems with Applications},
volume={42},
number={13},
pages={5645--5657},
year={2015},
publisher={Elsevier}
})

General:

Nothing New from the #10. Just couple of variations using PMI

Measurea:

perplexity or held-out likelihood
distributional semantics measure based on an algorithm provided by the increasingly popular word2vec tool. Each term was represented as a vector in a semantic space, with topic coherence calculated as mean pairwise vector similarity; Cosine similarity, Jaccard similarity, and the Dice coefficient were used.
Pointwise Mutual Information (PMI)

Stability of topics

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})

General:

A good paper which shows issues about the topics instability.
word-topic and topic-document matrices (probabilities of words appearing in topics and topics appearing in documents)
variational approximations and Gibbs sampling. These algorithms find a local maximum of the joint likelihood function of the dataset
the LDA approach has been further developed by offering more complex model extensions with additional parameters and additional information [2, 9, 18, 4]

Problem:

the case of LDA, there are plenty of local maxima, which may lead to instability in the output.
problem of finding the optimal number of clusters
Since these distributions result from the same dataset with the same vocabulary and model parameters, any differences between them are entirely due to the randomness in Gibbs sampling. This randomness affects perplexity variations, word and document ratios, and the reproducibility of the qualitative topical solution

Old Solutions for stability:

A new metric of similarity between topics and a criterion of vocabulary reduction to evaluate stability.
numerical evaluation of topic modeling results is to measure perplexity. Perplexity shows how well topic-word and word-document distributions predict new test samples. The smaller the perplexity, the better (less uniform) is the LDA model.
- Problem with perplexity:
  - the value of perplexity drops as the number of topics grows
  - perplexity depends on the dictionary size

Evaluation Metric:

symmetric Kullback–Leibler divergence
compute correlation between documents from two topic modeling experiments. correlation between documents does not depend on dictionary size. The method consists of the following steps:
- construct a bipartite graph based on two topical solutions;
- compute the minimal distance between topics in this bipartite graph;
- compare topics between two cluster solutions based on the minimal distance.

Preprocessing step:

document and word ratios that show the fraction of words and documents that are actually relevant to specific topics

Review - 04/27/2016

Literature Survey

All the literature survey from SE domain.
Searched for lda topics stable OR unstable OR coherence on google scholar. Top cited papers.
7 out of 9 papers stated topics unstable. So, went for manual validation of topics to do further experiments.
1 paper gave a very strong statement that lda is stable. Data not available to verify their results.
Some other lda toolkit mentioned are GibbsLDA++.
Some have talked about playing with different configurations. Common Parameters are, k, a, b, i. (BASICALLY TALKING ABOUT TUNING)
Non deterministic so not many people have bothered about how their results might vary
Refer this for more details

DE Experiment

Implemented DE with CR = 0.3, mutation = 0.7
Termination criteria, no of iterations.
Tuning Parameters, k = (10,30), a=(0,1), b=(0,1). Based on literature review.

Results

Tuning helped in various datasets for higher no of terms overlap. Or atleast remained same as default parameters.
No change just by tuning a,b
Higher k value improved the stability.
Less b improved the stability.

Meeting - 06/08

Stack Results.

New Citemap Results

#20

Paper Related graphs

Evaluation criteria

The score remains constant till 200, it goes up when we do 300 and stays constant or get down.

Frequency of parameters per max score

Pits A dataset. For each iteration i, with each label, its max score is selected. And these are the no of set of parameters which produced that stability score. Just for 2 labels.

Boxplot graph to show how parameter ranges are suffered.

is instability increased with random row ordered?

Pitsabcde
wikipedia

Parameter alpha and beta

Alpha

Beta

Citemap Results

Configuration:

Untuned: 10 repeats, and median stability score is taken. Default parameters.
Tuning: On Spark, 300 evaluations

Results:

Delta, Positive, tuning is better.
Different Combinations of F, CR, and Pop

Topics with the maximum overlap.

File structure - K_22_a_0.847433736937_b_0.763774618977.txt

K=22, a=0.847433736937, b = 0.763774618977

Run: 0
Topic 0: optim method solut factor advantag softwar guarante produc known tool 
Topic 1: configur templat variabl time kernel linux patch schedul spreadsheet compil 
Topic 2: workshop intern confer program messag member ics list summari review 
Topic 3: softwar idf invers analysi engin objectori abstract star program autom 
Topic 4: softwar analysi redocument objectori engin program tool autom experi abstract 
Topic 5: panel law equat length softwar debat demet storyboard tell qualifi 
Topic 6: softwar objectori analysi abstract engin tool autom design program use 
Topic 7: objectori trait softwar lisp analysi tool engin program design experi 
Topic 8: softwar analysi tool abstract engin design autom experi program use 
Topic 9: test program use techniqu gener analysi execut approach case algorithm 
Topic 10: code sourc detect clone open file similar base type techniqu 
Topic 11: subclass umpl substitut superclass analysi abstract softwar umplif objectori autom 
Topic 12: softwar analysi tool program abstract autom objectori engin design use 
Topic 13: softwar analysi objectori autom engin abstract experi design use tool 
Topic 14: ide eclips plug abstract plugin netbean softwar array framework chart 
Topic 15: model languag specif formal use aspect design concern implement base 
Topic 16: comput context network mobil resourc awar applic devic distribut platform 
Topic 17: softwar analysi autom abstract engin tool objectori use evolut design 
Topic 18: applic compon architectur web servic user transform integr engin framework 
Topic 19: softwar use chang tool approach program design inform code sourc 
Topic 20: softwar develop project engin process research studi qualiti bug use 
Topic 21: objectori softwar design analysi abstract engin program autom tool realtim 
Run: 1
Topic 0: code detect sourc clone refactor pattern tool chang base studi 
Topic 1: mainten softwar matter correct massiv tell taxonomi experiment comprehens cost 
Topic 2: mobil devic comment data net game load analyst tune network 
Topic 3: busi inform compani technolog workflow corpor lead divis comprehens execut 
Topic 4: robot race challeng softwar later intellig insight pipelin win artifici 
Topic 5: softwar experi analysi autom visual framework use model tool design 
Topic 6: scenario chart sequenc stereotyp msc reactiv messag impli visual lsc 
Topic 7: objectori model metamodel ontolog framework evolut mda omg eventu softwar 
Topic 8: slice static dynam rang wide case rel larg method propos 
Topic 9: softwar develop chang studi use bug project sourc report result 
Topic 10: softwar model develop design process architectur use tool requir support 
Topic 11: softwar analysi experi autom model design visual realtim framework abstract 
Topic 12: softwar remodular analysi visual autom music multidimension experi assess abstract 
Topic 13: languag specif model formal transform semant program verif gener use 
Topic 14: safeti certif proof critic complianc iso siemen certifi softwar ambigu 
Topic 15: program test use techniqu analysi gener approach execut base present 
Topic 16: secur protocol vulner access network control polici analysi schedul attack 
Topic 17: peer componentbas node cach softwar volatil har analysi netbean framework 
Topic 18: aspect point concern orient modular crosscut aop messag join aspectj 
Topic 19: layout softwar visualis visual autom analysi experi framework design distribut 
Topic 20: applic web data revers legaci extract engin queri databas tool 
Topic 21: softwar engin research workshop commun comput discuss intern industri challeng 
Run: 2
Topic 0: optim method solut deviat advantag guarante factor produc known pool 
Topic 1: workshop research intern engin comput track tutori topic session confer 
Topic 2: signatur alert match massiv defens notif worm softwar jone smoke 
Topic 3: safeti critic certif complianc hardwar healthcar ambigu certifi nasa mission 
Topic 4: inspect review commit kernel linux patch author driver port peer 
Topic 5: softwar experi autom componentbas engin increment assess use largescal case 
Topic 6: design object compon class orient pattern aspect servic featur method 
Topic 7: test code use techniqu approach chang sourc detect bug result 
Topic 8: configur word assert identifi scheme macro artefact preprocessor split expand 
Topic 9: softwar autom componentbas experi engin tool increment use largescal environ 
Topic 10: inform sourc extract open repositori data busi list visual retriev 
Topic 11: vulner schedul optim array real time buffer overflow alloc trade 
Topic 12: model languag specif use gener tool approach base requir implement 
Topic 13: robot win grand softwar autom intellig componentbas race autonom home 
Topic 14: softwar autom recoveri largescal componentbas experi engin tool case realtim 
Topic 15: softwar assess experi autom tool componentbas use largescal engin visual 
Topic 16: program analysi dynam static execut use slice algorithm graph techniqu 
Topic 17: applic web user databas interact client interfac migrat data approach 
Topic 18: softwar autom assess componentbas experi engin use environ largescal studi 
Topic 19: softwar develop engin process architectur project use mainten studi product 
Topic 20: smell spreadsheet end bad templat subject speed tabl cell formula 
Topic 21: context awar inconsist conflict merg resolv ide resolut revis chang 
Run: 3
Topic 0: model check databas constraint data logic queri tempor satisfi schema 
Topic 1: revers grammar fact word pars extract reengin engin parser cobol 
Topic 2: requir design method product softwar engin goal optim process support 
Topic 3: platform mobil devic android app permiss micro portabl bytecod phone 
Topic 4: objectori omnipres model softwar framework largescal tool visual use object 
Topic 5: graph scenario concurr interact behavior sequenc specif event behaviour monitor 
Topic 6: code chang sourc softwar studi evolut clone develop detect open 
Topic 7: inform legaci busi migrat reengin process recov workflow technolog compani 
Topic 8: architectur compon softwar view adapt decis configur distribut support environ 
Topic 9: softwar model framework objectori use tool analysi aspectori panel abstract 
Topic 10: visual metaphor music boundari analyt hill climb largescal softwar overlap 
Topic 11: applic analysi web use secur flow access function user detect 
Topic 12: formal properti specif verif verifi composit infer reason refin state 
Topic 13: softwar develop engin process research project servic manag mainten paper 
Topic 14: bug report perform time use predict measur defect develop data 
Topic 15: class refactor programm maintain parallel smell conflict improv ide merg 
Topic 16: test techniqu gener case execut use fault input suit autom 
Topic 17: softwar matrix model objectori sla tool agreement framework aspectori largescal 
Topic 18: program dynam static slice analysi depend algorithm comput condit comprehens 
Topic 19: model use approach tool languag base paper implement present gener 
Topic 20: templat confer member compil chair metaprogram calculu welcom committe debugg 
Topic 21: remodular eye idf layout invers model hyper softwar movement framework 
Run: 4
Topic 0: workshop intern review research track program messag confer list session 
Topic 1: model use program languag specif gener design approach tool base 
Topic 2: context parallel conflict inconsist awar comput merg resolv resolut middlewar 
Topic 3: procedur method softwar cobol ownership node layout solut instruct encapsul 
Topic 4: visualis debugg dimension breakpoint emul comprehens workbench softwar multidimension objectori 
Topic 5: compon product featur applic line reus secur approach configur softwar 
Topic 6: softwar realtim autom framework analysi objectori use tool experi architectur 
Topic 7: macro artefact preprocessor stream hidden preprocess expand expans scheme actor 
Topic 8: factori renov constructor softwar jstar autom objectori analysi experi model 
Topic 9: kernel devic schedul linux driver buffer window overflow array interrupt 
Topic 10: spreadsheet end smell decomposit dataflow hierarch formula stabil speed modeldriven 
Topic 11: pair propag compat renam evolut micro prioriti late makefil programm 
Topic 12: visual extract databas data inform sourc tool fact xml schema 
Topic 13: program analysi dynam static slice algorithm precis comput flow execut 
Topic 14: objectori softwar analysi use autom framework tool design experi model 
Topic 15: objectori softwar autom analysi framework legaci use largescal tool evolut 
Topic 16: web page browser string constant html php javascript ajax server 
Topic 17: anti antipattern linguist scc certif pattern imped occurr greater softwar 
Topic 18: softwar develop engin process architectur project tool use research mainten 
Topic 19: test code use sourc techniqu chang approach softwar result studi 
Topic 20: softwar analysi tool use autom objectori framework comprehens evolut experi 
Topic 21: optim search insight yield near soft solut sbse softwar engin 
Run: 5
Topic 0: smell word macro bad identifi cognit split expand taxonomi renam 
Topic 1: architectur compon applic framework base approach web interfac user use 
Topic 2: tool softwar mainten assess use studi approach experi evolut realtim 
Topic 3: ownership restructur organiz domin spi measur surpris incom owner encapsul 
Topic 4: factori constructor softwar mainten experi cell tool molecular inspector use 
Topic 5: test techniqu case gener execut fault use input effect suit 
Topic 6: period month churn seri forecast firefox foundat chrome softwar latenc 
Topic 7: program analysi dynam static slice algorithm use graph depend flow 
Topic 8: odc softwar mainten studi experi use tool recoveri support process 
Topic 9: softwar engin research revers develop industri practic workshop experi discuss 
Topic 10: languag design object model tool orient class aspect transform use 
Topic 11: fuzzi imperfect toss softwar mainten instabl reverseengin altogeth sdk etp 
Topic 12: calibr spc softwar process spreadsheet use mainten experi variat tool 
Topic 13: develop process project servic manag applic requir softwar product technolog 
Topic 14: port peer breakpoint item adjust remedi estim bia softwar inspect 
Topic 15: pair evolut empir ecosystem growth law softwar contributor studi regular 
Topic 16: reengin method open sourc list xml tool patch mail oss 
Topic 17: inform lead corpor compani divis comprehens analyst iso vice consolid 
Topic 18: tool softwar mainten realtim use objectori assess experi recoveri studi 
Topic 19: code softwar use sourc chang develop approach studi result tool 
Topic 20: anti antipattern remodular linguist cluster occurr softwar mainten scc finegrain 
Topic 21: model specif use check formal properti state approach gener verif 
Run: 6
Topic 0: softwar architectur develop process use compon approach mainten support paper 
Topic 1: objectori softwar analysi framework autom tool largescal increment use abstract 
Topic 2: applic secur web profil vulner synthesi server polici time real 
Topic 3: analysi objectori softwar autom abstract framework aspectori componentbas realtim largescal 
Topic 4: inform busi compani legaci workflow lead corpor technolog divis analyst 
Topic 5: softwar analysi objectori autom framework use abstract tool experi approach 
Topic 6: code sourc bug use chang detect studi develop softwar identifi 
Topic 7: softwar engin develop research project servic web applic revers technolog 
Topic 8: reconstruct decompil obfusc readabl ast reverseengin polymorph birthmark standalon bytecod 
Topic 9: parallel concurr platform perform sequenti net hardwar distribut thread multi 
Topic 10: program test use gener techniqu analysi execut approach present base 
Topic 11: visualis softwar objectori dimension analysi shrimp tool comprehens autom visual 
Topic 12: regular law equat softwar length demet autom smalltalk analysi friend 
Topic 13: optim solut method factor search comparison advantag softwar produc known 
Topic 14: malwar anti breakpoint defens analysi emul worm mitig card unpack 
Topic 15: remot factori renov softwar notif constructor metalanguag sent usabl analysi 
Topic 16: analysi softwar autom objectori tool framework experi use largescal program 
Topic 17: softwar autom analysi objectori tool largescal framework experi use approach 
Topic 18: machin binari translat spreadsheet virtual packag compil window instruct assembl 
Topic 19: adapt dynam runtim run time self assur failur fault reconfigur 
Topic 20: softwar analysi objectori autom use tool experi abstract program framework 
Topic 21: model design languag specif use class object formal pattern transform 
Run: 7
Topic 0: chang softwar evolut manag mainten evolv configur version support impact 
Topic 1: develop bug sourc project report code softwar studi open use 
Topic 2: program code use tool sourc approach analysi techniqu java pattern 
Topic 3: optim solut method profil schedul guarante produc speed advantag factor 
Topic 4: cognit anti antipattern occurr linguist taxonomi softwar scc visual imped 
Topic 5: trait actor gpu scala autom analysi basset softwar tool use 
Topic 6: alert signatur immut defens mitig jone autom analysi mutabl worm 
Topic 7: intens inter incorpor anoth intric softwar novel stream green depend 
Topic 8: applic web user secur interfac servic databas migrat client interact 
Topic 9: softwar analysi visual autom experi use largescal design objectori reengin 
Topic 10: pair agil review confer program member track accept panel regular 
Topic 11: model specif use design gener approach tool languag base requir 
Topic 12: featur aspect composit modular line product orient concern increment compos 
Topic 13: visual softwar star analysi gxl tool experi shrimp autom exchang 
Topic 14: malwar visualis behaviour obfusc harm growth certifi emul malici viru 
Topic 15: dynam static slice analysi condit precis rang execut case larg 
Topic 16: busi legaci inform compani process ibm corpor reengin cobol technolog 
Topic 17: test softwar use techniqu studi case result approach qualiti effect 
Topic 18: objectori redocument analysi softwar visual tool mainten experi design use 
Topic 19: softwar autom analysi visual experi objectori reengin largescal recoveri use 
Topic 20: analysi realtim softwar visual autom use data mainten tool experi 
Topic 21: softwar engin develop architectur research compon distribut design process challeng 
Run: 8
Topic 0: visualis softwar autom analysi visual abstract largescal realtim componentbas framework 
Topic 1: binari secur licens attack complianc malwar permiss free protect enforc 
Topic 2: privaci threat anonym regul mitig softwar dca analysi autom law 
Topic 3: cot shelf softwar analysi autom componentbas stateflow notif framework stand 
Topic 4: applic web distribut servic class environ user perform deploy develop 
Topic 5: model specif use compon languag architectur tool base approach requir 
Topic 6: bug report predict defect fix develop project use repositori file 
Topic 7: microblog softwar dissemin twitter million realtim analysi autom visual observatori 
Topic 8: optim solut method advantag factor produc guarante support known tool 
Topic 9: analysi softwar realtim visual architectur componentbas autom experi objectori use 
Topic 10: code sourc pattern design detect clone program refactor tool use 
Topic 11: negoti softwar win analysi autom visual abstract architectur experi componentbas 
Topic 12: edit script systemat scm interrupt ident induc umpl autom session 
Topic 13: artefact actor softwar hidden scala messag shrimp mcc basset analysi 
Topic 14: softwar develop chang use studi process mainten project paper evolut 
Topic 15: softwar engin research revers commun workshop challeng discuss industri practic 
Topic 16: test program use techniqu analysi gener execut algorithm approach case 
Topic 17: orient aspect object concern program modular point modul mechan separ 
Topic 18: schema format exchang fact extractor organis standard softwar testabl phase 
Topic 19: legaci busi reengin inform compani migrat databas technolog process workflow 
Topic 20: flaw objectori mathemat softwar analysi prey predat popul ssa use 
Topic 21: softwar analysi use tool autom abstract visual reengin experi largescal 
Run: 9
Topic 0: bug report code sourc api fix detect use predict approach 
Topic 1: design architectur compon softwar requir pattern framework approach base model 
Topic 2: agreement softwar sla analysi visual realtim tool experi largescal abstract 
Topic 3: optim solut method advantag visual layout softwar produc guarante known 
Topic 4: anchor adjust softwar matter cognit tool analysi visual wikipedia use 
Topic 5: test techniqu gener case execut use fault approach input autom 
Topic 6: factori renov constructor softwar scaffold proprietari experi analysi mode largescal 
Topic 7: process servic legaci technolog busi reengin migrat comprehens inform environ 
Topic 8: regular wrapper length law lexic sourcecod equat softwar extrapol intension 
Topic 9: softwar analysi experi realtim tool abstract largescal visual distribut autom 
Topic 10: engin softwar research revers commun workshop challeng comput discuss industri 
Topic 11: realtim softwar tool analysi largescal abstract experi visual model use 
Topic 12: model applic use languag specif tool gener approach web base 
Topic 13: object class metric orient method measur use concept coupl code 
Topic 14: softwar develop chang studi code use project sourc mainten process 
Topic 15: platform mobil devic deploy applic network resourc driver hardwar android 
Topic 16: softwar eve abstract interact tool autom analysi model experi largescal 
Topic 17: softwar use analysi realtim abstract experi tool autom largescal support 
Topic 18: program code analysi use java dynam refactor type static slice 
Topic 19: spreadsheet end formula microblog bidirect templat modeldriven excel dataflow cell 
Topic 20: classif classifi categori taxonomi csp capac item analyst notif orthogon 
Topic 21: configur wide rang conflict static merg rel larg case applic 

Runtime: --- 425.590034962 seconds ---

Score: 0.9

Credibility Of LDA

IDEA:

ACTUAL
          T1        T2         T3      .. .. . . .
Doc1
Doc2
Doc3

PREDICTED - Selected from Dominant topic from doc topic distribution.
          W1        W2         W3      .. .. . . .
Doc1
Doc2
Doc3

**According to literature, If a document is asked to belong to one of the dominant 
topic (hard assignment), the top words from the dominant topic should be in the 
actual document. If not:
 - then the probability of dominant topic is very less and there might be other topic which 
can be made dominant.
- or the top words are wrongly selected. The weights of words could be better to find 
the same dominant topic.**

Experiment:

Once top n words are selected from each topic, now those topics are represented with those n words.
A dominant topic is selected to represent a document, we call that as actual.
we will check for each topic which are now represented with n words. We will find most 'm' words out of those 'n' in a document. Whichever topic will have the most 'm' words, according to this, now that document is represented with this topic.

We have now x no of documents. For eg x=4, k(no of topics)=3
for x=4, we have [D1,D2,D3,D4]
Actual=[1,1,2,0]
Predicted=[1,0,2,0]
The score is = 2/4=0.50

Results:

Higher the better

Conclusion:

tuned with top 7 words is performing much better than untuned (default, k=10) top 7 words.
tuned with top 7 words is performing better or same than untuned (default, k=10) top 10 words.
With tuning we have better top 7 words defining that topic.

Meeting - 06/02

To Show:

Citemap Results:- #20
Other Learnings: https://github.com/ai-se/Pits_lda/blob/master/Papers/other_learnings.md

In process:

Running Modified citemap results based on Georges Input.
Experiements running on Stackoverflow

To Dos:

Specific Papers on the above learning methods
Wikipedia run??

To Dos

T1 = {1,2} T2 = {3,4,5}. Need to see T1 - T2 clusters. For all 6 different projects.
Stability of topics.

on Wikipedia sets

2 graphs with n% matching terms, across m% matched across l runs, I am saying stable.
Define stability of topics in lda
Increasing the number of topics made it stable. As right now there can be multiple hills and its finding different hills across the dataset.

Meeting - 06/16

5 terms overlap

Dataset \ Evaluations	100	200	300	400
PitsA	0.9	0.9	1.0	1.0
PitsB	0.9	0.9	0.9	1.0
PitsC	1.0	1.0	1.0	1.0
PitsD	1.0	1.0	1.0	1.0
PitsE	0.9	0.9	1.0	1.0
PitsF	0.9	0.9	0.9	0.9
Citemap	0.67	0.67	0.77	0.77

VEM vs Gibbs

Read this

Reading tea leaves #10
http://gradworks.umi.com/37/24/3724413.html #12
- is there software available?
http://irserver.ucd.ie/bitstream/handle/10197/6482/insight_publication.pdf?sequence=1 #11
- coherence
http://jis.sagepub.com/content/early/2015/12/05/0165551515617393.full
- does this poaper has method to evaluate cluster?
look up hyperparameter optimization in lda (whats the wei paper says?) #8
https://linis.hse.ru/data/2014/02/21/1331648934/websci_01.pdf issue no #9
paper

IN SE, do people tune lda parameters or even check stability? Find top 5 cited papers?
https://github.com/joelgrus/data-science-from-scratch/blob/master/code/natural_language_processing.py
http://cdn.oreillystatic.com/oreilly/booksamplers/9781491901427_sampler.pdf

ai-se / pits_lda Goto Github PK

pits_lda's Introduction

LDADE

pits_lda's People

Contributors

Stargazers

Watchers

Forkers

pits_lda's Issues

Experiment 1

Top 30 Topics

PitsA

PitsB

PitsC

PitsD

PitsE

PitsF

Experiment

Results

For T1 - Project A

For T2 - Project A

Improving the Usability of Topic Models

References:

How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms

Papers

How Many Topics? Stability Analysis for Topic Models

DOING:

DONE:

TODO:

roadblocks:

admin:

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

Experiment rig:

Change in experiment

How to read graphs?

Results

F = 0.3, CR = 0.7, Pop = 10

F = 0.7, CR = 0.3, Pop = 10

F = 0.3, CR = 0.7, Pop = 30

F = 0.7, CR = 0.3, Pop = 30

Conclusion

Experiment Setup

Results:

An Analysis of the Coherence of Descriptors in Topic Modeling

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

Literature Survey

DE Experiment

Results

Stack Results.

New Citemap Results

Paper Related graphs

Evaluation criteria

Frequency of parameters per max score

Boxplot graph to show how parameter ranges are suffered.

Alpha

Beta

Configuration:

Results:

Topics with the maximum overlap.

File structure - K_22_a_0.847433736937_b_0.763774618977.txt

IDEA:

Experiment:

Results:

Conclusion:

5 terms overlap

Recommend Projects

Recommend Topics

Recommend Org