Giter Site home page Giter Site logo

ai-se / pits_lda Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 4.0 41.43 MB

IST journal 2017: Tuning LDA

Home Page: https://github.com/amritbhanu/LDADE-package

Python 98.57% Shell 0.26% Scala 1.17%
hyperparameter-optimization hyperparameter-tuning tuning optimization clustering classification genetic-algorithm topic-modeling lda differential-evolution

pits_lda's Introduction

LDADE

  • The Cleaner code is available in LDADE folder with a test script
  • It requires python 2
  • Packages that are needed, numpy, scipy, sklearn

pits_lda's People

Contributors

amritbhanu avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pits_lda's Issues

alpha beta lda

a low alpha value places more weight on having each document composed of only a few dominant topics (whereas a high value will return many more relatively dominant topics). Similarly, a low beta value places more weight on having each topic composed of only a few dominant words.

Lda topics on nasa Pits

Experiment 1

  • Words greater than 3
  • Extracted those logs whose severity is 2, and discarded the others

Top 30 Topics

30 topics:
TOPIC 0
capabl  0.0072857702548080085
detect  0.006889243638298703
thruster    0.006700064944549625
int 0.006607782123785966
verifi  0.0065153889947990404
monitor 0.00650296321145424
provid  0.006375887077561054
version 0.006342073856340687
mode    0.006232402158806654
illeg   0.006220333047750552

TOPIC 1
sequenc 0.033129759453011234
flight  0.03252374735588161
address 0.022959018219828313
configur    0.019362618270488675
execut  0.018645697987183907
spacecraft  0.018536089237481138
vml 0.016712577126068412
contain 0.014304561669005917
dump    0.012876027216572137
capabl  0.012353371621085145

TOPIC 2
text    0.014176183805144018
indic   0.01402861490766197
disabl  0.011578996197979974
initi   0.01114483984031251
power   0.010708437769477594
process 0.010701123372393252
issu    0.010398400658366539
number  0.008637492236074887
configur    0.007928624859391881
manag   0.007903798812742504

TOPIC 3
mode    0.07966584673217642
state   0.03718070354145717
initi   0.03572169251594875
address 0.031429716725763605
fault   0.030452571566128406
spacecraft  0.027279151643025204
submod  0.022740745115632146
event   0.021906758483994815
array   0.01824853606074547
safe    0.017724767764738542

TOPIC 4
paramet 0.0071188055404684805
sram    0.006827464128244202
mode    0.00673273831926996
srup    0.006591190141582033
trace   0.006586444155181463
lead    0.006553998061153735
verifi  0.00645559544261332
packet  0.00642680558984484
refer   0.0064026813981321005
statu   0.006329258144168653

TOPIC 5
step    0.07957132256961552
procedur    0.0645619658601564
valv    0.041675091536043866
execut  0.036320039850185164
engcntrl    0.03142032361285736
rvm 0.031151429519339312
ppu 0.028139784762838016
latch   0.026210595939019456
control 0.01665968261541891
safe    0.01489455854502568

TOPIC 6
detect  0.006901424936164037
illeg   0.006830546315072616
ssp 0.006780407223867901
calcul  0.00675266664234023
chang   0.006670584494538672
baselin 0.006639908707605641
size    0.006587158292765058
base    0.006470646527962334
byte    0.006417814822230279
attitud 0.006399922089239512

TOPIC 7
engcntrl    0.12864125725078046
rvm 0.08106913144631238
miss    0.06797058286501821
paramet 0.04712804740169639
oper    0.03959955807878258
question    0.03548013555542337
set 0.035106402441915874
lead    0.035063628473402754
baselin 0.0331610948442407
valid   0.025475517399741195

TOPIC 8
text    0.013554263569267398
issu    0.011318615133417836
monitor 0.010701013969762457
function    0.01063652335226498
data    0.010621648784653277
eeprom  0.008556587731938457
indic   0.008555600925061647
manag   0.008527574670884068
flight  0.0083612822637015
number  0.008008184417918564

TOPIC 9
refer   0.007059825821001276
lead    0.006984447280031191
scr 0.006838087440533121
receiv  0.006717873863744479
uplink  0.006674271056485936
process 0.0066346156141782455
iru 0.006563385386916043
index   0.006443017673025953
safe    0.006432388815486642
pressur 0.006411195054815109

TOPIC 10
fals    0.03580471202584637
initi   0.010931205194610211
file    0.008206711770178607
num 0.007587271599197137
word    0.007044442564413284
code    0.007024508831246105
verif   0.006556217191035221
fail    0.0064630813488873295
number  0.006456010424001303
baselin 0.006374013212012404

TOPIC 11
projecta    0.06746218996739925
point   0.04184767002882261
tabl    0.028350651612809887
specif  0.026752048078420344
control 0.02492652800690521
inertia 0.024851322524132656
design  0.019213967081341876
perform 0.01911143149108216
analysi 0.01904415564239937
note    0.018525507516305707

TOPIC 12
scr 0.007000582800495213
paramet 0.006666637905827172
oper    0.006660309291516552
process 0.006600259070505539
state   0.006473211101004202
procedur    0.0063467868940232505
latch   0.006293818254668043
task    0.00625789441880334
monitor 0.006250431817756035
fals    0.006204256309834193

TOPIC 13
text    0.05542830868272589
issu    0.03725262264123818
function    0.035717358930808504
use 0.03449499026069172
number  0.024030362307005032
configur    0.023222596144389325
file    0.02161114366150162
differ  0.02041661569850046
rate    0.019600705898817104
data    0.019428473501135756

TOPIC 14
plenum  0.06249745243079222
engcntrl    0.0409187547507851
rvm 0.036013477398814595
pressur 0.03230371555369164
paramet 0.02545298747765356
second  0.022874174938327285
check   0.02236197131682671
valu    0.022107342696258078
data    0.018644210188598787
miss    0.017962971753087812

TOPIC 15
integr  0.006770258735374216
paramet 0.0065478122693753025
state   0.006507774808691103
event   0.006457570853626896
step    0.006382456198675023
result  0.006294514557105005
sequenc 0.006211944228406801
contain 0.006182191108707061
pse 0.006174884134512087
file    0.006140997936215436

TOPIC 16
perform 0.006759823859313174
gener   0.006753369449746122
idl 0.006579616748933607
discuss 0.006444489487713062
verif   0.00644447573029011
illeg   0.006349126276971217
calcul  0.006300919187424522
respons 0.006298347915265842
fault   0.0062068278503890325
projecta    0.006161297742674911

TOPIC 17
float   0.06443236109201486
equal   0.047199710697273065
variabl 0.02934617624497164
constant    0.023256584077755587
accept  0.017902067085217875
point   0.014308891713200092
number  0.01066318416415156
sun 0.01047228634419052
line    0.010271237303141185
safe    0.010187038486312423

TOPIC 18
constant    0.0067272032380626
event   0.006509176463676518
accept  0.0064645037686356
invalid 0.006408657163372205
statu   0.006308655882355975
follow  0.006282119776234993
updat   0.006242530188278966
case    0.0062374334951456134
valid   0.006232355334315194
gener   0.006214309165593506

TOPIC 19
code    0.06089576666196665
document    0.023529369371275612
calcul  0.019601636551500184
line    0.017346617131557968
bit 0.01578553107263362
float   0.014946003104453131
equal   0.0116140446933427
valu    0.01101304871095636
variabl 0.010571777572977425
flight  0.010009773380919672

TOPIC 20
text    0.01319767734485295
engcntrl    0.012989456529975841
issu    0.011008447357302964
miss    0.010487501094100416
valv    0.010455669829729019
rvm 0.010301555088397404
state   0.010288119759840264
initi   0.010162147401351915
use 0.010023395745515468
manag   0.00805930979917511

TOPIC 21
file    0.03711533387181546
line    0.036662071751758965
prioriti    0.03252689221436745
ace 0.024601020266133027
defin   0.023781163143673706
counter 0.02375448623140276
buffer  0.023409858058972895
valu    0.02262365120226543
size    0.02100273544220952
error   0.018441500948256993

TOPIC 22
rvm 0.058557052788672445
engcntrl    0.05147203358485272
bootload    0.04588724664333789
checksum    0.045278735080949054
calcul  0.041654115486330225
memori  0.03298984546384696
text    0.03269440107902858
idl 0.026195712265557315
fsw 0.023523170638217756
address 0.021366050392282596

TOPIC 23
num 0.006609283590812713
subaddress  0.0066003824882191
fals    0.006592974808638571
switch  0.006572564017413236
rate    0.006493912398838559
fsw 0.006470888042171489
control 0.006409973661028382
sun 0.006394022472625757
set 0.006350634655874299
launch  0.0063247526201570085

TOPIC 24
suncrosscalib   0.00683729202567074
init    0.006589179818997157
address 0.006555629737091954
hop 0.006453259698936264
valu    0.006421198805018765
dump    0.006304229634976303
scrub   0.006231210980935616
checksum    0.006222658102646131
type    0.0062151285407071565
limit   0.0062148949738669084

TOPIC 25
access  0.007015605770948273
packet  0.006878738003382888
sub 0.00681656724573433
obc 0.006680487905656781
prioriti    0.006644301963427974
write   0.006550542311169962
wait    0.006507482714691407
correct 0.006478934758681904
initi   0.006461548505262755
hlp 0.0064226804419323874

TOPIC 26
transit 0.007274267735111653
use 0.0067174487488747
verif   0.0067061381279485445
ppu 0.00658514885657192
receiv  0.0065420117405110695
scrub   0.006522674447596902
execut  0.006514190194472578
defin   0.006414629726145246
main    0.006374296721232405
capabl  0.0063654238087001505

TOPIC 27
statu   0.03141061634696384
messag  0.02692935272088277
bound   0.02655592019775904
access  0.026544893112142195
flexelint   0.020997808517881016
line    0.020127632879032676
file    0.01948517834033966
attitud 0.016821070971116785
int 0.01674476456958524
valid   0.015860426415481667

TOPIC 28
request 0.006491484194327534
limit   0.006442496939425952
bootload    0.006403388811680738
fdc 0.006399974023558598
progress    0.00639587365713958
index   0.006275722391109265
control 0.006249616915266672
includ  0.006200056435239194
vcdu    0.006174630782103055
calcul  0.006159157887654544

TOPIC 29
ang 0.007032597978994753
pcontrol    0.006737330705273606
rate    0.00672458742169172
rvm 0.006693384392581234
address 0.006614323139184355
flight  0.00656254355888187
valv    0.006546939310696933
switch  0.00648290719226052
sub 0.006415598605858243
spacecraft  0.006386103243420132

T1 - T2

Experiment

  • T1 = Represents severity reports from 1 and 2
  • T2 = Represents severity reports from 3,4 and 5
  • Each line is representing features of Top 10 topics

Results

  • Results show there is no overlap of topics between T1 and T2. Other than couple of words, both the topics clusters are different.

For T1 - Project A

projecta control inertia design specif perform attitud tabl spacecraft note 
interrupt uplink srup fsw verif error follow specif eeprom scr 
tabl initi use fals address ppu obc function event dump 
checksum calcul enabl progress process task idl text oper discuss 
text fault memori error plenum initi second number pressur indic 
mode flight issu execut sequenc current indic point set vml 
switch messag case file mode type code flexelint function use 
wait int variabl vml read dump write verif task verifi 
miss oper set text paramet state check valid indic number 
subaddress address telemetri packet word fsw data buffer request bootload 

For T2 - Project A

obc safe fault projecta power address flight mode state spacecraft 
code data function line valu access variabl use messag record 
srobc rate spacecraft flight memori prd alloc provid link point 
non load int unsign bit eeprom obc comput control data 
control mode point attitud error plenum sroac target main high 
grand word tlm type packet count cmd byte header command 
file defin line tlm data statu macro array ambi len 
command softwar flight trace link srup task uplink time spacecraft 
variabl initi messag line code entri valu use extern mode 
test script verifi mode engcntrl link indic issu procedur data 

Improving the Usability of Topic Models

Improving the Usability of Topic Models

[bibtex](@PhDThesis{yang2015improving,
title={Improving the Usability of Topic Models},
author={Yang, Yi},
year={2015},
school={NORTHWESTERN UNIVERSITY}
})

Problems:

  • Gibbs sampling inference method for LDA runs too slow for large dataset with many topics.
  • The topics learned by LDA sometimes are difficult to interpret by end users.
  • LDA suffers from instability problem

Motivation:

  • like to efficiently train a big topic model with prior knowledge

Terminologies:

  • Markov random field
  • First-Order Logic

Stability Measures:

  • try running the algorithm many times and choose the model with the highest likelihood
  • document-level stability and token-level stability.
  • the number of topics was set to 20, the number of iterations was 1000. We use a uniform α with a value of 1.0, a uniform β with a value of 0.01

General:

  • LDA can be viewed as dimension reduction tool for document modeling by reducing the dataset dimension from the vocabulary size V to the number of topics T.
  • users have external knowledge regarding word correlation, which can be taken into account to improve the semantic coherence of topic modeling
    Methods:
  • SCLDA can handle different kinds of knowledge such as word correlation, document correlation,
    document label and so on. One advantage of SC-LDA over existing methods is that it is very fast to converge.

Datasets:

Analogy when topics are unstable:

  • the mental map Jane has built for the paper collection is disrupted, resulting in confusion and frustration. The tool has become less useful to Jane unless she puts in some effort to update her mental map, which significantly increases her cognitive load

References:

  • Online LDA [23]
  • [41] presents an algorithm for distributed Gibbs sampling
  • [67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
  • [19] presents the Gibbs sampling method for LDA inference
  • Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
  • Dirichlet Forest LDA [3]
  • Logic LDA: [4]
  • Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
  • NMF-LDA: Similar to Quad-LDA, [64]
  • Markov Random Topic Fields (MRTF): [27]
  • Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
  • [47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

Summary:

  • Labeled LDA can only handle document label knowledge. Dirichlet Forest LDA, Quad-LDA, NMF-LDA and ITM can only handle word correlation knowledge. MRTF can only handle document correlation knowledge. Logic LDA can handle word correlation , document label knowledge and other kinds of knowledge. However, each knowledge has to be encoded as First Order Logic

LDA-GA

How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms

[bibtex](@inproceedings{panichella2013effectively,
title={How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms},
author={Panichella, Annibale and Dit, Bogdan and Oliveto, Rocco and Di Penta, Massimiliano and Poshyvanyk, Denys and De Lucia, Andrea},
booktitle={Proceedings of the 2013 International Conference on Software Engineering},
pages={522--531},
year={2013},
organization={IEEE Press}
})

Approaches:

  • Posterior Distribution over the assignments of words to topics
  • Computing the harmonic mean of posterior distribution

Parameters up for tuning:

  • k,n,a,b. (n comes from gibbs sampling generative model)

Definitions:

  • Dominant topic: Let θ be the topic-by-document matrix generated by a particular LDA configuration P = [k, n, α, β]. A generic document dj has a dominant topic ti, if and only if θ(i,j) = max{ (θ(h,j)), h = 1 . . . k}.
  • High inter cluster distance and low intra cluster distance

Evaluation criteria:

  • Internal- Cohesion (intra) and separation (inter). Silhouette Coefficient (-1 to 1)
  • External - External info needed.

Need clarity? - how to convert text into data points. To do the cluster goodness evaluation.

Actual LDA-GA

  • a stochastic search technique based on the mechanism of a natural selection and natural genetics.
  • Stochastic search is the method of choice for solving many hard combinatorial problems.
  • having multiple solutions (individuals) evolving in parallel to explore different parts of the search space
  • an individual (or chromosome) is a particular LDA configuration and the population is represented by a set of different LDA configuration
  • The fitness function that drives the GA evolution is the Silhouette coefficient.
  • α and β varied from 0-1, also α and β can be set to default of 50/k and 0.1
  • The LDA-GA has been implemented in R [37] using the topicmodels and GA libraries
  • STOPPING CRITERIA - For GA, we used the following settings: a crossover probability of 0.6, a mutation probability of 0.01, a population of 100 individuals, and an elitism of 2 individuals. As a stopping criterion for the GA, we terminated the evolution if the best results achieved did not improve for 10 generations; otherwise we stopped after 100 generations

Assumptions:

  • Top 10 words belonging to the topic with the highest probability in the obtained topic distribution were then used to label the class

To Dos

  • Talk to wei
  • do the lit (50-100)
  • standard practice about toolkits, parameters and validation
  • what anyone else concluded in SE domain?
  • Can you get the data in order to replicate? From a senior recent study with high citations.
  • ICSE, FSE, RE, MSR, TSE, TOSEM, ESEM, JSS (Journal of Systems and Software), ISTR, IST
  • Can you find stability in their results? Can you fix them using DE?

Papers

  • Online LDA [23]
  • [41] presents an algorithm for distributed Gibbs sampling
  • [67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
  • [19] presents the Gibbs sampling method for LDA inference
  • Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
  • Dirichlet Forest LDA [3]
  • Logic LDA: [4]
  • Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
  • NMF-LDA: Similar to Quad-LDA, [64]
  • Markov Random Topic Fields (MRTF): [27]
  • Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
  • [47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

To DOs

  • Does it hold for PITS B-E?
  • Can you use DE to make it more stable and less stable?
  • Gibbs Sampling?
  • Predict severity using topics? Perplexity

How Many Topics? Stability Analysis for Topic Models

How Many Topics? Stability Analysis for Topic Models

[bibtex](@incollection{greene2014many,
title={How many topics? stability analysis for topic models},
author={Greene, Derek and O’Callaghan, Derek and Cunningham, P{'a}draig},
booktitle={Machine Learning and Knowledge Discovery in Databases},
pages={498--513},
year={2014},
publisher={Springer}
})

Idea:

  • a term-centric stability approach for selecting the number of topics in a corpus, based on the agreement between term rankings generated over multiple runs of the same algorithm. Employed a “top-weighted” ranking measure, where higher-ranked terms have a greater degree of influence when calculating agreement scores.

Weekly Report - 10/11/2016

DOING:

  • judging the credibility of LDA of how many words should be reported for each topic.
  • experiment with tuned 7 words, against default 7 words, and 10 words.
  • reviewing the LDA Papers

DONE:

  • Check #34
  • expected was tuned 7 words should perform the same than untuned 7 words and bad than untuned 10 words
  • but tuned 7 words performed better than both. (WIN SITUATION ie TUNING HELPED)

TODO:

  • redo anything after getting feedback for above results
  • Does stability help classification? with about the same K (default and untuned), better alpha and beta help?
  • LDA+word vector featurization better than LDA features.

roadblocks:

  • wanted to reproduce this paper
  • but the link provided is not working. Have emailed the author but no response.

admin:

  • can above be solved?

Machine Reading Tea Leaves

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

[bibtex](@inproceedings{lau2014machine,
title={Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality.},
author={Lau, Jey Han and Newman, David and Baldwin, Timothy},
booktitle={EACL},
pages={530--539},
year={2014}
})

General:

  • A good paper which gives rational about the topics instability

Measures:

  • notion of topic “coherence”, and proposed an automatic method for estimating topic coherence based on pairwise pointwise mutual information (PMI) between the topic words
  • direct appraoch, asking people about topics, indirect approach by evaluating PMI, CP.
  • To create gold-standard coherence judgements, they used Amazon Mechanical Turk

Problems:

  • perplexity correlates negatively with topic interpretability

Research Question:

  • word intrusion measures topic interpretability differently to observed coherence

Terminologies:

  • topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics
  • “intruder word”, which has low probability in the topic of interest, but high probability in other topics

LDA topics as feature selector

Experiment rig:

  • The initial run was with default parameter of k=10. Documents had 10 features. The score was null.
  • neg/pos ratio was very high. Unbalanced dataset. Eg:
    • SE0: 'no': 6008, 'yes': 309 - we got Fscore of about 0.5
    • SE1 'no': 47201, 'yes': 1441 - we got Fscore of about 0.8
    • SE3 no': 83583, 'yes': 654 - 0 Fscore
    • SE6 'no': 15865, 'yes': 439 - 0 Fscore
    • SE8 'no': 58076, 'yes': 195 - 0 Fscore
  • Still tuning experiments are running.

Change in experiment

  • Will try smote.
  • And also i can change this default parameter k in the steps of 20, 40, 80.

Results

How to read graphs?

  • These graphs are (tuning - untuning) results
  • If y value 0, tuned and untuned results are same.
  • If y >0, tuning improved by that much y margin
  • if y <0, tuning didn't help and made it worse.

Results

F = 0.3, CR = 0.7, Pop = 10

file

F = 0.7, CR = 0.3, Pop = 10

file

F = 0.3, CR = 0.7, Pop = 30

file

F = 0.7, CR = 0.3, Pop = 30

file

Conclusion

  • Tuning helped for sure in most number of cases.
  • Termination is based on no. of iterations, they can be increased to get better results for less no of terms overlap or they can be reduced to make it faster.
  • On HPC, it took like 4 hours to generate each graph.

Classification using LDA

Experiment Setup

  • Datasets - Manney Generator of Stack Exchange sites. 25 datasets.
  • Running tuning experiment with 5 terms overlap. Select those parameters with max stability score.
  • Find clusters, and each topic is assigned a sequential "1,2,3 labels".
  • Now each document will be labelled as 1,2,3, rather than tags
  • Run SVM. Binary classification.

We have the baseline results with no smote svm, smote svm.

Mail with Prof. Mika Mäntylä

  • tfidf 5%, they did 50%
  • more coherent data, better results.
  • might need more data preprocessing to include them as stopwords and porter stemming.
  • an R lib for LDADE

Results 06-23

  • paper Review
  • Experiment of Gibbs vs Online VEM

Results:

  • Good news. Tuning helped in Gibbs as well as VEM.
    file

Coherence of Descriptors

An Analysis of the Coherence of Descriptors in Topic Modeling

[bibtex](@Article{o2015analysis,
title={An analysis of the coherence of descriptors in topic modeling},
author={O’Callaghan, Derek and Greene, Derek and Carthy, Joe and Cunningham, P{'a}draig},
journal={Expert Systems with Applications},
volume={42},
number={13},
pages={5645--5657},
year={2015},
publisher={Elsevier}
})

General:

  • Nothing New from the #10. Just couple of variations using PMI

Measurea:

  • perplexity or held-out likelihood
  • distributional semantics measure based on an algorithm provided by the increasingly popular word2vec tool. Each term was represented as a vector in a semantic space, with topic coherence calculated as mean pairwise vector similarity; Cosine similarity, Jaccard similarity, and the Dice coefficient were used.
  • Pointwise Mutual Information (PMI)

Stability of topics

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})

General:

  • A good paper which shows issues about the topics instability.
  • word-topic and topic-document matrices (probabilities of words appearing in topics and topics appearing in documents)
  • variational approximations and Gibbs sampling. These algorithms find a local maximum of the joint likelihood function of the dataset
  • the LDA approach has been further developed by offering more complex model extensions with additional parameters and additional information [2, 9, 18, 4]

Problem:

  • the case of LDA, there are plenty of local maxima, which may lead to instability in the output.
  • problem of finding the optimal number of clusters
  • Since these distributions result from the same dataset with the same vocabulary and model parameters, any differences between them are entirely due to the randomness in Gibbs sampling. This randomness affects perplexity variations, word and document ratios, and the reproducibility of the qualitative topical solution

Old Solutions for stability:

  • A new metric of similarity between topics and a criterion of vocabulary reduction to evaluate stability.
  • numerical evaluation of topic modeling results is to measure perplexity. Perplexity shows how well topic-word and word-document distributions predict new test samples. The smaller the perplexity, the better (less uniform) is the LDA model.
    • Problem with perplexity:
      • the value of perplexity drops as the number of topics grows
      • perplexity depends on the dictionary size

Evaluation Metric:

  • symmetric Kullback–Leibler divergence
  • compute correlation between documents from two topic modeling experiments. correlation between documents does not depend on dictionary size. The method consists of the following steps:
    • construct a bipartite graph based on two topical solutions;
    • compute the minimal distance between topics in this bipartite graph;
    • compare topics between two cluster solutions based on the minimal distance.

Preprocessing step:

  • document and word ratios that show the fraction of words and documents that are actually relevant to specific topics

Review - 04/27/2016

Literature Survey

  • All the literature survey from SE domain.
  • Searched for lda topics stable OR unstable OR coherence on google scholar. Top cited papers.
  • 7 out of 9 papers stated topics unstable. So, went for manual validation of topics to do further experiments.
  • 1 paper gave a very strong statement that lda is stable. Data not available to verify their results.
  • Some other lda toolkit mentioned are GibbsLDA++.
  • Some have talked about playing with different configurations. Common Parameters are, k, a, b, i. (BASICALLY TALKING ABOUT TUNING)
  • Non deterministic so not many people have bothered about how their results might vary
  • Refer this for more details

DE Experiment

  • Implemented DE with CR = 0.3, mutation = 0.7
  • Termination criteria, no of iterations.
  • Tuning Parameters, k = (10,30), a=(0,1), b=(0,1). Based on literature review.

Results

  • Tuning helped in various datasets for higher no of terms overlap. Or atleast remained same as default parameters.
  • No change just by tuning a,b
  • Higher k value improved the stability.
  • Less b improved the stability.
    file

Meeting - 06/08

Stack Results.

file

New Citemap Results

#20

Paper Related graphs

Evaluation criteria

  • The score remains constant till 200, it goes up when we do 300 and stays constant or get down.
    file

Frequency of parameters per max score

  • Pits A dataset. For each iteration i, with each label, its max score is selected. And these are the no of set of parameters which produced that stability score. Just for 2 labels.

file

Boxplot graph to show how parameter ranges are suffered.

Citemap Results

Configuration:

  • Untuned: 10 repeats, and median stability score is taken. Default parameters.
  • Tuning: On Spark, 300 evaluations

Results:

  • Delta, Positive, tuning is better.
  • Different Combinations of F, CR, and Pop
    file

Topics with the maximum overlap.

File structure - K_22_a_0.847433736937_b_0.763774618977.txt

K=22, a=0.847433736937, b = 0.763774618977

Run: 0
Topic 0: optim method solut factor advantag softwar guarante produc known tool 
Topic 1: configur templat variabl time kernel linux patch schedul spreadsheet compil 
Topic 2: workshop intern confer program messag member ics list summari review 
Topic 3: softwar idf invers analysi engin objectori abstract star program autom 
Topic 4: softwar analysi redocument objectori engin program tool autom experi abstract 
Topic 5: panel law equat length softwar debat demet storyboard tell qualifi 
Topic 6: softwar objectori analysi abstract engin tool autom design program use 
Topic 7: objectori trait softwar lisp analysi tool engin program design experi 
Topic 8: softwar analysi tool abstract engin design autom experi program use 
Topic 9: test program use techniqu gener analysi execut approach case algorithm 
Topic 10: code sourc detect clone open file similar base type techniqu 
Topic 11: subclass umpl substitut superclass analysi abstract softwar umplif objectori autom 
Topic 12: softwar analysi tool program abstract autom objectori engin design use 
Topic 13: softwar analysi objectori autom engin abstract experi design use tool 
Topic 14: ide eclips plug abstract plugin netbean softwar array framework chart 
Topic 15: model languag specif formal use aspect design concern implement base 
Topic 16: comput context network mobil resourc awar applic devic distribut platform 
Topic 17: softwar analysi autom abstract engin tool objectori use evolut design 
Topic 18: applic compon architectur web servic user transform integr engin framework 
Topic 19: softwar use chang tool approach program design inform code sourc 
Topic 20: softwar develop project engin process research studi qualiti bug use 
Topic 21: objectori softwar design analysi abstract engin program autom tool realtim 
Run: 1
Topic 0: code detect sourc clone refactor pattern tool chang base studi 
Topic 1: mainten softwar matter correct massiv tell taxonomi experiment comprehens cost 
Topic 2: mobil devic comment data net game load analyst tune network 
Topic 3: busi inform compani technolog workflow corpor lead divis comprehens execut 
Topic 4: robot race challeng softwar later intellig insight pipelin win artifici 
Topic 5: softwar experi analysi autom visual framework use model tool design 
Topic 6: scenario chart sequenc stereotyp msc reactiv messag impli visual lsc 
Topic 7: objectori model metamodel ontolog framework evolut mda omg eventu softwar 
Topic 8: slice static dynam rang wide case rel larg method propos 
Topic 9: softwar develop chang studi use bug project sourc report result 
Topic 10: softwar model develop design process architectur use tool requir support 
Topic 11: softwar analysi experi autom model design visual realtim framework abstract 
Topic 12: softwar remodular analysi visual autom music multidimension experi assess abstract 
Topic 13: languag specif model formal transform semant program verif gener use 
Topic 14: safeti certif proof critic complianc iso siemen certifi softwar ambigu 
Topic 15: program test use techniqu analysi gener approach execut base present 
Topic 16: secur protocol vulner access network control polici analysi schedul attack 
Topic 17: peer componentbas node cach softwar volatil har analysi netbean framework 
Topic 18: aspect point concern orient modular crosscut aop messag join aspectj 
Topic 19: layout softwar visualis visual autom analysi experi framework design distribut 
Topic 20: applic web data revers legaci extract engin queri databas tool 
Topic 21: softwar engin research workshop commun comput discuss intern industri challeng 
Run: 2
Topic 0: optim method solut deviat advantag guarante factor produc known pool 
Topic 1: workshop research intern engin comput track tutori topic session confer 
Topic 2: signatur alert match massiv defens notif worm softwar jone smoke 
Topic 3: safeti critic certif complianc hardwar healthcar ambigu certifi nasa mission 
Topic 4: inspect review commit kernel linux patch author driver port peer 
Topic 5: softwar experi autom componentbas engin increment assess use largescal case 
Topic 6: design object compon class orient pattern aspect servic featur method 
Topic 7: test code use techniqu approach chang sourc detect bug result 
Topic 8: configur word assert identifi scheme macro artefact preprocessor split expand 
Topic 9: softwar autom componentbas experi engin tool increment use largescal environ 
Topic 10: inform sourc extract open repositori data busi list visual retriev 
Topic 11: vulner schedul optim array real time buffer overflow alloc trade 
Topic 12: model languag specif use gener tool approach base requir implement 
Topic 13: robot win grand softwar autom intellig componentbas race autonom home 
Topic 14: softwar autom recoveri largescal componentbas experi engin tool case realtim 
Topic 15: softwar assess experi autom tool componentbas use largescal engin visual 
Topic 16: program analysi dynam static execut use slice algorithm graph techniqu 
Topic 17: applic web user databas interact client interfac migrat data approach 
Topic 18: softwar autom assess componentbas experi engin use environ largescal studi 
Topic 19: softwar develop engin process architectur project use mainten studi product 
Topic 20: smell spreadsheet end bad templat subject speed tabl cell formula 
Topic 21: context awar inconsist conflict merg resolv ide resolut revis chang 
Run: 3
Topic 0: model check databas constraint data logic queri tempor satisfi schema 
Topic 1: revers grammar fact word pars extract reengin engin parser cobol 
Topic 2: requir design method product softwar engin goal optim process support 
Topic 3: platform mobil devic android app permiss micro portabl bytecod phone 
Topic 4: objectori omnipres model softwar framework largescal tool visual use object 
Topic 5: graph scenario concurr interact behavior sequenc specif event behaviour monitor 
Topic 6: code chang sourc softwar studi evolut clone develop detect open 
Topic 7: inform legaci busi migrat reengin process recov workflow technolog compani 
Topic 8: architectur compon softwar view adapt decis configur distribut support environ 
Topic 9: softwar model framework objectori use tool analysi aspectori panel abstract 
Topic 10: visual metaphor music boundari analyt hill climb largescal softwar overlap 
Topic 11: applic analysi web use secur flow access function user detect 
Topic 12: formal properti specif verif verifi composit infer reason refin state 
Topic 13: softwar develop engin process research project servic manag mainten paper 
Topic 14: bug report perform time use predict measur defect develop data 
Topic 15: class refactor programm maintain parallel smell conflict improv ide merg 
Topic 16: test techniqu gener case execut use fault input suit autom 
Topic 17: softwar matrix model objectori sla tool agreement framework aspectori largescal 
Topic 18: program dynam static slice analysi depend algorithm comput condit comprehens 
Topic 19: model use approach tool languag base paper implement present gener 
Topic 20: templat confer member compil chair metaprogram calculu welcom committe debugg 
Topic 21: remodular eye idf layout invers model hyper softwar movement framework 
Run: 4
Topic 0: workshop intern review research track program messag confer list session 
Topic 1: model use program languag specif gener design approach tool base 
Topic 2: context parallel conflict inconsist awar comput merg resolv resolut middlewar 
Topic 3: procedur method softwar cobol ownership node layout solut instruct encapsul 
Topic 4: visualis debugg dimension breakpoint emul comprehens workbench softwar multidimension objectori 
Topic 5: compon product featur applic line reus secur approach configur softwar 
Topic 6: softwar realtim autom framework analysi objectori use tool experi architectur 
Topic 7: macro artefact preprocessor stream hidden preprocess expand expans scheme actor 
Topic 8: factori renov constructor softwar jstar autom objectori analysi experi model 
Topic 9: kernel devic schedul linux driver buffer window overflow array interrupt 
Topic 10: spreadsheet end smell decomposit dataflow hierarch formula stabil speed modeldriven 
Topic 11: pair propag compat renam evolut micro prioriti late makefil programm 
Topic 12: visual extract databas data inform sourc tool fact xml schema 
Topic 13: program analysi dynam static slice algorithm precis comput flow execut 
Topic 14: objectori softwar analysi use autom framework tool design experi model 
Topic 15: objectori softwar autom analysi framework legaci use largescal tool evolut 
Topic 16: web page browser string constant html php javascript ajax server 
Topic 17: anti antipattern linguist scc certif pattern imped occurr greater softwar 
Topic 18: softwar develop engin process architectur project tool use research mainten 
Topic 19: test code use sourc techniqu chang approach softwar result studi 
Topic 20: softwar analysi tool use autom objectori framework comprehens evolut experi 
Topic 21: optim search insight yield near soft solut sbse softwar engin 
Run: 5
Topic 0: smell word macro bad identifi cognit split expand taxonomi renam 
Topic 1: architectur compon applic framework base approach web interfac user use 
Topic 2: tool softwar mainten assess use studi approach experi evolut realtim 
Topic 3: ownership restructur organiz domin spi measur surpris incom owner encapsul 
Topic 4: factori constructor softwar mainten experi cell tool molecular inspector use 
Topic 5: test techniqu case gener execut fault use input effect suit 
Topic 6: period month churn seri forecast firefox foundat chrome softwar latenc 
Topic 7: program analysi dynam static slice algorithm use graph depend flow 
Topic 8: odc softwar mainten studi experi use tool recoveri support process 
Topic 9: softwar engin research revers develop industri practic workshop experi discuss 
Topic 10: languag design object model tool orient class aspect transform use 
Topic 11: fuzzi imperfect toss softwar mainten instabl reverseengin altogeth sdk etp 
Topic 12: calibr spc softwar process spreadsheet use mainten experi variat tool 
Topic 13: develop process project servic manag applic requir softwar product technolog 
Topic 14: port peer breakpoint item adjust remedi estim bia softwar inspect 
Topic 15: pair evolut empir ecosystem growth law softwar contributor studi regular 
Topic 16: reengin method open sourc list xml tool patch mail oss 
Topic 17: inform lead corpor compani divis comprehens analyst iso vice consolid 
Topic 18: tool softwar mainten realtim use objectori assess experi recoveri studi 
Topic 19: code softwar use sourc chang develop approach studi result tool 
Topic 20: anti antipattern remodular linguist cluster occurr softwar mainten scc finegrain 
Topic 21: model specif use check formal properti state approach gener verif 
Run: 6
Topic 0: softwar architectur develop process use compon approach mainten support paper 
Topic 1: objectori softwar analysi framework autom tool largescal increment use abstract 
Topic 2: applic secur web profil vulner synthesi server polici time real 
Topic 3: analysi objectori softwar autom abstract framework aspectori componentbas realtim largescal 
Topic 4: inform busi compani legaci workflow lead corpor technolog divis analyst 
Topic 5: softwar analysi objectori autom framework use abstract tool experi approach 
Topic 6: code sourc bug use chang detect studi develop softwar identifi 
Topic 7: softwar engin develop research project servic web applic revers technolog 
Topic 8: reconstruct decompil obfusc readabl ast reverseengin polymorph birthmark standalon bytecod 
Topic 9: parallel concurr platform perform sequenti net hardwar distribut thread multi 
Topic 10: program test use gener techniqu analysi execut approach present base 
Topic 11: visualis softwar objectori dimension analysi shrimp tool comprehens autom visual 
Topic 12: regular law equat softwar length demet autom smalltalk analysi friend 
Topic 13: optim solut method factor search comparison advantag softwar produc known 
Topic 14: malwar anti breakpoint defens analysi emul worm mitig card unpack 
Topic 15: remot factori renov softwar notif constructor metalanguag sent usabl analysi 
Topic 16: analysi softwar autom objectori tool framework experi use largescal program 
Topic 17: softwar autom analysi objectori tool largescal framework experi use approach 
Topic 18: machin binari translat spreadsheet virtual packag compil window instruct assembl 
Topic 19: adapt dynam runtim run time self assur failur fault reconfigur 
Topic 20: softwar analysi objectori autom use tool experi abstract program framework 
Topic 21: model design languag specif use class object formal pattern transform 
Run: 7
Topic 0: chang softwar evolut manag mainten evolv configur version support impact 
Topic 1: develop bug sourc project report code softwar studi open use 
Topic 2: program code use tool sourc approach analysi techniqu java pattern 
Topic 3: optim solut method profil schedul guarante produc speed advantag factor 
Topic 4: cognit anti antipattern occurr linguist taxonomi softwar scc visual imped 
Topic 5: trait actor gpu scala autom analysi basset softwar tool use 
Topic 6: alert signatur immut defens mitig jone autom analysi mutabl worm 
Topic 7: intens inter incorpor anoth intric softwar novel stream green depend 
Topic 8: applic web user secur interfac servic databas migrat client interact 
Topic 9: softwar analysi visual autom experi use largescal design objectori reengin 
Topic 10: pair agil review confer program member track accept panel regular 
Topic 11: model specif use design gener approach tool languag base requir 
Topic 12: featur aspect composit modular line product orient concern increment compos 
Topic 13: visual softwar star analysi gxl tool experi shrimp autom exchang 
Topic 14: malwar visualis behaviour obfusc harm growth certifi emul malici viru 
Topic 15: dynam static slice analysi condit precis rang execut case larg 
Topic 16: busi legaci inform compani process ibm corpor reengin cobol technolog 
Topic 17: test softwar use techniqu studi case result approach qualiti effect 
Topic 18: objectori redocument analysi softwar visual tool mainten experi design use 
Topic 19: softwar autom analysi visual experi objectori reengin largescal recoveri use 
Topic 20: analysi realtim softwar visual autom use data mainten tool experi 
Topic 21: softwar engin develop architectur research compon distribut design process challeng 
Run: 8
Topic 0: visualis softwar autom analysi visual abstract largescal realtim componentbas framework 
Topic 1: binari secur licens attack complianc malwar permiss free protect enforc 
Topic 2: privaci threat anonym regul mitig softwar dca analysi autom law 
Topic 3: cot shelf softwar analysi autom componentbas stateflow notif framework stand 
Topic 4: applic web distribut servic class environ user perform deploy develop 
Topic 5: model specif use compon languag architectur tool base approach requir 
Topic 6: bug report predict defect fix develop project use repositori file 
Topic 7: microblog softwar dissemin twitter million realtim analysi autom visual observatori 
Topic 8: optim solut method advantag factor produc guarante support known tool 
Topic 9: analysi softwar realtim visual architectur componentbas autom experi objectori use 
Topic 10: code sourc pattern design detect clone program refactor tool use 
Topic 11: negoti softwar win analysi autom visual abstract architectur experi componentbas 
Topic 12: edit script systemat scm interrupt ident induc umpl autom session 
Topic 13: artefact actor softwar hidden scala messag shrimp mcc basset analysi 
Topic 14: softwar develop chang use studi process mainten project paper evolut 
Topic 15: softwar engin research revers commun workshop challeng discuss industri practic 
Topic 16: test program use techniqu analysi gener execut algorithm approach case 
Topic 17: orient aspect object concern program modular point modul mechan separ 
Topic 18: schema format exchang fact extractor organis standard softwar testabl phase 
Topic 19: legaci busi reengin inform compani migrat databas technolog process workflow 
Topic 20: flaw objectori mathemat softwar analysi prey predat popul ssa use 
Topic 21: softwar analysi use tool autom abstract visual reengin experi largescal 
Run: 9
Topic 0: bug report code sourc api fix detect use predict approach 
Topic 1: design architectur compon softwar requir pattern framework approach base model 
Topic 2: agreement softwar sla analysi visual realtim tool experi largescal abstract 
Topic 3: optim solut method advantag visual layout softwar produc guarante known 
Topic 4: anchor adjust softwar matter cognit tool analysi visual wikipedia use 
Topic 5: test techniqu gener case execut use fault approach input autom 
Topic 6: factori renov constructor softwar scaffold proprietari experi analysi mode largescal 
Topic 7: process servic legaci technolog busi reengin migrat comprehens inform environ 
Topic 8: regular wrapper length law lexic sourcecod equat softwar extrapol intension 
Topic 9: softwar analysi experi realtim tool abstract largescal visual distribut autom 
Topic 10: engin softwar research revers commun workshop challeng comput discuss industri 
Topic 11: realtim softwar tool analysi largescal abstract experi visual model use 
Topic 12: model applic use languag specif tool gener approach web base 
Topic 13: object class metric orient method measur use concept coupl code 
Topic 14: softwar develop chang studi code use project sourc mainten process 
Topic 15: platform mobil devic deploy applic network resourc driver hardwar android 
Topic 16: softwar eve abstract interact tool autom analysi model experi largescal 
Topic 17: softwar use analysi realtim abstract experi tool autom largescal support 
Topic 18: program code analysi use java dynam refactor type static slice 
Topic 19: spreadsheet end formula microblog bidirect templat modeldriven excel dataflow cell 
Topic 20: classif classifi categori taxonomi csp capac item analyst notif orthogon 
Topic 21: configur wide rang conflict static merg rel larg case applic 

Runtime: --- 425.590034962 seconds ---

Score: 0.9

Credibility Of LDA

IDEA:

ACTUAL
          T1        T2         T3      .. .. . . .
Doc1
Doc2
Doc3

PREDICTED - Selected from Dominant topic from doc topic distribution.
          W1        W2         W3      .. .. . . .
Doc1
Doc2
Doc3

**According to literature, If a document is asked to belong to one of the dominant 
topic (hard assignment), the top words from the dominant topic should be in the 
actual document. If not:
 - then the probability of dominant topic is very less and there might be other topic which 
can be made dominant.
- or the top words are wrongly selected. The weights of words could be better to find 
the same dominant topic.**

Experiment:

  • Once top n words are selected from each topic, now those topics are represented with those n words.
  • A dominant topic is selected to represent a document, we call that as actual.
  • we will check for each topic which are now represented with n words. We will find most 'm' words out of those 'n' in a document. Whichever topic will have the most 'm' words, according to this, now that document is represented with this topic.
We have now x no of documents. For eg x=4, k(no of topics)=3
for x=4, we have [D1,D2,D3,D4]
Actual=[1,1,2,0]
Predicted=[1,0,2,0]
The score is = 2/4=0.50

Results:

  • Higher the better
    file

Conclusion:

  • tuned with top 7 words is performing much better than untuned (default, k=10) top 7 words.
  • tuned with top 7 words is performing better or same than untuned (default, k=10) top 10 words.
  • With tuning we have better top 7 words defining that topic.

To Dos

  • T1 = {1,2} T2 = {3,4,5}. Need to see T1 - T2 clusters. For all 6 different projects.
  • Stability of topics.

on Wikipedia sets

  • 2 graphs with n% matching terms, across m% matched across l runs, I am saying stable.
  • Define stability of topics in lda
  • Increasing the number of topics made it stable. As right now there can be multiple hills and its finding different hills across the dataset.

Meeting - 06/16

5 terms overlap

file

Dataset \ Evaluations 100 200 300 400
PitsA 0.9 0.9 1.0 1.0
PitsB 0.9 0.9 0.9 1.0
PitsC 1.0 1.0 1.0 1.0
PitsD 1.0 1.0 1.0 1.0
PitsE 0.9 0.9 1.0 1.0
PitsF 0.9 0.9 0.9 0.9
Citemap 0.67 0.67 0.77 0.77

Read this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.