lengerfulluse / hlda Goto Github PK
View Code? Open in Web Editor NEWThis project forked from mkneesh/hlda
Hierarchical Latent Dirichlet Allocation
This project forked from mkneesh/hlda
Hierarchical Latent Dirichlet Allocation
<GAM 0.2 0.2>
DEPTH 3
ETA 3.2 0.025 0.0005
GEM 0.1 100
SCALING 0.2 50
SAMPLE_ETA 0
SAMPLE_GEM 0
Path 34/11/9/6/5/5/5/4/4/ 54
Word allocation 2116/437/45
Before, We have always think that, As long as we select the sampling for GEM and ETA, we
would get a ideal result for the word allocation and tree structure. And it's naturally we could get a optimization of ultimate corpus modeling.
In practical it seems different a lot. the main cause lies in the too limited iterator times.
ETA 5.2 0.025 0.005
GEM 0.4 100
SAMPLE ETA 1
SAMPLE GEM 1
word allocation: 1399/526/673
Path 1 1 1 1 1 1 1 1 1
Final score and sampling results:
Score 58359
ETA 0.67 1.458 1.459
GEM_MEAN 0.57
GEM_SCALE 7.17
Pi parameter control how strict the m parameter influence on word allocation.
ETA 5.2 0.025 0.005
GEM 0.4 [300]
word allocation 1327/774/497
Path 13 10 8 4 4 4 4
ETA 5.2 0.025 0.005
GEM 0.4 [500]
word allocation 1369/740/489
Path 14 9 6 5 4 4 4
ETA 5.2 0.025 0.005
GEM 0.4 [10]
word allocation 1230/838/530
Path 12 10 6 5 4 4 4
ETA 5.2 0.025 0.005
GEM 0.4 [2000]
word allocation 1338/762/498
Path 14 7 7 6 5 4 4
Experiments with GEM_MEAN and GEM_SCALE
GEM_MEAN 0.5 GEM_SCALE 100.
Iter10000 Iter30000 Iter50000 Iter80000
0 1217 1152 1131 1175
1 666 757 733 718
2 715 689 734 705
Above show differ iterators.
2. Experiments with ETA
It seems that, not only the GEM_MEAN and GEM_SCALE will influence the word allocation , the ETA setting also have a great impact on the word allocation of different levels. But time of iterator seems have little influence on the word allocation.
GEM_MEAN 0.4 GEM_SCALE 100
ETA Allocation
3.2 0.025 0.005 1357/485/756
1.2 0.025 0.005 1403/722/473
5.2 0.025 0.005 1344/780/474
Above show the differ ETA.
3. Possible Reason for Missing mode.levels Files
ETA 3.2 0.025 0.0005
GAM 1.0 1.0
SCALING_SHAPE 1.0
SCALING_SCALE 50
Path 15/12/8/7/5/5/5/4/4/3 65
Word Allocation 2117/447/34
Path 12/11/10/8/8/7/7/6/6/6/6/5/4 50
Word Allocation 1423/757/418
Path 14/12/12/10/6/6/6/6/6/6/5/4 39
Word Allocation 1146/703/749
As the Issue#2 mentioned, For a definitive ETA setting, if we smaller the GEM_MEAN value, mode.levels file will missing! However, when we increase the ETA value, the mode.levels appear again.
Therefore, There exists some connection of effect on the final word allocation and tree structure.
ETA 5.2 0.25 0.005
GEM_MEAN 0.15 YES 2271/291/36
GEM_MEAN 0.25 YES 1640/687/271
GEM_MEAN 0.75 YES 786/650/1162
GEM_MEAN 0.4 GEM_SCALE 100
ETA 5.2 0.025 0.05
Word Allocation 1267/749/582
level topics doc_no/word_no
0 1 147/1267
1 9 135/728 5/9 1/2 1/2 1/1
2 44 48/222 23/110 13/43 12/65 6/19
ETA 5.2 0.025 0.005
Word Allocation 1288/746/564
level topics doc_no/word_no
0 1 147/1288
1 12 136/726 1/4 1/0 1/1 1/2
2 49 30/110 16/75 13/47 8/42 7/28 7/20
ETA 5.2 0.025 0.5
Word Allocation 1335/901/362
level topics doc_no/word_no
0 1 147/1335
1 13 134/877 2/12 1/0 1/4 1/2
2 93 6/23 6/17 6/17 4/10 4/13 4/11
ETA 5.2 0.025 0.5
Word Allocation 1342/914/342
level topics doc_no/word_no
0 1 147/1342
1 11 136/882 2/6 1/5 1/4 1/3 1/0
2 103 6/9 4/10 4/7 3/15 3/11 3/14
For a long time, I have always think that the reason for disappear of mode.levels file is ETA setting. However, the following experiments denied my assumption.
1.2 0.025 0.0005 YES
1.2 0.025 0.005 YES
0.2 0.025 0.005 YES
0.2 0.25 0.005 YES
2.2 0.25 0.005 YES
Above with GEM_MEAN 0.5 and GEM_SCALE 100
We setting ETA with 2.2 0.25 0.005:
0.15 NO
0.25 NO
0.35 YES
0.5 YES
Before, we have always attempt to sampling GEM and ETA and GAM, but keeps the other parameter unchange, We will always find that the topest score and mode are almost the same. So we try to change the SCALING parameters:
ETA 0.2 2.5 0.5
GAM 1.0 1.0
GEM_MEAN 0.1
GEM_SCALE 100
SCALING_SHAPE 1.0
SCALING_SCALE 0.5
SAMPLE ETA 1
SAMPLE GEM 1
word allocation 0/0/2598
Path 114 30 2 1
Score -33.9
ETA 1.517 1.517 1.097
GEM_MEAN 1.0
GEM_SCALE 6.967
Iter 126
ETA 0.2 2.5 0.5
SAMPLE ETA 1
SAMPLE GEM 1
SAMPLE_SHAPE 1.0
SAMPLE_SCALE 100
word allocation 1406/526/666
Path 1 1 1 1 1 1 1
Final Score and Sampling results:
Score 58710
ETA 0.7443 1.459 1.456
GEM 0.573 8.099
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.