lengerfulluse / hlda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mkneesh/hlda

1.0 1.0 0.0 671 KB

Hierarchical Latent Dirichlet Allocation

Python 94.45% Shell 5.55%

hlda's People

Contributors

Stargazers

Watchers

hlda's Issues

Experiment with Differ Gamma

with other parameter unchange

 <GAM   0.2      0.2>
 DEPTH 3
 ETA    3.2       0.025      0.0005
 GEM    0.1     100
 SCALING      0.2      50
 SAMPLE_ETA    0
 SAMPLE_GEM    0


 Path  34/11/9/6/5/5/5/4/4/            54
 Word allocation    2116/437/45

Sampling with Differ Initial values

Hyper-Hyper Parameter's Influence on Result for Limited Iteration:

Before, We have always think that, As long as we select the sampling for GEM and ETA, we
would get a ideal result for the word allocation and tree structure. And it's naturally we could get a optimization of ultimate corpus modeling.
In practical it seems different a lot. the main cause lies in the too limited iterator times.

ETA  5.2    0.025    0.005
GEM    0.4  100
SAMPLE ETA 1
SAMPLE  GEM 1
word allocation:         1399/526/673
Path      1      1     1      1      1      1     1    1     1

Final score and sampling results:
Score  58359    
ETA    0.67    1.458    1.459  
GEM_MEAN   0.57
GEM_SCALE   7.17

Experiment with Differ Pi for word allocation.

Focus on the Pi parameter's Influence on Word Allocation

Pi parameter control how strict the m parameter influence on word allocation.

ETA 5.2   0.025  0.005
GEM 0.4  [300]
word allocation      1327/774/497   
Path   13  10   8   4   4   4   4 

ETA 5.2   0.025    0.005
GEM 0.4  [500]
word allocation      1369/740/489
Path  14   9   6   5    4    4    4  

ETA 5.2  0.025     0.005
GEM　0.4   [10]
word allocation      1230/838/530
Path    12    10    6    5   4   4   4

ETA 5.2    0.025    0.005
GEM 0.4    [2000]
word allocation         1338/762/498
Path    14     7      7    6    5     4    4

Same GEM_MEAN and GEM_SCALE settings with different other parameters.

With different iterator times, comparison between word allocations:

Experiments with GEM_MEAN and GEM_SCALE
GEM_MEAN 0.5 GEM_SCALE 100.

Iter10000     Iter30000    Iter50000     Iter80000       
0  1217            1152             1131          1175     
1   666             757               733            718      
2   715             689               734            705

Above show differ iterators.
2. Experiments with ETA
It seems that, not only the GEM_MEAN and GEM_SCALE will influence the word allocation , the ETA setting also have a great impact on the word allocation of different levels. But time of iterator seems have little influence on the word allocation.

    GEM_MEAN   0.4          GEM_SCALE      100  
              ETA                         Allocation    
        3.2     0.025     0.005         1357/485/756       
        1.2     0.025     0.005         1403/722/473   
        5.2     0.025     0.005          1344/780/474

Above show the differ ETA.
3. Possible Reason for Missing mode.levels Files

Experiment with GEM

GEM 0.1

 ETA   3.2        0.025      0.0005
 GAM   1.0        1.0
 SCALING_SHAPE     1.0
 SCALING_SCALE      50
 Path    15/12/8/7/5/5/5/4/4/3                                                 65
 Word  Allocation                   2117/447/34

GEM 0.4

 Path     12/11/10/8/8/7/7/6/6/6/6/5/4                                       50
 Word  Allocation                   1423/757/418

GEM 0.7

 Path      14/12/12/10/6/6/6/6/6/6/5/4                                         39
 Word  Allocation                  1146/703/749

How ETA and GEM parameters influenced the Word allocation and mode.levels files

As the Issue#2 mentioned, For a definitive ETA setting, if we smaller the GEM_MEAN value, mode.levels file will missing! However, when we increase the ETA value, the mode.levels appear again.
Therefore, There exists some connection of effect on the final word allocation and tree structure.
ETA 5.2 0.25 0.005

  GEM_MEAN    0.15            YES             2271/291/36
  GEM_MEAN    0.25             YES            1640/687/271
  GEM_MEAN    0.75              YES            786/650/1162

ETA influence on topic numbers and doc_no, word_no in every level.

Experiment on ETA's Influence on topic number related in each level

GEM_MEAN 0.4 GEM_SCALE 100

ETA 5.2   0.025  0.05
Word Allocation       1267/749/582   
level            topics          doc_no/word_no
0                 1                         147/1267
1                 9                         135/728        5/9      1/2    1/2  1/1
2                 44                        48/222   23/110   13/43    12/65    6/19 

ETA  5.2  0.025   0.005
Word Allocation       1288/746/564
level            topics         doc_no/word_no
0                  1                     147/1288
1                  12                   136/726         1/4     1/0   1/1   1/2
2                  49                    30/110     16/75    13/47   8/42   7/28   7/20 


ETA 5.2  0.025    0.5
Word Allocation       1335/901/362
level            topics         doc_no/word_no
0                  1                      147/1335
1                  13                     134/877   2/12   1/0   1/4   1/2
2                  93                     6/23    6/17    6/17   4/10   4/13   4/11   

ETA 5.2   0.025    0.5
Word Allocation      1342/914/342
level           topics          doc_no/word_no
0                  1                      147/1342
1                  11                     136/882      2/6      1/5    1/4   1/3   1/0
2                  103                    6/9    4/10     4/7    3/15  3/11   3/14

Relation between GEM_MEAN, GEM_SCALE and mode.levels file

Differ ETA with Same GEM_MEAN and GEM_SCALE

For a long time, I have always think that the reason for disappear of mode.levels file is ETA setting. However, the following experiments denied my assumption.

 1.2    0.025    0.0005    YES
 1.2    0.025    0.005      YES
 0.2    0.025    0.005      YES
 0.2    0.25      0.005      YES
 2.2    0.25      0.005      YES

Above with GEM_MEAN 0.5 and GEM_SCALE　100

When We use the same ETA parameters and Different GEM_MEAN:

We setting ETA with 2.2 0.25 0.005:

 0.15      NO
 0.25      NO    
 0.35      YES
 0.5        YES

Sampling with Differ SCALING_SHAPE and SCALING SCALE

Differ SCALING Parameter( G Prior)

Before, we have always attempt to sampling GEM and ETA and GAM, but keeps the other parameter unchange, We will always find that the topest score and mode are almost the same. So we try to change the SCALING parameters:

 ETA 0.2      2.5        0.5
 GAM    1.0     1.0
 GEM_MEAN   0.1
 GEM_SCALE      100
 SCALING_SHAPE    1.0
 SCALING_SCALE     0.5
 SAMPLE      ETA      1
 SAMPLE       GEM       1

  word allocation  0/0/2598
  Path        114       30        2         1
  Score       -33.9       
  ETA      1.517      1.517          1.097
  GEM_MEAN     1.0
  GEM_SCALE       6.967

Another Comparison with SCALING_SCALE 100

    Iter 126
    ETA 0.2      2.5     0.5
    SAMPLE     ETA     1
    SAMPLE     GEM     1
    SAMPLE_SHAPE   1.0
    SAMPLE_SCALE     100
    word allocation         1406/526/666 
    Path          1         1          1           1            1         1           1
    Final Score and Sampling results:
    Score              58710
    ETA        0.7443        1.459           1.456
    GEM        0.573          8.099