Giter Site home page Giter Site logo

bab2min / tomotopy Goto Github PK

View Code? Open in Web Editor NEW
552.0 12.0 62.0 2.18 MB

Python package of Tomoto, the Topic Modeling Tool

Home Page: https://bab2min.github.io/tomotopy

License: MIT License

Python 7.70% C++ 88.71% C 2.15% HTML 1.38% Shell 0.05%
topic-modeling latent-dirichlet-allocation hierarchical-dirichlet-processes nlp pachinko-allocation dirichlet-multinomial-regression python-library correlated-topic-model supervised-lda topic-models

tomotopy's Introduction

tomotopy

๐ŸŽŒ English, ํ•œ๊ตญ์–ด.

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel)
  • Labeled LDA (tomotopy.LLDAModel)
  • Partially Labeled LDA (tomotopy.PLDAModel)
  • Supervised LDA (tomotopy.SLDAModel)
  • Dirichlet Multinomial Regression (tomotopy.DMRModel)
  • Generalized Dirichlet Multinomial Regression (tomotopy.GDMRModel)
  • Hierarchical Dirichlet Process (tomotopy.HDPModel)
  • Hierarchical LDA (tomotopy.HLDAModel)
  • Multi Grain LDA (tomotopy.MGLDAModel)
  • Pachinko Allocation (tomotopy.PAModel)
  • Hierarchical PA (tomotopy.HPAModel)
  • Correlated Topic Model (tomotopy.CTModel)
  • Dynamic Topic Model (tomotopy.DTModel)
  • Pseudo-document based Topic Model (tomotopy.PTModel).

Please visit https://bab2min.github.io/tomotopy to see more information.

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

$ pip install --upgrade pip
$ pip install tomotopy

The supported OS and Python versions are:

  • Linux (x86-64) with Python >= 3.6
  • macOS >= 10.13 with Python >= 3.6
  • Windows 7 or later (x86, x86-64) with Python >= 3.6
  • Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)

After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from 'sample.txt' file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

mdl.summary()

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that gensim's LdaModel uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

Following chart shows the comparison of LDA model's running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models' result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

Corpus and transform

Every topic model in tomotopy has its own internal document type. A document can be created and added into suitable for each model through each model's add_doc method. However, trying to add the same list of documents to different models becomes quite inconvenient, because add_doc should be called for the same list of documents to each different model. Thus, tomotopy provides tomotopy.utils.Corpus class that holds a list of documents. tomotopy.utils.Corpus can be inserted into any model by passing as argument corpus to __init__ or add_corpus method of each model. So, inserting tomotopy.utils.Corpus just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents. For example, tomotopy.DMRModel requires argument metadata in str type, but tomotopy.PLDAModel requires argument labels in List[str] type. Since tomotopy.utils.Corpus holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when a corpus is added into that topic model. In this case, miscellaneous data can be transformed to be fitted target topic model using argument transform. See more details in the following code:

from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus)
# You lose `a_data` field in `corpus`,
# and `metadata` that `DMRModel` requires is filled with the default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
    return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'

Parallel Sampling Algorithms

Since version 0.5.0, tomotopy allows you to choose a parallelism algorithm. The algorithm provided in versions prior to 0.4.2 is COPY_MERGE, which is provided for all topic models. The new algorithm PARTITION, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

https://bab2min.github.io/tomotopy/images/algo_comp.png

https://bab2min.github.io/tomotopy/images/algo_comp2.png

Performance by Version

Performance changes by version are shown in the following graph. The time it takes to run the LDA model train with 1000 iteration was measured. (Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)

https://bab2min.github.io/tomotopy/images/lda-perf-t1.png

https://bab2min.github.io/tomotopy/images/lda-perf-t4.png

https://bab2min.github.io/tomotopy/images/lda-perf-t8.png

Pining Topics using Word Priors

Since version 0.6.0, a new method tomotopy.LDAModel.set_word_prior has been added. It allows you to control word prior for each topic. For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes. This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic. Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'. This allows to manipulate some topics to be placed at a specific topic number.

import tomotopy as tp
mdl = tp.LDAModel(k=20)

# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See word_prior_example in example.py for more details.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.12.7 (2023-12-19)
    • New features
      • Added Topic Model Viewer tomotopy.viewer.open_viewer()
      • Optimized the performance of tomotopy.utils.Corpus.process()
    • Bug fixes
      • Document.span now returns the ranges in character unit, not in byte unit.
  • 0.12.6 (2023-12-11)
    • New features
      • Added some convenience features to tomotopy.LDAModel.train and tomotopy.LDAModel.set_word_prior.
      • LDAModel.train now has new arguments callback, callback_interval and show_progres to monitor the training progress.
      • LDAModel.set_word_prior now can accept Dict[int, float] type as its argument prior.
  • 0.12.5 (2023-08-03)
    • New features
      • Added support for Linux ARM64 architecture.
  • 0.12.4 (2023-01-22)
    • New features
      • Added support for macOS ARM64 architecture.
    • Bug fixes
      • Fixed an issue where tomotopy.Document.get_sub_topic_dist() raises a bad argument exception.
      • Fixed an issue where exception raising sometimes causes crashes.
  • 0.12.3 (2022-07-19)
    • New features
      • Now, inserting an empty document using tomotopy.LDAModel.add_doc() just ignores it instead of raising an exception. If the newly added argument ignore_empty_words is set to False, an exception is raised as before.
      • tomotopy.HDPModel.purge_dead_topics() method is added to remove non-live topics from the model.
    • Bug fixes
      • Fixed an issue that prevents setting user defined values for nuSq in tomotopy.SLDAModel (by @jucendrero).
      • Fixed an issue where tomotopy.utils.Coherence did not work for tomotopy.DTModel.
      • Fixed an issue that often crashed when calling make_dic() before calling train().
      • Resolved the problem that the results of tomotopy.DMRModel and tomotopy.GDMRModel are different even when the seed is fixed.
      • The parameter optimization process of tomotopy.DMRModel and tomotopy.GDMRModel has been improved.
      • Fixed an issue that sometimes crashed when calling tomotopy.PTModel.copy().
  • 0.12.2 (2021-09-06)
    • An issue where calling convert_to_lda of tomotopy.HDPModel with min_cf > 0, min_df > 0 or rm_top > 0 causes a crash has been fixed.
    • A new argument from_pseudo_doc is added to tomotopy.Document.get_topics and tomotopy.Document.get_topic_dist. This argument is only valid for documents of PTModel, it enables to control a source for computing topic distribution.
    • A default value for argument p of tomotopy.PTModel has been changed. The new default value is k * 10.
    • Using documents generated by make_doc without calling infer doesn't cause a crash anymore, but just print warning messages.
    • An issue where the internal C++ code isn't compiled at clang c++17 environment has been fixed.
  • 0.12.1 (2021-06-20)
    • An issue where tomotopy.LDAModel.set_word_prior() causes a crash has been fixed.
    • Now tomotopy.LDAModel.perplexity and tomotopy.LDAModel.ll_per_word return the accurate value when TermWeight is not ONE.
    • tomotopy.LDAModel.used_vocab_weighted_freq was added, which returns term-weighted frequencies of words.
    • Now tomotopy.LDAModel.summary() shows not only the entropy of words, but also the entropy of term-weighted words.
  • 0.12.0 (2021-04-26)
    • Now tomotopy.DMRModel and tomotopy.GDMRModel support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )
    • The performance of tomotopy.GDMRModel was improved.
    • A copy() method has been added for all topic models to do a deep copy.
    • An issue was fixed where words that are excluded from training (by min_cf, min_df) have incorrect topic id. Now all excluded words have -1 as topic id.
    • Now all exceptions and warnings that generated by tomotopy follow standard Python types.
    • Compiler requirements have been raised to C++14.
  • 0.11.1 (2021-03-28)
    • A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.
  • 0.11.0 (2021-03-26) (removed)
    • A new topic model tomotopy.PTModel for short texts was added into the package.
    • An issue was fixed where tomotopy.HDPModel.infer causes a segmentation fault sometimes.
    • A mismatch of numpy API version was fixed.
    • Now asymmetric document-topic priors are supported.
    • Serializing topic models to bytes in memory is supported.
    • An argument normalize was added to get_topic_dist(), get_topic_word_dist() and get_sub_topic_dist() for controlling normalization of results.
    • Now tomotopy.DMRModel.lambdas and tomotopy.DMRModel.alpha give correct values.
    • Categorical metadata supports for tomotopy.GDMRModel were added (see https://github.com/bab2min/tomotopy/blob/main/examples/gdmr_both_categorical_and_numerical.py ).
    • Python3.5 support was dropped.
  • 0.10.2 (2021-02-16)
    • An issue was fixed where tomotopy.CTModel.train fails with large K.
    • An issue was fixed where tomotopy.utils.Corpus loses their uid values.
  • 0.10.1 (2021-02-14)
    • An issue was fixed where tomotopy.utils.Corpus.extract_ngrams craches with empty input.
    • An issue was fixed where tomotopy.LDAModel.infer raises exception with valid input.
    • An issue was fixed where tomotopy.HLDAModel.infer generates wrong tomotopy.Document.path.
    • Since a new parameter freeze_topics for tomotopy.HLDAModel.train was added, you can control whether to create a new topic or not when training.
  • 0.10.0 (2020-12-19)
    • The interface of tomotopy.utils.Corpus and of tomotopy.LDAModel.docs were unified. Now you can access the document in corpus with the same manner.
    • __getitem__ of tomotopy.utils.Corpus was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.
    • New methods tomotopy.utils.Corpus.extract_ngrams and tomotopy.utils.Corpus.concat_ngrams were added. They extracts n-gram collocations using PMI and concatenates them into a single words.
    • A new method tomotopy.LDAModel.add_corpus was added, and tomotopy.LDAModel.infer can receive corpus as input.
    • A new module tomotopy.coherence was added. It provides the way to calculate coherence of the model.
    • A paramter window_size was added to tomotopy.label.FoRelevance.
    • An issue was fixed where NaN often occurs when training tomotopy.HDPModel.
    • Now Python3.9 is supported.
    • A dependency to py-cpuinfo was removed and the initializing of the module was improved.
  • 0.9.1 (2020-08-08)
    • Memory leaks of version 0.9.0 was fixed.
    • tomotopy.CTModel.summary() was fixed.
  • 0.9.0 (2020-08-04)
    • The tomotopy.LDAModel.summary() method, which prints human-readable summary of the model, has been added.
    • The random number generator of package has been replaced with EigenRand. It speeds up the random number generation and solves the result difference between platforms.
    • Due to above, even if seed is the same, the model training result may be different from the version before 0.9.0.
    • Fixed a training error in tomotopy.HDPModel.
    • tomotopy.DMRModel.alpha now shows Dirichlet prior of per-document topic distribution by metadata.
    • tomotopy.DTModel.get_count_by_topics() has been modified to return a 2-dimensional ndarray.
    • tomotopy.DTModel.alpha has been modified to return the same value as tomotopy.DTModel.get_alpha().
    • Fixed an issue where the metadata value could not be obtained for the document of tomotopy.GDMRModel.
    • tomotopy.HLDAModel.alpha now shows Dirichlet prior of per-document depth distribution.
    • tomotopy.LDAModel.global_step has been added.
    • tomotopy.MGLDAModel.get_count_by_topics() now returns the word count for both global and local topics.
    • tomotopy.PAModel.alpha, tomotopy.PAModel.subalpha, and tomotopy.PAModel.get_count_by_super_topic() have been added.
  • 0.8.2 (2020-07-14)
    • New properties tomotopy.DTModel.num_timepoints and tomotopy.DTModel.num_docs_by_timepoint have been added.
    • A bug which causes different results with the different platform even if seeds were the same was partially fixed. As a result of this fix, now tomotopy in 32 bit yields different training results from earlier version.
  • 0.8.1 (2020-06-08)
    • A bug where tomotopy.LDAModel.used_vocabs returned an incorrect value was fixed.
    • Now tomotopy.CTModel.prior_cov returns a covariance matrix with shape [k, k].
    • Now tomotopy.CTModel.get_correlations with empty arguments returns a correlation matrix with shape [k, k].
  • 0.8.0 (2020-06-06)
    • Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just list, but numpy.ndarray now.
    • Tomotopy has a new dependency NumPy >= 1.10.0.
    • A wrong estimation of tomotopy.HDPModel.infer was fixed.
    • A new method about converting HDPModel to LDAModel was added.
    • New properties including tomotopy.LDAModel.used_vocabs, tomotopy.LDAModel.used_vocab_freq and tomotopy.LDAModel.used_vocab_df were added into topic models.
    • A new g-DMR topic model(tomotopy.GDMRModel) was added.
    • An error at initializing tomotopy.label.FoRelevance in macOS was fixed.
    • An error that occured when using tomotopy.utils.Corpus created without raw parameters was fixed.
  • 0.7.1 (2020-05-08)
    • tomotopy.Document.path was added for tomotopy.HLDAModel.
    • A memory corruption bug in tomotopy.label.PMIExtractor was fixed.
    • A compile error in gcc 7 was fixed.
  • 0.7.0 (2020-04-18)
    • tomotopy.DTModel was added into the package.
    • A bug in tomotopy.utils.Corpus.save was fixed.
    • A new method tomotopy.Document.get_count_vector was added into Document class.
    • Now linux distributions use manylinux2010 and an additional optimization is applied.
  • 0.6.2 (2020-03-28)
    • A critical bug related to save and load was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.
  • 0.6.1 (2020-03-22) (removed)
    • A bug related to module loading was fixed.
  • 0.6.0 (2020-03-22) (removed)
    • tomotopy.utils.Corpus class that manages multiple documents easily was added.
    • tomotopy.LDAModel.set_word_prior method that controls word-topic priors of topic models was added.
    • A new argument min_df that filters words based on document frequency was added into every topic model's __init__.
    • tomotopy.label, the submodule about topic labeling was added. Currently, only tomotopy.label.FoRelevance is provided.
  • 0.5.2 (2020-03-01)
    • A segmentation fault problem was fixed in tomotopy.LLDAModel.add_doc.
    • A bug was fixed that infer of tomotopy.HDPModel sometimes crashes the program.
    • A crash issue was fixed of tomotopy.LDAModel.infer with ps=tomotopy.ParallelScheme.PARTITION, together=True.
  • 0.5.1 (2020-01-11)
    • A bug was fixed that tomotopy.SLDAModel.make_doc doesn't support missing values for y.
    • Now tomotopy.SLDAModel fully supports missing values for response variables y. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.
  • 0.5.0 (2019-12-30)
    • Now tomotopy.PAModel.infer returns both topic distribution nd sub-topic distribution.
    • New methods get_sub_topics and get_sub_topic_dist were added into tomotopy.Document. (for PAModel)
    • New parameter parallel was added for tomotopy.LDAModel.train and tomotopy.LDAModel.infer method. You can select parallelism algorithm by changing this parameter.
    • tomotopy.ParallelScheme.PARTITION, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.
    • A bug where rm_top didn't work at min_cf < 2 was fixed.
  • 0.4.2 (2019-11-30)
    • Wrong topic assignments of tomotopy.LLDAModel and tomotopy.PLDAModel were fixed.
    • Readable __repr__ of tomotopy.Document and tomotopy.Dictionary was implemented.
  • 0.4.1 (2019-11-27)
    • A bug at init function of tomotopy.PLDAModel was fixed.
  • 0.4.0 (2019-11-18)
    • New models including tomotopy.PLDAModel and tomotopy.HLDAModel were added into the package.
  • 0.3.1 (2019-11-05)
    • An issue where get_topic_dist() returns incorrect value when min_cf or rm_top is set was fixed.
    • The return value of get_topic_dist() of tomotopy.MGLDAModel document was fixed to include local topics.
    • The estimation speed with tw=ONE was improved.
  • 0.3.0 (2019-10-06)
    • A new model, tomotopy.LLDAModel was added into the package.
    • A crashing issue of HDPModel was fixed.
    • Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions.
      If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.
  • 0.2.0 (2019-08-18)
    • New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.
    • A new parameter option rm_top was added for all topic models.
    • The problems in save and load method for PAModel and HPAModel were fixed.
    • An occassional crash in loading HDPModel was fixed.
    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.
  • 0.1.6 (2019-08-09)
    • Compiling errors at clang with macOS environment were fixed.
  • 0.1.4 (2019-08-05)
    • The issue when add_doc receives an empty list as input was fixed.
    • The issue that tomotopy.PAModel.get_topic_words doesn't extract the word distribution of subtopic was fixed.
  • 0.1.3 (2019-05-19)
    • The parameter min_cf and its stopword-removing function were added for all topic models.
  • 0.1.0 (2019-05-12)
    • First version of tomotopy

Bindings for Other Languages

Bundled Libraries and Their License

Citation

@software{minchul_lee_2022_6868418,
  author       = {Minchul Lee},
  title        = {bab2min/tomotopy: 0.12.3},
  month        = jul,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v0.12.3},
  doi          = {10.5281/zenodo.6868418},
  url          = {https://doi.org/10.5281/zenodo.6868418}
}

tomotopy's People

Contributors

bab2min avatar claudinoac avatar dpfens avatar jonaschn avatar jucendrero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tomotopy's Issues

DTModel์—์„œ ์‹œ์ ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ ๊ณ„์‚ฐ

์•ˆ๋…•ํ•˜์„ธ์š” tomotopy, kiwipiepy๋ฅผ ์• ์šฉํ•˜๋Š” ๋Œ€ํ•™์›์ƒ์ž…๋‹ˆ๋‹ค. (bab2min๋‹˜์€ ์‚ฌ๋ž‘์ž…๋‹ˆ๋‹ค.)
์ „์— kiwipiepy ๊ด€๋ จํ•ด์„œ๋„ ์งˆ๋ฌธํ–‡์—ˆ๋Š”๋ฐ ๊ธฐ์–ตํ•˜์‹œ๋‚˜ ๋ชจ๋ฅด๊ฒ ๋„ค์š”.

์–ผ๋งˆ์ „์— DTModel์ด ์ถ”๊ฐ€๋˜์„œ ๊ธฐ์กด gensim์—์„œ ๋„ˆ๋ฌด๋‚˜ ๋Š๋ฆฌ๊ฒŒ ๋Œ์•„๊ฐ€๋˜ DTModel์„ tomotopy๋กœ ๋Œ๋ฆฌ๋‹ˆ ๋น„์•ฝ์ ์œผ๋กœ ํ•™์Šต์‹œ๊ฐ„์ด ์ค„์–ด์„œ ์• ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ •๋ง ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ๋Š” ์‹œ๊ฐํ™”ํ• ๋•Œ ์ธ๋ฐ์š”, ์ผ๋‹จ ์‹œ์ ๋ณ„ ํ† ํ”ฝ๋‚ด ๋‹จ์–ด๋“ค์˜ ๋ถ„ํฌ ๋ณ€ํ™”๋Š” get_topic_word_dist๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋ ๊ฑฐ๊ฐ™์€๋ฐ์š”. ํ† ํ”ฝ ๊ฐ„์˜ ๋ถ„ํฌ ๋ณ€ํ™”๋ฅผ ์•Œ๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์ด ์ฝ”๋“œ ์ฒ˜๋Ÿผ(https://github.com/GSukr/dtmvisual) time ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ํ† ํ”ฝ๋“ค์˜ ๋น„์œจ ๋ณ€ํ™”๋ฅผ plot์œผ๋กœ ๊ทธ๋ฆฌ๋ ค ํ•˜๋Š”๋ฐ, gensim์—์„œ๋Š” ๋ฌธํ—Œ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ gamma_๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ† ํ”ฝ๋“ค์˜ ๋น„์œจ๋ณ€ํ™”๋ฅผ ๊ณ„์‚ฐํ–ˆ๋Š”๋ฐ์š”.

tomotopy์—์„œ๋Š” get_alpha๋ฅผ ๋ฐ”๋กœ ํ™œ์šฉํ•˜๋ฉด ๋ ๊นŒ์š”? DTModel์˜ ๊ฒฝ์šฐ get_topic_dist ํ•จ์ˆ˜๊ฐ€ ์—†์–ด์„œ ์ด๋ ‡๊ฒŒ ์งˆ๋ฌธ ๋‚จ๊ธฐ๋„ค์š”.

์ข‹์€ ํŒจํ‚ค์ง€ ๋งŒ๋“ค์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์–ผ๋งˆ์ „์— ์†ก๋ฏผ ๊ต์ˆ˜๋‹˜๊ณผ ์“ฐ์‹  ๋…ผ๋ฌธ๋„ ๋ฆฌ๋ทฐ์ค‘์ž…๋‹ˆ๋‹ค.

Malloc error when training LLDA model

System: OSX (10.14.6)
Version: Tomotopy 0.5.2

Example code (mostly from lda example and using the same text file) :

import tomotopy as tp
from random import seed
from random import randint

seed(1)

def llda_example(input_file, save_path):
    topics = ['technology', 'art', 'economics', 'politics', 'religion', 'sport']
    mdl = tp.LLDAModel(tw=tp.TermWeight.ONE, min_cf=3, rm_top=5, k=20)
    for n, line in enumerate(open(input_file, encoding='utf-8')):
        ch = line.strip().split()
        labels = []
        for i in range(randint(1, 6)):
            labels.append(topics[randint(0, 5)])
        mdl.add_doc(ch,labels)
    mdl.burn_in = 100
    mdl.train(0)
            
print('Running LLDA')
llda_example('enwiki-stemmed-1000.txt', 'test.lda.bin')```

Result: Kernal dies - following message in console:
python(28393,0x111d1b5c0) malloc: *** error for object 0x7fc6f98cdfe0: pointer being freed was not allocated
python(28393,0x111d1b5c0) malloc: *** set a breakpoint in malloc_error_break to debug

Partially Labelled LDA (PLDA): 'k' is an invalid keyword argument for this function

I received the following error when instantiating a tomotopy.PLDA instance:

python3.5 -i process_pllda.py 
Traceback (most recent call last):
  File "process_pllda.py", line 27, in <module>
    mdl = tp.PLDAModel(k=50)
TypeError: 'k' is an invalid keyword argument for this function

The documentation for tomotopy.PLDA indicates that k is a valid keyword argument for the constructor method, but based on the source code it is not. k is also not listed as a valid parameter later in the documentation, but latent_topics is (source). I am going to assume that k should be latent_topics.

Tomotopy์™€ pyLDAvis๋ฅผ ์—ฐ๊ณ„ํ•  ์ˆ˜ ์žˆ๋‚˜์š”?

์ €๋Š” tomotopy hdp๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

pyLDAvis๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ธ์ž๋กœ model, corpus, dictionary๊ฐ€ ํ•„์š”ํ•œ๋ฐ, tomotopy๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋„ฃ์œผ๋ฉด
'tomotopy.HDPModel' object has no attribute 'num_topics'
๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ์–ด๋–ค ๊ฒฐ๊ณผ๋ณด๋‹ค tomotopy hdp๋ชจ๋ธ์ด ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์•„ ๊ผญ ํ™œ์šฉํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ณ„๊ฐ€ ๊ฐ€๋Šฅํ• ๊นŒ์š”?

์ถ”๊ฐ€์ ์œผ๋กœ ์•„๋ž˜ ๊ณผ์ •๊ณผ ๊ฐ™์ด ์ „์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋˜์–ด ์ €์žฅ๋œ ์ฝ”ํผ์Šค์˜ ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

corpus = tp.utils.Corpus(
    tokenizer=tokenizer
)
# ์ž…๋ ฅ ํŒŒ์ผ์—๋Š” ํ•œ ์ค„์— ๋ฌธํ—Œ ํ•˜๋‚˜์”ฉ์ด ๋“ค์–ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
corpus.process((line, kiwi.async_analyze(line)) for line in open('์ž…๋ ฅ ํ…์ŠคํŠธ ํŒŒ์ผ.txt', encoding='utf-8'))
# ์ „์ฒ˜๋ฆฌํ•œ ์ฝ”ํผ์Šค๋ฅผ ์ €์žฅํ•œ๋‹ค.
corpus.save('k.cps')

[new feature] converting HDP to LDA

Gensim provides hdp_to_lda method for training or inference with topics fixed in a specific state. It is good for tomotopy to have this feature too.

PLDA documentation, order of topics

I understand that after training the PLDAModel k is the number of latent topics + the topics resulting from the labels (topics_per_label * unique labels).

What is the order of the topics now? For example using get_topic_word_dist(0), would that give me the first latent topic or the first topic of the first label?

Unrelated to this, it would be great if tomotopy.Dictionary would get a nicer string representation.

>>> plda_model.topic_label_dict
<tomotopy.Dictionary object at 0x0000012034EADD90>

It could just be the same output as list(m.topic_label_dict) for example.

Thanks again for your great work. I don't think there is any other topic modeling library with that many models and such a great performance.

Segfaults with alpha

Hi @bab2min, just wanted to report some segfaults I came across when building the Ruby library.

import tomotopy as tp

model = tp.GDMRModel()
print(model.alpha)

model = tp.DTModel()
print(model.alpha)

tomotopy version: 0.9.1

PLDA performance/memory issue

Hi,
thanks so much for implementing the Partially Labeled LDA model :)

But there seems to be an issue with the add_docs method. Firstly it's much slower than any of the other models in this library and there is also a memory issue. After adding only about 300 (not overly long documents) I run out of memory.

I am using PLDAModel(latent_topics=20, topics_per_label=1) and the documents added to this point only contain 3 different labels.

add num_timepoints & num_docs_by_timepoint for DTModel

The current tomotopy.DTModel is missing some features about the number of timepoints and documents. Suggest adding following properties into DTModel.

  • num_timepoints (or t) : the value that is input as t of __init__
  • num_docs_by_timepoint : the number of documents belonging to each timepoint

[new feature] Convenient model description

It is good for topic model instances to have a simple method showing its status and description like:

mdl = tp.LDAModel(k=10, alpha=0.1, eta=0.1)

# do some works

mdl.model_summary() # will print like:

# LDAModel
# - hyperparameters
# -- term weight scheme: one
# -- k, number of topics: 10
# -- initial alpha, concentrate parameters for doc-topic dist: 0.1
# -- eta, concentrate parameters for topic-word dist: 0.1
# -- number of docs: 1000
# -- size of vocabs: 23456
# -- number of total words: 100000
# - parameters
# -- alpha: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
# ...

Alpha and gamma values not what are being fed in as arguments

It's entirely possible that this issue stems from my lack of understanding of this model or of HDP/LDA in general:

I've written a method that cycles through different hyperparameter values and trains the model so that you can see how the output changes with different values.

I just ran the method in PyCharm, and I saw some strange behavior: I input alpha as 10 ** -4 (0.0001), eta as 10 ** -1 (0.1), and gamma as 10 ** 0 (1). However, once the model was trained, I got the following values:

hdp.alpha = 7.38756571081467e-05
hdp.eta = 0.10000000149011612
hdp.gamma = 3.130246162414551

Is this normal? Should those values be changing once the model has trained?

Reproducibility issues even after setting model seed

Hi, thank you for all your work on this amazing library!

I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).

My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.

import tomotopy as tp

with open("docs.txt", "r", encoding="utf8") as fp:
    model = tp.LDAModel(k=5, seed=123456789)
    for line in fp:
        model.add_doc(line.split())

for i in range(0, 1000, 100):
    model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
    print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")

When I run the code, I usually get the following output:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927

But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317

The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!

Attached:
docs.txt

How do Tomotopy models handle bigrams/trigrams?

Hello.

One of the key aspects of gensim's LDA models are that they are capable of handling bigrams/trigrams via their dictionaries. I have searched through out your documentations but I couldn't find any notes/tutorials that also included using n-grams. I would be grateful if you could help me with this.

Thanks!

Typo on the website

Your HDP model area has a typo: it looks like the convert_to_lda and is_live_topic methods are exactly the same, although I'm assuming that they're not.

Segmentation fault (core dumped) when using labels in LLDA

Hello, I am not an expert and I may be very wrong here, but when I try to make a labelled LDA model I am getting 'Segmentation fault (core dumped)' and it may be a bug?

After declaring;
model = tp.LLDAModel(tw=TermWeight.ONE, min_cf=0, rm_top=0, k=20, alpha=0.1, eta=0.01)

If I do not specify labels, as in:

model.add_doc(myDocument)

instead of

model.add_doc(myDocument,labels=myLabel)

I can add all documents and create a working model just fine, but if I try to put the labels, the program gives me the segmentation error while adding the very first document.

I have also been able to create sLDA and LDA models without any single problem.

`infer` method topic distribution of doc mostly zeros

Hi -

I fitted an HDP model tried to obtain the topic distribution for an unseen document. I do get a list, however most of the entries are zeros so I'm thinking there might be a rounding issues in the code.

Here's an example of how it looks like

token_list = ['strong', 'organization', 'rusnews', 'line',  'misery', 'write', 'faq', 'ever', 'get', 
'modify', 'define', 'strong', 'atheist', 'believe', 'word']

doc_inst = hdp_model.make_doc(token_list)
topic_dist, ll = hdp_model.infer(doc_inst)

topic_dist
[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0, ## <--- Here's the only non-zero element which is correct, but I'd like to get %'s
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

Here's some other info on my OS

Darwin-18.5.0-x86_64-i386-64bit
Python 3.7.6 (default, Dec 30 2019, 19:38:28) 
[Clang 11.0.0 (clang-1100.0.33.16)]
NumPy 1.18.1
SciPy 1.4.1
tomotopy 0.7.1

Trying to use PAM model giving by input processed text

Hi. I'm trying to implement PAM Model by giving as input a processed text (list of lemmatized words, I removed punctuation and stop words).

def topicExtraction(corpus):
    model = tomotopy.PAModel(k=20)
    model.add_doc(corpus)
    model.train()

    print(model.get_topic_words(k, top_n = 20))

But it returns error.
Just one thing: This project looks great.

Possible to expose the corpus class?

Thanks for tomotopy! I have one small issue though, and that's rebuilding the corpus every time I want to change the model. Is there a way to do something like this:

import tomotopy as tp

corpus = tp.Corpus()
for document in documents:
    corpus.add_doc(document)

model = tp.LDAModel(k=5, corpus=corpus)
model.train(100)

Thanks again for a great package!

Error when calling train(0) with LDAModel

Hi, I'm following the example to train and use LDA model. After adding my documents with the method add_doc of LDAModel, when i call mdl.train(0) the code returns error (" Process finished with exit code -1073741819 (0xC0000005) "). I'm using tomotopy==0.7.1 with a python3.7 virtual environment on windows. Thanks

different results even if seed is fixed

Depending on the environment(32bit or 64bit / SSE2, AVX or AVX2) in which tomotopy are installed, different results will be produced with the same seed.
It is possibly related to #60.

Document-topic matrix for hierarchical LDA

Is there a way to get the topic mixture of each document back out from a hierarchical model? I am training a HLDAModel:

h_mdl = tp.HLDAModel(depth=4,corpus=corpus,seed=1)
    
for i in range(0, 100, 10): #Train the model using Gibbs-sampling
    h_mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, h_mdl.ll_per_word))

I am using the Document class to access instances of documents with the get_topic_dist() method. Is there a way to access a topic mixture of a document at a given depth of the model?

define _mm256_set_m128i for gcc compilers < 8.0

During installation in Python 3.6.9 with gcc 7.5.0 on Ubuntu 18.04:

sudo pip3 install tomotopy

I received the following error:

src/python/../Labeling/../Utils/EigenAddonOps.hpp:79:8: error: โ€˜_mm256_set_m128iโ€™ was not declared in this scope
        u = _mm256_set_m128i(
            ^~~~~~~~~~~~~~~~
    src/python/../Labeling/../Utils/EigenAddonOps.hpp:79:8: note: suggested alternative: โ€˜_mm256_set_epi8โ€™
        u = _mm256_set_m128i(
            ^~~~~~~~~~~~~~~~
            _mm256_set_epi8

The computer that has the following CPU information:

doug@doug-desktop:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          21
Model:               1
Model name:          AMD FX(tm)-8120 Eight-Core Processor
Stepping:            2
CPU MHz:             1419.705
CPU max MHz:         3100.0000
CPU min MHz:         1400.0000
BogoMIPS:            6242.06
Virtualization:      AMD-V
L1d cache:           16K
L1i cache:           64K
L2 cache:            2048K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate ssbd ibpb vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

I believe that this can be resolved by defining _mm256_set_m128i

topic ๊ฐ€์ค‘์น˜ ๋ฉ”ํŠธ๋ฆญ์Šค๋ฅผ ๋ฐ›์•„์˜ฌ ์ˆ˜ ์—†๋‚˜์š”?

์•ˆ๋…•ํ•˜์„ธ์š”, hdp ๊ตฌํ˜„ ์ฝ”๋“œ๋ฅผ ์ฐพ๋‹ค๊ฐ€ tomotopy๋ผ๋Š” ์ข‹์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

sklearn์— ์žˆ๋Š” LatentDiricheletAllocation์ฒ˜๋Ÿผ ํ•™์Šต ํ›„ ๊ฐ ๋‹คํ๋จผํŠธ ๋ฐ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ํ† ํ”ฝ ๊ฐ€์ค‘์น˜ ๋ฉ”ํŠธ๋ฆญ์Šค๋ฅผ ์–ป๊ณ  ์‹ถ์€๋ฐ ํ˜น์‹œ ๊ทธ๋Ÿฌํ•œ ๊ธฐ๋Šฅ์€ ์—†๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

(https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

Issue with model loading (0.6.1)

Hi @bab2min

Something seems wrong with the newest version 0.6.1.

models created with 0.6.1 cannot be loaded with 0.6.1, it throws an exception "Exception: 'lda.model.1000.bin' is not valid model file"

I saw that you changed the model file format: "Since version 0.6.0, the model file format has been changed. Thus model files saved in version 0.6.0 or later are not compatible with versions prior to 0.5.2."

training (this works)

mdl = tp.LDAModel(tw=tp.TermWeight.IDF, k=1000)
for description in descriptions:
    ch = description.split()
    mdl.add_doc(ch)
mdl.burn_in = 100
mdl.train(0)

print('Training...', file=sys.stderr, flush=True)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

print('Saving...', file=sys.stderr, flush=True)
mdl.save("lda.model.1000.bin", True)

load model and infer documents (this doesn't)

mdl = tp.LDAModel.load("lda.model.1000.bin")

docs = []
for record in records:
    docs.append(mdl.make_doc(record["description_cleaned"].split()))

infered_docs = mdl.infer(docs, together=True, parallel=3)

Is term weighting implemented correctly?

Term weighting is implemented here by simply multiplying the counts with the weights before and after sampling.

updateCnt<DEC>(doc.numByTopic[tid], INC * weight);
updateCnt<DEC>(ld.numByTopic[tid], INC * weight);
updateCnt<DEC>(ld.numByTopicWord(tid, vid), INC * weight);

However I don't think you can do that. It's ok for doc.numByTopic because here the document dependency is still kept. However for both ld counts the results are different from the implementation in the paper "Term Weighting Schemes for Latent Dirichlet Allocation" (eq. 6)

In the paper the original counts are multiplied with the weights during sampling. In code this should look like this (using numpy syntax and your variable names). termweights is a NumberOfDocument x VocabSize array and ld.numByTopicWord a NumberOfTopics x VocabSize array (with counts, not weights)

np.sum(termweights[docid,:][None,:] * ld.numByTopicWord[tid, vid], axis=-1)

I did some testing with my pure python implementation, and there this expression yields a different result as ld.numByTopic[tid] using the weights update in addWordTo.

Note that this should only matter for weighting schemes where the same tokens can have different weights for different documents (like PMI)

Issue in HDPModel

When I am training HDP model by using tomotopy, while computing log likelihood it is crashing.

'model.vocabs' gives full list of tokens even if it was filtered

'model.vocabs' gives full list of tokens even if it was filtered. At the same time 'num_vocabs' gives number of tokens after filtering. It is confusing as model gives topic-words distribution as list on numbers on 'num_vocabs' and does not provides id2token dictionary. A kind of 'used_vocabs' needed to make any down-steam analysis on produced topics

Error when installing tomotopy

I am installing the package:

$pip3 install tomotopy

The installation freezes, it takes hours, the wheel spins very slowly (/)

Building wheels for collected packages: tomotopy
Building wheel for tomotopy (setup.py) ... error
ERROR: Failed building wheel for tomotopy
Running setup.py clean for tomotopy
Failed to build tomotopy
Installing collected packages: tomotopy
Running setup.py install for tomotopy ... error

Any help is apprecciatted. Thanks.

segmentation fault from the `extract` method with trained `HDPModel` and `CTModel` :

I don't know if it is relevant, but I sometimes get a segmentation fault from the extract method with trained HDPModel and CTModel :

    extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
    cands = extractor.extract(mdl)
Fatal Python error: Segmentation fault
Current thread 0x00007fbd1009b740 (most recent call first)
...
Segmentation fault (core dumped)

I tried to debug more with the faulthandler module of Python 3, but I cannot get a more detailed output.

EDIT:
Here is the stacktrace using gdb. I hope it helps:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffc74cec6d in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
(gdb) backtrace
#0  0x00007fffc74cec6d in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#1  0x00007fffc74cec8b in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#2  0x00007fffc74cfbcf in tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#3  0x00007fffc7079dd8 in ExtractorObject::extract(ExtractorObject*, _object*, _object*) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#4  0x00005555556654f4 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1578429706181/work/Objects/methodobject.c:231
#5  0x00005555556ecdac in call_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4851
#6  0x000055555570f66a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:3335
#7  0x00005555556e6ebb in _PyFunction_FastCall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4933
#8  fast_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4968
#9  0x00005555556ece85 in call_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4872
#10 0x000055555570f66a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:3335
#11 0x00005555556e7c09 in _PyEval_EvalCodeWithName (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0,
    locals=0x7ffff7f55120, globals=0x7ffff7f55120, _co=0x7ffff6aaba50) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4166
#12 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4187
#13 0x00005555556e89ac in PyEval_EvalCode (co=co@entry=0x7ffff6aaba50, globals=globals@entry=0x7ffff7f55120, locals=locals@entry=0x7ffff7f55120) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:731
#14 0x0000555555768c64 in run_mod () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:1025
#15 0x0000555555769061 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:978
#16 0x0000555555769263 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:419
#17 0x000055555576936d in PyRun_AnyFileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:81
#18 0x000055555576cd53 in run_file (p_cf=0x7fffffffdddc, filename=0x5555558a76c0 L"gdpr_topic_modelling.py", fp=0x5555558f5110) at /tmp/build/80754af9/python_1578429706181/work/Modules/main.c:340
#19 Py_Main () at /tmp/build/80754af9/python_1578429706181/work/Modules/main.c:811
#20 0x00005555556373be in main () at /tmp/build/80754af9/python_1578429706181/work/Programs/python.c:69
#21 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556372d0 <main>, argc=2, argv=0x7fffffffdfe8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdfd8) at ../csu/libc-start.c:310
#22 0x0000555555716084 in _start () at ../sysdeps/x86_64/elf/start.S:103

Originally posted by @g3rfx in #40 (comment)

Ruby Library

Hey @bab2min, thanks for this awesome library! Just wanted to let you know there are now Ruby bindings for it. The code and docs were incredibly easy to follow.

If you have any feedback, feel free to let me know. Thanks!

How can I get the K2 distribution of topics out of PAModel

Currently PAModel.infer(doc) returns only K1 numbers + likelihood. I am trying to build an index and I am having difficulty getting K2 numbers out of the infer() method. I looked at the documentation and infer method is directly inherited from LDA. Is there a way to get the document x topic matrix for both K1 and K2 topics?

Return value of document.get_topic_dist does not sum to 1.0

I expected the return value of Document.get_topic_dist() to sum to 1.0 (within some epsilon). Yet that's not what I see after training an LDA model.

>>> sum(lda.docs[52904].get_topic_dist())
0.05711954347498249

There are many documents in my model where the sum is not close to 1.0.

Could you please explain what the values returned by Document.get_topic_dist() represents?

Apologies if this is my misunderstanding.

Issue in tomotopy

I have installed tomotopy by when I want to use it for LDA modeling it is showing error
AttributeError: module 'tomotopy' has no attribute 'LDAmodel'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.