bigartm / bigartm Goto Github PK

View Code? Open in Web Editor NEW

662.0 41.0 116.0 17.21 MB

Fast topic modeling platform

Home Page: http://bigartm.org/

License: Other

CMake 2.64% C++ 63.52% Python 30.17% Batchfile 0.15% Shell 0.04% C# 0.09% XSLT 2.91% C 0.48%

topic-modeling c-plus-plus python bigartm regularizer python-api text-mining machine-learning bigdata

bigartm's Introduction

The state-of-the-art platform for topic modeling.

What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections // Analysis of Images, Social Networks and Texts. 2015.
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.
Vorontsov K., Potapenko A., Plavin A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. // Statistical Learning and Data Sciences. 2015 — pp. 193-202.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.
More publications can be found in our wiki page.

Related Software Packages

TopicNet is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.
David Blei's List of Open Source topic modeling software
MALLET: Java-based toolkit for language processing with topic modeling package
Gensim: Python topic modeling library
Vowpal Wabbit has an implementation of Online-LDA algorithm

Installation

Installing with pip (Linux only)

We have a PyPi release for Linux:

$ pip install bigartm

$ pip install bigartm10

Installing on Windows

We suggest using pre-build binaries.

It is also possible to compile C++ code on Windows you want the latest development version.

Installing on Linux / MacOS

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

See here for detailed instructions.

How to Use

Command-line interface

Check out documentation for bigartm.

Examples:

Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20

Basic model with less tokens (filtered extreme values based on token's frequency)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt

Simple regularized model (increase sparsity up to 60-70%)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"

More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"

Interactive Python interface

BigARTM supports full-featured and clear Python API (see Installation to configure Python API for your OS).

Example:

import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.

License

BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.

bigartm's People

Contributors

Stargazers

Watchers

Forkers

sashafrey mellain marinadudarenko khalman-m vuvko anyap romovpa akashin jeanpaulshapo mindis cimsweb dselivanov yoojaebong vadimantiy nkruglikov bt2901 imclab ofrei nadiinchi vmarkovtsev sandy4321 little1tow benjamesbabala leezqcst laurii chenokay vseledkin dsakalley anastasiabayandina applejohnny smirnovevgeny diver-in-sky ml-lab kapulkin seriont michaelsolotky karthi2016 vovapolu angriff pershinmr vdovinanton rmdr zugenliu alexeysanko chelobaka kevinbsc skobets tiberiuichim zurk boriszubarev krabye arti32lehtonen galsagie anticodeninja maksbotan vishalbelsare randomsexy riomus albertaparicio lebd lrisliu vbarbanyagra christinazxy belovanna harelhan anweizhe2000 nyoroon bonn0062 benzei mohammedterry anirband rshekhovtsov bambangdw briantoliveira jeremiq abeusher luis-ramirez-r maksmolodtsov andrhua pmadhyastha foromik yudai-patronai zeionara pavelulyashev iliapetrov1982 grafde ra2003 phymucs ivekhov alvant arinaaageeva freekang mukiljchandiran vasyllyashkevych menshikh-iv vppuzakov wlodeklipski mgodiva abhisekk781 oguzhankarahan

bigartm's Issues

Topic-document distribution nulling

@marinadudarenko - When using large tau coeffitients in sparsing DirichletPhi or DecorrelatorPhi regularizers, final document-topic distributions is equal to zero. Possible problem may be in nulling word-topic raws in phi matrix.
@sashafrey - Let's also have a Score to count how many situations like this ocured during iteration

Add python examples of Phi and Theta matrix extraction

There are no examples of how to extract the entire Phi and Theta matrix.

Three operations required to set phi matrix (Overwrite, WaitIdle and Synchronize)

Currently as much as 3 separate operations are required to set explicit values for Phi matrix. These operations are:

model.Overwrite(/explicit values of Phi matrix/)
master_component.WaitIdle();
model.Synchronize();

This is very intuitive for new users. Let's think what is the right way of doing this.

Improve RAM usage in processor.cc

Running BigARTM in 32 threads on pubmed task for T=400 topics require ~10 GB of RAM. My theory is that this RAM is consumed in Processor, where each thread creates its own copy of the phi_matrix (check processor.cc::InitializePhi()).

An alternative solution would be to create one global phi_matrix in merger, and use it in processor.

Note that tickets #46 and #43 are related to this one.

Note that it should be OK to only fix for sparse_bow mode. In the dense_bow mode it is OK to simply create a local phi matrix (as it works today).

Check version of google.protobuf when importing artm.messages_pb2

artm.messages_pb2 could import wrong version of protobuf python module if it is installed on the user machine. Importing bad version of the protobuf module (e.g. 2.6.1 instead of 2.5.1) could result in strange behavior: zero perplexity, segmentation fault without any errors, etc.

Propose to check the version of google.protobuf when importing it in artm.messages_pb2 and raise an exception if it is incompatible.

Submit the project to MLOSS

MLOSS is a catalog of open source software for machine learning. It would be useful to submit a short description of BigARTM with a links to github and readthedocs.

Also there are Open Source Software Track in the JMLR — a right place to make publication about BigARTM as a software (not an approach to the topic modeling). Publication in JMLR OSS is like a proof of maturity for the project, so we need to fill all the gaps in the documentation and usability before.

Useful links to start:

BigARTM.exe (was: Wrap BigARTM into Vowpal Wabbit CLI)

Create command line interface similar to Vowpal Wabbit. The goal is to provide VW.LDA users a way to easily switch to BigARTM.

VW CLI is described here: slides, wiki page.

Could not create logging file: No such process

Copied from #111:

sometimes I have an error on my screen with logging:
'Could not create logging file: No such process
COULD NOT CREATE A LOGGINGFILE 20150205-205823.2054!'
But this file is created.

I've also seen this issue. It happens inside glog (Logging library for C++ from Google). I'm sure there are ways to solve this, but I have to debug their code to find out. Let me know if this is critical, otherwise I'll just keep this ticket as a reminder and eventually it will be fixed (but not next month).

Check for empty lines when parsing vocab-files

Right now CollectionParser will skip empry lines in vocab-files. This will result into less tokens than expected by user, and as the result cause errors when parsing docword files.

Typo in autogenerate tokens (CollectionParser)

if (token_map->empty()) {
// Autogenerate some tokens
for (int i = 0; i < num_tokens; ++i) {
std::string token_keyword = boost::lexical_caststd::string(i);
token_map->insert(std::make_pair(i, CollectionParserTokenInfo(token_keyword)));
}
}

For look seems to be wrong, replace upper limit to num_unique_tokens

Cleanup processor.cc by removing ItemProcessor and TokenIterator subclasses

Currently ItemProcessor and TokenIterator are now only used from only used Processor::FindThetaMatrix. We should switch this function to new implementation (sparse or dense depending on model_config), and remove ItemProcessor and TokenIterator.

Pass disk_path to InvokeIteration()

disk_path speficy a folder with batches to process during iteration.
Currently MasterComponent accepts disk_path in the constructor, and disk_path can't be changed.
A much more flexible solution would be to pass disk_path to InvokeIteration(), allowing to use different collections on different iterations.

Add python example for online_batch_processing mode

Online batch processing allows to explicitly pass batches to BigARTM through protobuf messages in memory (bypassing saving and loading batches to disk). This mode should be documented with python examples.

Perplexity score works only for @DefaultClass

... because of this line:
if (token_dict[field->token_id(token_index)].class_id == artm::core::DefaultClass) {

The score should use the same class_id(s) as set in ModelConfig.

Change DecorrelatorPhiConfig and SmoothSparsePhiConfig

Change DecorrelatorPhiConfig and SmoothSparsePhiConfig as follows:

DecorrelatorPhiConfig {
repeated string topic_name;
repeated string class_name;
}

SmoothSparsePhiConfig {
repeated string topic_name;
repeated string class_name;
optional string dictionary_name;
}

Theta scores accumulation

Currently Theta-scores are accumulating during the whole one collection scan. This behavior is incorrect in online mode, accumulation should re-start after every model.Synchronize().

Cover processor.cc with tests

The unit test should cover the following functionality of processor.cc:

There are two computation modes in processor.cc --- use_dense_bow=true and use_dense_bow=false. We need a unit test to ensure that they give exactly the same results.
One important but tricky case for processor.cc is when not all tokens from the batch appear in the topic model. This case should be protected by unit test. All such tokens should be still added ModelIncrement::Token and ModelIncrement::class_id fields, but corresponding ModelIncrement::operation_type should be set to ModelIncrement_OperationType_CreateIfNotExist. For all other tokens that have existed in the topic model processor should collect n_wt increments.

In order to test this in a simple way Processor.cc should be refactored as follows:
void Processor::ThreadFunction() {
...
std::for_each(model_names.begin(), model_names.end(), [&](ModelName model_name) {
... // This should turn into a separate function which takes batch, topic model, etc
and produces ModelIncrement. Then this function can be tested from a unit-test.
}
}

Running distributed BigARTM on Hadoop2 (YARN) cluster

Apache Hadoop is an infrastructure solution for processing big data. Hadoop is a de facto standard in the industry and it is common to have a hadoop-driven cluster in data-oriented companies (such as Yandex, Mail.ru, Spotify, many others).

Starting from version 2 Hadoop became not just a MapReduce engine, but a complete resource-management platform that could allocate resources (RAM, machines, cpus) and schedule users' applications.

I'm not sure about security issues, but it is possible to run native code on the hadoop. Look: guys run LDA (hehe =)) parallelized using MPI on a hadoop cluster. Think the one restriction is that the binary must be standalone with its own libraries nearby, and they should be built on the developer machine of the cluster (for the binary compatibility).

There are a series of articles about Hadoop Yarn and a good book: Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Roadmap for implementing this task

Implement input/output using HDFS (distributed filesystem)
Implement ApplicationManager (Java) that allocates machines, uploads necessary binaries with shared objects and runs workers

Tutorial for linux users

Currently there is no tutorial for linux users. The following commands are essential but can be found only in travis-ci script:

cp build/3rdparty/protobuf-cmake/protoc/protoc 3rdparty/protobuf/src/
cd 3rdparty/protobuf/python && python setup.py build && sudo python setup.py install && cd ../../..
export PYTHONPATH=pwd/src/python:$PYTHONPATH
export ARTM_SHARED_LIBRARY=pwd/build/src/artm/libartm.so

The option without 'sudo python setup.py install' also should be documented.

Docword parsing problem.

About 4.8 millions documents were found during parsing Pubmed collection, though info from UCI says, that there are 8.2 millions ones.

multiple_classes_test.cc reports that use_sparse_bow(false) causes different results

See https://travis-ci.org/bigartm/bigartm/builds/45646593 for an example.

Fix usage of model_config.class_weight in processor.cc

Currently model_config.class_weight is used as a factor to multiply the row of phi_matrix. This means that class_weight actually has no effect, because it goes into both numerator AND denominator of this blas->saxpy() call:

float p_wd_val = blas->sdot(topics_count, &phi_matrix(w, 0), 1, &(_theta_matrix)(0, d), 1);
blas->saxpy(topics_count, sparse_nwd.val()[i] / p_wd_val, &(_theta_matrix)(0, d), 1, &(*n_wt)(w, 0), 1);

To verify this theory we should cover processor.cc with tests, and verify that changing class_weight actually has an effect on the resulting n_wt increments.

This task depends on #43 (processor should be covered with tests before any further refactoring)

['helpers.cc:217] Field batch.class_id must have the same length as field batch.token

This warning should not appear in the log when batch.class_id is empty

Ensure repeatable results from run to run

Currently there are several places in BigARTM that cause non-deterministic behaviour. This can be a big issue for some users of BigARTM, and this also makes it more difficult for us to debug BigARTM and fix issues.

It should be possible to ensure deterministic behavior in the following scenario:

User set deterministic initial approximation of Phi matrix
User do not use online algorithm - e.g. no ModelSynchronize() happens during ongoing iteration.
In this case concurrency of bigartm should not cause any non-deterministic behavior. The only cause of non-deterministic behavior is due to random initialization of theta vectors in processor.cc. A simple alternative would be to use uniform distribution p(t|d)=1/T, where T = number of topics. Let's verify if this has as good convergence as random initialization, and permanently change processor to use this option.

Problem with theta_snippet_score and logging.

Hello!

I try to use BigARTM on mac os.
I have a problem with theta_snippet_score: it doesn't work.

In example02_parse_collection.py it writes 'Snippet of theta matrix:' in the end and nothing more.

Also sometimes a have an error on my screen with logging:
'Could not create logging file: No such process
COULD NOT CREATE A LOGGINGFILE 20150205-205823.2054!'
But this file is created.

Early stop if Theta converged earlier than inner_iterations_count

Allocation (LDA) with online variational Bayes (VB) has a threshold for Theta convergence, check

https://github.com/qpleple/online-lda-vb/blob/master/mdhoffma/onlineldavb.py
meanchange = n.mean(abs(gammad - lastgamma))
if (meanchange < meanchangethresh):
break

In BigARTM this is always fixed to inner_iterations_count. Let's also have an option to early-stop if theta vectors have converged.

Update documentation after Pull Request #115

#115

Describe new cpp_client.exe in the documentation

cpp_client.exe had been released in BigARTM v0.5.4, but without any documentation. This has to be fixed.

Hang in rpcz::application::create_rpc_channel()

I've seen that in some cases RPCZ hangs in rpcz::application::create_rpc_channel (call stack below).
We should enhance this method to pass some connection timeout.

[External Code] 
artm.dll!zmq::signaler_t::wait(int timeout_) Line 209   C++
artm.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_, int timeout_) Line 72  C++
artm.dll!zmq::socket_base_t::process_commands(int timeout_, bool throttle_) Line 915    C++
artm.dll!zmq::socket_base_t::send(zmq::msg_t * msg_, int flags_) Line 736   C++
artm.dll!s_sendmsg(zmq::socket_base_t * s_, zmq_msg_t * msg_, int flags_) Line 354  C++
artm.dll!rpcz::send_empty_message(zmq::socket_t * socket, int flags) Line 108   C++
artm.dll!rpcz::connection_manager::connect(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & endpoint) Line 463  C++
artm.dll!rpcz::application::create_rpc_channel(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & endpoint) Line 60   C++
artm.dll!artm::core::Instance::Reconfigure(const artm::MasterComponentConfig & master_config) Line 335  C++
artm.dll!artm::core::Instance::Instance(const artm::MasterComponentConfig & config, artm::core::InstanceType instance_type) Line 80 C++
artm.dll!artm::core::NodeControllerServiceImpl::CreateOrReconfigureInstance(const artm::MasterComponentConfig & request, rpcz::reply<artm::core::Void> response) Line 34    C++
artm.dll!artm::core::NodeControllerService::call_method(const google::protobuf::MethodDescriptor * method, const google::protobuf::Message & request, rpcz::server_channel * channel) Line 491  C++
artm.dll!rpcz::proto_rpc_service::dispatch_request(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & method, const void * payload, unsigned __int64 payload_len, rpcz::server_channel * channel_) Line 136   C++
artm.dll!rpcz::server::handle_request(const rpcz::client_connection & connection, rpcz::message_iterator & iter) Line 209   C++
artm.dll!rpcz::worker_thread(rpcz::connection_manager * connection_manager, zmq::context_t * context, std::basic_string<char,std::char_traits<char>,std::allocator<char> > endpoint) Line 170   C++
artm.dll!boost::detail::thread_data<boost::_bi::bind_t<void,void (__cdecl*)(rpcz::connection_manager * __ptr64,zmq::context_t * __ptr64,std::basic_string<char,std::char_traits<char>,std::allocator<char> >),boost::_bi::list3<boost::_bi::value<rpcz::connection_manager * __ptr64>,boost::_bi::value<zmq::context_t * __ptr64>,boost::_bi::value<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > > >::run() Line 118    C++
[External Code]

Improve performance in processor.cc

Experiments with py_artm library indicate that performance of the BigARTM's Processor component can be improved by 10 to 100 times by vectored operations and Intel MKL.

Processor components takes as an input one Batch (n_dw) and a Phi matrix (p_wt). The goal of Processor component is to infer Theta matrix (p_dt) and to calculate "n_wt" increments. The same operation is implemented in py_artm library (https://pypi.python.org/packages/source/p/py-artm/py-artm-0.1.1.tar.gz) by two different approaches:

By the dense representation of n_dw, p_wt and p_dt matrices
By sparse representation of n_dw matrix and dense representation of p_wt and p_dt matrices

Ideally BigARTM should have both algorithms (for dense and for sparse n_dw). Sparsity of other matrix can improve performance even further, but I suggest to keep this for future enhancements.

Documentation: rewrite all python examples in C++

Currently we have nice 8 examples for python, and none for C++ (except for cpp_client, which is our 'internal' stuff we use for testing). It would be nice to re-write all 8 examples from Python into C++ and create a simple walkthrough page in the documentation (similar to the Python tutorial).

Make a configurable option to gather new tokens during collection scan

Currently BigARTM always gather any new token that it found during collection processing. In some cases this behavior is undesired. It is easy to implement an option that disables auto-gathering of new tokens.

Allow to set the number of background topics and trajectory of regularization in cpp_client

cpp_client currently provides only the very basic parameters of the regularization.
It would be nice to allow users of cpp_client to set the number of background topics and regularization trajectory.

Implement explicit Theta calculation by weighted sum of p(w|t)

Implement the following strategy for Theta calculation:
p(t|d) = sum_{w \in d} {n_wd/n_d p(w|t) }

"python setup.py test" fails for protobuf in Windows release package of BigARTM

Some files are missing in Windows release of BigARTM and cause failures of "python setup.py test". This step is optional, but if we doesn't want users to execute it then we should fix the documentation.

Implement classification for topics with several classes

Classification refers to this:

Note that there is terminology issue in BigARTM. In BigARTM terms, the classification formula on the figure above assigns weights p(w|d) to different tokens w of a specific class_id.

Support 'cache_theta flag' in the online_processing_mode

This is need in order to remove AddBatch() for non-online_processing_mode. As the result we will be able to clean-up things like MemoryGeneration and some other code.

Entire topic model is nulling when any of ModelConfig.class_weight is set to zero

Use multiple_classes_test.cc and replace class_weight for __custom_class:

model_config3.add_class_id("@default_class"); model_config3.add_class_weight(0.5f);
model_config3.add_class_id("__custom_class"); model_config3.add_class_weight(0.0f); // set to zero

This result in many warnings like this:
W0102 01:08:03.712239 5864 merger.cc:506] 10 of 10 topics have zero probability mass.

And also a degenerative topic model:
token0(@default_class): -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO
token1(__custom_class): -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO -1.#IO

Note that the entire topic model is degenerative (not just __custom_class).
Also note that very similar behavior happens when __custom_class is not listed in ModelConfig.class_id field. This means that as soon as class_id is included in the Batch it has to be reflected in topic model! This is also a bug.

WORKAROUND: for now is to use a very small number as a class weight (for example 0.001).

rpcz::connection_manager doesn't release sockets allocated in the MAIN thread

I found a tricky problem in rpcz library. Notice that rpcz::connection_manager uses the boost::thread_Specific_ptr to allocate and deallocate ZeroMQ sockets. This is dangerous, because when new socket is allocated on MAIN thread it won't be deleted before PRCZ attempts to release the zmq::context_t. Releasing context hangs if any of sockets hasn't been closed.

class connection_manager {
boost::thread_specific_ptrzmq::socket_t socket_; <-- thread-specific storage.
};

To provoke this it is enough to call any rpcz-stub service from MAIN thread.

Example of the callstack when socket leaked:
artm_tests.exe!rpcz::connection_manager::get_frontend_socket
artm_tests.exe!rpcz::connection::send_request
artm_tests.exe!rpcz::rpc_channel_impl::call_method_full
artm_tests.exe!rpcz::rpc_channel_impl::call_method
artm_tests.exe!artm::core::NodeControllerService_Stub::CreateOrReconfigureDataLoader
artm_tests.exe!::operator()
artm_tests.exe!std::_Callable_obj<,0>::_ApplyX
artm_tests.exe!std::_Func_implstd::_Callable_obj<<lambda_fb14e41c2c43c258d62f1cd70a17e93e,0>,std::allocator >,void,artm::core::NodeControllerService_Stub &,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil,std::_Nil>::_Do_call
artm_tests.exe!std::_Func_class::operator()
artm_tests.exe!artm::core::NetworkClientCollection::for_each_client
artm_tests.exe!artm::core::NetworkClientCollection::Reconfigure
artm_tests.exe!artm::core::MasterComponent::Reconfigure

For now workaround is not to close zmq_context (see file \src\artm\zmq_context.h)

Improve performance of tokens deletion from the model

Currently TopicModel::ApplyDiff removes tokens one-by-one, which results in lots of costly delets from std::vector. This should be re-implemented in a more efficient way.

Change SmoothSparseThetaConfig regularizer

Change SmoothSparseThetaConfig as follows:
SmoothSparseThetaConfig {
repeated string topic_name;
repeated float aplha_iter;
}

and remove DicichletThetaConfig.

aplha_iter should have the same length as the number of inner iterations in the model.

Remove MasterComponentConfig.online_batch_processing

As soon as #117 is fixed there will be no reason to keep online_batch_processing, and this option should therefore be removed.

The goal is to use AddBatch() for one single purpose - it must always invoke processing of the batch it adds. At the same time there is nothing wrong with calling InvokeIteration() on the same MasterComponent.

Improve performance of RegularizerInterface::RegularizeTheta

virtual bool RegularizeTheta(const Item& item, std::vector* n_dt, int topic_size, int inner_iter, double tau)

Currently RegularizerInterface::RegularizeTheta appects one item from the batch (as std::vector n_dt). For better performance it should accept the entire Theta matrix (as DenseMatrix).

Gensim interface

Implement algorithms based on BigARTM as a gensim model.
See experiments on the English Wikipedia.

BigARTM interface could look like this:

model = gensim.models.bigartm.RobustPLSA(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
model = gensim.models.bigartm.ARTM(corpus=mm, id2word=id2word, regularizers=[...], num_topics=100, passes=1)
...

Fix build errors on OS X

See error message below.
Use OS X environment on TravisCi to reproduce this issue.

5:~/workspace/bigartm/build$ make
[ 1%] Built target gflags-static
[ 1%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/symbolize.cc.o
/Users/romovpa/workspace/bigartm/3rdparty/glog/src/symbolize.cc:669:10: fatal error: 'config.h' file not found

include "config.h"

1 error generated.
make[2]: *** [3rdparty/glog/CMakeFiles/google-glog.dir/src/symbolize.cc.o] Error 1
make[1]: *** [3rdparty/glog/CMakeFiles/google-glog.dir/all] Error 2
make: *** [all] Error 2

Support BigARTM versions in online documentation

Currently the documentation of BigARTM always refer to the latest master branch.
It is easy to configure versionized documentation.

regularizable.h is missing in Visual Studio project artm-static

Fix Make install for Linux

Currently CMakeLists.txt files are not configured to produce meanunful make install scripts for Linux. This should be fixed. If I understand correctly, make install should put all relevant BigARTM binaries and shared libraries into /usr/local/lib and /usr/local/bin folders. However I'm not an expert on Linux. If you have any suggestions on the behavior of make install - please share your thoughts...

Measure GetThetaMatrix() time for classification of one document

Performance of GetThetaMatrix() for a new document is especially important because this operation can be used in production to serve user queries. Let's measure current performance and improve it if needed.