machine-intelligence-laboratory / topicnet Goto Github PK

Interface for easier topic modelling.

Home Page: https://machine-intelligence-laboratory.github.io/TopicNet

License: MIT License

Python 99.91% Shell 0.09%

bigartm-library topic-modelling custom-score modalities pypi topic-modeling multimodal-learning multimodal-data document-representation

topicnet's Introduction

TopicNet

A high-level interface developed by Machine Intelligence Laboratory for BigARTM library.

What is TopicNet

TopicNet library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.

Consider using TopicNet if:

you want to explore BigARTM functionality without writing an overhead;
you need help with rapid solution prototyping;
you want to build a good topic model quickly (out-of-box, with default parameters);
you have an ARTM model at hand and you want to explore it's topics.

TopicNet provides an infrastructure for your prototyping with the help of Experiment class and helps to observe results of your actions via viewers module.

Example of the two-stage experiment scheme. At the first stage, regularizer with parameter $\tau$ taking values in some range $\{\tau_1, \tau_2, \tau_3\}$ is applied. Best models after the first stage are Model 1 and Model 2 — so Model 3 is not taking part in the training process anymore. The second stage is connected with another regularizer with parameter $\xi$ taking values in range $\{\xi_1, \xi_2\}$ . As a result of this stage, two descendant models of Model 1 and two descendant models of Model 2 are obtained.

And here is sample code of the TopicNet baseline experiment:

from topicnet.cooking_machine.config_parser import build_experiment_environment_from_yaml_config
from topicnet.cooking_machine.recipes import ARTM_baseline as config_string


config_string = config_string.format(
    dataset_path      = '/data/datasets/NIPS/dataset.csv',
    modality_list     = ['@word'],
    main_modality     = '@word',
    specific_topics   = [f'spc_topic_{i}' for i in range(19)],
    background_topics = [f'bcg_topic_{i}' for i in range( 1)],
)
experiment, dataset = (
    build_experiment_environment_from_yaml_config(
        yaml_string   = config_string,
        experiment_id = 'sample_config',
        save_path     = 'sample_save_folder_path',
    )
)
experiment.run(dataset)
best_model = experiment.select('PerplexityScore@all -> min')[0]

How to Start

Define TopicModel from an ARTM model at hand or with help from model_constructor module, where you can set models main parameters. Then create an Experiment, assigning a root position to this model and path to store your experiment. Further, you can define a set of training stages by the functionality provided by the cooking_machine.cubes module.

Further you can read documentation here.

Installation

Core library functionality is based on BigARTM library. So BigARTM should also be installed on the machine. Fortunately, the installation process should not be so difficult now. Below are the detailed explanations.

Via Pip

The easiest way to install everything is via pip (but currently works fine only for Linux users!)

pip install topicnet

The command also installs BigARTM library, not only TopicNet. However, BigARTM Command Line Utility will not be assembled. Pip installation makes it possible to use BigARTM only through Python Interface.

If working on Windows or Mac, you should install BigARTM by yourself first, then pip install topicnet will work just fine. We are hoping to bring all-in-pip installation support to the mentioned systems. However, right now you may find the following guide useful.

BigARTM for Non-Linux Users

To avoid installing BigARTM you can use docker images with preinstalled different versions of BigARTM library:

docker pull xtonev/bigartm:v0.10.0
docker run -t -i xtonev/bigartm:v0.10.0

Checking if all installed successfully:

$ python

>>> import artm
>>> artm.version()

Alternatively, you can follow BigARTM installation manual. There is also a pair of tips which may provide additional help for Windows users:

Go to the installation page for Windows and download the 7z archive in the Downloads section.
Use Anaconda conda install to download all the Python packages that BigARTM requires.
Path variables must be set through the GUI window of system variables, and, if the variable PYTHONPATH is missing — add it to the system wide variables. Close the GUI window.

After setting up the environment you can fork this repository or use pip install topicnet to install the library.

From Source

One can also install the library from GitHub, which may give more flexibility in developing (for example, making one's own viewers or regularizers a part of the module as .py files)

git clone https://github.com/machine-intelligence-laboratory/TopicNet.git
cd topicnet
pip install .

Google Colab & Kaggle Notebooks

As Linux installation may be done solely using pip, TopicNet can be used in such online services as Google Colab and Kaggle Notebooks. All you need is to run the following command in a notebook cell:

! pip install topicnet

There is also a notebook in Google Colab made by Nikolay Gerasimenko, where BigARTM is build from source. This may be useful, for example, if you want to use the BigARTM Command Line Utility.

Usage

Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start?

Data Preparation

Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via Dataset class. Here is a basic example of how one can achieve that: rtl_wiki_preprocessing.

Training a Topic Model

Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model

Get Your Data

We need to load our data prepared previously with Dataset:

DATASET_PATH = '/Wiki_raw_set/wiki_data.csv'
dataset = Dataset(DATASET_PATH)

Make an Initial Model

In case you want to start from a fresh model we suggest you use this code:

from topicnet.cooking_machine.model_constructor import init_simple_default_model

artm_model = init_simple_default_model(
    dataset=dataset,
    modalities_to_use={'@lemmatized': 1.0, '@bigram':0.5},
    main_modality='@lemmatized',
    specific_topics=14,
    background_topics=1,
)

Note that here we have model with two modalities: '@lemmatized' and '@bigram'. Further, if needed, one can define a custom score to be calculated during the model training.

from topicnet.cooking_machine.models.base_score import BaseScore


class CustomScore(BaseScore):
    def __init__(self):
        super().__init__()

    def call(self,
             model,
             eps=1e-5,
             n_specific_topics=14):

        phi = model.get_phi().values[:,:n_specific_topics]
        specific_sparsity = np.sum(phi < eps) / np.sum(phi < 1)

        return specific_sparsity

Now, TopicModel with custom score can be defined:

from topicnet.cooking_machine.models.topic_model import TopicModel


custom_scores = {'SpecificSparsity': CustomScore()}
topic_model = TopicModel(artm_model, model_id='Groot', custom_scores=custom_scores)

Define an Experiment

For further model training and tuning Experiment is necessary:

from topicnet.cooking_machine.experiment import Experiment


experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model)

Toy with the Cubes

Defining a next stage of the model training to select a decorrelator parameter:

from topicnet.cooking_machine.cubes import RegularizersModifierCube


my_first_cube = RegularizersModifierCube(
    num_iter=5,
    tracked_score_function='PerplexityScore@lemmatized',
    regularizer_parameters={
        'regularizer': artm.DecorrelatorPhiRegularizer(name='decorrelation_phi', tau=1),
        'tau_grid': [0,1,2,3,4,5],
    },
    reg_search='grid',
    verbose=True,
)

my_first_cube(topic_model, dataset)

Selecting a model with best perplexity score:

perplexity_criterion = 'PerplexityScore@lemmatized -> min COLLECT 1'
best_model = experiment.select(perplexity_criterion)

Alternatively: Use Recipes

If you need a topic model now, you can use one of the code snippets we call recipes.

from topicnet.cooking_machine.recipes import BaselineRecipe


training_pipeline = BaselineRecipe()
EXPERIMENT_PATH = '/home/user/experiment/'

training_pipeline.format_recipe(dataset_path=DATASET_PATH)
experiment, dataset = training_pipeline.build_experiment_environment(save_path=EXPERIMENT_PATH,)

after that you can expect a following result:

View the Results

Browsing the model is easy: create a viewer and call its view() method (or view_from_jupyter() — it is advised to use it if working in Jupyter Notebook):

from topicnet.viewers import TopTokensViewer


toptok_viewer = TopTokensViewer(best_model, num_top_tokens=10, method='phi')
toptok_viewer.view_from_jupyter()

More info about different viewers is available here: viewers.

FAQ

In the example we used to write vw modality like @modality, is it a VowpallWabbit format?

It is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.

CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?

Modality search space can be defined using standart library logic like:

class_ids_cube = CubeCreator(
    num_iter=5,
    parameters: [
        name: 'class_ids',
        values: {
            '@text':   [1, 2, 3],
            '@ngrams': [4, 5, 6],
        },
    ]
    reg_search='grid',
    verbose=True,
)

However, for the case of modalities a couple of slightly more convenient methods are availiable:

parameters : [
    {
        'name'  : 'class_ids@text',
        'values': [1, 2, 3]
    },
    {
        'name'  : 'class_ids@ngrams',
        'values': [4, 5, 6]
    }
]
parameters:[
    {
        'class_ids@text'  : [1, 2, 3],
        'class_ids@ngrams': [4, 5, 6]
    }
]

Contribution

If you find a bug, or if you would like the library to have some new features — you are welcome to contact us or create an issue or a pull request!

It also worth noting that TopicNet library is always open to improvements in several areas:

New custom regularizers.
New topic model scores.
New topic models or recipes to train topic models for a particular task/with some special properties.
New datasets (so as to make them available for everyone to download and conduct experiments with topic models).

Citing TopicNet

When citing topicnet in academic papers and theses, please use this BibTeX entry:

@InProceedings{bulatov-EtAl:2020:LREC,
  author    = {Bulatov, Victor  and  Alekseev, Vasiliy  and  Vorontsov, Konstantin  and  Polyudova, Darya  and  Veselova, Eugenia  and  Goncharov, Alexey  and  Egorov, Evgeny},
  title     = {TopicNet: Making Additive Regularisation for Topic Modelling Accessible},
  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month          = {May},
  year           = {2020},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {6747--6754},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.833}
}

topicnet's People

Contributors

Stargazers

Watchers

Forkers

bt2901 xtonev murzikov48 tkhirianov sladkoejka bahleg rnekrasov-msk oksanadanilova dmitriyvaletov daywatch yyht sungchien in-oleynikov alvant brothertook altauth

topicnet's Issues

We need to provide datasets in BOW and Natural-Order forms

Ideally, each dataset should be available in both forms.

Incorrect link to dataset

The link to RTL-Wiki dataset provided in RTL-Wiki-Preprocessing.ipynb (http://139.18.2.164/mroeder/palmetto/datasets/rtl-wiki.tar.gz) points to non-existing file (error 404 is returned).

After some search, I've found this one: https://hobbitdata.informatik.uni-leipzig.de/homes/mroeder/palmetto/datasets/rtl-wiki.tar.gz

Model's _fit should accept Dataset also, not just BatchVectorizer

Seems more natural for a model to fit on Dataset.
Maybe better to use Union[artm.BatchVectorizer, topicnet.cooking_machine.Dataset] instead of just artm.BatchVectorizer (Union — for compatibility)?

Get document-topic representation

Hi, thanks for your implementation.

I'm wondering can we get document-topic representation and word-topic representations?
For each document or each word, the representation should be a vector with length equal to the number of topics.

Thanks!

No module named 'dill'

Install TopicNet, run Python

conda create -n test python=3.6
conda activate test
pip install topicnet
python

Try import TopicModel

from topicnet.cooking_machine.models import TopicModel

You get

ModuleNotFoundError: No module named 'dill'

None as Default, not List/Dict

'Not Found' in Dataset.get_vw_document() — is it really needed?

Maybe better to fall with an Error?
What is this "Not Found" for?

Also, there is a column "id", but the selection is made via index.

For some reason, isintstance(model, BaseModel) == False

TopDocumentViewer

issue создан для того, чтобы обсудить возможность создания вьюера, где можно посмотреть как линкуются между собой документы и ключевые слова темы.
На данном этапе есть два отдельных вьюера. TopDocumrntViewer которые выводит техническое название топика (которое в 99.(9)% случаев использования этого вьюера является его айдишкой) и примеры документов к этому топику. То есть по факту тексты линкуются к какому-то мало что говоряющему id.
и TopTokensViewer который выводит опять этот айди и ключевые слова этого айди.

На практике, когда я анализирую уровень адекватности модели "в ручную", я, конечно, вывожу примеры текстов для каждого топика и прохожусь глазами по тексту. Topic_id не является характеристикой темы, поэтому я не могу оценить адекватность построенной модели с помощью этого вьюера. Характеристикой темы является набор ключевых слов для неё. Это то, что хочется видеть рядом со списком выведенных документов

Отсутствие такой возможности не позволило мне окончательно перейти со своих рукописных фреймворков на TopicNet.

Demo/Guide for custom regularizers

@Guince suggested making some more or less detailed guide about how one can create a regularizer:

what should one do given M-step
how to test if all correct and workable
(*) how to wrap some Bayesian logic into a regularizer (maybe a bit hardcore, but seems a good thing to do)

Cached dict in Dataset breaks things

Dictionary filter() function changes the dictionary in-place.
So currently it is not possible to do

dictionary = dataset.get_dictionary()
dictionary.filter(min_df_rate=1.0)
...
dictionary = dataset.get_dictionary()  # this guy is already spoiled 
dictionary.filter(min_df_rate=0.1)

Fix multiprocessing in tests

Currently multiprocessing may lead to the fact that tests fail from time to time.
It means that builds may be red even there is all OK with the code.
The case is not yet clear, but there are several hypotheses about what can cause this instability:

multiprocessing + pytest
too many experiments with the same name are conducted on too little text collection

Эта проблема, по-видимому, возникает на стыке работы мультипроцессинга и пайтеста.
Т.е. маловероятно, что кто-то, кроме нас, её вообще видит.
Другой возможный вариант — тренировка большого числа экспериментов с одинаковым именем на очень маленькой коллекции, что тоже не является рядовым случаем использования библиотеки.

E. Egorov

error in 20NG-GenSim vs TopicNet

[Errno 21] Is a directory: '20_News_dataset/train_preprocessed.csv'

тут как я понял создаётся папка для логирования, я на всякий случай создал свой путь
EXPERIMENT_PATH = 'test//'
но ошибка осталась

Add some kind of Contacts section in README?

How/To whom to ask a question? Maybe add info about the channel of the library in Slack?

Invalid root folder name for Windows

when I init the experiment, it tries to create a folder with illegal symbols in the name.
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model )

and I get
OSError: [WinError 123] Синтаксическая ошибка в имени файла, имени папки или метке тома: 'experiments\\simple_experiment\\<<<<<<<<<<<root>>>>>>>>>>>'

I tried to change "model_id" manually, topic_model.model_id = 'root' but it change nothing.

Dataset_manager - Connection refused

Что-то из колаба dataset_manager не может достучаться до вас:
URLError: <urlopen error [Errno 111] Connection refused>

Код:
from topicnet import dataset_manager
from IPython.display import display, Markdown
display(Markdown(dataset_manager.get_info()))

Ошибка:
URLError: <urlopen error [Errno 111] Connection refused>

А при таком: dataset_manager.load_dataset('RTL_Wiki') ещё выпадет ошибка с неопределенной переменной в блоке finally:
UnboundLocalError: local variable 'save_path' referenced before assignment

Где можно скачать датасеты на которые ссылаются нотбуки?/ Where can I get datasets referred to in demo notebooks?

Где можно скачать датасеты на которые ссылаются нотбуки?
wiki_data/wiki_data.csv
topicnet/PScience.csv
data/lenta_1000_100.csv

More functionality for TopDocumentViewer: filter documents by condition

Instead of retrieving all documents sorted by ptd, retrieve all documents containing any of the following words with length between minlen and maxlen, sorted by ptd.

That's a simple but very useful feature

Add some kind of dispose method to Dataset

Make it possible for entities of Dataset class to clear everything after they no longer needed (dataset_batches folder)

Exclude .html from code stats

Making-Decorrelation-and-Topic-Selection-Friends.ipynb Doesn't work in google colab from scratch

!pip install topicnet
TopicNet in /usr/local/lib/python3.7/dist-packages (0.8.0)
`
%%time

enable_hide_warnings()

models = decorrelating_cube(topic_model, dataset) # TODO: nice cube output?

To toggle on/off output_stderr, click here.

/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1845: PicklingWarning: Cannot locate reference to <class '_ctypes.PyCFuncPtrType'>.
warnings.warn('Cannot locate reference to %r.' % (obj,), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1847: PicklingWarning: Cannot pickle <class '_ctypes.PyCFuncPtrType'>: _ctypes.PyCFuncPtrType has recursive self-references that trigger a RecursionError.
warnings.warn('Cannot pickle %r: %s.%s has recursive self-references that trigger a RecursionError.' % (obj, obj.module, obj_name), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/topic_model.py:427: UserWarning: Failed to save custom score "ActiveTopicNumberScore" correctly! Freezing score (saving only its value)
f'Failed to save custom score "{score_object}" correctly! '
`

`enable_hide_warnings()

To toggle on/off output_stderr, click here.

TypeError Traceback (most recent call last)

in
2
3 best_model, DECORRELATION_TAU = select_best_model(
----> 4 topic_model.experiment, CRITERION, DECORRELATE_SPECIFIC
5 )

6 frames

/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/scores_wrapper.py in add(self, score)
54 def add(self, score: Union[BaseScore, artm.scores.BaseScore]):
55 if isinstance(score, FrozenScore):
---> 56 raise TypeError('FrozenScore is not supposed to be added to model')
57
58 elif isinstance(score, BaseScore):

TypeError: FrozenScore is not supposed to be added to model

Downloading `postnauka` dataset produces error

I've tried to access postnauka dataset using demo_data = load_dataset('postnauka')

Unfortunately, topicnet produced error:

HTTPError: HTTP Error 502: Bad Gateway

Specify package versions in requirements.txt

save_experiment parameter seems not working (at least in some cases)

import topicnet

dataset = topicnet.cooking_machine.Dataset(...)
model = topicnet.cooking_machine.models.TopicModel(...)

cube = CubeCreator(
    num_iter=10,
    parameters={
        "seed": [11221963],
        "num_topics": [5, 10]
    }
)
exp = Experiment(
    model,
    experiment_id='exp1',
    save_path='should_not_be_created',
    save_experiment=False
)

cube(model, dataset)

Result: folder specified in experiment init should_not_be_created exists
Expected: there shouldn't be any such folder, because save_experiment=False

error ModuleNotFoundError: No module named 'topicnet.cooking_machine.recipes'

from topicnet.cooking_machine.recipes import ARTM_baseline
Notebook: 20NG-GenSim vs TopicNet

Dataset's dictionary not updated if one changes the collection dynamically

Create a dataset
Call dataset.get_dictionary()
Change dataset's _data by renaming one of modalities (eg. lemmatized -> new_lemmatized)
Try to build a topic model using the dataset

Result: old modality in model's Phi
Expected: new modality in Phi

P.S.
One should also check that dataset._modalities is up-to-date

Change config format

There are three points here.

The usage of CommaSeparated(Int()) / CommaSeparated(Str()) instead of Seq() should make files more concise and better readable.

More info: crdoconnor/strictyaml#70

Build bigARTM-related schemas automatically; probably by parsing numpydoc-style docstrings.
Allow to define regularizers inline (should be trivial after some slight refactoring)

"field larger than field limit (131072)" warning when reading large documents

Maybe csv isn't best format for Dataset? Or, perhaps, there's some simple fix for that (increase field limit)

Junk stuff accumulates in dataset internals folder

Make viewers interface more uniform

The exact details aren't clear, but each of them needs three distinct methods:

method returning JSON-like structure, used internally in other methods
method saving something fancy (html-like) to a hard drive
method to visualize result inside Jupyter Notebook (could just raise NotImplementedError for now)

Demo/Guide for CubeCreator

what is it for?
what parameters one can/is encouraged to vary using this cube (eg. number of topics maybe better to find using OptimalNumberOfTopics repository 🙂)?
how to find best modality weights (rel2abs?, ...)
what is second level? how to use?

Seems something's wrong with PostNauka dataset encoding (if one tries to use it on Windows)

Pip install topicnet leads to error if artm v8 already installed on Windows

However, installation from source (git clone ... & pip install .) works

Не выполняется код в Making-Decorrelation-and-Topic-Selection-Friends

Может быть вопрос и не к вам, но вы наверняка с этим сталкивались.

Я не могу понять как мне запустить скрипт, для подсчёта совстречаемости.

вот этот код
cd <working_directory path> && <путь к папке, в которой находится bigartm>/bigartm/build/bin/bigartm
-c train_vw_natural_order.txt
-v vocab.txt
--cooc-window 10
--cooc-min-tf 1
--write-cooc-tf cooc_tf_
--cooc-min-df 1
--write-cooc-df cooc_df_
--write-ppmi-tf ppmi_tf_
--write-ppmi-df ppmi_df_

Я в линуксе не могу найти где вообще этот терминал bigartm, т.к. в самой папке с библиотекой где он находится и близко нет (/bigartm/build/bin/bigartm ) , но я устанавливал её просто pip install bigartm10
Я пытался на windows запустить, там ваша библиотека не устанавливается, но я скопировал все файлы
вот эта строчка выполнилась -c train_vw_natural_order.txt даже что-то создала, а остальные ни в какую, я их пробовал и построчно запускать и всё вместе в одну строку как тут указано https://bigartm.readthedocs.io/en/stable/tutorials/bigartm_cli.html

Но всё это тоже не работает, вы это как-то запускали, расскажите, что я делаю не так?
Может дело в том, что нужно было bigartm не через pip устанавливать...

Extend the functionality of Dataset

Something along the lines of "convert between Counter and vowpal_wabbit" would be very helpful.

Also, maybe we need to store more metadata (such as main modality and co-occurrences)

Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)

Refactor of `reg_search` and `tracked_score_function`

tracked_score_function should have default value of "PerplexityScore@all". Also, possibly it should be an attribute of Strategy instead of Cube.

reg_search should be renamed into something more informative. Also, possibly it should be an attribute of Strategy instead of Cube.

Takes a lifetime to initialize a really big Dataset

Create some big .csv (several 100k rows should be enough; maybe fewer also ok)
Try do dataset = Dataset(csv_table_path, keep_in_memory=False)

Result: dataset is not ready after reasonable amount of time

P.S.
After this one fixed, one should check that dataset.get_batch_vectorizer() also is not too slow

Relative modality weights in CubeCreator

Currently CubeCreator supports only absolute weights (am I right 🙂 ?). Seems that relative weights are more useful (plus taking into account that init_simple_default_model requires relative weights as input). + README should be updated (a part in the end about modality weights): one should emphasize that currently weights should be absolute

Related: #62

Exclude .ipynb from the project's code stats on welcome page

Note in README about relative weights (modality & regularizer)

Worth to note that relative weights are cool and library provides simple ways to use it.

More intuitive way to choose modality weights (not just some random values virtually out of nowhere)
Cubes use relative weights by default: so the recommended range for regularizer weights is between 0 and 1, not ~10000 (but other values also possible)

Related: #62

Thetaless sometimes have troubles with modalities

😢 np.ravel(ind > 0)
😄 np.ravel(ind != 0)

P.S. 0.000000001 != 0 -> Ok?

Dataset "ruwiki_good" does not want to be downloaded

Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good'). ~~Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file)~~.

If you try this:

>>> d = load_dataset('ruwiki_good')

you get something like this:

Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
    raise exception
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
    return Dataset(save_path, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
    self._data = self._read_data(data_path)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
    data = data_handle.read_csv(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

OS is:

Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

Expected Result

The dataset is 1) downloaded and 2) ready to use for topic modeling.

Current "Workaround"

If you set sep='###' in this code:

data = data_handle.read_csv(
    data_path,
    engine='python',
    error_bad_lines=False,
    sep='\n',
    header=None,
    names=[VW_TEXT_COL]
)

then everything seems to work fine.

Clean up the optional dependencies

We have a number of dependencies which setuptools calls "extras": packages that only needed for unlocking additional functionality but not used otherwise. Ideally, we should not require the user to install them.

It appears that we need to add something like the following to the setup.py:

    extras_require = {
        "custom_regularizers'':  ["numba"],
        "large_datasets'':  ["dask[dataset]"],
    }

and then check if the optional package is installed (I'm not clear how it works exactly)

Then user could install our library with the following syntax: pip install topicnet[custom_regularizers] (which will pull numba).

PS: Naming suggestions are welcome.

dataset_batches folder contains not only batches, but also vw.txt and dict.dict

Empty: Failed to retrive number of trained models

I installed topicnet through pip and have 'Empty: Failed to retrive number of trained models' when trying to run my_first_cube(tm, data) from guide

`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

How one tried to fix the problem locally: TopicBank-Experiment-BankCreation.ipynb, section Lower Time Consumption in Case of Big Datasets

Базовый набор скор трекеров

на моей практике очень полезно разделять скор трекеры на фоновые и предметные.

Есть такой параметр, как спарсити и мне важно чтобы он рос для предметных тем и опускался для фоновых. При этом также важно чтобы предметные темы совсем не обнулялись. Факт полного обнуления на 100% можно увидеть, если у нас есть скор трекер только для предметных тем. Если я вижу что он равен 1, то я сразу понимаю что модель упоролась.

Это хорошо ложится в концепцию селектора моделей если там задать условие "спарсити для предметных тем меньше 1"

Demo/Guide for ModifierCube

what is strategy? is it important or default value is always OK? how to pick up the right one?
what is tracked_score_function? is it important or default value is always OK? how to pick up the right one?
ModifierCube vs ControllerCube: what are the differences, use cases?

TopicNet's BaseRegularizer can't be used in cube settings?!

P.S. Score in the picture is actually

class TopicPriorRegularizer(BaseRegularizer, artm.regularizers.BaseRegularizer):
    pass

not just

class TopicPriorRegularizer(BaseRegularizer):
    pass

I was trying to fix stuff and make things work, but it was not so easy.

WinError: Failed to load artm shared library from artm.dll

I have been struggling with the installation of the original library, BigARTM, for quite a while now, on Windows 10 OS but with no success. I saw my same issue having been raised over there for over a year without a response, and seeing the vast number of open tickets and lack of development, I concluded that, sadly, the BigARTM project is seemingly a failing, dead project by now.

So I was enthusiastic to discover TopicNet and was hoping that the installation on Windows is finally made available and tested. But I see that TopicNet installation relies exclusively on the underlying BigARTM's installation, which is still failing on Windows with the following error:

OSError: [WinError 126] The specified module could not be found
Failed to load artm shared library from `['C:\\BigARTM\\python\\artm\\wrapper\\..\\artm.dll', 'artm.dll']`.
Try to add the location of `artm.dll` file into your PATH system variable, or to set ARTM_SHARED_LIBRARY -
the specific system variable which may point to `artm.dll` file, including the full path.

Is there any way to discover this dll file for all of us, Windows users?