machine-intelligence-laboratory / topicnet Goto Github PK

Interface for easier topic modelling.

Home Page: https://machine-intelligence-laboratory.github.io/TopicNet

License: MIT License

Python 99.91% Shell 0.09%

bigartm-library topic-modelling custom-score modalities pypi topic-modeling multimodal-learning multimodal-data document-representation

topicnet's Issues

TopicNet's BaseRegularizer can't be used in cube settings?!

P.S. Score in the picture is actually

class TopicPriorRegularizer(BaseRegularizer, artm.regularizers.BaseRegularizer):
    pass

not just

class TopicPriorRegularizer(BaseRegularizer):
    pass

I was trying to fix stuff and make things work, but it was not so easy.

Не выполняется код в Making-Decorrelation-and-Topic-Selection-Friends

Может быть вопрос и не к вам, но вы наверняка с этим сталкивались.

Я не могу понять как мне запустить скрипт, для подсчёта совстречаемости.

вот этот код
cd <working_directory path> && <путь к папке, в которой находится bigartm>/bigartm/build/bin/bigartm
-c train_vw_natural_order.txt
-v vocab.txt
--cooc-window 10
--cooc-min-tf 1
--write-cooc-tf cooc_tf_
--cooc-min-df 1
--write-cooc-df cooc_df_
--write-ppmi-tf ppmi_tf_
--write-ppmi-df ppmi_df_

Я в линуксе не могу найти где вообще этот терминал bigartm, т.к. в самой папке с библиотекой где он находится и близко нет (/bigartm/build/bin/bigartm ) , но я устанавливал её просто pip install bigartm10
Я пытался на windows запустить, там ваша библиотека не устанавливается, но я скопировал все файлы
вот эта строчка выполнилась -c train_vw_natural_order.txt даже что-то создала, а остальные ни в какую, я их пробовал и построчно запускать и всё вместе в одну строку как тут указано https://bigartm.readthedocs.io/en/stable/tutorials/bigartm_cli.html

Но всё это тоже не работает, вы это как-то запускали, расскажите, что я делаю не так?
Может дело в том, что нужно было bigartm не через pip устанавливать...

WinError: Failed to load artm shared library from artm.dll

I have been struggling with the installation of the original library, BigARTM, for quite a while now, on Windows 10 OS but with no success. I saw my same issue having been raised over there for over a year without a response, and seeing the vast number of open tickets and lack of development, I concluded that, sadly, the BigARTM project is seemingly a failing, dead project by now.

So I was enthusiastic to discover TopicNet and was hoping that the installation on Windows is finally made available and tested. But I see that TopicNet installation relies exclusively on the underlying BigARTM's installation, which is still failing on Windows with the following error:

OSError: [WinError 126] The specified module could not be found
Failed to load artm shared library from `['C:\\BigARTM\\python\\artm\\wrapper\\..\\artm.dll', 'artm.dll']`.
Try to add the location of `artm.dll` file into your PATH system variable, or to set ARTM_SHARED_LIBRARY -
the specific system variable which may point to `artm.dll` file, including the full path.

Is there any way to discover this dll file for all of us, Windows users?

No module named 'dill'

Install TopicNet, run Python

conda create -n test python=3.6
conda activate test
pip install topicnet
python

Try import TopicModel

from topicnet.cooking_machine.models import TopicModel

You get

ModuleNotFoundError: No module named 'dill'

Incorrect link to dataset

The link to RTL-Wiki dataset provided in RTL-Wiki-Preprocessing.ipynb (http://139.18.2.164/mroeder/palmetto/datasets/rtl-wiki.tar.gz) points to non-existing file (error 404 is returned).

After some search, I've found this one: https://hobbitdata.informatik.uni-leipzig.de/homes/mroeder/palmetto/datasets/rtl-wiki.tar.gz

Dataset "ruwiki_good" does not want to be downloaded

Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good'). ~~Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file)~~.

If you try this:

>>> d = load_dataset('ruwiki_good')

you get something like this:

Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
    raise exception
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
    return Dataset(save_path, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
    self._data = self._read_data(data_path)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
    data = data_handle.read_csv(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

OS is:

Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

Expected Result

The dataset is 1) downloaded and 2) ready to use for topic modeling.

Current "Workaround"

If you set sep='###' in this code:

data = data_handle.read_csv(
    data_path,
    engine='python',
    error_bad_lines=False,
    sep='\n',
    header=None,
    names=[VW_TEXT_COL]
)

then everything seems to work fine.

Demo/Guide for CubeCreator

what is it for?
what parameters one can/is encouraged to vary using this cube (eg. number of topics maybe better to find using OptimalNumberOfTopics repository 🙂)?
how to find best modality weights (rel2abs?, ...)
what is second level? how to use?

Model's _fit should accept Dataset also, not just BatchVectorizer

Seems more natural for a model to fit on Dataset.
Maybe better to use Union[artm.BatchVectorizer, topicnet.cooking_machine.Dataset] instead of just artm.BatchVectorizer (Union — for compatibility)?

Exclude .ipynb from the project's code stats on welcome page

Extend the functionality of Dataset

Something along the lines of "convert between Counter and vowpal_wabbit" would be very helpful.

Also, maybe we need to store more metadata (such as main modality and co-occurrences)

Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)

More functionality for TopDocumentViewer: filter documents by condition

Instead of retrieving all documents sorted by ptd, retrieve all documents containing any of the following words with length between minlen and maxlen, sorted by ptd.

That's a simple but very useful feature

Где можно скачать датасеты на которые ссылаются нотбуки?/ Where can I get datasets referred to in demo notebooks?

Где можно скачать датасеты на которые ссылаются нотбуки?
wiki_data/wiki_data.csv
topicnet/PScience.csv
data/lenta_1000_100.csv

Dataset's dictionary not updated if one changes the collection dynamically

Create a dataset
Call dataset.get_dictionary()
Change dataset's _data by renaming one of modalities (eg. lemmatized -> new_lemmatized)
Try to build a topic model using the dataset

Result: old modality in model's Phi
Expected: new modality in Phi

P.S.
One should also check that dataset._modalities is up-to-date

Demo/Guide for ModifierCube

what is strategy? is it important or default value is always OK? how to pick up the right one?
what is tracked_score_function? is it important or default value is always OK? how to pick up the right one?
ModifierCube vs ControllerCube: what are the differences, use cases?

`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

How one tried to fix the problem locally: TopicBank-Experiment-BankCreation.ipynb, section Lower Time Consumption in Case of Big Datasets

Junk stuff accumulates in dataset internals folder

Exclude .html from code stats

save_experiment parameter seems not working (at least in some cases)

import topicnet

dataset = topicnet.cooking_machine.Dataset(...)
model = topicnet.cooking_machine.models.TopicModel(...)

cube = CubeCreator(
    num_iter=10,
    parameters={
        "seed": [11221963],
        "num_topics": [5, 10]
    }
)
exp = Experiment(
    model,
    experiment_id='exp1',
    save_path='should_not_be_created',
    save_experiment=False
)

cube(model, dataset)

Result: folder specified in experiment init should_not_be_created exists
Expected: there shouldn't be any such folder, because save_experiment=False

Seems something's wrong with PostNauka dataset encoding (if one tries to use it on Windows)

Change config format

There are three points here.

The usage of CommaSeparated(Int()) / CommaSeparated(Str()) instead of Seq() should make files more concise and better readable.

More info: crdoconnor/strictyaml#70

Build bigARTM-related schemas automatically; probably by parsing numpydoc-style docstrings.
Allow to define regularizers inline (should be trivial after some slight refactoring)

error ModuleNotFoundError: No module named 'topicnet.cooking_machine.recipes'

from topicnet.cooking_machine.recipes import ARTM_baseline
Notebook: 20NG-GenSim vs TopicNet

Specify package versions in requirements.txt

For some reason, isintstance(model, BaseModel) == False

None as Default, not List/Dict

error in 20NG-GenSim vs TopicNet

[Errno 21] Is a directory: '20_News_dataset/train_preprocessed.csv'

тут как я понял создаётся папка для логирования, я на всякий случай создал свой путь
EXPERIMENT_PATH = 'test//'
но ошибка осталась

Thetaless sometimes have troubles with modalities

😢 np.ravel(ind > 0)
😄 np.ravel(ind != 0)

P.S. 0.000000001 != 0 -> Ok?

'Not Found' in Dataset.get_vw_document() — is it really needed?

Maybe better to fall with an Error?
What is this "Not Found" for?

Also, there is a column "id", but the selection is made via index.

Dataset_manager - Connection refused

Что-то из колаба dataset_manager не может достучаться до вас:
URLError: <urlopen error [Errno 111] Connection refused>

Код:
from topicnet import dataset_manager
from IPython.display import display, Markdown
display(Markdown(dataset_manager.get_info()))

Ошибка:
URLError: <urlopen error [Errno 111] Connection refused>

А при таком: dataset_manager.load_dataset('RTL_Wiki') ещё выпадет ошибка с неопределенной переменной в блоке finally:
UnboundLocalError: local variable 'save_path' referenced before assignment

Get document-topic representation

Hi, thanks for your implementation.

I'm wondering can we get document-topic representation and word-topic representations?
For each document or each word, the representation should be a vector with length equal to the number of topics.

Thanks!

Clean up the optional dependencies

We have a number of dependencies which setuptools calls "extras": packages that only needed for unlocking additional functionality but not used otherwise. Ideally, we should not require the user to install them.

It appears that we need to add something like the following to the setup.py:

    extras_require = {
        "custom_regularizers'':  ["numba"],
        "large_datasets'':  ["dask[dataset]"],
    }

and then check if the optional package is installed (I'm not clear how it works exactly)

Then user could install our library with the following syntax: pip install topicnet[custom_regularizers] (which will pull numba).

PS: Naming suggestions are welcome.

Downloading `postnauka` dataset produces error

I've tried to access postnauka dataset using demo_data = load_dataset('postnauka')

Unfortunately, topicnet produced error:

HTTPError: HTTP Error 502: Bad Gateway

Базовый набор скор трекеров

на моей практике очень полезно разделять скор трекеры на фоновые и предметные.

Есть такой параметр, как спарсити и мне важно чтобы он рос для предметных тем и опускался для фоновых. При этом также важно чтобы предметные темы совсем не обнулялись. Факт полного обнуления на 100% можно увидеть, если у нас есть скор трекер только для предметных тем. Если я вижу что он равен 1, то я сразу понимаю что модель упоролась.

Это хорошо ложится в концепцию селектора моделей если там задать условие "спарсити для предметных тем меньше 1"

Cached dict in Dataset breaks things

Dictionary filter() function changes the dictionary in-place.
So currently it is not possible to do

dictionary = dataset.get_dictionary()
dictionary.filter(min_df_rate=1.0)
...
dictionary = dataset.get_dictionary()  # this guy is already spoiled 
dictionary.filter(min_df_rate=0.1)

Pip install topicnet leads to error if artm v8 already installed on Windows

However, installation from source (git clone ... & pip install .) works

Refactor of `reg_search` and `tracked_score_function`

tracked_score_function should have default value of "PerplexityScore@all". Also, possibly it should be an attribute of Strategy instead of Cube.

reg_search should be renamed into something more informative. Also, possibly it should be an attribute of Strategy instead of Cube.

Add some kind of Contacts section in README?

How/To whom to ask a question? Maybe add info about the channel of the library in Slack?

Relative modality weights in CubeCreator

Currently CubeCreator supports only absolute weights (am I right 🙂 ?). Seems that relative weights are more useful (plus taking into account that init_simple_default_model requires relative weights as input). + README should be updated (a part in the end about modality weights): one should emphasize that currently weights should be absolute

Related: #62

We need to provide datasets in BOW and Natural-Order forms

Ideally, each dataset should be available in both forms.

Add some kind of dispose method to Dataset

Make it possible for entities of Dataset class to clear everything after they no longer needed (dataset_batches folder)

Fix multiprocessing in tests

Currently multiprocessing may lead to the fact that tests fail from time to time.
It means that builds may be red even there is all OK with the code.
The case is not yet clear, but there are several hypotheses about what can cause this instability:

multiprocessing + pytest
too many experiments with the same name are conducted on too little text collection

Эта проблема, по-видимому, возникает на стыке работы мультипроцессинга и пайтеста.
Т.е. маловероятно, что кто-то, кроме нас, её вообще видит.
Другой возможный вариант — тренировка большого числа экспериментов с одинаковым именем на очень маленькой коллекции, что тоже не является рядовым случаем использования библиотеки.

E. Egorov

Make viewers interface more uniform

The exact details aren't clear, but each of them needs three distinct methods:

method returning JSON-like structure, used internally in other methods
method saving something fancy (html-like) to a hard drive
method to visualize result inside Jupyter Notebook (could just raise NotImplementedError for now)

dataset_batches folder contains not only batches, but also vw.txt and dict.dict

Takes a lifetime to initialize a really big Dataset

Create some big .csv (several 100k rows should be enough; maybe fewer also ok)
Try do dataset = Dataset(csv_table_path, keep_in_memory=False)

Result: dataset is not ready after reasonable amount of time

P.S.
After this one fixed, one should check that dataset.get_batch_vectorizer() also is not too slow

Making-Decorrelation-and-Topic-Selection-Friends.ipynb Doesn't work in google colab from scratch

!pip install topicnet
TopicNet in /usr/local/lib/python3.7/dist-packages (0.8.0)
`
%%time

enable_hide_warnings()

models = decorrelating_cube(topic_model, dataset) # TODO: nice cube output?

To toggle on/off output_stderr, click here.

/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1845: PicklingWarning: Cannot locate reference to <class '_ctypes.PyCFuncPtrType'>.
warnings.warn('Cannot locate reference to %r.' % (obj,), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1847: PicklingWarning: Cannot pickle <class '_ctypes.PyCFuncPtrType'>: _ctypes.PyCFuncPtrType has recursive self-references that trigger a RecursionError.
warnings.warn('Cannot pickle %r: %s.%s has recursive self-references that trigger a RecursionError.' % (obj, obj.module, obj_name), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/topic_model.py:427: UserWarning: Failed to save custom score "ActiveTopicNumberScore" correctly! Freezing score (saving only its value)
f'Failed to save custom score "{score_object}" correctly! '
`

`enable_hide_warnings()

To toggle on/off output_stderr, click here.

TypeError Traceback (most recent call last)

in
2
3 best_model, DECORRELATION_TAU = select_best_model(
----> 4 topic_model.experiment, CRITERION, DECORRELATE_SPECIFIC
5 )

6 frames

/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/scores_wrapper.py in add(self, score)
54 def add(self, score: Union[BaseScore, artm.scores.BaseScore]):
55 if isinstance(score, FrozenScore):
---> 56 raise TypeError('FrozenScore is not supposed to be added to model')
57
58 elif isinstance(score, BaseScore):

TypeError: FrozenScore is not supposed to be added to model

Empty: Failed to retrive number of trained models

I installed topicnet through pip and have 'Empty: Failed to retrive number of trained models' when trying to run my_first_cube(tm, data) from guide

"field larger than field limit (131072)" warning when reading large documents

Maybe csv isn't best format for Dataset? Or, perhaps, there's some simple fix for that (increase field limit)

Invalid root folder name for Windows

when I init the experiment, it tries to create a folder with illegal symbols in the name.
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model )

and I get
OSError: [WinError 123] Синтаксическая ошибка в имени файла, имени папки или метке тома: 'experiments\\simple_experiment\\<<<<<<<<<<<root>>>>>>>>>>>'

I tried to change "model_id" manually, topic_model.model_id = 'root' but it change nothing.

Note in README about relative weights (modality & regularizer)

Worth to note that relative weights are cool and library provides simple ways to use it.

More intuitive way to choose modality weights (not just some random values virtually out of nowhere)
Cubes use relative weights by default: so the recommended range for regularizer weights is between 0 and 1, not ~10000 (but other values also possible)

Related: #62

Demo/Guide for custom regularizers

@Guince suggested making some more or less detailed guide about how one can create a regularizer:

what should one do given M-step
how to test if all correct and workable
(*) how to wrap some Bayesian logic into a regularizer (maybe a bit hardcore, but seems a good thing to do)

TopDocumentViewer

issue создан для того, чтобы обсудить возможность создания вьюера, где можно посмотреть как линкуются между собой документы и ключевые слова темы.
На данном этапе есть два отдельных вьюера. TopDocumrntViewer которые выводит техническое название топика (которое в 99.(9)% случаев использования этого вьюера является его айдишкой) и примеры документов к этому топику. То есть по факту тексты линкуются к какому-то мало что говоряющему id.
и TopTokensViewer который выводит опять этот айди и ключевые слова этого айди.

На практике, когда я анализирую уровень адекватности модели "в ручную", я, конечно, вывожу примеры текстов для каждого топика и прохожусь глазами по тексту. Topic_id не является характеристикой темы, поэтому я не могу оценить адекватность построенной модели с помощью этого вьюера. Характеристикой темы является набор ключевых слов для неё. Это то, что хочется видеть рядом со списком выведенных документов

Отсутствие такой возможности не позволило мне окончательно перейти со своих рукописных фреймворков на TopicNet.

machine-intelligence-laboratory / topicnet Goto Github PK

topicnet's Issues

Expected Result

Current "Workaround"

Recommend Projects

Recommend Topics

Recommend Org