Giter Site home page Giter Site logo

machine-intelligence-laboratory / topicnet Goto Github PK

View Code? Open in Web Editor NEW
140.0 12.0 17.0 7.56 MB

Interface for easier topic modelling.

Home Page: https://machine-intelligence-laboratory.github.io/TopicNet

License: MIT License

Python 99.91% Shell 0.09%
bigartm-library topic-modelling custom-score modalities pypi topic-modeling multimodal-learning multimodal-data document-representation

topicnet's Issues

TopicNet's BaseRegularizer can't be used in cube settings?!

not_implt

P.S. Score in the picture is actually

class TopicPriorRegularizer(BaseRegularizer, artm.regularizers.BaseRegularizer):
    pass

not just

class TopicPriorRegularizer(BaseRegularizer):
    pass

I was trying to fix stuff and make things work, but it was not so easy.

Не выполняется код в Making-Decorrelation-and-Topic-Selection-Friends

Может быть вопрос и не к вам, но вы наверняка с этим сталкивались.

Я не могу понять как мне запустить скрипт, для подсчёта совстречаемости.

вот этот код
cd <working_directory path> && <путь к папке, в которой находится bigartm>/bigartm/build/bin/bigartm
-c train_vw_natural_order.txt
-v vocab.txt
--cooc-window 10
--cooc-min-tf 1
--write-cooc-tf cooc_tf_
--cooc-min-df 1
--write-cooc-df cooc_df_
--write-ppmi-tf ppmi_tf_
--write-ppmi-df ppmi_df_

Я в линуксе не могу найти где вообще этот терминал bigartm, т.к. в самой папке с библиотекой где он находится и близко нет (/bigartm/build/bin/bigartm ) , но я устанавливал её просто pip install bigartm10
Я пытался на windows запустить, там ваша библиотека не устанавливается, но я скопировал все файлы
вот эта строчка выполнилась -c train_vw_natural_order.txt даже что-то создала, а остальные ни в какую, я их пробовал и построчно запускать и всё вместе в одну строку как тут указано https://bigartm.readthedocs.io/en/stable/tutorials/bigartm_cli.html

Но всё это тоже не работает, вы это как-то запускали, расскажите, что я делаю не так?
Может дело в том, что нужно было bigartm не через pip устанавливать...

WinError: Failed to load artm shared library from artm.dll

I have been struggling with the installation of the original library, BigARTM, for quite a while now, on Windows 10 OS but with no success. I saw my same issue having been raised over there for over a year without a response, and seeing the vast number of open tickets and lack of development, I concluded that, sadly, the BigARTM project is seemingly a failing, dead project by now.

So I was enthusiastic to discover TopicNet and was hoping that the installation on Windows is finally made available and tested. But I see that TopicNet installation relies exclusively on the underlying BigARTM's installation, which is still failing on Windows with the following error:

OSError: [WinError 126] The specified module could not be found
Failed to load artm shared library from `['C:\\BigARTM\\python\\artm\\wrapper\\..\\artm.dll', 'artm.dll']`.
Try to add the location of `artm.dll` file into your PATH system variable, or to set ARTM_SHARED_LIBRARY -
the specific system variable which may point to `artm.dll` file, including the full path.

Is there any way to discover this dll file for all of us, Windows users?

No module named 'dill'

Install TopicNet, run Python

conda create -n test python=3.6
conda activate test
pip install topicnet
python

Try import TopicModel

from topicnet.cooking_machine.models import TopicModel

You get

ModuleNotFoundError: No module named 'dill'

Dataset "ruwiki_good" does not want to be downloaded

Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good'). Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file).

If you try this:

>>> d = load_dataset('ruwiki_good')

you get something like this:

Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
    raise exception
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
    return Dataset(save_path, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
    self._data = self._read_data(data_path)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
    data = data_handle.read_csv(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

OS is:

Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

Expected Result

The dataset is 1) downloaded and 2) ready to use for topic modeling.

Current "Workaround"

If you set sep='###' in this code:

data = data_handle.read_csv(
    data_path,
    engine='python',
    error_bad_lines=False,
    sep='\n',
    header=None,
    names=[VW_TEXT_COL]
)

then everything seems to work fine.

Demo/Guide for CubeCreator

  • what is it for?
  • what parameters one can/is encouraged to vary using this cube (eg. number of topics maybe better to find using OptimalNumberOfTopics repository 🙂)?
  • how to find best modality weights (rel2abs?, ...)
  • what is second level? how to use?

Dataset's dictionary not updated if one changes the collection dynamically

  • Create a dataset
  • Call dataset.get_dictionary()
  • Change dataset's _data by renaming one of modalities (eg. lemmatized -> new_lemmatized)
  • Try to build a topic model using the dataset

Result: old modality in model's Phi
Expected: new modality in Phi

P.S.
One should also check that dataset._modalities is up-to-date

Demo/Guide for ModifierCube

  • what is strategy? is it important or default value is always OK? how to pick up the right one?
  • what is tracked_score_function? is it important or default value is always OK? how to pick up the right one?
  • ModifierCube vs ControllerCube: what are the differences, use cases?

save_experiment parameter seems not working (at least in some cases)

import topicnet

dataset = topicnet.cooking_machine.Dataset(...)
model = topicnet.cooking_machine.models.TopicModel(...)

cube = CubeCreator(
    num_iter=10,
    parameters={
        "seed": [11221963],
        "num_topics": [5, 10]
    }
)
exp = Experiment(
    model,
    experiment_id='exp1',
    save_path='should_not_be_created',
    save_experiment=False
)

cube(model, dataset)

Result: folder specified in experiment init should_not_be_created exists
Expected: there shouldn't be any such folder, because save_experiment=False

Change config format

There are three points here.

  1. The usage of CommaSeparated(Int()) / CommaSeparated(Str()) instead of Seq() should make files more concise and better readable.

More info: crdoconnor/strictyaml#70

  1. Build bigARTM-related schemas automatically; probably by parsing numpydoc-style docstrings.

  2. Allow to define regularizers inline (should be trivial after some slight refactoring)

error in 20NG-GenSim vs TopicNet

[Errno 21] Is a directory: '20_News_dataset/train_preprocessed.csv'

тут как я понял создаётся папка для логирования, я на всякий случай создал свой путь
EXPERIMENT_PATH = 'test//'
но ошибка осталась

image

Dataset_manager - Connection refused

Что-то из колаба dataset_manager не может достучаться до вас:
URLError: <urlopen error [Errno 111] Connection refused>

Код:
from topicnet import dataset_manager
from IPython.display import display, Markdown
display(Markdown(dataset_manager.get_info()))

Ошибка:
URLError: <urlopen error [Errno 111] Connection refused>
изображение

А при таком: dataset_manager.load_dataset('RTL_Wiki') ещё выпадет ошибка с неопределенной переменной в блоке finally:
UnboundLocalError: local variable 'save_path' referenced before assignment

изображение

Get document-topic representation

Hi, thanks for your implementation.

I'm wondering can we get document-topic representation and word-topic representations?
For each document or each word, the representation should be a vector with length equal to the number of topics.

Thanks!

Clean up the optional dependencies

We have a number of dependencies which setuptools calls "extras": packages that only needed for unlocking additional functionality but not used otherwise. Ideally, we should not require the user to install them.

It appears that we need to add something like the following to the setup.py:

    extras_require = {
        "custom_regularizers'':  ["numba"],
        "large_datasets'':  ["dask[dataset]"],
    }

and then check if the optional package is installed (I'm not clear how it works exactly)

Then user could install our library with the following syntax: pip install topicnet[custom_regularizers] (which will pull numba).

PS: Naming suggestions are welcome.

Базовый набор скор трекеров

на моей практике очень полезно разделять скор трекеры на фоновые и предметные.

Есть такой параметр, как спарсити и мне важно чтобы он рос для предметных тем и опускался для фоновых. При этом также важно чтобы предметные темы совсем не обнулялись. Факт полного обнуления на 100% можно увидеть, если у нас есть скор трекер только для предметных тем. Если я вижу что он равен 1, то я сразу понимаю что модель упоролась.

Это хорошо ложится в концепцию селектора моделей если там задать условие "спарсити для предметных тем меньше 1"

Cached dict in Dataset breaks things

cached

Dictionary filter() function changes the dictionary in-place.
So currently it is not possible to do

dictionary = dataset.get_dictionary()
dictionary.filter(min_df_rate=1.0)
...
dictionary = dataset.get_dictionary()  # this guy is already spoiled 
dictionary.filter(min_df_rate=0.1)

Refactor of `reg_search` and `tracked_score_function`

tracked_score_function should have default value of "PerplexityScore@all". Also, possibly it should be an attribute of Strategy instead of Cube.

reg_search should be renamed into something more informative. Also, possibly it should be an attribute of Strategy instead of Cube.

Relative modality weights in CubeCreator

Currently CubeCreator supports only absolute weights (am I right 🙂 ?). Seems that relative weights are more useful (plus taking into account that init_simple_default_model requires relative weights as input). + README should be updated (a part in the end about modality weights): one should emphasize that currently weights should be absolute

Related: #62

Fix multiprocessing in tests

Currently multiprocessing may lead to the fact that tests fail from time to time.
It means that builds may be red even there is all OK with the code.
The case is not yet clear, but there are several hypotheses about what can cause this instability:

  • multiprocessing + pytest
  • too many experiments with the same name are conducted on too little text collection

Эта проблема, по-видимому, возникает на стыке работы мультипроцессинга и пайтеста.
Т.е. маловероятно, что кто-то, кроме нас, её вообще видит.
Другой возможный вариант — тренировка большого числа экспериментов с одинаковым именем на очень маленькой коллекции, что тоже не является рядовым случаем использования библиотеки.

E. Egorov

Make viewers interface more uniform

The exact details aren't clear, but each of them needs three distinct methods:

  • method returning JSON-like structure, used internally in other methods
  • method saving something fancy (html-like) to a hard drive
  • method to visualize result inside Jupyter Notebook (could just raise NotImplementedError for now)

Takes a lifetime to initialize a really big Dataset

  1. Create some big .csv (several 100k rows should be enough; maybe fewer also ok)
  2. Try do dataset = Dataset(csv_table_path, keep_in_memory=False)

too-long

Result: dataset is not ready after reasonable amount of time

P.S.
After this one fixed, one should check that dataset.get_batch_vectorizer() also is not too slow

Making-Decorrelation-and-Topic-Selection-Friends.ipynb Doesn't work in google colab from scratch

!pip install topicnet
TopicNet in /usr/local/lib/python3.7/dist-packages (0.8.0)
`
%%time

enable_hide_warnings()

models = decorrelating_cube(topic_model, dataset) # TODO: nice cube output?


To toggle on/off output_stderr, click here.

/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1845: PicklingWarning: Cannot locate reference to <class '_ctypes.PyCFuncPtrType'>.
warnings.warn('Cannot locate reference to %r.' % (obj,), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1847: PicklingWarning: Cannot pickle <class '_ctypes.PyCFuncPtrType'>: _ctypes.PyCFuncPtrType has recursive self-references that trigger a RecursionError.
warnings.warn('Cannot pickle %r: %s.%s has recursive self-references that trigger a RecursionError.' % (obj, obj.module, obj_name), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/topic_model.py:427: UserWarning: Failed to save custom score "ActiveTopicNumberScore" correctly! Freezing score (saving only its value)
f'Failed to save custom score "{score_object}" correctly! '
`

`enable_hide_warnings()

To toggle on/off output_stderr, click here.


TypeError Traceback (most recent call last)

in
2
3 best_model, DECORRELATION_TAU = select_best_model(
----> 4 topic_model.experiment, CRITERION, DECORRELATE_SPECIFIC
5 )

6 frames

/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/scores_wrapper.py in add(self, score)
54 def add(self, score: Union[BaseScore, artm.scores.BaseScore]):
55 if isinstance(score, FrozenScore):
---> 56 raise TypeError('FrozenScore is not supposed to be added to model')
57
58 elif isinstance(score, BaseScore):

TypeError: FrozenScore is not supposed to be added to model

`

Invalid root folder name for Windows

when I init the experiment, it tries to create a folder with illegal symbols in the name.
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model )

and I get
OSError: [WinError 123] Синтаксическая ошибка в имени файла, имени папки или метке тома: 'experiments\\simple_experiment\\<<<<<<<<<<<root>>>>>>>>>>>'

I tried to change "model_id" manually, topic_model.model_id = 'root' but it change nothing.

Note in README about relative weights (modality & regularizer)

Worth to note that relative weights are cool and library provides simple ways to use it.

  • More intuitive way to choose modality weights (not just some random values virtually out of nowhere)
  • Cubes use relative weights by default: so the recommended range for regularizer weights is between 0 and 1, not ~10000 (but other values also possible)

Related: #62

Demo/Guide for custom regularizers

@Guince suggested making some more or less detailed guide about how one can create a regularizer:

  • what should one do given M-step
  • how to test if all correct and workable
  • (*) how to wrap some Bayesian logic into a regularizer (maybe a bit hardcore, but seems a good thing to do)

TopDocumentViewer

issue создан для того, чтобы обсудить возможность создания вьюера, где можно посмотреть как линкуются между собой документы и ключевые слова темы.
На данном этапе есть два отдельных вьюера. TopDocumrntViewer которые выводит техническое название топика (которое в 99.(9)% случаев использования этого вьюера является его айдишкой) и примеры документов к этому топику. То есть по факту тексты линкуются к какому-то мало что говоряющему id.
и TopTokensViewer который выводит опять этот айди и ключевые слова этого айди.

На практике, когда я анализирую уровень адекватности модели "в ручную", я, конечно, вывожу примеры текстов для каждого топика и прохожусь глазами по тексту. Topic_id не является характеристикой темы, поэтому я не могу оценить адекватность построенной модели с помощью этого вьюера. Характеристикой темы является набор ключевых слов для неё. Это то, что хочется видеть рядом со списком выведенных документов

Отсутствие такой возможности не позволило мне окончательно перейти со своих рукописных фреймворков на TopicNet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.