machine-intelligence-laboratory / topicnet Goto Github PK
View Code? Open in Web Editor NEWInterface for easier topic modelling.
Home Page: https://machine-intelligence-laboratory.github.io/TopicNet
License: MIT License
Interface for easier topic modelling.
Home Page: https://machine-intelligence-laboratory.github.io/TopicNet
License: MIT License
Может быть вопрос и не к вам, но вы наверняка с этим сталкивались.
Я не могу понять как мне запустить скрипт, для подсчёта совстречаемости.
вот этот код
cd <working_directory path> && <путь к папке, в которой находится bigartm>/bigartm/build/bin/bigartm
-c train_vw_natural_order.txt
-v vocab.txt
--cooc-window 10
--cooc-min-tf 1
--write-cooc-tf cooc_tf_
--cooc-min-df 1
--write-cooc-df cooc_df_
--write-ppmi-tf ppmi_tf_
--write-ppmi-df ppmi_df_
Я в линуксе не могу найти где вообще этот терминал bigartm, т.к. в самой папке с библиотекой где он находится и близко нет (/bigartm/build/bin/bigartm ) , но я устанавливал её просто pip install bigartm10
Я пытался на windows запустить, там ваша библиотека не устанавливается, но я скопировал все файлы
вот эта строчка выполнилась -c train_vw_natural_order.txt даже что-то создала, а остальные ни в какую, я их пробовал и построчно запускать и всё вместе в одну строку как тут указано https://bigartm.readthedocs.io/en/stable/tutorials/bigartm_cli.html
Но всё это тоже не работает, вы это как-то запускали, расскажите, что я делаю не так?
Может дело в том, что нужно было bigartm не через pip устанавливать...
I have been struggling with the installation of the original library, BigARTM, for quite a while now, on Windows 10 OS but with no success. I saw my same issue having been raised over there for over a year without a response, and seeing the vast number of open tickets and lack of development, I concluded that, sadly, the BigARTM project is seemingly a failing, dead project by now.
So I was enthusiastic to discover TopicNet and was hoping that the installation on Windows is finally made available and tested. But I see that TopicNet installation relies exclusively on the underlying BigARTM's installation, which is still failing on Windows with the following error:
OSError: [WinError 126] The specified module could not be found
Failed to load artm shared library from `['C:\\BigARTM\\python\\artm\\wrapper\\..\\artm.dll', 'artm.dll']`.
Try to add the location of `artm.dll` file into your PATH system variable, or to set ARTM_SHARED_LIBRARY -
the specific system variable which may point to `artm.dll` file, including the full path.
Is there any way to discover this dll file for all of us, Windows users?
Install TopicNet, run Python
conda create -n test python=3.6
conda activate test
pip install topicnet
python
Try import TopicModel
from topicnet.cooking_machine.models import TopicModel
You get
ModuleNotFoundError: No module named 'dill'
The link to RTL-Wiki dataset provided in RTL-Wiki-Preprocessing.ipynb (http://139.18.2.164/mroeder/palmetto/datasets/rtl-wiki.tar.gz) points to non-existing file (error 404 is returned).
After some search, I've found this one: https://hobbitdata.informatik.uni-leipzig.de/homes/mroeder/palmetto/datasets/rtl-wiki.tar.gz
Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good')
. Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file).
If you try this:
>>> d = load_dataset('ruwiki_good')
you get something like this:
Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
raise exception
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
return Dataset(save_path, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
self._data = self._read_data(data_path)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
data = data_handle.read_csv(
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
kwds_defaults = _refine_defaults_read(
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.
OS is:
Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
The dataset is 1) downloaded and 2) ready to use for topic modeling.
If you set sep='###'
in this code:
data = data_handle.read_csv(
data_path,
engine='python',
error_bad_lines=False,
sep='\n',
header=None,
names=[VW_TEXT_COL]
)
then everything seems to work fine.
Something along the lines of "convert between Counter
and vowpal_wabbit
" would be very helpful.
Also, maybe we need to store more metadata (such as main modality and co-occurrences)
Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)
Instead of retrieving all documents sorted by ptd
, retrieve all documents containing any of the following words
with length between minlen
and maxlen
, sorted by ptd
.
That's a simple but very useful feature
Где можно скачать датасеты на которые ссылаются нотбуки?
wiki_data/wiki_data.csv
topicnet/PScience.csv
data/lenta_1000_100.csv
dataset
dataset.get_dictionary()
_data
by renaming one of modalities (eg. lemmatized -> new_lemmatized)dataset
Result: old modality in model's Phi
Expected: new modality in Phi
P.S.
One should also check that dataset._modalities
is up-to-date
strategy
? is it important or default value is always OK? how to pick up the right one?tracked_score_function
? is it important or default value is always OK? how to pick up the right one?The method is too slow!
Do we really need dask.dataframe
? Maybe better to store documents on disk as single files (and not as one big .csv)?
References:
import topicnet
dataset = topicnet.cooking_machine.Dataset(...)
model = topicnet.cooking_machine.models.TopicModel(...)
cube = CubeCreator(
num_iter=10,
parameters={
"seed": [11221963],
"num_topics": [5, 10]
}
)
exp = Experiment(
model,
experiment_id='exp1',
save_path='should_not_be_created',
save_experiment=False
)
cube(model, dataset)
Result: folder specified in experiment init should_not_be_created
exists
Expected: there shouldn't be any such folder, because save_experiment=False
There are three points here.
CommaSeparated(Int())
/ CommaSeparated(Str())
instead of Seq()
should make files more concise and better readable.More info: crdoconnor/strictyaml#70
Build bigARTM-related schemas automatically; probably by parsing numpydoc-style docstrings.
Allow to define regularizers inline (should be trivial after some slight refactoring)
Что-то из колаба dataset_manager не может достучаться до вас:
URLError: <urlopen error [Errno 111] Connection refused>
Код:
from topicnet import dataset_manager
from IPython.display import display, Markdown
display(Markdown(dataset_manager.get_info()))
Ошибка:
URLError: <urlopen error [Errno 111] Connection refused>
А при таком: dataset_manager.load_dataset('RTL_Wiki') ещё выпадет ошибка с неопределенной переменной в блоке finally:
UnboundLocalError: local variable 'save_path' referenced before assignment
Hi, thanks for your implementation.
I'm wondering can we get document-topic representation and word-topic representations?
For each document or each word, the representation should be a vector with length equal to the number of topics.
Thanks!
We have a number of dependencies which setuptools
calls "extras": packages that only needed for unlocking additional functionality but not used otherwise. Ideally, we should not require the user to install them.
It appears that we need to add something like the following to the setup.py
:
extras_require = {
"custom_regularizers'': ["numba"],
"large_datasets'': ["dask[dataset]"],
}
and then check if the optional package is installed (I'm not clear how it works exactly)
Then user could install our library with the following syntax: pip install topicnet[custom_regularizers]
(which will pull numba).
PS: Naming suggestions are welcome.
I've tried to access postnauka
dataset using demo_data = load_dataset('postnauka')
Unfortunately, topicnet
produced error:
HTTPError: HTTP Error 502: Bad Gateway
на моей практике очень полезно разделять скор трекеры на фоновые и предметные.
Есть такой параметр, как спарсити и мне важно чтобы он рос для предметных тем и опускался для фоновых. При этом также важно чтобы предметные темы совсем не обнулялись. Факт полного обнуления на 100% можно увидеть, если у нас есть скор трекер только для предметных тем. Если я вижу что он равен 1, то я сразу понимаю что модель упоролась.
Это хорошо ложится в концепцию селектора моделей если там задать условие "спарсити для предметных тем меньше 1"
tracked_score_function
should have default value of "PerplexityScore@all"
. Also, possibly it should be an attribute of Strategy
instead of Cube
.
reg_search
should be renamed into something more informative. Also, possibly it should be an attribute of Strategy
instead of Cube
.
How/To whom to ask a question? Maybe add info about the channel of the library in Slack?
Currently CubeCreator supports only absolute weights (am I right 🙂 ?). Seems that relative weights are more useful (plus taking into account that init_simple_default_model
requires relative weights as input). + README should be updated (a part in the end about modality weights): one should emphasize that currently weights should be absolute
Related: #62
Ideally, each dataset should be available in both forms.
Make it possible for entities of Dataset class to clear everything after they no longer needed (dataset_batches folder)
Currently multiprocessing may lead to the fact that tests fail from time to time.
It means that builds may be red even there is all OK with the code.
The case is not yet clear, but there are several hypotheses about what can cause this instability:
Эта проблема, по-видимому, возникает на стыке работы мультипроцессинга и пайтеста.
Т.е. маловероятно, что кто-то, кроме нас, её вообще видит.
Другой возможный вариант — тренировка большого числа экспериментов с одинаковым именем на очень маленькой коллекции, что тоже не является рядовым случаем использования библиотеки.E. Egorov
The exact details aren't clear, but each of them needs three distinct methods:
NotImplementedError
for now)dataset = Dataset(csv_table_path, keep_in_memory=False)
Result: dataset is not ready after reasonable amount of time
P.S.
After this one fixed, one should check that dataset.get_batch_vectorizer()
also is not too slow
!pip install topicnet
TopicNet in /usr/local/lib/python3.7/dist-packages (0.8.0)
`
%%time
enable_hide_warnings()
models = decorrelating_cube(topic_model, dataset) # TODO: nice cube output?
To toggle on/off output_stderr, click here.
/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1845: PicklingWarning: Cannot locate reference to <class '_ctypes.PyCFuncPtrType'>.
warnings.warn('Cannot locate reference to %r.' % (obj,), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/dill/_dill.py:1847: PicklingWarning: Cannot pickle <class '_ctypes.PyCFuncPtrType'>: _ctypes.PyCFuncPtrType has recursive self-references that trigger a RecursionError.
warnings.warn('Cannot pickle %r: %s.%s has recursive self-references that trigger a RecursionError.' % (obj, obj.module, obj_name), PicklingWarning)
/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/topic_model.py:427: UserWarning: Failed to save custom score "ActiveTopicNumberScore" correctly! Freezing score (saving only its value)
f'Failed to save custom score "{score_object}" correctly! '
`
`enable_hide_warnings()
To toggle on/off output_stderr, click here.
TypeError Traceback (most recent call last)
in
2
3 best_model, DECORRELATION_TAU = select_best_model(
----> 4 topic_model.experiment, CRITERION, DECORRELATE_SPECIFIC
5 )
6 frames
/usr/local/lib/python3.7/dist-packages/topicnet/cooking_machine/models/scores_wrapper.py in add(self, score)
54 def add(self, score: Union[BaseScore, artm.scores.BaseScore]):
55 if isinstance(score, FrozenScore):
---> 56 raise TypeError('FrozenScore is not supposed to be added to model')
57
58 elif isinstance(score, BaseScore):
TypeError: FrozenScore is not supposed to be added to model
`
I installed topicnet through pip and have 'Empty: Failed to retrive number of trained models' when trying to run my_first_cube(tm, data) from guide
Maybe csv
isn't best format for Dataset
? Or, perhaps, there's some simple fix for that (increase field limit)
when I init the experiment, it tries to create a folder with illegal symbols in the name.
experiment = Experiment(experiment_id="simple_experiment", save_path="experiments", topic_model=topic_model )
and I get
OSError: [WinError 123] Синтаксическая ошибка в имени файла, имени папки или метке тома: 'experiments\\simple_experiment\\<<<<<<<<<<<root>>>>>>>>>>>'
I tried to change "model_id" manually, topic_model.model_id = 'root'
but it change nothing.
Worth to note that relative weights are cool and library provides simple ways to use it.
Related: #62
@Guince suggested making some more or less detailed guide about how one can create a regularizer:
issue создан для того, чтобы обсудить возможность создания вьюера, где можно посмотреть как линкуются между собой документы и ключевые слова темы.
На данном этапе есть два отдельных вьюера. TopDocumrntViewer которые выводит техническое название топика (которое в 99.(9)% случаев использования этого вьюера является его айдишкой) и примеры документов к этому топику. То есть по факту тексты линкуются к какому-то мало что говоряющему id.
и TopTokensViewer который выводит опять этот айди и ключевые слова этого айди.
На практике, когда я анализирую уровень адекватности модели "в ручную", я, конечно, вывожу примеры текстов для каждого топика и прохожусь глазами по тексту. Topic_id не является характеристикой темы, поэтому я не могу оценить адекватность построенной модели с помощью этого вьюера. Характеристикой темы является набор ключевых слов для неё. Это то, что хочется видеть рядом со списком выведенных документов
Отсутствие такой возможности не позволило мне окончательно перейти со своих рукописных фреймворков на TopicNet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.