Giter Site home page Giter Site logo

milanlproc / contextualized-topic-models Goto Github PK

View Code? Open in Web Editor NEW
1.2K 17.0 140.0 32.78 MB

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021.

License: MIT License

Makefile 2.21% Python 97.79%
topic-modeling bert transformer embeddings text-as-data topic-coherence multilingual-topic-models multilingual-models neural-topic-models nlp

contextualized-topic-models's People

Contributors

cerqueiramatheus avatar damienlancry avatar dependabot[bot] avatar dnozza avatar e-tornike avatar erip avatar huzeyfeayaz avatar nreimers avatar silviatti avatar supermaxman avatar vinid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

contextualized-topic-models's Issues

How to use the model trained by this algo to predict topics

  • Contextualized Topic Models version:
  • Python version:
  • Operating System:

Description

Hi, appreciate this module very much.

Just two simple questions:

  1. I have used the algo to train my own model and saved it to the directory.

May I ask how to use the trained model to predict topics for my documents outside the training corpus??

  1. Do the trained model could facilitate textual similarities ,e.g. given a topic word or paragraph, the model would tell us the prob of which topics are closer to the query item?

Thanks a lot.

CoherenceNPMI class of contextualized Topic Model to evaluate a Bertopic model for Portuguese Language.

  • Contextualized Topic Models version: 1.8.2
  • Python version:
  • Operating System: IOS

I am trying to use the CoherenceNPMI class of contextualized Topic Model to evaluate a Bertopic model for Portuguese Language.

from contextualized_topic_models.evaluation.measures import CoherenceNPMI
import re
delimiters = '*****', ' '

texts = [doc.split() for doc in comentariosList] # load text for NPMI

regexPattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
topics_ = [re.split(regexPattern, text.strip()) for text in topics_df['Keywords'].values.tolist()]

npmi = CoherenceNPMI(texts=texts, topics=topics_)
npmi.score(topk=3)

I had this error:


KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
444 try:
--> 445 return np.array([self.dictionary.token2id[token] for token in topic])
446 except KeyError: # might be a list of token ids already, but let's verify all in dict

7 frames
KeyError: 'de '

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in (.0)
445 return np.array([self.dictionary.token2id[token] for token in topic])
446 except KeyError: # might be a list of token ids already, but let's verify all in dict
--> 447 topic = (self.dictionary.id2token[_id] for _id in topic)
448 return np.array([self.dictionary.token2id[token] for token in topic])
449

KeyError: 'de '

Can you help me?

Using GPU instead of CPU

  • Contextualized Topic Models version: 1.4.1
  • Python version: 3.6.9
  • Operating System: Ubuntu

I was trying to run the basic example on a couple of texts. However, it takes a long time when running on CPU only, and the model is not receiving a device argument to make it use the GPU

Is there an easy way to do that without reimplementing everything?

Could I do inference when only having bert text?

  • Contextualized Topic Models version: 1.7.1
  • Python version:
  • Operating System:

Description

Could I do inference when only having bert text? For example, I have a trained CombinedTM model, and I would like to know the topic distribution of the input "I like apples."

What I Did

I wonder could we achieve the goal by the following code shown below.

bert_texts = ["I like apples."]

qt = QuickText("distiluse-base-multilingual-cased",
                text_for_bert=bert_texts,
                text_for_bow=bert_texts)

testing_dataset = qt.load_dataset()

# n_sample how many times to sample the distribution (see the doc)
ctm.get_thetas(testing_dataset, n_samples=20)

However, in QuickText, it will construct self.bow on the fly, and the vocabulary will be different from the one used in training data. I think the mismatch will make the model produce wrong predictions. Do I understand correctly? Also, how could I achieve the goal through current codes?

Different coherence measures

I am experimenting with your code to draw some comparisons to classic LDA and other topic modeling algorithms.

Also not the most reliable measure, I wanted to look at the various coherence measures.

In you measure.py file you are using the CoherenceModel from genism, but you are also specifying the 'c_npmi' as measure.

  1. I couldn't figure out how to adjust it to 'c_v' & 'u_mass'. Is it even possible? If not, could you please explain why?

  2. I always get the following error message when trying to compute the 'c_npmi':

KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
    444         try:
--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

/opt/conda/lib/python3.7/site-packages/gensim/models/coherencemodel.py in <listcomp>(.0)
    444         try:
--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 'lan'

Thanks in advance.

All the best

Contextized-topic-models weird gpu usage (pytorch and tensorflow working fine)

  • Contextualized Topic Models version: 1.8.2
  • Python version: 3.7.6
  • Operating System: Windows 10 Pro

Description

I'm trying to get Contextualized Topic Models tutorial to run on my gpu (rtx 3090).

What I Did

At the very start of the .fit() I see a spike in gpu usage and gpu memory get allocated then gpu usage sits at just 2-4% with the model taking several minutes to train. I tried the same code on google colab and it ran in seconds. Additionally I see my cpu usage at nearly 100% during the entire run. The only explanation I have for this behavior is if the code is trying to run in float64 mode which would cripple the speed of a 3090; as I fairly certain my gpu drivers are installed correctly as I see normal gpu usage when using tensorflow, and pytorch is detecting my devices. Any help that could be offered would be much appreciated and please let me know if there is any more information that I could provide that would be helpful.

Multilingual BERT models

  • Contextualized Topic Models version:
  • Python version: 3.8
  • Operating System: linux

Description

I would like to train a topic model over a multilingual corpus (most of the corpus is English, but ~20% is French, German etc...)
Can I use the combinedTM model for that?
I guess I should use a multilingual BERT model to represent my sentences and the rest of your model pipeline should work fine - right?

thanks a lot!

pip install and runtime errors

  • Contextualized Topic Models version:
  • Python version: 3.8.6, pip: 20.2.1
  • Operating System: Windows

Description

pip install results in torch/torchvision dependency errors. After "fixing" dependency errors, I get some errors related to missing names (e.g., ZeroShotTM) when trying to run the code.

What I Did

pip install -U contextualized_topic_models

Collecting contextualized_topic_models
  Using cached contextualized_topic_models-1.8.1-py2.py3-none-any.whl (29 kB)
Collecting wordcloud==1.8.1
  Using cached wordcloud-1.8.1-cp38-cp38-win_amd64.whl (155 kB)
Collecting scipy==1.4.1
  Using cached scipy-1.4.1-cp38-cp38-win_amd64.whl (31.0 MB)
  Using cached numpy-1.19.1-cp38-cp38-win_amd64.whl (13.0 MB)
ERROR: Could not find a version that satisfies the requirement torchvision==0.7.0 (from contextualized_topic_models) (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3, 0.5.0)
ERROR: No matching distribution found for torchvision==0.7.0 (from contextualized_topic_models)

To fix this, I pip installed the latest torch and torchvision and reran 'pip install -U contextualized_topic_models' which seemed to complete successfully. However, when I run the example code, I get the following errors:

No name 'ZeroShotTM' in module 'contextualized_topic_models.models.ctm'

When I view the /venv/contextualized_topic_models/models/ctm.py file in my virtual environment, I see that it does not include class ZeroShotTM as it does on Github.

I am also getting similar errors as shown below:

No name 'TopicModelDataPreparation' in module
Undefined variable 'load_from_text'
Undefined variable 'MmCorpus'
Undefined variable 'LdaModel'
Undefined variable 'doc2bow'

BoW and Contextual Embeddings have different sizes

Whenever I try to run the CTMDataset, I get the following error message.

Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

I am not sure how to solve this problem. When I convert the preprocessed and not_preprocessed column in my dataframe to a txt.file they have the same length without any NaN or empty strings. But after the handler.prepare() and bert_embdeddings_from_file, they have a different length.

I hope that I formulated my problem relatively clearly.

Looking forward to some feedback. If you need more information, please let me know!

Many thanks

Is a dense-representation BoW necessary?

I understand why the BoW matrix is necessary for the combined model. Is it also necessary for the contextual model? The BoW matrix requires num_documents * vocabulary_size memory but is mostly zeroes. Is it possible to modify the contextual model pipeline so that the BoW matrix does not have to be created? Alternatively, can the BoW be stored in a more efficient sparse representation?

A very large instance with ~600GB of memory can only process ~5 million documents with a vocabulary of just 15,000 tokens. That may sound like a lot, but that limit is easily reached if large documents are broken up into smaller sections in order to fit BERT's input length limitations. And 15,000 tokens is a pretty small vocabulary, especially for a cased model.

For example, we would like to try this method with a vocabulary of 100,000 tokens and ~100M documents, representing about 5GB of text. The BoW matrix for that would require something like 10TB of memory. A sparse representation would require much less memory.

Problematic matching between BOW and LM embeddings in CTMDataset

  • Contextualized Topic Models version: 1.3.1
  • Python version: 3.6.8
  • Operating System: Ubuntu 18.04 LTS

Description

Not properly a bug, but a possibly problematic behavior worth highlighting:

When initializing the CTMDataset object, there are no checks to ensure that both the BOW and contextual embeddings have the same length. This wouldn't normally be a problem, except that documents can get filtered out from the BOW list during TextHandler.prepare() if they don't contain any word in the vocabulary due to the condition if np.sum(x[x != np.array(None)]) != 0 in get_bag_of_words. The result is that there is a mismatch between BOW and contextual embedding, and CTMDataset __getitem__ accesses wrong elements.

What I Did

I took the raw first 50'000 sentences of the English Europarl corpus (next referred to as "unpreprocessed documents") and did some standard preprocessing (lemmatization, POS-based filtering) obtaining a set of sentences in a format similar to the one used in your GoogleNews examples ("preprocessed documents").

Following the provided examples, I use the prepare method of the TextHandler class to obtain the BOW representation of each preprocessed document. After that, I generate the contextual embeddings from unpreprocessed documents using the bert_embeddings_from_file or bert_embeddings_from_list methods and use both to initialize the CTMDataset.

Additional precisions and possible solutions

Despite the issue, the fitting process for the CTM with mismatched sizes runs without crashing. I didn't investigate this specifically, but I assume it is somehow related to some default batching behavior in torch.

A quick possible fix for the reported issue would be to simply raise an exception if # of BOW and contextual embeddings do not match in the CTMDataset initializer, making users aware of the problem. A more complex albeit possibly better alternative could be to avoid filtering empty BOWs in the get_bag_of_words method and do it in the CTMDataset initializer, removing the corresponding contextual embedding to preserve the correct matching.

Error after running training_dataset = qt.load_dataset() in the colab zero shot example

  • Contextualized Topic Models version:
  • Python version:
  • Operating System:

Description

I am getting the following error

AttributeError Traceback (most recent call last)
in ()
----> 1 training_dataset = qt.load_dataset()

AttributeError: 'TopicModelDataPreparation' object has no attribute 'load_dataset'

After running training_dataset = qt.load_dataset() in the colab Contexutalized Topic Modeling with your Documents (with preprocessing).

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Support on Amazon Sagemaker

Attempting to test this package in amazon sagemaker on python3

Description

Pip is failing to install module, getting module not found error.

What I Did

!pip install contextualized_topic_models
import contextualized_topic_models

output:
Collecting contextualized_topic_models
Using cached contextualized_topic_models-1.8.2-py2.py3-none-any.whl (29 kB)
Collecting gensim==3.8.3
Using cached gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
Collecting scipy==1.4.1
Using cached scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Collecting tqdm==4.56.0
Using cached tqdm-4.56.0-py2.py3-none-any.whl (72 kB)
Requirement already satisfied: wordcloud==1.8.1 in /opt/conda/lib/python3.7/site-packages (from contextualized_topic_models) (1.8.1)
Collecting numpy==1.19.1
Using cached numpy-1.19.1-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
Collecting sentence-transformers==0.4.1
Using cached sentence-transformers-0.4.1.tar.gz (64 kB)
Collecting torchvision>=0.7.0
Using cached torchvision-0.9.1-cp37-cp37m-manylinux1_x86_64.whl (17.4 MB)
Collecting torch>=1.6.0
Downloading torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 804.1 MB 77.5 MB/s eta 0:00:01 |โ–ˆโ–ˆโ–ˆโ–‹ | 91.7 MB 75.8 MB/s eta 0:00:10 |โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 106.4 MB 75.8 MB/s eta 0:00:10 |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 195.9 MB 64.6 MB/s eta 0:00:10 |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 244.1 MB 64.6 MB/s eta 0:00:09 |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 308.5 MB 80.7 MB/s eta 0:00:07โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 522.5 MB 73.7 MB/s eta 0:00:04Killed

ModuleNotFoundError Traceback (most recent call last)
in
1 get_ipython().system('pip install contextualized_topic_models')
----> 2 import contextualized_topic_models
3 from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
4 from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
5 from contextualized_topic_models.datasets.dataset import CTMDataset

ModuleNotFoundError: No module named 'contextualized_topic_models'

Error when getting document distribution matrix

Hi!
I run the following code on google colab:

from contextualized_topic_models.models.ctm import CTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
from contextualized_topic_models.datasets.dataset import CTMDataset

handler = TextHandler("documents.txt")
handler.prepare() # create vocabulary and training data

# generate BERT data
training_bert = bert_embeddings_from_file("documents.txt", "bert-base-nli-mean-tokens")

training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

ctm = CTM(input_size=len(handler.vocab), bert_input_size=768, inference_type="combined", n_components=512, num_epochs=100)

ctm.fit(training_dataset) # run the model

After training, I use the following command to get the topic distribution of document:

distribution = ctm.get_thetas(training_dataset) 
print(distribution)

And get an error:
image

I don't know why this happens, because several days ago I tried the same code and got the desired result.

How to Fix the Seed in in CTMs

  • Contextualized Topic Models version: version = '1.8.2'
  • Python version: 3.7.2
  • Operating System: iOS locally and VM Linux Centos 7

First of all, thanks a lot for the awesome, nicely structured package!

Description

I'm trying to run the ZeroShot model over my personal train and test dataset, using the 'sbert.net_models_bert-base-nli-mean-tokens' BERT model. I succeed in training and evaluating the model, however, when I train two separate models with the same parameter settings, I get very different results (that is, very different topics)

I would like to get the same results for each combination of hyperparameter settings, so that I can actually see what the effect is of using different hyperparameters.

What I Did

def set_seed():
    """ Fixes the seed for every iteration.
        Should be placed in all torch loops. """
    torch.manual_seed(SETUP.SEED)
    torch.cuda.manual_seed(SETUP.SEED)
    np.random.seed(SETUP.SEED)
    random.seed(SETUP.SEED)
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True

I tried to fix the seed by calling the function above in each for loop related to PyTorch, however, this doesn't seem to solve the problem.

If anyone can help me to create reproduceable results, please let me know!

Expected more than 1 value per channel when training

I am following the instructions to fit the model. However, I am getting this error:
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 50])

How to resolve this? What does this mean?

from contextualized_topic_models.models.ctm import CTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
from contextualized_topic_models.datasets.dataset import CTMDataset

handler = TextHandler("topic_tst.txt")
handler.prepare() # create vocabulary and training data

training_bert = bert_embeddings_from_file("topic_tst.txt", "./roberta-large-nli-mean-tokens")


training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

ctm = CTM(input_size=len(handler.vocab), bert_input_size=1024, inference_type="combined", n_components=50)

ctm.fit(training_dataset) # run the model

Multiprocessing module issue

from contextualized_topic_models.models.ctm import CTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
from contextualized_topic_models.datasets.dataset import CTMDataset

handler = TextHandler("documents.txt")
handler.prepare() # create vocabulary and training data

generate BERT data

training_bert = bert_embeddings_from_file("documents.txt", "distiluse-base-multilingual-cased")

training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

ctm = CTM(input_size=len(handler.vocab), bert_input_size=512, inference_type="combined", n_components=50)

ctm.fit(training_dataset) # run the model

output:::::
Settings:
N Components: 50
Topic Prior Mean: 0.0
Topic Prior Variance: 0.98
Model Type: prodLDA
Hidden Sizes: (100, 100)
Activation: softplus
Dropout: 0.2
Learn Priors: True
Learning Rate: 0.002
Momentum: 0.99
Reduce On Plateau: False
Save Dir: None
Traceback (most recent call last):
Traceback (most recent call last):
File "test3.py", line 22, in
File "", line 1, in
File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 105, in spawn_main
ctm.fit(training_dataset)
exitcode = _main(fd) File "C:\Users\Jay\py36\neo\lib\site-packages\contextualized_topic_models\models\ctm.py", line 225, in fit

  File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 114, in _main

sp, train_loss = self._train_epoch(train_loader)
prepare(preparation_data) File "C:\Users\Jay\py36\neo\lib\site-packages\contextualized_topic_models\models\ctm.py", line 151, in _train_epoch

  File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 225, in prepare

for batch_samples in loader:
_fixup_main_from_path(data['init_main_from_path']) File "C:\Users\Jay\py36\neo\lib\site-packages\torch\utils\data\dataloader.py", line 291, in iter

  File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path

return _MultiProcessingDataLoaderIter(self)
run_name="mp_main") File "C:\Users\Jay\py36\neo\lib\site-packages\torch\utils\data\dataloader.py", line 737, in init

File "C:\Users\Jay\py36\lib\runpy.py", line 263, in run_path
w.start()
File "C:\Users\Jay\py36\lib\multiprocessing\process.py", line 105, in start
pkg_name=pkg_name, script_name=fname)
self._popen = self._Popen(self) File "C:\Users\Jay\py36\lib\runpy.py", line 96, in _run_module_code

  File "C:\Users\Jay\py36\lib\multiprocessing\context.py", line 223, in _Popen

mod_name, mod_spec, pkg_name, script_name)
return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\Jay\py36\lib\runpy.py", line 85, in _run_code

  File "C:\Users\Jay\py36\lib\multiprocessing\context.py", line 322, in _Popen

exec(code, run_globals)
return Popen(process_obj) File "C:\Users\Jay\Desktop\try_backend\test3.py", line 22, in

  File "C:\Users\Jay\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__

ctm.fit(training_dataset)
reduction.dump(process_obj, to_child) File "C:\Users\Jay\py36\neo\lib\site-packages\contextualized_topic_models\models\ctm.py", line 225, in fit

  File "C:\Users\Jay\py36\lib\multiprocessing\reduction.py", line 60, in dump

sp, train_loss = self._train_epoch(train_loader)
ForkingPickler(file, protocol).dump(obj) File "C:\Users\Jay\py36\neo\lib\site-packages\contextualized_topic_models\models\ctm.py", line 151, in _train_epoch

BrokenPipeErrorfor batch_samples in loader::

[Errno 32] Broken pipe File "C:\Users\Jay\py36\neo\lib\site-packages\torch\utils\data\dataloader.py", line 291, in iter

return _MultiProcessingDataLoaderIter(self)

File "C:\Users\Jay\py36\neo\lib\site-packages\torch\utils\data\dataloader.py", line 737, in init
w.start()
File "C:\Users\Jay\py36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\Jay\py36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\Jay\py36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\Jay\py36\lib\multiprocessing\popen_spawn_win32.py", line 33, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "C:\Users\Jay\py36\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

The thing you people mentioned in the github itself is not running.

failing to predict topics in unseen documents

  • Contextualized Topic Models version: 1.5.3
  • Python version: 3.6.9
  • Operating System: google colab

Description

Following the colab tutorial I trained the model using the wikipedia dataset.
Now I am trying to follow the documentation to apply the model on new documents as described at here and slightly different way here described in this notebook . Both ways I get the same error. What am I missing?

What I Did

test_handler = TextHandler("my_dataset_english.txt")
test_handler.prepare() # create vocabulary and training data

# generate BERT data
testing_bert = bert_embeddings_from_file("my_dataset_english.txt", "distilbert-base-nli-mean-tokens")

testing_dataset = CTMDataset(test_handler.bow, testing_bert, test_handler.idx2token)
# n_sample how many times to sample the distribution (see the doc)
ctm.get_thetas(testing_dataset, n_samples=20)

error

Predict topics for unseen documents

I have a question about predicting the topics for documents.

Can I predict the topics of the documents which are different from the trained documents? For example, I will train with technology related documents but in prediction, i will try with management related documents. Will it still give reasonable topics or should i retrain the model on management related documents?

I also saw the discussion in #4 and tried the following code

ctm.get_topic_lists()[predicted_topics[0]]

but this get_topic_lists() is from the trained technology documents which are unrelated topics from management documents. So, according to this, there is clearly no chance of getting management topics because we are mapping with unrelated topic lists.

Is my question understandable? I'm quite new to this topic modelling field. So, this is quite confusing to me.

Please help me.

Loading pretrained models

  • Contextualized Topic Models version: 1.7.0
  • Python version: 3.7.3
  • Operating System: Windows

Description

I have trained a Zero Shot Cross Lingual topic model and saved it with the .save() method. Now, I have problems with loading this model.

What I Did

CTM().load(model_dir = 'path to the epoch_9.pth file', epoch=9)

This resulted in the following error:

TypeError: __init__() missing 2 required positional arguments: 'input_size' and 'bert_input_size'

Could you please tell me, what would be the right way to load a pretrained Zero Shot model ?
Thank you!

dimension mismatch during inference

Thanks for sharing the code.
I am trying to do a contextual topic modelling on a custom dataset as below.

tp = TopicModelDataPreparation("bert-base-nli-mean-tokens")
training_dataset = tp.create_training_set(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)
ctm_1 = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, n_components=5, num_epochs=15)
ctm_1.fit(training_dataset)

test_documents = ['sentece 1', 'sentence 2']
sp = WhiteSpacePreprocessing(test_documents, stopwords_language='english')
test_preprocessed_documents, test_unpreprocessed_corpus, vocab = sp.preprocess()
testing_dataset = tp.create_training_set(text_for_contextual=test_unpreprocessed_corpus, text_for_bow=test_preprocessed_documents)
test_topics_predictions_1 = ctm_1.get_thetas(testing_dataset)

It gives me below dimension mismatch error.

    RuntimeError: mat1 dim 1 must match mat2 dim 0 
    contextualized_topic_models/networks/inference_network.py in forward(self, x, x_bert)
     128         x = torch.cat((x, x_bert), 1)
--> 130         x = self.input_layer(x)

Tried with another way to creat the test dataset, however it also gets the same error.

test_set = tp.create_test_set(test_documents)
test_topics_predictions_1 = ctm_1.get_doc_topic_distribution(test_set)
And
test_topics_predictions_1  = ctm_1.get_thetas(test_set)

Am I missing anything here while doing the inference. Please share your input how to do the inference on a CTM trained model in real time.

NameError: name 'tp' is not defined

Description

qt = TopicModelDataPreparation("bert-base-nli-mean-tokens")
training_dataset = tp.create_training_set(unpreprocessed_corpus_for_contextual, preprocessed_documents_for_bow)

What I Did

NameError                                 Traceback (most recent call last)

<ipython-input-11-de13bb34ab74> in <module>()
      1 qt = TopicModelDataPreparation("bert-base-nli-mean-tokens")
----> 2 training_dataset = tp.create_training_set(unpreprocessed_corpus_for_contextual, preprocessed_documents_for_bow)

NameError: name 'tp' is not defined

@vinid @silviatti @dnozza

Hugging Face Model for Embedding

  • Contextualized Topic Models version: 1.8.2
  • Python version: 3.6
  • Operating System: Windows 10

Description

Hey guys... I'm trying to use CTM's for Topic Modeling answers of a survey. This texts are in spanish so I want to use a spanish pre trained HuggingFace Model as it says in the repository: "In general, our package should be able to support all the models described in the sentence transformer package and in HuggingFace."

Could you give an example how to export and use for embedding an HugginFace model like, for example:

https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased

It'd incredible if I can export this model, since it works very good in other NLP tasks.

Thanks!

Apllying code in my own dataset

Hi, Thank you for this great job, i'm beginner in BERT, and i want to use it in topic modeling (extraction topics from arabic text), do you have an idea how can i do this using your code? thank you so much.

BR

Parallel GPU training

Is parallel GPU training support possible? We would like to try this with a fairly large (multi-GB) dataset, but to make training time reasonable it would need to be done in parallel. Single node parallelism with DataParallel() would probably work for our use case, although the PyTorch documentation suggests that DistributedDataParallel() is preferred even for a single node.

Part of the motivation for this is that a large dataset needs a lot of memory, which in a cloud environment means a large, multi-GPU instance. It is very expensive to run such a large instance for weeks with all but one of the GPUs idle.

Explanation on ZeroShotTm working

Can you explain the internal working of ZeroShotTm as it doesn't uses BOW approach . The paper didn't highlighted about the ZeroShotTm . How does the topic get derived? And how does it predict for unseen documents and calculate the probability of the topics

Loading Model for Inference without train everytime

  • Contextualized Topic Models version:
  • Python version:
  • Operating System:

Description

Hello!

I have a model that is predicting very well the topics so I want to set it on production. While I was starting this I came with the following doubt. I'm loading my pretrained model and that's ok, but when I create my test_data I have an issue related with the data processed for the BoW:

def create_test_set(self, text_for_contextual, text_for_bow=None):

        if self.contextualized_model is None:
            raise Exception("You should define a contextualized model if you want to create the embeddings")

        if text_for_bow is not None:
            test_bow_embeddings = self.vectorizer.transform(text_for_bow)
        else:
            # dummy matrix
            test_bow_embeddings = scipy.sparse.csr_matrix(np.zeros((len(text_for_contextual), 1)))
        test_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, self.contextualized_model)

        return CTMDataset(test_bow_embeddings, test_contextualized_embeddings, self.id2token)

    def create_validation_set(self, text_for_contextual, text_for_bow=None):
        return self.create_test_set(text_for_contextual=text_for_contextual, text_for_bow=text_for_bow)

Here in your data_preparation.py script the create_test_set function has the variable text_for_bow is optional and it makes sense as I don't want to train every time my model (that means to create the create the training_dataset everytime) where the CountVectorizer will be created. Anyway, if I only put my text_for_contextual and then I do this:

predictions = ctm.get_doc_topic_distribution(testing_dataset, n_samples=10)

I get an error of dimension... (After that I created a training set with the create_training_set and then my test_set with the text_for_bow = processed_test_set and it went well.

Any advice?? I will appreciate beacause my main objective here is to predict whithout train every time.

Thanks.

RuntimeError                              Traceback (most recent call last)
<ipython-input-210-61941325c3e1> in <module>
      1 testing_dataset_prueba = qt.create_test_set(unpreprocessed_documents_test[:1000]) # create dataset for the testset
----> 2 predictions_prueba = ctma.get_doc_topic_distribution(testing_dataset_prueba, n_samples=10)

~\AppData\Roaming\Python\Python38\site-packages\contextualized_topic_models\models\ctm.py in get_doc_topic_distribution(self, dataset, n_samples)
    289                     # forward pass
    290                     self.model.zero_grad()
--> 291                     collect_theta.extend(self.model.get_theta(X, X_bert).cpu().numpy().tolist())
    292 
    293                 pbar.update(1)

~\AppData\Roaming\Python\Python38\site-packages\contextualized_topic_models\networks\decoding_network.py in get_theta(self, x, x_bert)
    126         with torch.no_grad():
    127             # batch_size x n_components
--> 128             posterior_mu, posterior_log_sigma = self.inf_net(x, x_bert)
    129             posterior_sigma = torch.exp(posterior_log_sigma)
    130 

~\Miniconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~\AppData\Roaming\Python\Python38\site-packages\contextualized_topic_models\networks\inference_network.py in forward(self, x, x_bert)
    126         x_bert = self.adapt_bert(x_bert)
    127         x = torch.cat((x, x_bert), 1)
--> 128         x = self.input_layer(x)
    129 
    130         x = self.activation(x)

~\Miniconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~\Miniconda3\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
     91 
     92     def forward(self, input: Tensor) -> Tensor:
---> 93         return F.linear(input, self.weight, self.bias)
     94 
     95     def extra_repr(self) -> str:

~\Miniconda3\lib\site-packages\torch\nn\functional.py in linear(input, weight, bias)
   1688     if input.dim() == 2 and bias is not None:
   1689         # fused op is marginally faster
-> 1690         ret = torch.addmm(bias, input, weight.t())
   1691     else:
   1692         output = input.matmul(weight.t())

RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x2001 and 4000x100)

Error While Getting BERT Embeddings From File

Hey,

Hope you are well.

Description

I was trying to take some text from some JSON files and then create topic models for those files. In particular, I was trying to calculate the topic distribution for each of the documents.

I followed mostly from what was mentioned on this issue ticket regarding the Pandas DataFrame and got it to work yesterday. However, today it is continuously giving me a weird error - "ValueError: Wrong shape for input_ids (shape torch.Size([1040])) or attention_mask (shape torch.Size([1040]))"

The error is when I try to get BERT embeddings from the file, in particular, when I run the following command:

training_bert = bert_embeddings_from_file("pre_documents.txt", "bert-base-nli-mean-tokens")

Below are the exact crash details.

ValueError                                Traceback (most recent call last)
<ipython-input-12-7cd62b310f6c> in <module>()
      6 handler.prepare()
      7 
----> 8 training_bert = bert_embeddings_from_file("pre_documents.txt", "bert-base-nli-mean-tokens") # I'm assuming the tweets are in english here.
      9 
     10 training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

7 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
    260             raise ValueError(
    261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
--> 262                     input_shape, attention_mask.shape
    263                 )
    264             )

ValueError: Wrong shape for input_ids (shape torch.Size([1040])) or attention_mask (shape torch.Size([1040]))

For reference, I'm using Google Colab.

I'm pretty confused and any help would be greatly appreciated!

Predicted topic distribution for documents varies

  • Contextualized Topic Models version: 1.4.2
  • Python version: 3.7.6
  • Operating System: Ubuntu 18.04

Description

I want to extract the topic distribution for my documents (after having fitted the model).
For example, I want to use this to show the "top 10" documents for every topic.

It's not clear how to do this in the library. Using ctm.get_thetas() doesn't seem to work for me here.

What I Did

Following your example notebook I tried:

distribution = ctm.get_thetas(training_dataset)[8] # topic distribution for the eight document
print(distribution)

However, if I run the example notebook and execute the cell with the above code multiple times, the output of distribution changes. Is this expected? From my understanding, the model should be in pure inference mode and multiple calls with the same docs should result in the same topic probabilities. Why is that the case?

Code for topic prediction in the example

Thanks for sharing the code! A little bit confused about the code for the topic prediction in https://github.com/MilaNLProc/contextualized-topic-models/blob/master/examples/topic-modeling.ipynb So if I understand it right(I am new to this field and hopefully I got it), is it correct that one line in the GoogleNews.txt refers to one document? For the examples above, the topics the model generates for document nokia lumia launch are ['moto', 'xbox', 'camera', 'surface', 'review']. These words didn't appear in the eighth document.

topic distribution vector of word

Hi,
In the example notebook for Contextual Topic Modeling, we can get topic distribution for a document via
distribution = ctm.get_thetas(training_dataset)[8] # topic distribution for the first document
I'm wondering whether can we get topic distribution for a single word? More specifically, for a input text corpus can we obtain the vocabulary file which contains words and the corresponding topic representations?

Function ctm.get_thetas tales very long time to evaluate from 100K set.

Heelo,

I have used the below method to work on a text documents to evaluate the topics, code works well on 100 lines of documents. But, I am facing issues when I run on 100K lines, the traning completes on time but during topic extraction, ctm.get_thetas takes way too much time (80+ hours). Is there anything I am missing here.
Thank you in advance.

#FYI:

Yes, it will predict english topics for the spanish unseen documents (see the paper for more details). However, if the context of the training document is completely different from the ones used at test time (e.g., you train on sport related tweet in english and test on politics related tweets in spanish) the topic prediction will be less accurate.

A few more notes:

You will need to use multilingual embeddings in this case

training_bert = bert_embeddings_from_file("documents.txt", "distiluse-base-multilingual-cased")

and you must use the "contextual" model in place of the "combined" one:

ctm = CTM(input_size=len(handler.vocab), bert_input_size=512, inference_type="contextual", n_components=50)

The multilingual notebook in the example folder should contain all that is needed to run this process.

At test time, you can simply create the new vector embeddings for a different language and predict the topics. You'll find that to predict topic you can use the very same code I've shared with you in the first message.

testing_bert_italian = bert_embeddings_from_file('unpreprocessed_docs_italian.txt', "distiluse-base-multilingual-cased")
testing_dataset_italian = CTMDataset(testing_bert_italian, testing_bert_italian, []) # not the best definition, but it's an effective workaround to get the topics for italian documents.

and you get the topic distributions:

predicted_topics = [] 
thetas = np.zeros((len(testing_dataset_italian), num_topics))
for a in range(0, 100):
    thetas = thetas + np.array(ctm.get_thetas(testing_dataset_italian))
    
for idd in range(0, len(testing_dataset_italian)):
    thetas[idd] = thetas[idd]/np.sum(thetas[idd])
    predicted_topic = np.argmax(thetas[idd]) 
    predicted_topics.append(predicted_topic)

thetas # document-topic distribution 
predicted_topics # list of the topic predicted for each testing document

Hope this helps :)

Originally posted by @vinid in #4 (comment)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 2869: character maps to <undefined>

  • Contextualized Topic Models version: 1.5.2
  • Python version: 3.7
  • Operating System: Windows 10

Description

I'm just copy and pasting the notebook found here, with the difference that instead of using wget on the dataset, I just downloaded it from the colab notebook:

https://colab.research.google.com/drive/1V0tkpJL1yhiHZUJ_vwQRu6I7_svjw1wb?usp=sharing#scrollTo=WXZ8fOdYwdWO

What I Did

I then get the following error.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-5453e403b509> in <module>
      3         training_bert = pickle.load(filino)
      4 else:
----> 5     training_bert = bert_embeddings_from_file("dbpedia_sample_abstract_20k_unprep.txt", "distilbert-base-nli-mean-tokens", batch_size=50)
      6     with open("saved_embeddings", "wb") as filino:
      7         pickle.dump(training_bert, filino)

~\AppData\Roaming\Python\Python38\site-packages\contextualized_topic_models\utils\data_preparation.py in bert_embeddings_from_file(text_file, sbert_model_to_load, batch_size)
     21     model = SentenceTransformer(sbert_model_to_load)
     22     with open(text_file) as filino:
---> 23         train_text = list(map(lambda x: x, filino.readlines()))
     24 
     25     return np.array(model.encode(train_text, show_progress_bar=True, batch_size=batch_size))

c:\program files\python38\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 2869: character maps to <undefined>

Add Example

add example of pre processed and unpreprocessed text in the documentation

TextHandler Continually Overflows RAM with ~100K Lines of Text

  • Contextualized Topic Models version: contextualized-topic-models 1.3.1
  • Python version: Python 3.6.10
  • Operating System: Ubuntu 18.04

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

I have a .txt file with ~100,000 lines of text, all of which are between 30 and 128 words long. When I try to initialize TextHandler with this dataset and prepare the resulting handler object, the TextHandler pipeline quickly overflows my 128 GB of RAM.

I'm not sure if I'm running something incorrectly or if this method can only work with very small datasets. Hopefully it's not the latter, given that standard topic modeling and other BERT-based sentence encoding methods work fine on my server with far more data that are even longer (e.g., 512 words maximum) โ€“ and with much less RAM.

What I Did

Here is how I initialized the TextHandler and prepared it:

handler = TextHandler("/media/seagate0/amazon/data/lang_pol_data_cult_bks_sampled_text.txt")
handler.prepare()

Halfway through the preparation, Python thew this warning, after which it overflowed the RAM and crashed:

/home/amruch/anaconda3/envs/nlp_polarization/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py:62: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  self.vocab_dict[x], y.split()))), data)))

Many thanks for any suggestions! Very much looking forward to applying this method!

Saving the zero shot model

  • Contextualized Topic Models version:1.6.0
  • Python version:
  • Operating System:

Can I use the pickle module in python to save the Zerohot model

some little error

  • Contextualized Topic Models version:
  • Python version:3.6
  • Operating System: win10

Description

When I run the Contexutalized Topic Modeling with your Documents (with preprocessing) have error
error happen on 29 line
image

What I Did

image

thank you very much

Numpy module errors when importing package.

Hi,

Trying to run this project when installed from PyPi, throws various errors based on the numpy installation / version:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-10-295696d3f617> in <module>
      3 
      4 from contextualized_topic_models.models.ctm import CTM
----> 5 from contextualized_topic_models.utils.data_preparation import TextHandler
      6 from contextualized_topic_models.datasets.dataset import CTMDataset

/opt/conda/lib/python3.7/site-packages/contextualized_topic_models/utils/data_preparation.py in <module>
      1 import numpy as np
----> 2 from sentence_transformers import SentenceTransformer
      3 import scipy.sparse
      4 
      5 

/opt/conda/lib/python3.7/site-packages/sentence_transformers/__init__.py in <module>
      1 __version__ = "0.3.5.1"
      2 __DOWNLOAD_SERVER__ = 'https://sbert.net/models/'
----> 3 from .datasets import SentencesDataset, SentenceLabelDataset, ParallelSentencesDataset
      4 from .LoggingHandler import LoggingHandler
      5 from .SentenceTransformer import SentenceTransformer

/opt/conda/lib/python3.7/site-packages/sentence_transformers/datasets/__init__.py in <module>
----> 1 from .sampler import *
      2 from .ParallelSentencesDataset import ParallelSentencesDataset
      3 from .SentenceLabelDataset import SentenceLabelDataset
      4 from .SentencesDataset import SentencesDataset
      5 from .EncodeDataset import EncodeDataset

/opt/conda/lib/python3.7/site-packages/sentence_transformers/datasets/sampler/__init__.py in <module>
----> 1 from .LabelSampler import *

/opt/conda/lib/python3.7/site-packages/sentence_transformers/datasets/sampler/LabelSampler.py in <module>
      4 from torch.utils.data import Sampler
      5 import numpy as np
----> 6 from ...datasets import SentenceLabelDataset
      7 
      8 

/opt/conda/lib/python3.7/site-packages/sentence_transformers/datasets/SentenceLabelDataset.py in <module>
      6 import numpy as np
      7 from tqdm import tqdm
----> 8 from .. import SentenceTransformer
      9 from ..readers.InputExample import InputExample
     10 from multiprocessing import Pool, cpu_count

/opt/conda/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py in <module>
     20 
     21 from . import __DOWNLOAD_SERVER__
---> 22 from .evaluation import SentenceEvaluator
     23 from .util import import_from_string, batch_to_device, http_get
     24 from .datasets.EncodeDataset import EncodeDataset

/opt/conda/lib/python3.7/site-packages/sentence_transformers/evaluation/__init__.py in <module>
      2 from .SimilarityFunction import SimilarityFunction
      3 
----> 4 from .BinaryClassificationEvaluator import BinaryClassificationEvaluator
      5 from .EmbeddingSimilarityEvaluator import EmbeddingSimilarityEvaluator
      6 

/opt/conda/lib/python3.7/site-packages/sentence_transformers/evaluation/BinaryClassificationEvaluator.py in <module>
      7 import os
      8 import csv
----> 9 from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
     10 import numpy as np
     11 from typing import List

/opt/conda/lib/python3.7/site-packages/sklearn/__init__.py in <module>
     78     from . import _distributor_init  # noqa: F401
     79     from . import __check_build  # noqa: F401
---> 80     from .base import clone
     81     from .utils._show_versions import show_versions
     82 

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in <module>
     19 from . import __version__
     20 from ._config import get_config
---> 21 from .utils import _IS_32BIT
     22 from .utils.validation import check_X_y
     23 from .utils.validation import check_array

/opt/conda/lib/python3.7/site-packages/sklearn/utils/__init__.py in <module>
     21 
     22 from .murmurhash import murmurhash3_32
---> 23 from .class_weight import compute_class_weight, compute_sample_weight
     24 from . import _joblib
     25 from ..exceptions import DataConversionWarning

/opt/conda/lib/python3.7/site-packages/sklearn/utils/class_weight.py in <module>
      5 import numpy as np
      6 
----> 7 from .validation import _deprecate_positional_args
      8 
      9 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in <module>
     23 from contextlib import suppress
     24 
---> 25 from .fixes import _object_dtype_isnan, parse_version
     26 from .. import get_config as _get_config
     27 from ..exceptions import NonBLASDotWarning, PositiveSpectrumWarning

/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py in <module>
     16 import scipy.sparse as sp
     17 import scipy
---> 18 import scipy.stats
     19 from scipy.sparse.linalg import lsqr as sparse_lsqr  # noqa
     20 from numpy.ma import MaskedArray as _MaskedArray  # TODO: remove in 0.25

/opt/conda/lib/python3.7/site-packages/scipy/stats/__init__.py in <module>
    386 
    387 """
--> 388 from .stats import *
    389 from .distributions import *
    390 from .morestats import *

/opt/conda/lib/python3.7/site-packages/scipy/stats/stats.py in <module>
    178 import scipy.special as special
    179 from scipy import linalg
--> 180 from . import distributions
    181 from . import mstats_basic
    182 from ._stats_mstats_common import (_find_repeats, linregress, theilslopes,

/opt/conda/lib/python3.7/site-packages/scipy/stats/distributions.py in <module>
      6 #       instead of `git blame -Lxxx,+x`.
      7 #
----> 8 from ._distn_infrastructure import (entropy, rv_discrete, rv_continuous,
      9                                     rv_frozen)
     10 

/opt/conda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py in <module>
     21 
     22 # for root finding for continuous distribution ppf, and max likelihood estimation
---> 23 from scipy import optimize
     24 
     25 # for functions of continuous distributions (e.g. moments, entropy, cdf)

/opt/conda/lib/python3.7/site-packages/scipy/optimize/__init__.py in <module>
    386 
    387 from .optimize import *
--> 388 from ._minimize import *
    389 from ._root import *
    390 from ._root_scalar import *

/opt/conda/lib/python3.7/site-packages/scipy/optimize/_minimize.py in <module>
     25 from ._trustregion_krylov import _minimize_trust_krylov
     26 from ._trustregion_exact import _minimize_trustregion_exact
---> 27 from ._trustregion_constr import _minimize_trustregion_constr
     28 
     29 # constrained minimization

/opt/conda/lib/python3.7/site-packages/scipy/optimize/_trustregion_constr/__init__.py in <module>
      2 
      3 
----> 4 from .minimize_trustregion_constr import _minimize_trustregion_constr
      5 
      6 __all__ = ['_minimize_trustregion_constr']

/opt/conda/lib/python3.7/site-packages/scipy/optimize/_trustregion_constr/minimize_trustregion_constr.py in <module>
      3 from scipy.sparse.linalg import LinearOperator
      4 from .._differentiable_functions import VectorFunction
----> 5 from .._constraints import (
      6     NonlinearConstraint, LinearConstraint, PreparedConstraint, strict_bounds)
      7 from .._hessian_update_strategy import BFGS

/opt/conda/lib/python3.7/site-packages/scipy/optimize/_constraints.py in <module>
      6 from .optimize import OptimizeWarning
      7 from warnings import warn
----> 8 from numpy.testing import suppress_warnings
      9 from scipy.sparse import issparse
     10 

ModuleNotFoundError: No module named 'numpy.testing'

Import Error: Cannot import ZeroShotTM error

  • Contextualized Topic Models version: 1.5.3
  • Python version: 3.6
  • Operating System: Linux

Description

Hello, I got an error "ImportError: cannot import name 'ZeroShotTM'" when running the command:

from contextualized_topic_models.models.ctm import ZeroShotTM

My packages version:

transformers: 3.1.0

sentence-transformer: 0.3.6

torch: 1.6.0

ZeroShot Model: Predict topics of the documents in unseen languages

  • Contextualized Topic Models version: 1.7.1
  • Python version: 3.7.3
  • Operating System: Windows

I trained a ZeroShot model in one language and, now, I am trying to predict topics in another language using get_doc_topic_distribution(n_samples = 10) . However, it takes much longer as compared to the training dataset. Both datasets are about the same size, namely 800000 texts. Now, it has been running for almost 24 h.
Could you tell me, from your experience, whether it is normal? Or should I set lower n_samples number?

Thank you!

Using ELMO instead of BERT

Hi;
Thank you fot your great and well-explained work. Do you have how can i use ELMO instead of BERT ?

Code :

handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
docELMO = [i.split() for i in unpreprocessed_documents]

from elmoformanylangs import Embedder
e = Embedder('/home/nassera/136',batch_size = 64)

training_elmo = e.sents2elmo(docELMO, output_layer=0)

print("training ELMO : ", training_elmo[0])
training_dataset = CTMDataset(handler.bow, training_elmo, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

ctm.fit(training_dataset) # run the model
print('topics : ',ctm.get_topics())

When i run this code i get this error :

2021-01-16 22:12:51,392 INFO: char embedding size: 3773 2021-01-16 22:12:52,371 INFO: word embedding size: 221272 2021-01-16 22:12:58,469 INFO: Model( (token_embedder): ConvTokenEmbedder( (word_emb_layer): EmbeddingLayer( (embedding): Embedding(221272, 100, padding_idx=3) ) (char_emb_layer): EmbeddingLayer( (embedding): Embedding(3773, 50, padding_idx=3770) ) (convolutions): ModuleList( (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,)) (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,)) (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,)) (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,)) (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,)) (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,)) (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,)) ) (highways): Highway( (_layers): ModuleList( (0): Linear(in_features=2048, out_features=4096, bias=True) (1): Linear(in_features=2048, out_features=4096, bias=True) ) ) (projection): Linear(in_features=2148, out_features=512, bias=True) ) (encoder): ElmobiLm( (forward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (forward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) ) ) 2021-01-16 22:13:11,365 INFO: 2 batches, avg len: 20.9 training ELMO : [[ 0.06318592 -0.04212857 -0.40941882 ... -0.393932 0.65597 -0.19988859] [ 0.0464317 -0.03159406 -0.23152797 ... 0.2573734 0.28932744 -0.21369117] [ 0.04215719 -0.27414545 -0.1282109 ... -0.01528776 0.15322109 -0.02998078] ... [-0.20043871 0.11804245 -0.5754699 ... 0.19337586 -0.06868231 0.11217812] [-0.1898424 -0.24078836 -0.1522124 ... -0.08325598 -0.5789431 -0.21831807] [ 0.08684797 -0.14746179 -0.2742679 ... 0.06612014 0.15257567 -0.32261848]] Settings: N Components: 50 Topic Prior Mean: 0.0 Topic Prior Variance: 0.98 Model Type: prodLDA Hidden Sizes: (100, 100) Activation: softplus Dropout: 0.2 Learn Priors: True Learning Rate: 0.002 Momentum: 0.99 Reduce On Plateau: False Save Dir: None Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/TM_FB/Test_CTM_ELMO.py", line 76, in <module> ctm.fit(training_dataset) # run the model File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 227, in fit sp, train_loss = self._train_epoch(train_loader) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 154, in _train_epoch for batch_samples in loader: File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__ data = self._next_data() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [30, 1024] at entry 0 and [21, 1024] at entry 1

i got a list of numpy arrays concerning ELMO but it bug in
ctm.fit(training_dataset) # run the model

How to use with pandas dataframe?

I have a pandas data frame in which a column has a set of tweets. How do I process it using this library and get the output to another column in the data frame?

use case / classify feeds and articles

Hi,

Hope you are all well !

I work on a rss news feed aggregator and I d like to classify them with a model.
url: https://github.com/ncarlier/feedpushr.

Being quite new to nlp related fields, I was wondering if is it possible to train it with your package in order to get contextualized topics for articles or feeds ?

Thanks in advance for your inputs and insights about that question.

Cheers,
X

What is the source of randomness in get_thetas?

I am experimenting with your code and trying to understand the source of randomness in get_thetas, why are multiple sampled needed to be averaged?

I have the model set in evaluation mode and fixed my seeds, if I change the seeds I get different topic predictions. I am sorry if this is a rudimentary question but I can't figure out the source of this randomness.

Thank you,

Martin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.