mim-solutions / bert_for_longer_texts Goto Github PK

BERT classification model for processing texts longer than 512 tokens. Text is first divided into smaller chunks and after feeding them to BERT, intermediate results are pooled. The implementation allows fine-tuning.

License: Other

Python 60.32% Shell 0.16% Jupyter Notebook 36.76% Makefile 1.22% Batchfile 1.54%

bert deep-learning machine-learning natural-language-processing nlp pytorch roberta text-classification transfer-learning transformers

bert_for_longer_texts's People

Contributors

Stargazers

Watchers

bert_for_longer_texts's Issues

QnA system using BERT

I'm trying to build a QnA system with bert where i will provide a pdf document. As 512 token is the limitation it's unable to take longer texts like a pdf.
I want to know which part of this repository I need to use for bert to work fine even with pdf.

MaskedLM for longer texts

Is it possible to apply the same logic on Roberta for the MaskedLM model? I need it for pretraining on a custom dataset that has long texts - Thanks

RuntimeError with the following message: "mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)

I'm encountering a RuntimeError with the following message: "mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)" when testing the fit and predict methods for a model with pooling using a pretrained model. Has anyone encountered this issue before, and if so, do you have any suggestions on how to resolve it?
Full errors log:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 model.fit(X_train, y_train, epochs=1)

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:80, in BertClassifier.fit(self, x_train, y_train, epochs)
     76 dataloader = DataLoader(
     77     dataset, sampler=RandomSampler(dataset), batch_size=self.batch_size, collate_fn=self.collate_fn
     78 )
     79 for epoch in range(epochs):
---> 80     self._train_single_epoch(dataloader, optimizer)

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:126, in BertClassifier._train_single_epoch(self, dataloader, optimizer)
    123 for step, batch in enumerate(dataloader):
    125     labels = batch[-1].float().cpu()
--> 126     predictions = self._evaluate_single_batch(batch)
    127     loss = cross_entropy(predictions, labels) / self.accumulation_steps
    128     loss.backward()

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert_with_pooling.py:124, in BertClassifierWithPooling._evaluate_single_batch(self, batch)
    119 attention_mask_combined_tensors = torch.stack(
    120     [torch.tensor(x).to(self.device) for x in attention_mask_combined]
    121 )
    123 # get model predictions for the combined batch
--> 124 preds = self.neural_network(input_ids_combined_tensors, attention_mask_combined_tensors)
    126 preds = preds.flatten().cpu()
    128 # split result preds into chunks

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/belt_nlp/bert.py:180, in BertClassifierNN.forward(self, input_ids, attention_mask)
    177 x = x[0][:, 0, :]  # take <s> token (equiv. to [CLS])
    179 # classification head
--> 180 x = self.linear(x)
    181 x = self.sigmoid(x)
    182 return x

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x512 and 768x1)

I appreciate any insights or assistance provided!

Any example colab scripts to fine tune BERT variations for text multi-class classification tasks?

My data has tokens more than 512 and I need to train a bert model (ALBERT base v2) for multiclass text classification task. I can't seem to find any example colab scripts in the repo. Kindly provide with some links or articles.

Loss function and optimizer as parameters

The best option would be to have the loss function as a parameter with the default value.
Similarly, there should be a possibility to choose the optimizer and its parameters.

Originally posted by @mwachnicki in #29 (comment)

text length warning

Token indices sequence length is longer than the specified maximum sequence length for this model (2268 > 512). Running this sequence through the model will result in indexing errors
Can I ignore the warning notice above? Why is it popping up?

running the fit function doesnt give me any verbose

Write description what does package do in Readme

At the top there should be explained the purpose of the package.

Outputting Attentions

Is it possible to output the attentions of each chunk using output_attentions=True

Obtain embedding vectors

Hello and thank you for sharing your work!
I would like to know if there is a way to obtain embedding vectors of one (or more) sentences fed into the model.
Hope you could help me.
Thank you in any case

Split Sizes Throws Error

@MichalBrzozowski91, thanks for this project! Really great stuff.

I get the following error in main.py on a four GPU setup when attempting to fine-tune a BERT model:

With batch size 1:
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 800 (input tensor's size at dimension 0), but got split_sizes=[16]

With batch size 4:
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 2750 (input tensor's size at dimension 0), but got split_sizes=[25, 6, 8, 16]

I would expect number_of_chunks to be of variable size for each record in the batch, but no matter my batch size, I seem to get an error at preds_split = preds.split(number_of_chunks) in main.py.

Any idea what I might be missing?

Managing GPU memory for token length more than 4000

Your code helped a lot to understand the chunking process. When i'm trying to fine tune using token length of 4000+ the model breaks with Out of memory exception. I have tried a batch size of 2 and on a larger 48GB GPU as well. I can see we are continuously pushing into GPU which causes memory exhaustion. Is there a way to better manage the memory for samples which are represented by 4000+ tokens.

use it for multiclass classification

this works for binary classification, how to support it for multiclass classification

plz help me

Is there a method for multi-class classification?
Are you currently conducting research on multi-class classification?
I get this warning message when training the model: "Token indices sequence length is longer than the specified maximum sequence length for this model (23716 > 512). Running this sequence through the model will result in indexing errors." Does this warning not affect the model training?

A few general questions

Hello there! Thank you for this nice project ✨ @mwachnicki @MichalBrzozowski91
I'm really enjoying working through the details! I've just got a few general questions I hope you can help me with.

Let's consider Devlin's example and say we have a 3x6 mini-batch as a result of splitting our input sequence into 3 chunks:

the man went to the store
to the store and bought a
and bought a gallon of milk

BELT allows to process the mini-batch in one go and returns a single, pooled probability value as a result.

Question 1
As far as the attention mechanism goes, am I right to understand that this is applied separately to each chunk? In other words, the tokens in the first chunk do not attend to the tokens in the second and third one, correct?

Question 2
Devlin suggests applying an attention mask to ensure boundary words are not considered twice; in our example to the store and and bought a in the second and third chunk, respectively. Why don't we simply split the original sentence in a way that the chunks do not overlap? For example:

the man went to the store
and bought a gallon of milk

What is the purpose of keeping these overlapping bits if we have to mask them anyway?

Question 3
If my considerations in Question 1 are correct and attention is applied separately on each individual chunk, wouldn't it be beneficial to not mask the overlapping boundary words? Intuitively, I'd say this increases the context of each chunk, making them more similar to each other "in the eyes of the model".

Thanks again for the great work!

Would it be okay to use the code below instead of bert?

Original code

pretrained_model_name_or_path: Optional[str] = "bert-base-uncased",

new code

pretrained_model_name_or_path: Optional[str] = "monologg/kobert",

mim-solutions / bert_for_longer_texts Goto Github PK

bert_for_longer_texts's People

Contributors

Stargazers

Watchers

Forkers

bert_for_longer_texts's Issues

Recommend Projects

Recommend Topics

Recommend Org