Giter Site home page Giter Site logo

longformer's Introduction

Longformer

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

  • Pretrained models: 1) led-base-16384, 2) led-large-16384

  • Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git

  • Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

  1. Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

  1. We added a notebook to show how to convert an existing pretrained model into its "long" version.

  2. Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

  1. Download pretrained model
  1. Install environment and code

    conda create --name longformer python=3.7
    conda activate longformer
    conda install cudatoolkit=10.0
    pip install git+https://github.com/allenai/longformer.git
  2. Run the model

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'sliding_chunks'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

  • Training scripts: scripts/triviaqa.py
  • Pretrained large model: here (replicates leaderboard results)
  • Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

longformer's People

Contributors

ajkl avatar armancohan avatar ibeltagy avatar riklopfer avatar schmmd avatar separius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

longformer's Issues

TriviaQA LR scheduler code issue

Hi,

For single gpu training using the triviaqa code script, the learning rate goes to 0 in the first epoch itself.

Possible reasons: For a batchsize of 1, the global_step in pytorch_lightning increases with each batch of size 1 returned by the data_loader. It doesn't correspond to the number of optimizer steps. The LR scheduler was written with accumulated gradient batch size and thus the learning rate goes to 0 within the first epoch itself.

Thanks.
Apoorv

Attention mask values

Hi,

In the tutorial of "how to use", the descriptions about the attention mask values are as follows:

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention

But the comments about the attention mask in the code are:

def forward(self, hidden_states, attention_mask=None, head_mask=None):
'''
The `attention_mask` is changed in BertModel.forward from 0, 1, 2 to
-ve: no attention
0: local attention
+ve: global attention
'''
if attention_mask is not None:
attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
key_padding_mask = attention_mask < 0
extra_attention_mask = attention_mask > 0
remove_from_windowed_attention_mask = attention_mask != 0

The implementation seems to use the signs of attention mask values to control the masks, which is inconsistent with the tutorial.

Hence, I would like to ask for some clarifications and check the correct usage of attention masks.

Thanks very much, and congratulations on your impressive results!

Best,
Jyun-Yu

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Hi,
I can't run the main script with huggingface model:

longformer-base-4096/


JSONDecodeError Traceback (most recent call last)
in
4 from transformers import RobertaTokenizer
5
----> 6 config = LongformerConfig.from_pretrained('longformer-base-4096/')
7 # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
8 # 'n2': for regular n2 attantion

D:\Anaconda3\envs\tf-gpu\lib\site-packages\transformers\configuration_utils.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
161
162 # Load config
--> 163 config = cls.from_json_file(resolved_config_file)
164
165 if hasattr(config, 'pruned_heads'):

D:\Anaconda3\envs\tf-gpu\lib\site-packages\transformers\configuration_utils.py in from_json_file(cls, json_file)
194 with open(json_file, "r", encoding='utf-8') as reader:
195 text = reader.read()
--> 196 return cls.from_dict(json.loads(text))
197
198 def eq(self, other):

D:\Anaconda3\envs\tf-gpu\lib\json_init_.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder

D:\Anaconda3\envs\tf-gpu\lib\json\decoder.py in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):

D:\Anaconda3\envs\tf-gpu\lib\json\decoder.py in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

What could be the possible issue?

sliding_chunks vs tvm in terms of speed

Hi there,

Thanks for the great work and making code available.

I ran Longformer on TriviaQA dev set as per instructions given in cheatsheet (to reproduce leaderboard numbers with pretrained model) in "tvm" and then in "sliding_chunks" to compare their speed. I found that sliding_chunks mode is faster compared to "tvm" + fp32 and at par with "tvm " + fp16.

BTW, I was able to reproduce TriviQA metric numbers.

Could you please answer the following questions:

  1. I thought "tvm" was supposed to be faster since its custom CUDA Kernal. Am I missing something?

    • Follow up, If they both are equally fast then I wonder why does one need this CUDA kernal for window attention only (assuming one doesn't need dilation and autoregressive attention)
  2. I thought "tvm" CUDA Kernal didn't support fp16 then how does the code runs fine with fp16 in "tvm" mode?

Configuration for RoBERTa

Great paper, and really clean and explainable repo, Thanks!

In the paper, you run the sequence length through the model, collect the output activations and repeat until the context is exhausted.

What is the max_doc_len and doc_stride for the roberta-base on Wikihop and TriviaQA?

Best,
Deming Ye

Question generation

Given a text corpus, would it be possible to generate questions using longformer?

Segmentation fault for 5 (or more) gpus training

When I am testing the model pretrain demo with 5 or more gpus parallelly, I meet the segmentation fault. But it works properly in 4 or less gpus.

Here is the demo code:

import torch
from longformer.longformer import Longformer, LongformerConfig, LongformerForMaskedLM2
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import utils
import numpy as np
from pytorch_optimization import get_optimization
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4'
config = LongformerConfig.from_pretrained('./longformer-large-4096/')
config.attention_mode = 'tvm'

longformer = Longformer(config=config)
model = LongformerForMaskedLM2(config, longformer)
utils.torch_init_model(model, 'longformer-large-4096/pytorch_model.bin')
tokenizer = RobertaTokenizer(vocab_file='roberta_large/vocab.json',
                             merges_file='roberta_large/merges.txt')
tokenizer.model_max_length = config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world'] * 750)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
print(input_ids.shape)
model.half()
# TVM code doesn't work on CPU.
# Uncomment this if `config.attention_mode = 'tvm'`
model = model.cuda()
optimizer = get_optimization(model=model,
                             float16=True,
                             learning_rate=3e-5,
                             total_steps=10000,
                             schedule='warmup_linear',
                             warmup_rate=0.1,
                             max_grad_norm=1.0,
                             weight_decay_rate=0.01)
model = torch.nn.DataParallel(model)
input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)  # initialize to local attention
# attention_mask[:, [1, 1023, ]] = 2  # Set global attention based on the task. For example,
# classification: the <s> token
# QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(input_ids, attention_mask, config.attention_window[0],
                                               tokenizer.pad_token_id)
print(input_ids.shape, attention_mask.shape)
masked_positions = np.random.choice(np.arange(0, input_ids.shape[1]), 300, replace=False)
masked_positions = torch.tensor(masked_positions).unsqueeze(0).cuda()
masked_lm_labels = torch.tensor(np.random.randint(0, 50000, masked_positions.shape)).cuda()

for i in range(10000):
    loss = model(input_ids=input_ids.repeat(5, 1),
                 attention_mask=attention_mask.repeat(5, 1),
                 masked_positions=masked_positions.repeat(5, 1),
                 masked_lm_labels=masked_lm_labels.repeat(5, 1))
    if loss.shape[0] > 1:
        loss = loss.mean()
    loss_value = loss.item()
    print('Step:{}/10000, Loss:{}'.format(i, loss_value))
    optimizer.backward(loss)
    optimizer.step()
    model.zero_grad()

Here is the error:
image

It works in 4 GPUs successfully:
image

ImportError: cannot import name 'nvcc'

from tvm.contrib import nvcc
ImportError: cannot import name 'nvcc'

I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

tvm doesn't work with cuda version 10.1

Currently, tvm works only with cuda 10.0. Can we have support for cuda 10.1 and 10.2 as well.

OSError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Out of memory issue

I have a list of size 2200 of long documents. Thorough a for loop I am applying longformer to each of the documents and append ouput[1] to the output variable. Thoughg through each iterations the memory grows significantly. As a result After only a few iterations I ran out of memory (256 GB). I can't figure out what is consuming all the memory. A snippet of my code:

`model = Longformer.from_pretrained(longformer_base_dir, config=config)

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.max_len = model.config.max_position_embeddings
output_vec =[]
for pg in pages:
pg = f'{tokenizer.cls_token}{pg}{tokenizer.eos_token}'
input_ids = torch.tensor(tokenizer.encode(pg)).unsqueeze(0) # batch of size 1
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)

    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

    output = model(input_ids, attention_mask=attention_mask)
    output_vec.append(output[1])

`

Longformer using BERT

Hi folks! First of all, Thank you a lot for sharing this code with the community!! And thank your for the hard work with contribution on NLP area.

Based on convert_model_to_long.ipynb, I'm trying to use BERT pt-br (from neuralmind) version to generate a portuguese longformer model. After some ajusts at the original code, I'm stucked on an error, catched on method 'pretrain_and_evaluate' calling 'trainer.evaluate()':

forward() got an unexpected keyword argument 'labels'

Did you tried use any bert model as base? Do you know how I can move on about this error?

Thanks!

Does LongFormer work with bidirectional context?

Hi,

It is mentioned in the paper here "While such models have been successful in autoregressive language modeling, they are unsuitable for transfer learning approaches with tasks that benefit from bidirectional context." about the bidirectional context but I just wanted to confirm it.

Thanks!

Correct way of using fp16 for training

Hi,
Firstly, thank you for making the code and trained model public!

I have a question regarding the use of fp16 during training:
I'm using the longformer for sequence classification and I've created a pytorch module that adds a RobertaClassificationHead on top of the longformer like so:

class LongformerForSequenceClassification(Longformer):
    def __init__(self, pretrained_model, config):
        super(LongformerForSequenceClassification, self).__init__(config)
        self.num_labels = config.num_labels
        
        self.longformer = pretrained_model.half() # For fp16
        self.classifier = RobertaClassificationHead(config)
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, input_embeds=None, labels=None,):
        outputs = self.longformer(input_ids, 
                                  attention_mask=attention_mask, 
                                  token_type_ids=token_type_ids, 
                                  position_ids=position_ids,
                                  head_mask=head_mask,)
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output.float())
        
        outputs = (logits,) + outputs[2:]
        if labels is not None:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

I'm halving the longformer model when I load it self.longformer = pretrained_model.half() # For fp16. This gets the code to work but my loss become nan. I'm guessing this is because of the numerical instabilities due to mixed precision training.

I've tried converting the entire model to fp16 as well the inputs but all of these fail at one or the other step because some aspect of the model (embedding, linear layers in classifier) do not permit this.

How can I correctly use fp16 in this context? From looking at the test for sliding_chunks should I just convert these 3 variables to fp16

q = self.query(hidden_states)
k = self.key(hidden_states)
v = self.value(hidden_states)
?

I don't have Apex's AMP setup on my machine. I'm just using the functionality that pytorch offers.

Please let me know your thoughts. Thank you for your time!

Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet.
I user following instructions to reproduce it from cheatsheet.txt

// To run our pretrained TriviaQA large model (replicates the leaderboard results),
// first download the pytorch-lightning checkpoint:
// https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz
// then run:
python -m scripts.triviaqa
--train_dataset squad-wikipedia-train-4096.json \ # loaded but not used
--dev_dataset squad-wikipedia-dev-4096.json
--gpus 0 --num_workers 4
--max_seq_len 4096 --doc_stride -1
--save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint
--model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used
--test # predictions will be saved into predictions.json

//then run the official evaluation scripts
python -m scripts.triviaqa_utils.evaluation_utils
--dataset_file path/to/qa/wikipedia-dev.json
--prediction_file predictions.json

//Output should be:
{'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

help with understanding the intuition behind the "4096" attention window.

Hi,

I understand that in order to overcome the limitation of BERT, Transformer-XL, etc with a longer attention window, LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096). The selection of "4096" as the extended attention window isn't clear to me in the paper.

My understanding of the above is the GPU memory training and evaluation constraints because all the 4096 or perhaps, just the 512 tokens are processed simultaneously. So, does it mean that this model can work beyond the 4096 attention window as well? Say, for very long legal documents with 100,000 tokens. In such cases, is it advisable to break it down to say, 100,000/512 = 196 (approx.) local blocks?

Also, the output is just a number associated with each long document (regression) or a class label (classification).

Any suggestions for this challenge will be helpful.

Thank you!

Is it possible to finetune the pretrained model on casual language modeling or text summarization?

Hi,
Thanks for providing and presenting this nice work.

As mentioned in your paper, your attention pattern for modeling long sequences can be plugged into any pretrained transformer model.
I wonder if this repo covers code to finetune a pretrained LM (e.g. gpt-2) or your own released pretrained model on a new dataset for language modeling task?

If so, is it possible through PyTorch implementation or the CUDA kernel?
I would appreciate if you can guide me in this respect.

Text Classifier using longformer

Can we request to add a short example of longformer for long text/review classification?
Current triviaQA is good but more examples will encourage further use of longformer.

Thanks.
Patrick

ImportError: cannot import name 'nvcc' from 'tvm.contrib' (unknown location)

Hi there, I am trying to run the sample model code, point3 under "How to use", but I ran into following issue.
ImportError: cannot import name 'nvcc' from 'tvm.contrib' (unknown location)

I thought the "Install environment and code" would cover all the dependencies. Please let me know if I am missing something.

Thanks in advance.

Not able to use the embedding for calculating similarity.

First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :)
Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}'
...................................
......................
output = model(input_ids, attention_mask=attention_mask)[0]

I get a embedding of dimension: torch.Size([1, 512, 768])
and when I try to calculate the cosine similarity on these embeddings I get error saying :
ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

However, I am unsure where should I append this line of code.
I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

Thanks for help :)

Pre-Training Model and Clarification on QA Dataset

Hi - I'm pretty excited about Longformer and the implications it has for long form NLP!

In the paper, it's outlined that the Pre-Training was conducted in 5 total phases with starting sequence length of 2,048 and ending sequence length of 23,040. For additional LM Pre-Training (English), what would be the best way to continue Pre-Training with additional datasets like C5?

Does the Pre-Training method care about whiteline block text (such as the Shakespeare txt) vs one complete document per line?

In the case of Multi-Lingual and Translation tasks, would it be similar to T5, where you would be able to Translation tasks by fine-tuning, or would it be more effective to have the languages visible during the Pre-Training process for better predictions downstream? (If so, would that essentially require retraining from Phase 1?)

In the cheatsheet for Finetuning QA, there's two additional parameters which are:

--wikipedia_dir path/to/evidence/wikipedia/
--web_dir path/to/evidence/web/

Would wikipedia_dir be enwiki8 and web_dir be text8?

Last Question - since Longformer uses a custom CUDA implementation during runtime for compiling the functions, would that mean that TPU accelerators would not be able to be used in this implementation?

Thanks!

Prediction/Inferencing Base & Large Models

Thank you for your release of Longformer, which looks spot-on to solve a pressing transformer need. I ran your pretrained TriviaQA large model per cheatsheet.txt and replicated the leaderboard results with the evaluation script. Worked great!

Closed

Intructions for Hyperpartisan preparation?

Great paper, and really clean and explainable repo, Thanks!
Any plans to release the Hyperpartisan dataset and benchmark utils?
It could really help future researchers go through your pipeline of cleaning and evaluating the data.

Thanks!

GLIBCXX_3.4.20 not found

When I run the model I get "`GLIBCXX_3.4.20' not found" error! Do you have any ideas?

OSError: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/myuser/anaconda3/envs/longformer/lib/python3.7/site-packages/tvm/libtvm_runtime.so)

triviaqa.py code issue

Encounter below issue when calling main function from triviaqa.py script explicitly (through another script)

Error log:
NameError Traceback (most recent call last)
in
----> 1 triviaqa.main(args)

/notebooks/files/longformer/triviaqa.py in main(args)
658 torch.cuda.manual_seed_all(args.seed)
659
--> 660 model = TriviaQA(args)
661
662 logger = TestTubeLogger(

/notebooks/files/longformer/triviaqa.py in init(self, args)
260 self.tokenizer.max_len_single_sentence = self.args.max_seq_len
261 self.tokenizer.max_len_sentences_pair = self.args.max_seq_len
--> 262 self.model = self.load_model()
263 self.num_labels = 2
264 self.qa_outputs = torch.nn.Linear(self.model.config.hidden_size, self.num_labels)

/notebooks/files/longformer/triviaqa.py in load_model(self)
266
267 def load_model(self):
--> 268 model = Longformer.from_pretrained(args.model_path)
269 for layer in model.encoder.layer:
270 layer.attention.self.attention_mode = self.args.attention_mode

NameError: name 'args' is not defined

Changes needed here:
From: model = Longformer.from_pretrained(args.model_path)
To: model = Longformer.from_pretrained(self.args.model_path)

[WIP] Running longformer on TPU using pytorch/xla

We try running wrapped longformer model under colab TPU and got the following errors:

Tvm binary not found. Compiling ...
Exception in device=TPU:0: cannot import name 'nvcc'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "", line 66, in _mp_fn
fitter.fit(train_loader, validation_loader)
File "", line 47, in fit
losses, final_scores = self.train_one_epoch(para_loader.per_device_loader(self.device))
File "", line 120, in train_one_epoch
outputs = self.model(inputs, attention_masks)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 558, in call
result = self.forward(*input, **kwargs)
File "", line 26, in forward
seq_x, _ = self.backbone(input_ids=input_ids, attention_mask=attention_masks)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 558, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 790, in forward
....

Anyway to work around this error will be appreciated.
Thanks.

How can I train the pre-train model on chinese corpus?

Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

XLM-R support

Hey,

Congratulations on the impressive results and thank you for open-sourcing the work! 🤗

I have a question, do you also plan to implement Longformer for XLM-R because cross-lingual NLP with long text would be extremely useful?

Thanks & stay healthy,
Johannes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.