Hi, I am trying to reproduce the pre-training experiment with the codebase here an

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Oh yeah, that makes sense! <a class="user-mention notranslate" data-hovercard-type="us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Issue in using given code for pretraining the model about primer HOT 7 OPEN

jaineshdoshi commented on September 24, 2024

Issue in using given code for pretraining the model

from primer.

Comments (7)

Wendy-Xiao commented on September 24, 2024

Hi there,

Thanks for your interest in our work, and your description is clear, so I can fully understand your current situation.

First of all, thanks for pointing out the problem with the validation output of the PretrainDataset, as I revised the code for training and testing after I pretrained the model, so there might be some incompatibilities with the pre-training process. And what you understand was totally correct, as I did not compute ROUGE score in pre-training phase, so although tgt is required as part of the input, it is not in use at all. I will revise the code to support it.

And as for the problem when running the pre-training method, as I have not pre-trained the model from HF version of LED before, so I have not faced exactly the same problem. Based on the errors, it seems like it's an out-of-index problem. Could you double check the length of input and output of the batch causing this error, as well as the size of embedding layer of the LED-large model? It might be relevant to the settings of max_input_len and max_output_len in this function.

Let me know if you still cannot find the problem.

from primer.

haisonle001 commented on September 24, 2024

I managed to run the code of pretrain process. I just change code of validation_step function. My code is down below. I dont know whether my code is good or bad. @Wendy-Xiao , @jaineshdoshi , Can you comment about my change? Thank you so much!

@jaineshdoshi Furthermore, in the paper, the config is said to be changed, im quite confused about how to change it. Do you change anything with your config file of led-large model or you just initialize the model with hf config file? Is there any problem if i just use led-large and dont change anything? Thank you so much, look forwarding to receiving reply from you.

from primer.

jaineshdoshi commented on September 24, 2024

Hi,
@Wendy-Xiao
Thanks for your prompt reply and appreciate your help here!
I checked the batch max input/output lengths and they were set at 4096/1024 respectively, while the input samples too were truncated correctly with the code. I tried using different tokenizers (PRIMERA and led-large-16384) but still resulted in the same error. would yo have any other checks that I might be missing out on to handle this issue?
I double checked the token embeddings and it doesn't seem to be an issue from that, I think it might be from positional embeddings indexing issues. Would you know how to sanity check that?

Edit: I ran the code on a CPU to get the exact error due to the embeddings. Please see below for my update:

Maybe, I can try to use the longformer/LED implementation from the Github instead of the HuggingFace version but I recall I had faced some other errors initially with that too.

@haisonle001
I had also tried that approach to make the code here work but without luck as I ended up with the same error.
If you wont mind, could you point me to what you did differently from the steps that I have mentioned above in order to get the code working for pretraining?

I used the same configs of LED large as acquired from the hugging face model directly. I believe the authors can help you out better on this one.

Cheers!

Edit:

By running the pretrain code on the CPU I managed to pin down the issue with the embedding token.
This is the tokenizer that we want to have, with a vocab size of 50265 (incluing the token which is not present in the LED tokenizer)

model.tokenizer
PreTrainedTokenizerFast(name_or_path='allenai/PRIMERA', vocab_size=50265, model_max_len=4096, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False), 'additional_special_tokens': ['<doc-sep>']})

In, the pretraining mode, the tokenizer got updated with the required token, but I observed that the embedding table torch.nn.functional.embedding was not receiving the correct dimensions for the tokens as the shape of the weights tensor at the input was weight.shape = torch.Size([50265, 1024]) which is defined as:

weight (Tensor): The embedding matrix with number of rows equal to the maximum possible index + 1,
            and number of columns equal to the embedding size

So, in this case it should be (50266,1024) after addition of the <doc-sep> token.

(I still am figuring it out as to how to solve this for now, but all help is appreciated!)

Thanks!

from primer.

haisonle001 commented on September 24, 2024

@jaineshdoshi I use primer_main.py instead of primer_hf_main.py and i just changed code as above. I dont see much different between 2 files expect for how they loaded the model? You can try primer_main.py and do like what i did? may be it works.

from primer.

Wendy-Xiao commented on September 24, 2024

Oh yeah, that makes sense! @jaineshdoshi. In the original code, I resize the embedding when initializing the model, and then I might remove those lines when I cleaned up the code.

you can try adding these lines into the init function:

# The special token is added after each document in the pre-processing step.
self.tokenizer.add_special_tokens(
                {"additional_special_tokens": ["<doc-sep>"]}
            )
self.docsep_token_id = self.tokenizer.additional_special_tokens_ids[0]
self.model.resize_token_embeddings(len(self.tokenizer))

from primer.

jaineshdoshi commented on September 24, 2024

Hi,
@Wendy-Xiao
Thank you for the code updates!
I just got it working yesterday by adding the lines that you mentioned above as soon as realized that the embedding dimensions weren't updated as per the new token.

I do have a couple of questions on the model configuration.

Looking at the PRIMERA config.json files I do observe a few different parameters used for the LED, for example the attention window parameter in PRIMERA models config file is:

"attention_window": [
  512,
  512,
  512,
  512,
  512,
  512,
  512,
  512,
  512,
  512,
  512,
  512
  ],

while the same parameter in the LED large model config file (that the paper says is the starting point for pretraining PRIMERA) on HF is

"attention_window": [
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024,
  1024
  ],

I am a bit confused as to if this was the LED that was used as a starting point for pretraining of PRIMERA.
Can you shed some light on how do I get the correct configuration for the PRIMERA model to clear my confusion here?

@haisonle001
Thanks for helping out!
I am hoping that the pretraining using the HF based code/models should still produce the same experiment as such. Will update you folks if I find any disparities!

from primer.

Wendy-Xiao commented on September 24, 2024

Hi @jaineshdoshi

Yes, as indicated in the paper(sec 4.1), we used different window size (512) and length limit for the input (4096) and output (1024) for PRIMERA. All the settings can be found in the paper (sec 4.1 and appendix A), and if there is no specification, we just used the same setting as LED. You can refer PRIMERA config.

from primer.

Issue in using given code for pretraining the model about primer HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent