Hi everyone, I'm just curious to understand how this family of model

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

handling longer sequences about flan HOT 18 CLOSED

google-research commented on September 18, 2024

handling longer sequences

from flan.

Comments (18)

shayne-longpre commented on September 18, 2024 4

@kelvin-inspire To clarify, INPUT_SEQ_LEN = 2056 and TARGET_SEQ_LEN = 512 in the run_example.py script correspond to the input sequence length, and the maximum number of tokens the model can generate, respectively.

There is no chunking or sliding window. So if you have a training example with input length of 5000 and output length of 1000, but you are using INPUT_SEQ_LEN = 2056 and TARGET_SEQ_LEN = 512, then the model will only see 2056 tokens of the input, and only learn to generate 512 tokens of the output, as they both would be truncated.

You could of course just increase these maximum sequence lengths at inference, or train a model with longer sequences as well. Our script allows you to generate longer sequences if you need.

from flan.

shayne-longpre commented on September 18, 2024 2

@jianguoz Our input sequence length for training was 2048, and out output sequence length was 512.

These lengths fit I think almost all training examples in our corpus.

from flan.

shayne-longpre commented on September 18, 2024 2

@kelvin-inspire I am not an expert in this, but here is my understanding.

The only reasons it is not recommended to give BERT more than 512 tokens is because its positional embeddings were only trained up to 512 (see this post, but like this poster, you can still expand to more tokens if your hardware can handle the quadratic memory requirements with respect to the input sequence length.

The T5 model uses a different kind of "relative" positional embedding (see page 5) that works for any length sequence. So the only obstacle to feeding T5 or Flan-T5 longer sequences is that it may not fit into your memory, or that it never saw as long sequence examples at training, so it may perform poorly. But there is no chunking or sliding window, it just truncates tokens after however many are allowed in the pre-defined input size and output size lengths.

from flan.

shayne-longpre commented on September 18, 2024

@kelvin-inspire Thank you for the question.

You can set the inference time sequence length in flan/v2/run_example.py to whatever you like. Input sequences longer than that are simply truncated.

from flan.

jianguoz commented on September 18, 2024

Hi @shayne-longpre Thank you for your information. You mentioned that Input sequences longer than that are simply truncated. Does it mean FlanT5 2022 was instruction-tuned on training inputs with maximum limit of 512 tokens? If not, can we know the maximum length in training? Thanks:)

from flan.

jianguoz commented on September 18, 2024

Hi @shayne-longpre , thanks very much for your clarification! Another side question is that FlanT5 2022 uses a learning rate of 0.001 and 100k steps, while other papers, such as OPT-IML uses a very small learning rate of 5e-5 and only 10k steps, and FLAN 2021 uses another small learning rate of 3e-5 and 30k steps.

There are big gaps between the scale of learning rates and training steps, and we are not sure whether a large learning rate (0.001) and large training steps (100k) will cause overfitting, or other learning rates/small steps will cause under-fitting. Do you have any comments or suggestions here? Appreciate it:)

from flan.

shayne-longpre commented on September 18, 2024

@jianguoz Yes, I would suggest following the settings for which you are seeing better convergence, rather than following our parameters exactly. There are many quirks of the internal infrastructure, including specific seqio and caching settings, and the TPUs, that I'm not sure the exact hyperparameters will transfer well to other training infrastructure.

We found a range of learning rates can work well. It's more important that you make sure you don't under-train. But you can monitor this by validating every thousand steps or so on some held-out tasks.

from flan.

jianguoz commented on September 18, 2024

@shayne-longpre Thanks for your helpful comments and valuable suggestions!

from flan.

kelvin-inspire commented on September 18, 2024

@kelvin-inspire Thank you for the question.

You can set the inference time sequence length in flan/v2/run_example.py to whatever you like. Input sequences longer than that are simply truncated.

Thanks for the reply.

For instance, if the input_seq_len is set to 2048 and the max_seq_len is 512, how would the model process this sequence. Do these models use 'chunking' or 'sliding window' to process these the longer sequences, if yes, how the outputs are aggregated at the end?

from flan.

kelvin-inspire commented on September 18, 2024

@kelvin-inspire To clarify, INPUT_SEQ_LEN = 2056 and TARGET_SEQ_LEN = 512 in the run_example.py script correspond to the input sequence length, and the maximum number of tokens the model can generate, respectively.

There is no chunking or sliding window. So if you have a training example with input length of 5000 and output length of 1000, but you are using INPUT_SEQ_LEN = 2056 and TARGET_SEQ_LEN = 512, then the model will only see 2056 tokens of the input, and only learn to generate 512 tokens of the output, as they both would be truncated.

You could of course just increase these maximum sequence lengths at inference, or train a model with longer sequences as well. Our script allows you to generate longer sequences if you need.

I have an another stupid question to ask. I think I need more clarity on these terminologies. When you say INPUT_SEQ_LEN = 2056, does this mean that the model can only accept 2056 tokens. For example, if I use flan-large model, its input_seq_len is 512 i.e. the maximum number of tokens the model can accept. Am I interpreting it correct?

from flan.

kelvin-inspire commented on September 18, 2024

@shayne-longpre If I'm thinking in a wrong way, can you just provide an extra example? That would be appreciated much.

from flan.

shayne-longpre commented on September 18, 2024

@kelvin-inspire Yes your thinking is correct. The model can accept however many tokens you specify (and can fit into memory). It just might not be so good at super long sequences because it didn't see many training examples that were super long in the Flan Collection.

from flan.

kelvin-inspire commented on September 18, 2024

@kelvin-inspire Yes your thinking is correct. The model can accept however many tokens you specify (and can fit into memory). It just might not be so good at super long sequences because it didn't see many training examples that were super long in the Flan Collection.

That's exactly what I want to understand in depth. How's it even happening behind the scenes? I remember you saying the model is not using sliding window or chunking. When the input head is of size say 128 (the model can accept 128 tokens one time) and if my sequence has a token length of say 300, what technique is being used by the model to accept the entire sequence? We can't definitely change the head size from 128 to any other because it's fixed. Say BERT-base head is capable of accepting 512 tokens and nothing more. To accept more tokens we need to use different model like BERT-large or something like which can take in more tokens at once. In this way flan models will also have a fixed window to accept inputs say 512 tokens. If you say I can change INPUT_SEQ_LEN to any number, that does not make much sense. I was thinking like the model is using something like a sliding window, where the model divides the input into smaller chunks of maximum head size (for example if the head size is 512 and sentence token size 1024, the input will be divided into two - each of size 512) then process somehow them as a batch and aggregate the results at the end.

from flan.

i-am-neo commented on September 18, 2024

@jianguoz Our input sequence length for training was 2048, and out output sequence length was 512.

These lengths fit I think almost all training examples in our corpus.

Hi @shayne-longpre Does the input sequence length of 2048 hold for the pre-trained models published on huggingface (flan-large, -xl, -xxl)?

from flan.

shayne-longpre commented on September 18, 2024

@jianguoz Our input sequence length for training was 2048, and out output sequence length was 512.
These lengths fit I think almost all training examples in our corpus.

Hi @shayne-longpre Does the input sequence length of 2048 hold for the pre-trained models published on huggingface (flan-large, -xl, -xxl)?

Yes, that is correct.

from flan.

kelvin-inspire commented on September 18, 2024

@kelvin-inspire I am not an expert in this, but here is my understanding.

The only reasons it is not recommended to give BERT more than 512 tokens is because its positional embeddings were only trained up to 512 (see this post, but like this poster, you can still expand to more tokens if your hardware can handle the quadratic memory requirements with respect to the input sequence length.

The T5 model uses a different kind of "relative" positional embedding (see page 5) that works for any length sequence. So the only obstacle to feeding T5 or Flan-T5 longer sequences is that it may not fit into your memory, or that it never saw as long sequence examples at training, so it may perform poorly. But there is no chunking or sliding window, it just truncates tokens after however many are allowed in the pre-defined input size and output size lengths.

Thank you for reading my questions and answering them timely.

from flan.

gahdritz commented on September 18, 2024

@shayne-longpre were the FLAN Collection models trained with gradient accumulation? Without it, using provided hyperparams, I'm running out of memory. Running the small model with a batch size of 64 (as reported in Chung et al) and input sequence lengths of 2048, each set of attention logits (of shape [64, 6, 2048, 2048]) requires 6GB of memory during preallocation. Overall, you'd need 116GB. Am I missing something?

from flan.

shayne-longpre commented on September 18, 2024

@gahdritz Training ran on internal TPU infrastructure, which may well have had enough memory to handle this, though I can't confirm for sure after my internship. From what I remember, 2048 was a very generous sequence length and finetuning was pretty robust to an array of hyperparameters.

I would expect to see similar results with a smaller sequence length or slightly different hyperparameters. I would not decrease the batch size too much more, so gradient accumulation would also be an excellent option.

from flan.

handling longer sequences about flan HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent