Giter Site home page Giter Site logo

Comments (12)

lovodkin93 avatar lovodkin93 commented on July 17, 2024 1

@demelin
Great, now I understand.
Thank you very much!

from nematus.

rsennrich avatar rsennrich commented on July 17, 2024

Hi @lovodkin93,

I think there's two issues to keep in mind:

  • both functions you point to are active at training time. During training, all time steps are processed in parallel (the fact that this is possible with Transformers is a big advantage over RNNs, where you need to process time steps sequentially).
  • even during inference, the Transformer needs access to all decoder states from previous time steps for self-attention (unlike RNNs, where access to the hidden state(s) from the last time step is enough). You can find the inference code here. Yes, this function has a variable `current_time_step' that indicates which word is currently being processed.

best,
Rico

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

amazing, thank you!
I do have a follow-up question - I have searched for the current_time_step variable, but I see it appears only in the tranformer_inference.py code. I actually need it in the training process, rather than the inference process, so I was wondering if there is something similar for the transformer.py code? Namely, is there a way to know which word is being processed through the decoder at a given train step (of course, it is all done simultaneously, but I am talking about at a given thread at a given time step).

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

In fact, I will explain what my purpose is, as the current time step might be redundant:
I am trying to propagate the attn_weights of the decoder layers' self-attention layer from transformer_attention_modules.py line 168 all the way to transformer.py line 75 (by adding the appropriate outputs in all the relevant functions on the way). My goal is to incorporate a certain manipulation on the attn_weights in the loss function. The thing is, I don't want redundancy in the loss, so I want for each word in the target sentence to have just the row in attn_weights relevant to it (namely for word generated in time step i I want just the row that showed the attn_weights of that word to the i-1 first words). So my question is, is the generated triangular matrix attn_weights computed once per sentence or as many as the number of words in the sentence?
For example, for the generated target sentence "I eat pizza", will the attn_weights be generated once (namely for each word the equivalent row will be computed once and since it is all done simultaneously then it will be generated once):
1,0,0
x, 1-x,0
x, y, 1-x-y

or rather would it be generated thrice:
1,0,0
nonsense3
nonsense
3

1,0,0
x, 1-x,0
nonesnse*3

1,0,0
x, 1-x,0
x, y, 1-x-y

Thanks!

from nematus.

rsennrich avatar rsennrich commented on July 17, 2024

attn_weights will only be generated once per attention head and timestep. If you want to propagate this to the loss, you'll need to properly pack this information (a base Transformer has 48 attention heads), and you can look at the layer_memories for inspiration (they're currently only used for caching at inference time) (@demelin is the original author of that code, and might help with more detailed questions).

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

@rsennrich So let me make sure I understand clearly:
given the sentence "I eat pizza", all the words of the sentence ("I", "eat", "pizza") will be processed in the same timestep (as they are processed in parallel), so attn_weights will be computed once for the whole sentence for this timestep, right?

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

@rsennrich In other words - are all the words of a sentence processed in the same single timestep or each is processed in a separate timestep?

from nematus.

rsennrich avatar rsennrich commented on July 17, 2024

I use timestep to the position in the sentence (it could be a word, subword, or character). So I'm saying that attn_weights will only be generated once per attention head and subword, and that all subwords are processed in parallel as a single tensor (for examples, the key tensor has the shape [batch_size, time_steps, num_heads, num_features].

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

@rsennrich @demelin
From what I know, there are two attention layers in the decoder - the cross-attention layer which receives as inputs the output of the encoders' block (which is quite clear how it is implemented in the code), and the self-attention which receives all the words that were generated up to the current timestep. So what I don't seem to understand is - how can a word at timestep t be computed without the knowledge of all the t-1 first words (which are being computed in parallel to it and are not necessarily known at timestep t)?

In fact, I have also printed the attn_weights and its size, and looked at the output. The size of the attn_weights was [2 8 16 16] (namely a batch of 2 sentences, the longer of which is of length 16 for, so in fact there was a matrix of 16X16 for each of the sentences and for each of the 8 heads). I also printed the matrix of the first sentence of one of those heads, and got the following output (each row contains 16 elements):
image

So to ask a more practical question, for every timestep, is this whole matrix being generated, so that for a sentence of length 16, 16 matrices of size 16X16 will be generated over 16 timesteps, per head (one matrix per timestep)? Or alternatively, For every timestep, just one of the rows of the matrix is generated, and they are all concatenated in the end (so for a sentence of length 16, 16 rows of size 1X16, and not 16 matrices of size 16X16 will be generated, one row per timestep).

from nematus.

demelin avatar demelin commented on July 17, 2024

Hey there,

During training, all attention weights are computed in parallel for all time-steps (i.e. encoder self-attention, decoder self-attention, and decoder-to-encoder attention). During inference, encoder self-attention is computed in parallel for all time-steps, as we have access to the full source sentence. Decoder self-attention is re-computed for all target-side tokens each time we expand the generated sequence, so as to update the latent representations at positions 0, ..., t-1 with the information contained in the token t. Decoder-to-encoder attention, on the other hand, is only computed between the target token at the current position, i.e. t, and all hidden states in the final layer of the encoder. The result is concatenated with previous time-step-specific attention weights which are cached in layer_memories to avoid redundant computation.

To answer you question regarding the attention matrix in your example, during inference at t == 1, the attention matrix has size 1x1, at t == 2 it has size 2x2, ..., at t == 16 it has size 16x16. During training, it's 16x16 right away and we apply a mask to prevent the model from attending to future positions, as the decoder is given the full (shifted) target sequence as input. This enables the efficient, non-recurrent training process. Hope that helps.

All the best,
Denis

from nematus.

lovodkin93 avatar lovodkin93 commented on July 17, 2024

@demelin
ok, I think I understand.
So just to make sure I understand:
For the example I included, what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)?
My main concern is the number of matrices yield during a single propagation of all the words in a sentence in the training phase.

from nematus.

demelin avatar demelin commented on July 17, 2024

Yes, the training step produces a single attention matrix (or, more accurately, a 4D attention tensor) per mini-batch per attention type, i.e. one for encoder self-attention, one for decoder self-attention, and one for decoder-to-encoder attention.

what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)

Yep :) . 16 matrices would be way too many. Also, the training is not sequential (e.g. as in RNNs), but parallel (e.g. as in FFNs).

from nematus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.