Comments (12)
@demelin
Great, now I understand.
Thank you very much!
from nematus.
Hi @lovodkin93,
I think there's two issues to keep in mind:
- both functions you point to are active at training time. During training, all time steps are processed in parallel (the fact that this is possible with Transformers is a big advantage over RNNs, where you need to process time steps sequentially).
- even during inference, the Transformer needs access to all decoder states from previous time steps for self-attention (unlike RNNs, where access to the hidden state(s) from the last time step is enough). You can find the inference code here. Yes, this function has a variable `current_time_step' that indicates which word is currently being processed.
best,
Rico
from nematus.
amazing, thank you!
I do have a follow-up question - I have searched for the current_time_step variable, but I see it appears only in the tranformer_inference.py code. I actually need it in the training process, rather than the inference process, so I was wondering if there is something similar for the transformer.py code? Namely, is there a way to know which word is being processed through the decoder at a given train step (of course, it is all done simultaneously, but I am talking about at a given thread at a given time step).
from nematus.
In fact, I will explain what my purpose is, as the current time step might be redundant:
I am trying to propagate the attn_weights of the decoder layers' self-attention layer from transformer_attention_modules.py line 168 all the way to transformer.py line 75 (by adding the appropriate outputs in all the relevant functions on the way). My goal is to incorporate a certain manipulation on the attn_weights in the loss function. The thing is, I don't want redundancy in the loss, so I want for each word in the target sentence to have just the row in attn_weights relevant to it (namely for word generated in time step i I want just the row that showed the attn_weights of that word to the i-1 first words). So my question is, is the generated triangular matrix attn_weights computed once per sentence or as many as the number of words in the sentence?
For example, for the generated target sentence "I eat pizza", will the attn_weights be generated once (namely for each word the equivalent row will be computed once and since it is all done simultaneously then it will be generated once):
1,0,0
x, 1-x,0
x, y, 1-x-y
or rather would it be generated thrice:
1,0,0
nonsense3
nonsense3
1,0,0
x, 1-x,0
nonesnse*3
1,0,0
x, 1-x,0
x, y, 1-x-y
Thanks!
from nematus.
attn_weights will only be generated once per attention head and timestep. If you want to propagate this to the loss, you'll need to properly pack this information (a base Transformer has 48 attention heads), and you can look at the layer_memories for inspiration (they're currently only used for caching at inference time) (@demelin is the original author of that code, and might help with more detailed questions).
from nematus.
@rsennrich So let me make sure I understand clearly:
given the sentence "I eat pizza", all the words of the sentence ("I", "eat", "pizza") will be processed in the same timestep (as they are processed in parallel), so attn_weights will be computed once for the whole sentence for this timestep, right?
from nematus.
@rsennrich In other words - are all the words of a sentence processed in the same single timestep or each is processed in a separate timestep?
from nematus.
I use timestep to the position in the sentence (it could be a word, subword, or character). So I'm saying that attn_weights will only be generated once per attention head and subword, and that all subwords are processed in parallel as a single tensor (for examples, the key tensor has the shape [batch_size, time_steps, num_heads, num_features].
from nematus.
@rsennrich @demelin
From what I know, there are two attention layers in the decoder - the cross-attention layer which receives as inputs the output of the encoders' block (which is quite clear how it is implemented in the code), and the self-attention which receives all the words that were generated up to the current timestep. So what I don't seem to understand is - how can a word at timestep t be computed without the knowledge of all the t-1 first words (which are being computed in parallel to it and are not necessarily known at timestep t)?
In fact, I have also printed the attn_weights and its size, and looked at the output. The size of the attn_weights was [2 8 16 16] (namely a batch of 2 sentences, the longer of which is of length 16 for, so in fact there was a matrix of 16X16 for each of the sentences and for each of the 8 heads). I also printed the matrix of the first sentence of one of those heads, and got the following output (each row contains 16 elements):
So to ask a more practical question, for every timestep, is this whole matrix being generated, so that for a sentence of length 16, 16 matrices of size 16X16 will be generated over 16 timesteps, per head (one matrix per timestep)? Or alternatively, For every timestep, just one of the rows of the matrix is generated, and they are all concatenated in the end (so for a sentence of length 16, 16 rows of size 1X16, and not 16 matrices of size 16X16 will be generated, one row per timestep).
from nematus.
Hey there,
During training, all attention weights are computed in parallel for all time-steps (i.e. encoder self-attention, decoder self-attention, and decoder-to-encoder attention). During inference, encoder self-attention is computed in parallel for all time-steps, as we have access to the full source sentence. Decoder self-attention is re-computed for all target-side tokens each time we expand the generated sequence, so as to update the latent representations at positions 0, ..., t-1 with the information contained in the token t. Decoder-to-encoder attention, on the other hand, is only computed between the target token at the current position, i.e. t, and all hidden states in the final layer of the encoder. The result is concatenated with previous time-step-specific attention weights which are cached in layer_memories to avoid redundant computation.
To answer you question regarding the attention matrix in your example, during inference at t == 1, the attention matrix has size 1x1, at t == 2 it has size 2x2, ..., at t == 16 it has size 16x16. During training, it's 16x16 right away and we apply a mask to prevent the model from attending to future positions, as the decoder is given the full (shifted) target sequence as input. This enables the efficient, non-recurrent training process. Hope that helps.
All the best,
Denis
from nematus.
@demelin
ok, I think I understand.
So just to make sure I understand:
For the example I included, what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)?
My main concern is the number of matrices yield during a single propagation of all the words in a sentence in the training phase.
from nematus.
Yes, the training step produces a single attention matrix (or, more accurately, a 4D attention tensor) per mini-batch per attention type, i.e. one for encoder self-attention, one for decoder self-attention, and one for decoder-to-encoder attention.
what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)
Yep :) . 16 matrices would be way too many. Also, the training is not sequential (e.g. as in RNNs), but parallel (e.g. as in FFNs).
from nematus.
Related Issues (20)
- Docker: No module named 'rescore' HOT 1
- Throws error when running with rnn_dropout_embedding: true parameter HOT 5
- Load model from Opus-MT HOT 6
- how to map --external_validation_script option to the current nematus with tensorflow? HOT 4
- reimplement rnn of nematus
- How to continue training from the last checkpoint HOT 5
- Why use tie_encoder_decoder_embeddings instead of tie_decoder_embeddings in line 131 of transformer.py? HOT 3
- question about the self-attention in the Encoder HOT 2
- Training stops after few epochs HOT 1
- beam search HOT 1
- translation_maxlen HOT 2
- Performance issues in the difinition of generate_initial_memories, nematus/transformer_inference.py HOT 1
- OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index -1 of dimension 1 out of bounds. HOT 2
- Config mrt_ml_mix seems to be unused HOT 1
- TensorArray Not Used on line 240 of mrt_utils.py
- multiple models (with same vocabulary) for ensemble decoding HOT 2
- None exception HOT 4
- Question about running on GPU HOT 3
- A request for system version updates HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nematus.