Requirements:
- dynet >= 2.0
- Chainer (for data batching)
- progressbar 2.0
- nltk for computing BLEU score
Implementaton of Attention is all you need paper: (Transformer Model)
features added:
- Multi-Head Attention
- Positional Encoding
- Positional Embedding
- Label Smoothing
- Warm-up steps training of Adam Optimizer
- Shared weights of target embedding and decoder softmax layer
Run the model: python train.py -s train-big.ja -t train-big.en --dynet-gpu --epoch 30 -b 128 --head 1
It reaches a maximum BLEU Score of around 25.2
. Also the current training speed is around 0.87 seconds per optimization step for batch size of 128. Overall 1 epoch takes ~ 10 minutes on TITAN X (Pascal) GPU.
Issues / Need for improvement:
- Also, currently,
Layernorm is not working properly
. So that part of the code is commented out. - If we keep multi-heads as 8, then the training speed decreases by a factor of 3. I am guessing, this is due to "dynet.pick_batch_elems()" function. If this can be converted to something like, dynet.pick_batch_range(), as is currently for the rows selection, then the code will speed up.