The aiayn's intro from devsinghsachan

aiayn's Introduction

Requirements:

dynet >= 2.0
Chainer (for data batching)
progressbar 2.0
nltk for computing BLEU score

Implementaton of Attention is all you need paper: (Transformer Model)

features added:

Multi-Head Attention
Positional Encoding
Positional Embedding
Label Smoothing
Warm-up steps training of Adam Optimizer
Shared weights of target embedding and decoder softmax layer

Run the model: python train.py -s train-big.ja -t train-big.en --dynet-gpu --epoch 30 -b 128 --head 1

It reaches a maximum BLEU Score of around 25.2. Also the current training speed is around 0.87 seconds per optimization step for batch size of 128. Overall 1 epoch takes ~ 10 minutes on TITAN X (Pascal) GPU.

Issues / Need for improvement:

Also, currently, Layernorm is not working properly. So that part of the code is commented out.
If we keep multi-heads as 8, then the training speed decreases by a factor of 3. I am guessing, this is due to "dynet.pick_batch_elems()" function. If this can be converted to something like, dynet.pick_batch_range(), as is currently for the rows selection, then the code will speed up.

Recommend Projects