Elegant work! In addition to training a transformer_base-scale model, I am still tryin

Have you tried using the hyperparameters for transformer big? <a href="https://github.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Failed to train the Mask-predict with larger model/hidden dimension about mask-predict HOT 9 OPEN

alphadl commented on August 28, 2024 1

Failed to train the Mask-predict with larger model/hidden dimension

from mask-predict.

Comments (9)

jungokasai commented on August 28, 2024 2

Have you tried using the hyperparameters for transformer big? https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#3-train-a-model. I think I ran into a similar problem at some point and switching to the hyperparameters from Ott et al. 2018 large got it work.

from mask-predict.

Marjan-GH commented on August 28, 2024 1

Also, don't forget to preprocess your data with the code of this branch.

from mask-predict.

alphadl commented on August 28, 2024 1

@jungokasai @Marjan-GH Thanks! I adjusted the original command to:

python train.py data-bin/xlm_pretained-wmt14.en-de --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9,0.98)' --task translation_self --max-tokens 11000 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --encoder-attention-heads 8 --decoder-attention-heads 8 --max-source-positions 10000 --max-target-positions 10000 --max-update 500000 --seed 0 --save-dir ${model_dir} --update-freq 3 --ddp-backend=no_c10d --fp16 --keep-last-epochs 10

And seemingly this works for me, I picked one line of the training log. The ppl converges in a reasonable trend:

| epoch 006:  86%|▊| 944/1099 [28:55<04:45,  1.84s/it, loss=3.986, nll_loss=1.915, ppl=3.77, wps=32894, ups=1, wpb=60193.316, bsz=4095.215, num_updates=6429, lr=0.000394392, gnorm=0.705, clip=0.000, oom=0.000, loss_scale=0.125, wall=11957, train_wall=10246, length_loss=3.72187]

from mask-predict.

alphadl commented on August 28, 2024

@yinhanliu Hoping for your advices 😉

from mask-predict.

alphadl commented on August 28, 2024

@yinhanliu @omerlevy @Marjan-GH

Hi, Dear authors, I have trained the large scale MaskPredict (Hidden size: 1024/4096, vocab size: 6w+), the #Param is ~270M. Because the amount of parameters increases, the translation effect should be better than the normal scale MaskPredict (512/2048) as expected ! However, the BLEU score of the large scale model on ENDE newstest14 is only ~26.

I'm pretty sure the model has converged, Some indicators are shown below:

The loss of the latest large scale model that I used for evaluation is as follows:

loss=2.915; nll_loss=0.833;  ppl=1.78; length_loss=2.88968; lr=0.000119598

which looks obviously better than the normal scale MaskPredict, where I reproduced your result on same ENDE dataset and the BLEU can reach ~27, the loss of that normal model is:

loss=3.136; nll_loss=1.146; ppl=2.21; length_loss=3.04; lr=0.000117369

So, I am wondering that if the MaskPredict model only fits the base(512/2048) scale and it does not work under the large scale setting ???

Looking forward to your reply

Best

from mask-predict.

jungokasai commented on August 28, 2024

That seems a bit strange. The perplexity and length loss in validation are smaller, so I would suspect the large transformer would be at least as good as the base one in BLEU as well. It does not look like a training issue. Just for a sanity check, could you check the BLEU and loss on the validation data with both base and large? Roughly speaking, there should be correlation between BLEU and the loss, but if not there might be something wrong with the inference. Otherwise it might be overfitting? I wouldn't have expected this to happen though with dropout 0.3. Also, please make sure you are distilling from the same large autoregressive transformer.

from mask-predict.

alphadl commented on August 28, 2024

That seems a bit strange. The perplexity and length loss in validation are smaller, so I would suspect the large transformer would be at least as good as the base one in BLEU as well. It does not look like a training issue. Just for a sanity check, could you check the BLEU and loss on the validation data with both base and large? Roughly speaking, there should be correlation between BLEU and the loss, but if not there might be something wrong with the inference. Otherwise it might be overfitting? I wouldn't have expected this to happen though with dropout 0.3.

Hi @jungokasai ,

The BLEU scores on validation set using the best single checkpoint are:

large scale model↓

BLEU = 15.74, 38.1/19.3/11.3/7.4 (BP=1.000, ratio=1.946, syslen=62394, reflen=32064)

base scale model↓

BLEU = 15.19, 37.3/18.7/11.0/7.0 (BP=1.000, ratio=1.935, syslen=62578, reflen=32332)

There do exist positive correlation between validation BLEU and the loss.

However, the BLEU scores on test set with the best single checkpoint are:

large scale model↓

BLEU = 25.77, 57.7/31.5/19.5/12.5 (BP=1.000, ratio=1.008, syslen=65745, reflen=65214)

base scale model↓

BLEU = 26.81, 58.9/32.7/20.4/13.2 (BP=1.000, ratio=1.005, syslen=64820, reflen=64496)

This phenomenon is strange. Can it be said that the MaskPredict architecture is not suitable for large scale?

B.T.W., all my distilled data is derived from a pretrained powerful big AT model, refer to that issue response

from mask-predict.

omerlevy commented on August 28, 2024

Hi Liam,
When increasing the model size, you usually need to retune the optimization hyperparameters (e.g. learning rate, dropout). I would start with the recommended values for Transformer-Large and tweak it from there.
Hope this helps!

from mask-predict.

alphadl commented on August 28, 2024

@omerlevy
Hi Levy, Thanks for your prompt reply~

Because of my relative limited computing resources, the experiment took a long time. Looking forward to your results! This will be helpful for researchers who follow this paper, thank you!

from mask-predict.

Failed to train the Mask-predict with larger model/hidden dimension about mask-predict HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent