Giter Site home page Giter Site logo

Comments (9)

jungokasai avatar jungokasai commented on August 28, 2024 2

Have you tried using the hyperparameters for transformer big? https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#3-train-a-model. I think I ran into a similar problem at some point and switching to the hyperparameters from Ott et al. 2018 large got it work.

from mask-predict.

Marjan-GH avatar Marjan-GH commented on August 28, 2024 1

Also, don't forget to preprocess your data with the code of this branch.

from mask-predict.

alphadl avatar alphadl commented on August 28, 2024 1

@jungokasai @Marjan-GH Thanks! I adjusted the original command to:

python train.py data-bin/xlm_pretained-wmt14.en-de --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9,0.98)' --task translation_self --max-tokens 11000 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --encoder-attention-heads 8 --decoder-attention-heads 8 --max-source-positions 10000 --max-target-positions 10000 --max-update 500000 --seed 0 --save-dir ${model_dir} --update-freq 3 --ddp-backend=no_c10d --fp16 --keep-last-epochs 10

And seemingly this works for me, I picked one line of the training log. The ppl converges in a reasonable trend:

| epoch 006:  86%|▊| 944/1099 [28:55<04:45,  1.84s/it, loss=3.986, nll_loss=1.915, ppl=3.77, wps=32894, ups=1, wpb=60193.316, bsz=4095.215, num_updates=6429, lr=0.000394392, gnorm=0.705, clip=0.000, oom=0.000, loss_scale=0.125, wall=11957, train_wall=10246, length_loss=3.72187]

from mask-predict.

alphadl avatar alphadl commented on August 28, 2024

@yinhanliu Hoping for your advices 😉

from mask-predict.

alphadl avatar alphadl commented on August 28, 2024

@yinhanliu @omerlevy @Marjan-GH

Hi, Dear authors, I have trained the large scale MaskPredict (Hidden size: 1024/4096, vocab size: 6w+), the #Param is ~270M. Because the amount of parameters increases, the translation effect should be better than the normal scale MaskPredict (512/2048) as expected ! However, the BLEU score of the large scale model on ENDE newstest14 is only ~26.

I'm pretty sure the model has converged, Some indicators are shown below:

The loss of the latest large scale model that I used for evaluation is as follows:

loss=2.915; nll_loss=0.833;  ppl=1.78; length_loss=2.88968; lr=0.000119598

which looks obviously better than the normal scale MaskPredict, where I reproduced your result on same ENDE dataset and the BLEU can reach ~27, the loss of that normal model is:

loss=3.136; nll_loss=1.146; ppl=2.21; length_loss=3.04; lr=0.000117369

So, I am wondering that if the MaskPredict model only fits the base(512/2048) scale and it does not work under the large scale setting ???

Looking forward to your reply

Best

from mask-predict.

jungokasai avatar jungokasai commented on August 28, 2024

That seems a bit strange. The perplexity and length loss in validation are smaller, so I would suspect the large transformer would be at least as good as the base one in BLEU as well. It does not look like a training issue. Just for a sanity check, could you check the BLEU and loss on the validation data with both base and large? Roughly speaking, there should be correlation between BLEU and the loss, but if not there might be something wrong with the inference. Otherwise it might be overfitting? I wouldn't have expected this to happen though with dropout 0.3. Also, please make sure you are distilling from the same large autoregressive transformer.

from mask-predict.

alphadl avatar alphadl commented on August 28, 2024

That seems a bit strange. The perplexity and length loss in validation are smaller, so I would suspect the large transformer would be at least as good as the base one in BLEU as well. It does not look like a training issue. Just for a sanity check, could you check the BLEU and loss on the validation data with both base and large? Roughly speaking, there should be correlation between BLEU and the loss, but if not there might be something wrong with the inference. Otherwise it might be overfitting? I wouldn't have expected this to happen though with dropout 0.3.

Hi @jungokasai ,

The BLEU scores on validation set using the best single checkpoint are:

large scale model↓

BLEU = 15.74, 38.1/19.3/11.3/7.4 (BP=1.000, ratio=1.946, syslen=62394, reflen=32064)

base scale model↓

BLEU = 15.19, 37.3/18.7/11.0/7.0 (BP=1.000, ratio=1.935, syslen=62578, reflen=32332)

There do exist positive correlation between validation BLEU and the loss.

However, the BLEU scores on test set with the best single checkpoint are:

large scale model↓

BLEU = 25.77, 57.7/31.5/19.5/12.5 (BP=1.000, ratio=1.008, syslen=65745, reflen=65214)

base scale model↓

BLEU = 26.81, 58.9/32.7/20.4/13.2 (BP=1.000, ratio=1.005, syslen=64820, reflen=64496)

This phenomenon is strange. Can it be said that the MaskPredict architecture is not suitable for large scale?

B.T.W., all my distilled data is derived from a pretrained powerful big AT model, refer to that issue response

from mask-predict.

omerlevy avatar omerlevy commented on August 28, 2024

Hi Liam,
When increasing the model size, you usually need to retune the optimization hyperparameters (e.g. learning rate, dropout). I would start with the recommended values for Transformer-Large and tweak it from there.
Hope this helps!

from mask-predict.

alphadl avatar alphadl commented on August 28, 2024

@omerlevy
Hi Levy, Thanks for your prompt reply~

Because of my relative limited computing resources, the experiment took a long time. Looking forward to your results! This will be helpful for researchers who follow this paper, thank you!

from mask-predict.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.