facebookresearch / mask-predict Goto Github PK

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.

License: Other

Python 98.59% C++ 0.49% Lua 0.60% Shell 0.32%

mask-predict's Introduction

Mask-Predict

Download model

Description	Dataset	Model
MASK-PREDICT	[WMT14 English-German]	download (.tar.bz2)
MASK-PREDICT	[WMT14 German-English]	download (.tar.bz2)
MASK-PREDICT	[WMT16 English-Romanian]	download (.tar.bz2)
MASK-PREDICT	[WMT16 Romanian-English]	download (.tar.bz2)
MASK-PREDICT	[WMT17 English-Chinese]	download (.tar.bz2)
MASK-PREDICT	[WMT17 Chinese-English]	download (.tar.bz2)

Preprocess

text=PATH_YOUR_DATA

output_dir=PATH_YOUR_OUTPUT

src=source_language

tgt=target_language

model_path=PATH_TO_MASKPREDICT_MODEL_DIR

python preprocess.py --source-lang ${src} --target-lang ${tgt} --trainpref $text/train --validpref $text/valid --testpref $text/test --destdir ${output_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

Train

model_dir=PLACE_TO_SAVE_YOUR_MODEL

python train.py ${output_dir}/data-bin --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 8192 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 0 --save-dir ${model_dir}

Evaluation

python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best_average.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

License

MASK-PREDICT is CC-BY-NC 4.0. The license applies to the pre-trained models as well.

Citation

Please cite as:

@inproceedings{ghazvininejad2019MaskPredict,
  title = {Mask-Predict: Parallel Decoding of Conditional Masked Language Models},
  author = {Marjan Ghazvininejad, Omer Levy, Yinhan Liu, Luke Zettlemoyer},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  year = {2019},
}

mask-predict's People

Stargazers

Watchers

mask-predict's Issues

Provide wmt14_en_de model can only achieve 20.90 bleu

Hi, when I used the checkpoint_best.pt provided in readme and the inference script "python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict", I can only got the bleu of 20.90. What is the problem? Are there any other hyperparameters I need to modify in the inference script?

I see "average the 5 best checkpoints to create the final model" in the paper. So is the checkpoint_best.pt provided in the link the final model? If not, I wonder how to average the best checkpoints? Do we forward 5 models and average the prediction distribution?

Thank you!

Why is the BLEU obtained from the training model provided much higher than the value on paper?

I download the provided trained model, and test on test dataset, but get much higher BLEU than the values in paper.

I use the scripts provided, and don't change anything:

python preprocess.py \
  --source-lang de \
  --target-lang en \
  --trainpref data/wmt14.en-de/train \
  --validpref data/wmt14.en-de/valid \
  --testpref data/wmt14.en-de/test \
  --destdir output/data-bin/wmt14.de-en \
  --srcdict output/maskPredict_de_en/dict.de.txt \
  --tgtdict output/maskPredict_de_en/dict.en.txt

python generate_cmlm.py output/data-bin/wmt14.${src}-${tgt}  \
    --path ${model_dir}/checkpoint_best.pt \
    --task translation_self \
    --remove-bpe True \
    --max-sentences 20 \
    --decoding-iterations ${iteration} \
    --decoding-strategy mask_predict

I get 34.42 on WMT14 DE->EN, 35.20 on WMT16 EN->RO, 35.62 on WMT RO->EN. These values are much higher than that in origin paper. This is strange, and what happened?

Could you make a ReadMe more specifically about preprocess ?

Could you make a ReadMe more specifically about preprocess ? Or release the datasets you used in the paper. Thx.

can not download the en-de distill dataset

The previous link to download the en-de distill dataset is Not found, can you reprovide it.

Does the En-De experiment use the compound-split-bleu.sh?

I've trained a Transformer-base on WMT14 En-De following your settings. It would only be around ~27.74 when using compound-split-bleu.sh. I am not sure about this since the paper only said "evaluated with BLEU".

Release the bpe codes?

Hi, I find that the released model doesn't contain the bpe codes, which makes itself unusable. Can the authors release the bpe code?
Thx.

can not reproduce the result on wmt14 en-de

Hi,

Thanks so much for releasing your models and data. However, after running the following command,
I could only get 9.75 for BLEU 4 on wmt14 ende.
python generate_cmlm.py data-bin/wmt14.en-de/ --path models/wmt14-ende/maskPredict_en_de/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

Any idea where I am wrong? Thanks!

path to the datasets

Hi
Could you please provide me with the link to the datasets to be able to run the codes? I am not sure about the format of the datasets model needs.

thanks.

Not able to reproduce the BLEU score with the saved model for en-de or de-en

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${data_dir}/data-bin --path ./maskPredict_${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

bpe code file

Could you provide the bpe code file for the wmt17-en-zh model?

Cannot fine-tune your model with default arguments

Hello. I download your en-zh trained model and try to fine-tune it with my own data (also part of WMT17 en-zh corpus). But when I run the train.py script, I encountered the following error:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/root/work/Mask-Predict/train.py", line 265, in distributed_main
    main(args, init_distributed=True)
  File "/root/work/Mask-Predict/train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "/root/work/Mask-Predict/train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/root/work/Mask-Predict/fairseq/trainer.py", line 347, in train_step
    self.optimizer.step()
  File "/root/work/Mask-Predict/fairseq/optim/fp16_optimizer.py", line 197, in step
    self.fp32_optimizer.step(closure)
  File "/root/work/Mask-Predict/fairseq/optim/fairseq_optimizer.py", line 94, in step
    self.optimizer.step(closure)
  File "/root/work/Mask-Predict/fairseq/optim/adam.py", line 114, in step
    amsgrad = group['amsgrad']
KeyError: 'amsgrad'

My arguments for train.py are:

python train.py {lang_prefix}/data-bin --arch bert_transformer_seq2seq --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 1024 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --max-source-positions 10000 --max-target-positions 10000 --fp16 --max-update 300000 --seed 0 --save-dir {lang_prefix}/model

You can see that it is almost identical to the default ones. Are there any subtle differences between your default arguments and released pre-trained model?

path to the actual method implementation

Hi
Since the codebase is mixed with fairseq this is hard for me to realize which part is actual method implementation. Do you mind providing me with the path of the actual method implemenation?
thanks a lot for your help

Will you release the dataset of wmt17-enzh ?

Hi，

I want reproduce your result of wmt17-enzh, but your released model doesn't contain the bpe codes.
Will you release wmt17-enzh dataset ?

tks

why a lot of @@ in the data

Hi,

I notice that there a lot of @@ in the data. For example, "Gut@@ ach : Incre@@ ased safety for pedestri@@ ans". It seems like that "Incre@@ ased" means "Increased". Should we revise the file such that deleting the @@ and combine two tokens to one token? I think for the preprocess.py ignore such a problem. And create the dictionary that contains a lot of words that have "@@".

Best,
Ye

data url 404

The url of wmt14 en-de dataset in https://github.com/facebookresearch/Mask-Predict/blob/main/get_data.sh is unavailable now. Could you release the processed data?

Whether there exists the [length] token?

I have a question about the length token. I cannot find the length token in the input source sentences

Failed to train the Mask-predict with larger model/hidden dimension

Elegant work! In addition to training a transformer_base-scale model, I am still trying to train a large model, (e.g., 1024 model dim. & 4096 hidden dim), such that I can fine-tune Mask-predict with XLM.

However, when I simply change the dimension and fix other arguments, the training is failed, that is, the ppl is even becoming bigger. Can you give me some advices?

Below is my training command:

python train.py data-bin/xlm_pretained-wmt14.en-de --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 11000 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --encoder-attention-heads 8 --decoder-attention-heads 8 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 0 --save-dir ${model_dir} --update-freq 3 --ddp-backend=no_c10d --fp16 --keep-last-epochs 10

and the following is the log of one training step:

| epoch 012:  74%|▋| 814/1099 [24:26<08:28,  1.79s/it, loss=12.243, nll_loss=11.121, ppl=2226.58, wps=33332, ups=1, wpb=60068.756, bsz=4060.299, num_updates=12894, lr=0.000440328, gnorm=0.341, clip=0.000, oom=0.000, loss_scale=0.250, wall=23856, train_wall=20393, length_loss=6.6472]

BTW, because I reused the XLM vocabulary list, the vocab size of larger Mask-predict is more than 60k+.

Namespace(adam_betas='(0.9,0.999)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='bert_transformer_seq2seq', attention_dropout=0.0, best_checkpoint_metric='loss', bilm_add_bos=False, bilm_attention_dropout=0.0, bilm_mask_last_state=False, bilm_model_dropout=0.1, bilm_relu_dropout=0.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_length_cross_entropy', curriculum=0, data=['data-bin/xlm_pretained-wmt14.en-de'], dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=1024, decoder_embed_path=None, decoder_embed_scale=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:10859', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, dynamic_length=False, embedding_only=False, encoder_attention_heads=8, encoder_embed_dim=1024, encoder_embed_path=None, encoder_embed_scale=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=10, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_range=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=10000, max_target_positions=10000, max_tokens=11000, max_tokens_valid=11000, max_update=500000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_dec_token_positional_embeddings=False, no_enc_token_positional_embeddings=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./distill_model_from_scratch_1024_xlm', save_interval=1, save_interval_updates=0, seed=0, self_target=False, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_self', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[3], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01)
| [en] dictionary: 60192 types
| [de] dictionary: 60192 types
| data-bin/xlm_pretained-wmt14.en-de valid 3000 examples
Transformer_nonautoregressive(
  (encoder): TransformerEncoder(
    (embed_tokens): Embedding(60192, 1024, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(10002, 1024, padding_idx=1)
    (embed_lengths): Embedding(10000, 1024)
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (layer_norms): ModuleList(
          (0): BertLayerNorm()
          (1): BertLayerNorm()
        )
      )(1)(2)...(5)
        )
      )
    )
  )
  (decoder): SelfTransformerDecoder(
    (embed_tokens): Embedding(60192, 1024, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(10002, 1024, padding_idx=1)
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): BertLayerNorm()
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): BertLayerNorm()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): BertLayerNorm()
      )(1)(2)...(5)
   )
)

pytorch version

could you provide the details about the environment requirements to run this code? Which pytorch version do you use?

the length loss should devide nsentences?

Mask-Predict/fairseq/criterions/label_smoothed_length_cross_entropy.py

Line 63 in 11c7535

loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss + length_loss

Mask-Predict/fairseq/criterions/label_smoothed_length_cross_entropy.py

Line 73 in 11c7535

    
           'loss': sum(log.get('loss', 0) for log in logging_outputs) / sample_size / math.log(2),

the total loss is sumed by nll_loss, smooth_loss and length loss. When compute the mean loss, nll_loss, smooth_loss should devide ntokens, and length loss should devide nsentences. However, in the source code, both of them are devided by ntokens?

Doubt in mask-predict

Hi,

I have a small doubt. Plz clear.
In mask-predict decoding algorithm, whatever tokens are predicted in 1st round, are retained in all rounds of iterative decoding or they can also be masked?

Thanks

Will you release the distillation dataset of wmt-en-de?

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.