Giter Site home page Giter Site logo

facebookresearch / mask-predict Goto Github PK

View Code? Open in Web Editor NEW
238.0 6.0 38.0 2.41 MB

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.

License: Other

Python 98.59% C++ 0.49% Lua 0.60% Shell 0.32%

mask-predict's Introduction

Mask-Predict

Download model

Description Dataset Model
MASK-PREDICT [WMT14 English-German] download (.tar.bz2)
MASK-PREDICT [WMT14 German-English] download (.tar.bz2)
MASK-PREDICT [WMT16 English-Romanian] download (.tar.bz2)
MASK-PREDICT [WMT16 Romanian-English] download (.tar.bz2)
MASK-PREDICT [WMT17 English-Chinese] download (.tar.bz2)
MASK-PREDICT [WMT17 Chinese-English] download (.tar.bz2)

Preprocess

text=PATH_YOUR_DATA

output_dir=PATH_YOUR_OUTPUT

src=source_language

tgt=target_language

model_path=PATH_TO_MASKPREDICT_MODEL_DIR

python preprocess.py --source-lang ${src} --target-lang ${tgt} --trainpref $text/train --validpref $text/valid --testpref $text/test --destdir ${output_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

Train

model_dir=PLACE_TO_SAVE_YOUR_MODEL

python train.py ${output_dir}/data-bin --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 8192 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --fp16 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 0 --save-dir ${model_dir}

Evaluation

python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best_average.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

License

MASK-PREDICT is CC-BY-NC 4.0. The license applies to the pre-trained models as well.

Citation

Please cite as:

@inproceedings{ghazvininejad2019MaskPredict,
  title = {Mask-Predict: Parallel Decoding of Conditional Masked Language Models},
  author = {Marjan Ghazvininejad, Omer Levy, Yinhan Liu, Luke Zettlemoyer},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  year = {2019},
}

mask-predict's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mask-predict's Issues

Not able to reproduce the BLEU score with the saved model for en-de or de-en

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${data_dir}/data-bin --path ./maskPredict_${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

Cannot fine-tune your model with default arguments

Hello. I download your en-zh trained model and try to fine-tune it with my own data (also part of WMT17 en-zh corpus). But when I run the train.py script, I encountered the following error:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/root/work/Mask-Predict/train.py", line 265, in distributed_main
    main(args, init_distributed=True)
  File "/root/work/Mask-Predict/train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "/root/work/Mask-Predict/train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/root/work/Mask-Predict/fairseq/trainer.py", line 347, in train_step
    self.optimizer.step()
  File "/root/work/Mask-Predict/fairseq/optim/fp16_optimizer.py", line 197, in step
    self.fp32_optimizer.step(closure)
  File "/root/work/Mask-Predict/fairseq/optim/fairseq_optimizer.py", line 94, in step
    self.optimizer.step(closure)
  File "/root/work/Mask-Predict/fairseq/optim/adam.py", line 114, in step
    amsgrad = group['amsgrad']
KeyError: 'amsgrad'

My arguments for train.py are:

python train.py {lang_prefix}/data-bin --arch bert_transformer_seq2seq --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 1024 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --max-source-positions 10000 --max-target-positions 10000 --fp16 --max-update 300000 --seed 0 --save-dir {lang_prefix}/model

You can see that it is almost identical to the default ones. Are there any subtle differences between your default arguments and released pre-trained model?

can not reproduce the result on wmt14 en-de

Hi,

Thanks so much for releasing your models and data. However, after running the following command,
I could only get 9.75 for BLEU 4 on wmt14 ende.
python generate_cmlm.py data-bin/wmt14.en-de/ --path models/wmt14-ende/maskPredict_en_de/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

Any idea where I am wrong? Thanks!

the length loss should devide nsentences?

loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss + length_loss

'loss': sum(log.get('loss', 0) for log in logging_outputs) / sample_size / math.log(2),

the total loss is sumed by nll_loss, smooth_loss and length loss. When compute the mean loss, nll_loss, smooth_loss should devide ntokens, and length loss should devide nsentences. However, in the source code, both of them are devided by ntokens?

Will you release the distillation dataset of wmt-en-de?

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

why a lot of @@ in the data

Hi,

I notice that there a lot of @@ in the data. For example, "Gut@@ ach : Incre@@ ased safety for pedestri@@ ans". It seems like that "Incre@@ ased" means "Increased". Should we revise the file such that deleting the @@ and combine two tokens to one token? I think for the preprocess.py ignore such a problem. And create the dictionary that contains a lot of words that have "@@".

Best,
Ye

bpe code file

Could you provide the bpe code file for the wmt17-en-zh model?

path to the actual method implementation

Hi
Since the codebase is mixed with fairseq this is hard for me to realize which part is actual method implementation. Do you mind providing me with the path of the actual method implemenation?
thanks a lot for your help

Why is the BLEU obtained from the training model provided much higher than the value on paper?

I download the provided trained model, and test on test dataset, but get much higher BLEU than the values in paper.

I use the scripts provided, and don't change anything:

python preprocess.py \
  --source-lang de \
  --target-lang en \
  --trainpref data/wmt14.en-de/train \
  --validpref data/wmt14.en-de/valid \
  --testpref data/wmt14.en-de/test \
  --destdir output/data-bin/wmt14.de-en \
  --srcdict output/maskPredict_de_en/dict.de.txt \
  --tgtdict output/maskPredict_de_en/dict.en.txt

python generate_cmlm.py output/data-bin/wmt14.${src}-${tgt}  \
    --path ${model_dir}/checkpoint_best.pt \
    --task translation_self \
    --remove-bpe True \
    --max-sentences 20 \
    --decoding-iterations ${iteration} \
    --decoding-strategy mask_predict 

I get 34.42 on WMT14 DE->EN, 35.20 on WMT16 EN->RO, 35.62 on WMT RO->EN. These values are much higher than that in origin paper. This is strange, and what happened?

Release the bpe codes?

Hi, I find that the released model doesn't contain the bpe codes, which makes itself unusable. Can the authors release the bpe code?
Thx.

Provide wmt14_en_de model can only achieve 20.90 bleu

Hi, when I used the checkpoint_best.pt provided in readme and the inference script "python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict", I can only got the bleu of 20.90. What is the problem? Are there any other hyperparameters I need to modify in the inference script?

I see "average the 5 best checkpoints to create the final model" in the paper. So is the checkpoint_best.pt provided in the link the final model? If not, I wonder how to average the best checkpoints? Do we forward 5 models and average the prediction distribution?

Thank you!

Failed to train the Mask-predict with larger model/hidden dimension

Elegant work! In addition to training a transformer_base-scale model, I am still trying to train a large model, (e.g., 1024 model dim. & 4096 hidden dim), such that I can fine-tune Mask-predict with XLM.

However, when I simply change the dimension and fix other arguments, the training is failed, that is, the ppl is even becoming bigger. Can you give me some advices?

Below is my training command:

python train.py data-bin/xlm_pretained-wmt14.en-de --arch bert_transformer_seq2seq --share-all-embeddings --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 11000 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --encoder-attention-heads 8 --decoder-attention-heads 8 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 0 --save-dir ${model_dir} --update-freq 3 --ddp-backend=no_c10d --fp16 --keep-last-epochs 10

and the following is the log of one training step:

| epoch 012:  74%|▋| 814/1099 [24:26<08:28,  1.79s/it, loss=12.243, nll_loss=11.121, ppl=2226.58, wps=33332, ups=1, wpb=60068.756, bsz=4060.299, num_updates=12894, lr=0.000440328, gnorm=0.341, clip=0.000, oom=0.000, loss_scale=0.250, wall=23856, train_wall=20393, length_loss=6.6472] 

BTW, because I reused the XLM vocabulary list, the vocab size of larger Mask-predict is more than 60k+.

Namespace(adam_betas='(0.9,0.999)', adam_eps=1e-06, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='bert_transformer_seq2seq', attention_dropout=0.0, best_checkpoint_metric='loss', bilm_add_bos=False, bilm_attention_dropout=0.0, bilm_mask_last_state=False, bilm_model_dropout=0.1, bilm_relu_dropout=0.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_length_cross_entropy', curriculum=0, data=['data-bin/xlm_pretained-wmt14.en-de'], dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=1024, decoder_embed_path=None, decoder_embed_scale=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:10859', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, dynamic_length=False, embedding_only=False, encoder_attention_heads=8, encoder_embed_dim=1024, encoder_embed_path=None, encoder_embed_scale=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=10, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_range=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=10000, max_target_positions=10000, max_tokens=11000, max_tokens_valid=11000, max_update=500000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_dec_token_positional_embeddings=False, no_enc_token_positional_embeddings=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./distill_model_from_scratch_1024_xlm', save_interval=1, save_interval_updates=0, seed=0, self_target=False, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_self', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[3], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01)
| [en] dictionary: 60192 types
| [de] dictionary: 60192 types
| data-bin/xlm_pretained-wmt14.en-de valid 3000 examples
Transformer_nonautoregressive(
  (encoder): TransformerEncoder(
    (embed_tokens): Embedding(60192, 1024, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(10002, 1024, padding_idx=1)
    (embed_lengths): Embedding(10000, 1024)
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (layer_norms): ModuleList(
          (0): BertLayerNorm()
          (1): BertLayerNorm()
        )
      )(1)(2)...(5)
        )
      )
    )
  )
  (decoder): SelfTransformerDecoder(
    (embed_tokens): Embedding(60192, 1024, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(10002, 1024, padding_idx=1)
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): BertLayerNorm()
        (encoder_attn): MultiheadAttention(
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): BertLayerNorm()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): BertLayerNorm()
      )(1)(2)...(5)
   )
)

format of datasets

Hi
could you provide a sample datasets and document the format of the datasets which the codes expects? thanks helps a lot. thanks for this.

path to the datasets

Hi
Could you please provide me with the link to the datasets to be able to run the codes? I am not sure about the format of the datasets model needs.

thanks.

pytorch version

could you provide the details about the environment requirements to run this code? Which pytorch version do you use?

Doubt in mask-predict

Hi,

I have a small doubt. Plz clear.
In mask-predict decoding algorithm, whatever tokens are predicted in 1st round, are retained in all rounds of iterative decoding or they can also be masked?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.