Hi there, happy new year and thanks for releasing the code for your nice work! I have some questions about the training configs and I hope you could clarify them for me.
Specifically, I would like to replicate the released m2m model and I am using the provided trainer.sh
. As I use two A100s, I set PER_DEVICE_TRAIN_BATCH_SIZE=32
so as to keep the effective batch size to 256. I keep the rest configs intact.
Then I do inference on the Chinese Simplified-English
test set with the checkpoint at the 25K step. It gets 21.65 on ROUGE-L, while the released m2m model gets 26.75.
After inspecting the model outputs, I found that the model of my replication sometimes generates summaries of non-target languages. For example, for the Chinese Simplified-English
test set, around 10% of the generated summaries are Chinese, while the released model is able to generate only English summaries. This may explain the above performance gap.
Another observation is that the above-mentioned problem is more severe with the checkpoint at the 20K step, so I wonder if this is due to underfitting, and it may vanish with more training steps (e.g., 30K). I have not validated this assumption yet as I would like to adopt your original training configs if possible.
I would appreciate it if you could shed some light on how to correctly replicate your m2m model. Are there any particular training configs that I should adopt? It would also help if you could share which checkpoint (training steps) the released m2m model is.
Many thanks!