Hi, thanks for creating and sharing this codebase, it has been really helpful to me.</

Question about multilingual experiments in kNN-BOX paper about knn-box HOT 3 CLOSED

njunlp commented on June 11, 2024

Question about multilingual experiments in kNN-BOX paper

from knn-box.

Comments (3)

Maxwell-Lyu commented on June 11, 2024 2

Hi, thanks for using kNN-BOX. Here's a short reply to your answer:

Yes.
Yes.

Detailed reply with more information:

Yes. We used the smallest version because of resource limitations when M2M100 is combined with Robust kNN-MT. For your reference, 25GB+ GPU memory is needed to train Robust Combiner for M2M100-418M.
Yes. The datastore is created for each language pair, so is the RobustCombiner. En->Cs M2M-100 BLEU score for M2M100 is 20.7, and when adding a En-Cs datastore based on the TED En->Cs training data the result will be improved to 22.3.

Here are scripts used to produce our result in attachment. Additional information that might be helpfull if you want to reproduce our results:

Dataset:
- Tokenization: The TED dataset from Qi et al.,2018 is preprocessed using Moses tokenizer. So before using it with M2M100, you must detokenize it first, then use the spm model provided by M2M100 to do a correct SPM encoding.
- Binarization: The dict file to run fairseq-preprocess is named data_dict.128k.txt, please note this.
Datastore: The datastore size is provided below, "1.3M" means 1.3 million key-value pairs. The dimention of key is 1024.

	cs	da	de	es	fr	it	nl	pl	pt	sv
En-X	2.9M	1.2M	4.6M	5.1M	5.8M	5.6M	4.7M	4.7M	1.2M	1.4M
X-En	2.6M	1.1M	4.3M	5.0M	4.9M	5.3M	4.6M	4.5M	1.2M	1.3M

Model: The M2M100 418M is a transformer_wmt_en_de_big model with the task translation_multi_simple_epoch and some model comfiguration changes. These are arguments that makes M2M100 special:

--task translation_multi_simple_epoch \
--source-lang $SRC --target-lang $TGT \
--lang-pairs $PROJECT_PATH/model/language_pairs_small_models.txt \
--fixed-dictionary $PROJECT_PATH/model/model_dict.128k.txt \
--encoder-normalize-before \
--decoder-normalize-before \
--dropout 0.3 \
--attention-dropout 0.1 \
--encoder-layers 12 \
--decoder-layers 12 \
--encoder-layerdrop 0.05 \
--decoder-layerdrop 0.05 \
--share-decoder-input-output-embed \
--share-all-embeddings \
--encoder-langtok src \
--decoder-langtok \

from knn-box.