Giter Site home page Giter Site logo

Comments (3)

Maxwell-Lyu avatar Maxwell-Lyu commented on June 11, 2024 2

Hi, thanks for using kNN-BOX. Here's a short reply to your answer:

  • Yes.
  • Yes.

Detailed reply with more information:

  • Yes. We used the smallest version because of resource limitations when M2M100 is combined with Robust kNN-MT. For your reference, 25GB+ GPU memory is needed to train Robust Combiner for M2M100-418M.
  • Yes. The datastore is created for each language pair, so is the RobustCombiner. En->Cs M2M-100 BLEU score for M2M100 is 20.7, and when adding a En-Cs datastore based on the TED En->Cs training data the result will be improved to 22.3.

Here are scripts used to produce our result in attachment. Additional information that might be helpfull if you want to reproduce our results:

  • Dataset:
    • Tokenization: The TED dataset from Qi et al.,2018 is preprocessed using Moses tokenizer. So before using it with M2M100, you must detokenize it first, then use the spm model provided by M2M100 to do a correct SPM encoding.
    • Binarization: The dict file to run fairseq-preprocess is named data_dict.128k.txt, please note this.
  • Datastore: The datastore size is provided below, "1.3M" means 1.3 million key-value pairs. The dimention of key is 1024.
cs da de es fr it nl pl pt sv
En-X 2.9M 1.2M 4.6M 5.1M 5.8M 5.6M 4.7M 4.7M 1.2M 1.4M
X-En 2.6M 1.1M 4.3M 5.0M 4.9M 5.3M 4.6M 4.5M 1.2M 1.3M
  • Model: The M2M100 418M is a transformer_wmt_en_de_big model with the task translation_multi_simple_epoch and some model comfiguration changes. These are arguments that makes M2M100 special:
--task translation_multi_simple_epoch \
--source-lang $SRC --target-lang $TGT \
--lang-pairs $PROJECT_PATH/model/language_pairs_small_models.txt \
--fixed-dictionary $PROJECT_PATH/model/model_dict.128k.txt \
--encoder-normalize-before \
--decoder-normalize-before \
--dropout 0.3 \
--attention-dropout 0.1 \
--encoder-layers 12 \
--decoder-layers 12 \
--encoder-layerdrop 0.05 \
--decoder-layerdrop 0.05 \
--share-decoder-input-output-embed \
--share-all-embeddings \
--encoder-langtok src \
--decoder-langtok \

from knn-box.

davidstap avatar davidstap commented on June 11, 2024 2

I was able to replicate your results, thanks again!

from knn-box.

davidstap avatar davidstap commented on June 11, 2024

Thanks for your quick and detailed reply! I'll try again with this information :)

from knn-box.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.