mozilla / firefox-translations-training Goto Github PK

View Code? Open in Web Editor NEW

144.0 15.0 31.0 11.01 MB

Training pipelines for Firefox Translations neural machine translation models

Home Page: https://mozilla.github.io/firefox-translations-training/

License: Mozilla Public License 2.0

Shell 9.28% Python 88.30% Perl 0.53% Makefile 0.45% Dockerfile 1.44%

machine-translation neural-machine-translation machine-learning ml

firefox-translations-training's Introduction

Firefox Translations training

Training pipelines for Firefox Translations machine translation models.

The trained models are hosted in firefox-translations-models repository, compatible with bergamot-translator and power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of Bergamot project that focuses on improving client-side machine translation in a web browser.

Documentation

Pipeline

The pipeline is capable of training a translation model for a language pair end to end. Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.

We use fast translation engine Marian.

You can find more details about the pipeline steps in the documentation.

Orchestrators

An orchestrator is responsible for workflow management and parallelization.

Taskcluster - Mozilla task execution framework. It is also used for Firefox CI. It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. Usage instructions.
Snakemake - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster. Usage instructions. (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)

Experiment tracking

Public training dashboard in Weights & Biases

Marian training metrics are parsed from logs and published using a custom module within the tracking directory. More information is available here.

Learning resources

High level overview post on Mozilla Hacks
Model training guide - practical advice on how to use the pipeline
Reference papers

Acknowledgements

This project uses materials developed by:

Bergamot project (github, website) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
HPLT project (github, website) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
OPUS-MT project (github, website)
Many other open source projects and research papers (see References)

firefox-translations-training's People

Contributors

Stargazers

Watchers

firefox-translations-training's Issues

Bicleaner path is hardcoded even if bicleaner is not used, resulting in crashes

firefox-translations-training/Snakefile

Line 387 in 3b3f33b

params: prefix_train=f"{biclean}/corpus",prefix_test=f"{original}/devset",

If we try to use a language pair without bicleaner, the training will crash because the corpus location is invalid.

spm vocabulary training could fail on high res languages

We don't necessarily want to train the vocabulary on all corpora, because for high res language the vocabulary step would fail eg:

tcmalloc: large alloc 1342177280 bytes == 0x559ec7632000 @ 
trainer_interface.cc(137) LOG(INFO) Loaded 17000000 lines
trainer_interface.cc(137) LOG(INFO) Loaded 18000000 lines
trainer_interface.cc(137) LOG(INFO) Loaded 19000000 lines
trainer_interface.cc(137) LOG(INFO) Loaded 20000000 lines
trainer_interface.cc(112) LOG(WARNING) Too many sentences are loaded! (20000000), which may slow down training.
trainer_interface.cc(114) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
trainer_interface.cc(117) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
trainer_interface.cc(376) LOG(INFO) Loaded all 20000000 sentences
trainer_interface.cc(391) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(391) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(396) LOG(INFO) Normalizing sentences...
trainer_interface.cc(457) LOG(INFO) all chars count=3847586460
trainer_interface.cc(468) LOG(INFO) Done: 99.954% characters are covered.
trainer_interface.cc(478) LOG(INFO) Alphabet size=87
trainer_interface.cc(479) LOG(INFO) Final character coverage=0.99954
trainer_interface.cc(511) LOG(INFO) Done! preprocessed 20000000 sentences.
tcmalloc: large alloc 2147483648 bytes == 0x55a075720000 @ 
tcmalloc: large alloc 4294967296 bytes == 0x55a0f5720000 @ 
tcmalloc: large alloc 8589934592 bytes == 0x55a1f5faa000 @ 
tcmalloc: large alloc 17179869184 bytes == 0x55a3f67aa000 @ 
unigram_model_trainer.cc(124) [(array.size()) <= (static_cast<size_t>(std::numeric_limits<node_int_type>::max()))] Input corpus too large, try with train_extremely_large_corpus=true
Program terminated with an unrecoverable error.

Furthermore, we don't usually want to train on large dirty corpora (small outlier inputs would be filtered out by noise anyways, but it slows down training unnecessary).

Add bicleaner step

https://github.com/bitextor/bicleaner-ai
https://github.com/bitextor/bicleaner

User bicleaner and bicleaner-ai for additional cleaner of a parallel corpus

deescape-special-chars.perl should be part of the clean step

Some of the corpora on the internet are there from the moses era where we couldn't have any special characters in the phrase tables and exist in a preprocessed manner on Opus and other sources. To fix that we should use https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/deescape-special-chars.perl in our pipeline.

Crash on attempt to get datasets from Opus

Hi, I'm testing the latest version with nl-en/en-nl models and I run into several crashes like this one:

[Thu Jan  6 20:40:34 2022]
Job 140: Splitting monolingual trg dataset
Reason: Missing output files: /mnt/mimir0/nbogoych/data/data/en-nl/snakemake-en-nl/translated/mono_trg
Downstream jobs will be updated after completion.

Activating conda environment: /mnt/mimir0/firefox-translations-training/.snakemake/conda/316f8d637776ef821bee8959d04cdacd
InputFunctionException in line 566 of /mnt/mimir0/firefox-translations-training/Snakefile:
Error:
  AttributeError: 'NoneType' object has no attribute 'get'
Wildcards:
  dataset=opus_DGT/v2019
Traceback:
  File "/mnt/mimir0/firefox-translations-training/Snakefile", line 597, in <lambda>

The dataset could be any, I got the same error one with CCMatrix or ELRA.

Ideas?

Here's my config:

$ cat configs/config.nlen.yml 
####
# Example of a production config
# Change language pair, experiment name, datasets and other settings if needed
# Training low resource languages might require more tuning of pipeline/training/configs
###


# These settings depend on execution environment
# They are set in the Makefile
root: ""
cuda: ""
deps: false
gpus: ""
numgpus: ""
workspace: ""
mariancmake: ""


experiment:
  name: snakemake-nl-en
  src: nl
  trg: en

  teacher-ensemble: 2
  # path to a pretrained backward model (optional)
  backward-model: ""

  # limits per downloaded dataset
  mono-max-sentences-src: 100000000
  mono-max-sentences-trg: 20000000
  # split corpus to parallelize translation
  split-length: 2000000
  # vocab training sample
  spm-sample-size: 10000000

  best-model: chrf

  bicleaner:
    default-threshold: 0
    dataset-thresholds:
      # 0 = skip filtering
      #opus_ParaCrawl/v8: 0


marian-args:
# these configs override pipeline/train/configs
  training-backward:
    # change based on available training data
    after: 10e
  training-teacher-all:
    # remove for low resource languages or if training without augmentation
    after: 5e
# these configs override pipeline/translate/decoder.yml
  decoding-backward:
    # 12 Gb GPU, s2s model
    mini-batch-words: 2000
    # 2080ti or newer
    precision: float16
  decoding-teacher:
    # 12 Gb GPU, ensemble of 2 teachers
    mini-batch-words: 1000
    # 2080ti or newer
    precision: float16


datasets:
  # parallel training corpus
  train:
    - opus_bible-uedin/v1
    - opus_Wikipedia/v1.0
    - opus_OpenSubtitles/v2018
    - opus_ELRC_2922/v1
    - opus_QED/v2.0a
    - opus_WikiMatrix/v1
    - opus_CCMatrix/v1
    - opus_ELRC_2923/v1
    - opus_JRC-Acquis/v3.0
    - opus_TED2013/v1.1
    - opus_EUconst/v1
    - opus_GlobalVoices/v2018q4
    - opus_ELITR-ECA/v1
    - opus_CCAligned/v1
    - opus_GNOME/v1
    - opus_wikimedia/v20210402
    - opus_ECB/v1
    - opus_ELRA-W0301/v1
    - opus_ParaCrawl/v8
    - opus_Tanzil/v1
    - opus_KDE4/v2
    - opus_EUbookshop/v2
    - opus_News-Commentary/v16
    - opus_XLEnt/v1.1
    - opus_Books/v1
    - opus_ELRC_3382/v1
    - opus_TED2020/v1
    - opus_Tatoeba/v2021-07-22
    - opus_EMEA/v3
    - opus_DGT/v2019
    - opus_Europarl/v8
    - opus_Ubuntu/v14.10
    - opus_TildeMODEL/v2018
    - opus_PHP/v1
  # datasets to merge for validation while training
  devtest:
    - custom-corpus_/mnt/elli0/nbogoych/eu-dcep.tail3k
  # datasets for evaluation
  test:
    - custom-corpus_/mnt/elli0/nbogoych/eu-dcep.tail3k
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
  # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018

Add dataset specific bicleaner thresholds

There should be a default value for bicleaner threshold and the ability to override it for specific datasets.

Workflow manager integration

A workflow manager should allow us to:

Run the pipeline on a cluster
Represent pipeline steps as a DAG
Parallelize some steps across multiple machines (for example translation)
Utilize resources better - schedule CPU only tasks on CPU machines
Improve observability and monitoring
Schedule training jobs

It should provide extensibility to be integrated with different platforms:

Snakepit - the main platform for the machine translation model training at Mozilla at this point.
GCP - the cloud platform that is used by Mozilla, possible integration in future. Also most likely the workflow scheduler will live there.
Slurm - cluster management solution that is used by universities and installed on HPCs.

https://snakemake.readthedocs.io/en/stable/ is one of the candidates for integration

Support any importer for evaluation

Right now only SacreBLEU data is supported. It's not enough for language pairs that are not official WMT ones.

Remove punctuation normalizaiton

Ulrich:

https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/clean/clean-corpus.sh uses the normalize-punctuation perl script from Moses. I would advise against using that, because using it forces us to do the same hacky preprocessing for actual translation (otherwise the translator may encounter characters it has never seen). On the target side I'm fine with standardization of quotation marks as part of data preparation (but these should match language-specific standards, not use Ascii quotes for everything). I'm not sure though, if it is worth the effort in practice. Post-translation punctuation normalization is one of the usual tricks for improving BLEU scores in shared tasks / competitions, while most humans probably aren't bothered too much if quotation marks aren't quite right.

Kenenth:

Just remove
perl ${CLEAN_TOOLS}/normalize-punctuation.perl -l ${lng}" |

Allow expert user to configure mini-batch size when decoding

The default mini batch size setting:
https://github.com/eu9ene/bergamot-training/blob/quality/pipeline/translate/decoder.yml

normalize: 1.0
word-penalty: 0
mini-batch: 16
mini-batch-words: 2000
maxi-batch: 1000
maxi-batch-sort: src
max-length: 200
max-length-crop: true
beam-size: 8
quiet-translation: True

Is not adequate for all models involved. The s2s model used for backtranslation could easily have 512 mini-batch size on a 3090ti and much larger maxi-batch and mini-batch words. That would yield up to 4X improvement in translation speed, as shown on table 4: http://statmt.org/wmt21/pdf/2021.wmt-1.74.pdf

It is good to have safe default values for non-expert users, but for getting the most of our limited hardware budget, those need to be easily adjustable configuration options. Also the settings for the teacher models and for the backtranslation models would have to be different.

Add dataset specific fixes

Check https://github.com/ZJaume/clean

There should be a possibility to add scripts that fix a dataset based on its name. It should be easily extensible to new datasets.

Teacher does not continue training if training on augmented data was early stopped

If ce-mean-words was stalled 10 times on teacher_all step , continue_teacher step fails with error, it stops without training. It's fine to early stop the pertaining, however, continuation should load only weights of the pre-trained model but not optimizer parameters.

install-deps should be optional as it requires root

https://github.com/mozilla/firefox-translations-training/blob/main/pipeline/setup/install-deps.sh crashes if the user doesn't have root access /apt is configured to be run as a user.

I can see that

firefox-translations-training/Makefile

Line 77 in 174ccea

--config $(CONFIG_OPTIONS) deps=true \

This step is hardcoded. It should be made optional at the top of the makefile?

Custom training/test dataset

Add an importer that can use any file on disk. Prefix - custom_

Handle soft hyphens with custom normalization tables

Ulrich:

The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps.

It requires more clarification whether we need this or not.

Cache datasets across experiments and language pairs

Right now datasets are redownloaded for every new experiment.

The data should be cached in a common folder to be reused across multiple experiments and reversed language pairs.

Fix Chinese compatibility

Check that Chinese language (zh-CN) training works and fix encountered issues.

Add flores dataset importer

It is very useful for the evaluation of low-resource languages.

https://github.com/facebookresearch/flores

Add a script to find datasets

It should be able to produce datasets based on a language pair and importer type to be used in the config.

Add more languages to corpus cleaning script

we can use a list from wiki to add valid characters for other languages: https://en.wikipedia.org/wiki/Wikipedia:Language_recognition_chart

Update mtdata version

It allows using paracrawl v9

Custom corpus works fine as a dev set but not as a test set.

I was trying to train an en-bg model and I got a failure in the eval backward model step.

Since Bulgarian doesn't have an official dataset, I used a custom one. (Here's a snipped of my config):

...
datasets:
  # parallel corpus
  train:
    - opus_Ubuntu/v14.10
    ...
    - opus_DGT/v2019
    - opus_ELRC_2682/v1
  devtest:
    - custom-corpus_/mnt/hrist0/nbogoych/government.bg
  test:
    - custom-corpus_/mnt/hrist0/nbogoych/government.bg
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2020
    - news-crawl_news.2019
    ...

Snakemake fails at:

[Wed Dec  1 20:41:15 2021]                                                                                                                                                                                                                                                                           
Error in rule eval_backward:                                                                                                                                                                                                                                                                         
    jobid: 169                                                                                                                                                                                                                                                                                       
    output: /mnt/hrist0/nbogoych/data/models/en-bg/snakemake-en-bg/evaluation/s2s                                                                                                                                                                                                                    
    log: /mnt/hrist0/nbogoych/data/logs/en-bg/snakemake-en-bg/eval_backward.log (check log file(s) for error message)                                                                                                                                                                                
    conda-env: /mnt/hrist0/nbogoych/firefox-translations-training/.snakemake/conda/ba5ec19af3d16034e3cbf600bb7b34dd                                                                                                                                                                                  
    shell:                                                                                                                                                                                                                                                                                           
        bash pipeline/train/eval.sh "/mnt/hrist0/nbogoych/data/models/en-bg/snakemake-en-bg/evaluation/s2s" "/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval" bg en /mnt/hrist0/nbogoych/data/models/en-bg/snakemake-en-bg/s2s/model.npz.best-chrf.npz >> /mnt/hrist0/nbogoych/data/logs/en-bg/snakemake-en-bg/eval_backward.log 2>&1                                                                                                                                                                                                                                                  
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                     
Removing output files of failed job eval_backward since they might be corrupted:

The log contains:

+ set -euo pipefail
+ echo '###### Evaluation of a model'
###### Evaluation of a model
+ test -v GPUS
+ test -v MARIAN
+ test -v WORKSPACE
+ eval_dir=/mnt/hrist0/nbogoych/data/models/en-bg/snakemake-en-bg/evaluation/s2s
+ datasets_dir=/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval
+ src=bg
+ trg=en
+ models=("${@:5}")
+ mkdir -p /mnt/hrist0/nbogoych/data/models/en-bg/snakemake-en-bg/evaluation/s2s
+ echo '### Evaluating the model'
### Evaluating the model
+ for src_path in "${datasets_dir}"/*."${src}.gz"
++ basename '/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval/*.bg.gz' .bg.gz
+ prefix='*'
+ echo '### Evaluating * bg-en'
### Evaluating * bg-en
+ pigz -dc '/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval/*.bg.gz'
pigz: skipping: /mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval/*.bg.gz does not exist

The directory structure:

/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/eval$ tree
.
`-- custom-corpus_
    `-- mnt
        `-- hrist0
            `-- nbogoych
                |-- government.bg.bg.gz
                `-- government.bg.en.gz

4 directories, 2 files

The devset on the other hand looks differently:

/mnt/hrist0/nbogoych/data/data/en-bg/snakemake-en-bg/original/devset$ tree
.
|-- custom-corpus_
|   `-- mnt
|       `-- hrist0
|           `-- nbogoych
|               |-- government.bg.bg.gz
|               `-- government.bg.en.gz
`-- merge.enbg.gz

4 directories, 3 files

Since i use the same entries for dev and test sets, I can't imagine it's a config file issue?

Forward (and back) translation should be done in fp16 mode if the GPU supports it.

When producing the necessary data for training the student and if decoding on a recent GPU (2080ti or newer, NOT a CPU) we should use float16 mode. Float16 mode gives about 30% improvement in translation speed at basically no loss of BLEU and also allows for much larger mini-batches. This will save significant amount of time and money.

Out of memory on shuffling huge datasets

300M dataset, 128 GB RAM

the workaround is to shuffle dataset after the merge step, disable --shuffle-in-ram and use --shuffle batches

find-corpus.py is outdated?

find-corpus.py is now located at ./utils/ as opposed to ./pipeline/utils/, the readme should be updated.

It also can't find a module named mtdata:

$ python ./utils/find-corpus.py en ru mtdata
Traceback (most recent call last):
  File "/home/nikolay/firefox-translations-training/./utils/find-corpus.py", line 36, in <module>
    from mtdata.main import LangPair
ModuleNotFoundError: No module named 'mtdata'

Works fine with OPUS though.

Add ability to use custom backward model

It is possible to reuse any backward model for back-translations and ce-filter to not train s2s from scratch for every experiment.

TRG lang should also be checked whether it obeys the ratio metric.

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 103 in 3b3f33b

if src_lang in CHARS:

Missing TRG_LANG case here.

Checkpointing training

A good high-resource model can't be trained in 24 (or even 36) hours on one node of CSD3. We need to checkpoint and allow longer training. Pulling issue out of #52. Normally we submit #SBATCH --array=1-7%1 to allow the job to have up to 7 runs interrupted by training. Not sure how to fit this (or an equivalent) into snakemake. Maybe a kludge is producing numbered files as "output" to make a sequence of steps, though producing this file should happen late in the job.

Add support of ensemble of teachers for translation

It's questionable how much improvement of student quality this approach can give. At the same time translation with even one teacher is very slow, so translation using, say 4 teachers, would be quite impractical, unless better scalability tools are implemented. However, it is still possible to implement in the current architecture.

Improve tensorboard

Better integrate with the pipeline settings
Automatically discover models in MODELS_DIR
Remove intermediate file
Do not require to restart the script when a new model was added

Version of marian used for training

We should be training with marian-nmt/marian-dev HEAD (current head is safe), not with browsermt/marian-dev. browsermt/marian-dev should only be used for binarizing the intgemm models that run into the browser.

Track experiments metadata

Just copy config.sh on start of every experiment to check later what datasets and parameters were used.

We might also add any useful experiment metadata to metadata.json. For example, commit hash.

Proposed directory structure:

├ experiments
│   ├ ru-en
│   │   └ experiment-name
│   │        └ metadata.json
│   │        └ config.sh

bicleaner should be optional for some corpora

Currently the bicleaner step in the pipeline is mandatory, but it's not necessarily appropriate for all cases

Some corpora such as paracrawl have already been cleaned by bicleaner, so there is no point of going through them again.
Some corpora are already quite clean and do not need the full bicleaner treatment.

bicleaner should be an optional step for every corpus.

Add a check for to verify datasets aren't already tokenized (similar to JW300 issue)

Me and @kpu both noticed the issue that sometimes tokenized texts appear in the wild resulting in our models learning to place space around punctuation marks which just looks bad to the users. Example: What 's your name ?

I proposed running blanked de-tokenization on the English side, but @kpu pointed out that this could mess up legitimately formatted text.

One idea would be to apply a subset of detokenization rules, such as removing spaces between words and punctuation marks.

We could also have an option in the config file to run a de-tokenization for some corpora, similar to bicleaner. We could also have the preprocessing script look for cases of suspected tokenization and warn the user.

Bicleaner Ai multilingual model

I've seen that you integrate Bicleaner in your pipeline and you choose the tool based on available languages. FYI there's the full en-xx model that is trained with the concatenation of all other languages and performs quite decent on unseen languages. I've been using it for Turkish, Albanian and Macedonian, and it seemed nice when eyeballing. Unfortunately I don't have a comparison between using and not using it.

Change sharding mode in training to improve speed

We should marian-nmt/marian-dev for our training (current HEAD), as it has a bunch of fixes and performance improvements. One of those is a change in the sharding mode: https://github.com/marian-nmt/marian-dev/blob/master/src/common/config_parser.cpp#L828 achieved by setting: sharding: local, as it noticeably improves training speed.

Downside of using this is slightly increased memory consumption. If we have the memory available in the GPU, we should use it.

Add chrF evaluation metric

Snakemake shouldn't invalidate future steps in case some files were changed (devsets, marian binaries, etc..)

Currently, if Snakemake detects a change in a file that future steps are dependent on, it will invalidate all previous steps. However, it is not necessarely to do so in all cases. Two examples:

marian binary is rebuilt due to system upgrade. The previous trained models and vocabulary files are still valid and shouldn't need invalidation
Devset/test set changed. The models built are still valid and shouldn't need to be rebuilt from scratch, just because a different devset was given. If the user feels that the models should be rebuilt from scratch, they can just delete the relevant files by themselves.

In the case of 2), it is also possible that we would want to resume training of existing models, rather than start training them from scratch. This behaviour would also be preferred to the current one.

bicleaner-ai crashes with out of GPU memory, when running with more than one jobs per node

bicleaner-ai uses tensorflow which tends to reserve all the memory on a single GPU (even if it doesn't need it). As a result, cleaning several different parallel corpora will result in a crash due to running out of GPU memory (or memory corruption).

Recommendation: use CUDA_VISIBLE_DEVICES to mask the GPUs and run only number of GPUs jobs, with each job taking a designated GPU.

Bicleaner applies default threshold instead of skipping dataset

in case 0 threshold is specified and cleaning is supposed to be skipped

Requirements for the python venv

Hi everyone, thanks for this nice repo :).

I wanted to know if it's possible to share the requirements needed for the loval python (conda) venv that is sourced here ? I would help for reproducibility. Fasttext is being used in this script, which mean at least one non built-in python module is required.

Thanks :)

Support Chinese discussion

Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be fully supported

Language detection: Some Chinese corpora are not tagged with zh, but with zh_{tw,zh,hk...} etc. it would be helpful if find_corpus.py checks for those when checking for zh..
Chinese script comes in traditional and simplified variety. Most big translation vendors support both. Converting traditional to simplified (and vice versa) can be easily achieved through hanzi-conv https://pypi.org/project/hanziconv/0.3/ . There might be a very small information loss when converting simplified to traditional, but it should be fine in 99.9% of the cases. Some datasets such as ted talks come in traditional, so they should be converted before using.
Preprocessing filters:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 51 in 3b3f33b

'sv': r'[a-zÅåÄäÖö]',

Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something like u'[\u4e00-\u9fff]', but this may be improved.
Segmentation. Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):

#!/usr/bin/env python

import re
import sys


re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE)
re_final_comma = re.compile("\.$")


for line in sys.stdin:
    line = line[:-1] #EoL
    line = line.strip()
    line.replace(' ', "")
    if line[-1] == '，':
        line = line[:-1] + u"\u3002"
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    line = re_space.sub("", line)
    line = line.replace(",", u"\uFF0C")
    line = re_final_comma.sub(u"\u3002", line)
    print(line)

This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.

Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 93 in 3b3f33b

ratio_len = src_len / float(trg_len)

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Corpus specific fixes. The UN corpus doesn't contain fulstops (for example) and we use something like this to fix it:

import sys

for line in sys.stdin:
    line = line[:-1] #EoL
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    print(line)

(This script is integrated in the previous copy/paste of script).

All of these steps except 2) Apply to Japanese as well. Japanese tokenizer should be used in place of jieba for japanese.

Group jobs request too many cores on slurm

This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller jobs on a regular cluster to increase utilization, overhead on creating a new job is minimal.

Replace absolute paths with relative ones

Feedback:

Given the name WORKDIR , I'd expect it means the directory where the scripts are doing work like training or whatever. But it appears to mean I have to tell the scripts where they are. Why not the best practice of scripts finding their dependencies relative to themselves? If you want it rooted at the top of the scripts repo, then just tell each script at the top how deep it is.

It seems it is not the best practice to expect all scripts to know their location in terms of an absolute path, even though it is easier to debug, so we can replace this approach with relative paths.
Also WORKDIR name is confusing and can be renamed to something else.

Train both directions at once

Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories

firefox-translations-training/configs/config.prod.yml

Line 18 in 3b3f33b

name: prod

eg: exp-name/src-trg, meaning that all datasets will be redownloaded.

Furthermore data cleaning is done via concating src and target files and is asymetrical at places: #41 (comment)

In practise preprocessing can be symmetrical and once a good student model is trained in one direction, it may even be used for producing backtranslations in the other automatically (prior to quantising). By training src-trg and trg-src at the same time, we can avoid data duplication, lengthy and space consuming preprocessing, training vocabulary and training one separate model

Fine tune teacher on parallel data only

When using backtranslations + parallel data for teacher training, continue training on parallel data only.

Evaluate ensemble of teachers

Make it possible to assign non-consecutive GPUs depending on what we have available on the machine

Currently GPUs are assigned by this:

firefox-translations-training/Snakefile

Line 118 in a09b0ac

gpus = ' '.join([str(n) for n in range(int(gpus_num))])

On some non-cluster shared machines, we may want to use a subset of the GPUs on the system for training, which is not possible with the current system. Instead we should be able to pass something like GPUS={2, 3, 4, 5} if this is what we have free on our system.

Wrong tcol when cleaning with Bicleaner

Shouldn't this line say --scol 1 --tcol 2?

Try other teacher hyperparameters

An example of teacher hyperparameters from Kenneth (requires the latest Marian):

devices:
 - 0
 - 1
 - 2
 - 3

workspace: 12000
log: train.log
model: model.npz

train-sets:
  - data/train.en-de.en.gz
  - data/train.en-de.de.gz
seed: 1131
vocabs:
  - data/vocab.ende.spm
  - data/vocab.ende.spm
task: transformer-big
shuffle-in-ram: true
# Validation set options
valid-sets:
  - data/valid.en-de.en
  - data/valid.en-de.de
valid-freq: 3000
disp-freq: 100
valid-metrics:
  - ce-mean-words
  - bleu-detok
valid-mini-batch: 8
valid-max-length: 300
valid-translation-output: wmt19.valid.output
keep-best: true
valid-log: valid.log
overwrite: true
learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer

If they work better, we should update the teacher config