browsermt / students Goto Github PK

Efficient teacher-student models and scripts to make them

License: Other

Shell 4.61% JavaScript 24.17% Python 0.40% Makefile 6.48% Perl 0.06% Emacs Lisp 7.34% Handlebars 49.57% Mathematica 3.94% Slash 3.44%

students's People

Contributors

Stargazers

Watchers

Forkers

mfomicheva alvations zjaume virgulvirgul mksifakis kirianguiller tortlang jelmervdl sukantasen kaleidoescape abhi-agg pdehaan felipesantosk zakiehshakeri wandri-jooste centaurioun godnoken aliciannz pacien

students's Issues

Why are the speed.sh files packaged with the models?

The models are supposedly consistent enough to not require their own speed scripts right?

Reduce number of Marian branches

Users of this repo have to compile three different branches of Marian:

master:
https://github.com/browsermt/students/blob/master/install.sh#L8-L9
(I think we can remove master)
Alham's quantization training branch:
https://github.com/afaji/Marian/tree/fixed-quant
Nick's intgemm_reintegated_computestats branch

So a couple of questions:

How far is @afaji's fixed-quant branch from something that can be in master?
Can we merge fixed-quant and intgemm_reintegrated_computestats?

Add Chinese to English

Hi,

It is available in marian. Can it be added here ?

tcmalloc causes extract_stats.py to crash

"In very rare cases, the extract_stats.py script might crash. If this happens, the most likely culprit is tcmalloc. If the model is too large (or the workspace too small), tcmalloc will spit tcmalloc: large alloc... to stderr which our script doesn't know how to handle. Just delete that line inside the quantmuls file and you're good to go."

Citation of models

Is there a list of works to cite when using the trained models in academic works? Namely the models cs-en, de-en.

cs and de models lack information for the dataset which they were trained on.

Were they just WMT news task compatible data + backtranslation?

Wikipedia [ modifier | modifier le code ] grossly confusing fr-en model

https://fr.wikipedia.org/wiki/Droupadi_Murmu but this is all over French wikipedia. Did the data cleaning remove |?

Source has [ modifier | modifier le code ]:

Target has a very confused model:

In corpus cleaning, remove empty lines before predicting language IDs?

Perhaps we need to remove empty lines before running LangID in clean/clean-corpus.sh otherwise prediction in clean/tools/langid-fasttext.py will complain.

Remove reference to Alham's 8-bit fork

I think all of Alham's 8-bit tuning code is in marian-nmt/marian-dev so there is no need for the fork to do 8-bit tuning.

Config files not using binary lexical shortlist

Example:

students/csen/csen.student.tiny11/config.intgemm8bitalpha.yml

Line 8 in 3344c53

- lex.s2t.gz

This is making translateLocally load slower than it should.

New de-en models

On valhalla in /fs/surtr0/heafield/{base,tiny11}-de-en there are new German->English student models based on the WMT21 submissions that include guided alignment. We should give them the 8-bit and lexical shortlist treatment then release them here.

Model license clarification

Hello,

'Models produced by the Bergamot project are released under the CC-BY-SA 4.0 license.'

If I use the software under the MIT license in this repo and create a model - is that model automatically under this license? Or are we strictly talking about the models released in this repo?

And if I may be so cheeky to ask, without taking it as legal advice - is it possible to not have to release my own software under the CC-BY-SA 4.0 license (or any other compatible license), if I include any of these Bergamot-released models (unmodified) in my software?

Thank you!

Update en-de models

The en-de models submitted to WMT21 were trained without guided alignment. They are in /fs/alvis0/heafield/student (base) and /fs/surtr0/heafield/student-tiny11 (tiny11).

Retrain with guided alignment (find alignments or make them on the parallel data). See train.sh
Make 8-bit and lexical shortlists
Post here

Start en-pl and en-fr training

These can get a leg up by using the existing pl-en (#61) and fr-en (#60) models that have already built their teacher systems, which should then be used for back translation. Note that forward translation for making a pl-en student is the same thing as backtranslation for making an en-pl model.

See mozilla/firefox-translations-training#46 on how to short-circuit this and talk to @eu9ene as necessary.

Re-train fr-en

Trained models for fr-en suffered from broken back-translation.

Steps taken to resurrect training:

rsync-ed /fs/nanna0/nbogoych/firefox-translations-training into /fs/nanna0/germann/bergamot/train/firefox-translations-training
edited CONDA_PATH in Makefile to ${HOME}/mambaforge
ran make conda; make snakemake
changed precision to always float32 in configs/config.fr-en.yml
rsynced data/fr-en/fre-en-prod/clean/mono.* and data/fr-en/fr-en-prod/biclean/corpus.* into respective directory in /fs/nanna0/germann/bergamot/train/fr-en
rsynced data/fr-en/fre-en-prod/original/{devset.*,eval}
rsynced models/fr-en/fre-en-prod/{backward|vocab} for backtranslation
hacked Snakefile in a truly awful way to declare ancient all the depedencies that I didn't want to recreate. This is necessary because my run of snakemake recreated various tools upon which stuff depends; I kept re-running make dry-run till things started to make sense.

[ will be edited as I progress]

[Question] Training data for the alignment and lexical shortlists?

From where should I train word-alignments an lexical shortlists? From the original parallel corpus or from the distilled data?

Missing lexical shortlists for nn

@ZJaume we are missing lexical shortlists. https://github.com/browsermt/students/blob/master/nnen/nnen.student.tiny11/config.yml

Stop referring to a private repo

There's a link to https://github.com/browsermt/coordination/wiki/Low-precision-GEMM-and-Intgemm

Broken back-translations in model training

Looks like we're back to square 2 (data is cleaned) for nl-en, en-nl, pl-en, fr-en ...

All trained models (post the backtranslation model)

translation direction	snakemake data dir	status
nl-en	/fs/elli/0nbogoych/data/data/nl-en/snakemake-nl-en/	broken
en-nl	/fs/mimir0/nbogoych/data/data/en-nl/snakemake-en-nl/	broken
pl-en	/fs/surtr0/nbogoych/data/data/pl-en/pl-en-prod/	broken
fr-en	/fs/nanna0/nbogoych/data/data/fr-en/fr-en-prod/	broken

Collapse translatelocally deployment and this repository

Currently we have two sets of tarballs: those downloaded by this students repo (which often has multiple models at once) and those downloaded by translatelocally. We should consolidate, probably by throwing out the more ad-hoc format used by the students repo.

Merge train-student and 8bitdecode guides

We should have an overall guide for building a student model. Currently it's fragmented in separate directories and there isn't even a reference.

csen

Hey,
Could we have Czech models and scripts uploaded here following a similar fashion as the others? @ugermann @radidd

Cheers,

Nick

english to chinese translate?

please support english to chinese translate?

Is tiny student training config correct?

Hi,

I'm exploring the train-student directory to see if I can train student models for Paracrawl. I have taken a look at the training config file for tiny models at https://github.com/browsermt/students/blob/master/train-student/models/student.tiny11/student.tiny11.yml and found options a bit confusing. The paper says tiny students use transformer encoder and rnn decoder but the config file is using:

enc-cell: gru
enc-cell-depth: 1
enc-type: bidirectional

Maybe this options are ignored when using type: transformer but I wanted be sure it is the correct configuration before using it.

Include non-breaking prefixes file for source language

Currently bergamot-translator is just not loading non-breaking prefixes browsermt/bergamot-translator#104 . This is bad and should be fixed. I think the clean way to do this is to ship the file for the source language. They're small enough that some copying is probably ok.

Explain the quantisation finetuning parameters

@afaji since your code was merged to master, there are some addition parameters, that nobody but you understands, could you add them to the readme here:

https://github.com/browsermt/students/blob/master/train-student/finetune/README.md

Notably, I have no idea what --quantize-optimization-steps does, and possibly some people would struggle to parse --quantize-log-based. Please, add a one line explanation

Segmentation fault on finetuning

Hi @afaji,

I'm getting a segmentation fault when trying to finetune with https://github.com/afaji/Marian/tree/fixed-quant.

Compiled today with: cmake .. -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DUSE_FBGEMM=on

[2020-11-06 14:00:27] [marian] Marian v1.9.37 741cd865 2020-11-02 09:00:03 +0000
[2020-11-06 14:00:27] [marian] Running on host as process 2003 with command line:
[2020-11-06 14:00:27] [marian] ../../../../marian-fixedquant/build/marian --model model.finetune.npz --train-sets corpus.is.gz corpus.en.gz -T /tmp --shuffle-in-ram --guided-alignment corpus.aln.gz --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 --max-length 200 --mini-batch-fit -w 8000 --mini-batch 1000 --maxi-batch 1000 --devices 0 1 --sync-sgd --optimizer-delay 4 --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 --cost-type ce-mean-words --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 --valid-freq 500 --save-freq 500 --disp-freq 100 --disp-first 10 --valid-metrics bleu-detok ce-mean-words --valid-sets devset.is devset.en --valid-translation-output devset.out --quiet-translation --valid-mini-batch 16 --beam-size 1 --normalize 1 --early-stopping 20 --overwrite --keep-best --log train.finetune.log --valid-log valid.finetune.log --quantize-bits 8
[2020-11-06 14:00:27] [config] after-batches: 0
[2020-11-06 14:00:27] [config] after-epochs: 0
[2020-11-06 14:00:27] [config] all-caps-every: 0
[2020-11-06 14:00:27] [config] allow-unk: false
[2020-11-06 14:00:27] [config] authors: false
[2020-11-06 14:00:27] [config] beam-size: 1
[2020-11-06 14:00:27] [config] bert-class-symbol: "[CLS]"
[2020-11-06 14:00:27] [config] bert-mask-symbol: "[MASK]"
[2020-11-06 14:00:27] [config] bert-masking-fraction: 0.15
[2020-11-06 14:00:27] [config] bert-sep-symbol: "[SEP]"
[2020-11-06 14:00:27] [config] bert-train-type-embeddings: true
[2020-11-06 14:00:27] [config] bert-type-vocab-size: 2
[2020-11-06 14:00:27] [config] build-info: ""
[2020-11-06 14:00:27] [config] cite: false
[2020-11-06 14:00:27] [config] clip-gemm: 0
[2020-11-06 14:00:27] [config] clip-norm: 0
[2020-11-06 14:00:27] [config] cost-scaling:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] cost-type: ce-mean-words
[2020-11-06 14:00:27] [config] cpu-threads: 0
[2020-11-06 14:00:27] [config] data-weighting: ""
[2020-11-06 14:00:27] [config] data-weighting-type: sentence
[2020-11-06 14:00:27] [config] dec-cell: ssru
[2020-11-06 14:00:27] [config] dec-cell-base-depth: 2
[2020-11-06 14:00:27] [config] dec-cell-high-depth: 1
[2020-11-06 14:00:27] [config] dec-depth: 2
[2020-11-06 14:00:27] [config] devices:
[2020-11-06 14:00:27] [config]   - 0
[2020-11-06 14:00:27] [config]   - 1
[2020-11-06 14:00:27] [config] dim-emb: 256
[2020-11-06 14:00:27] [config] dim-rnn: 1024
[2020-11-06 14:00:27] [config] dim-vocabs:
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config] disp-first: 10
[2020-11-06 14:00:27] [config] disp-freq: 100
[2020-11-06 14:00:27] [config] disp-label-counts: false
[2020-11-06 14:00:27] [config] dropout-rnn: 0
[2020-11-06 14:00:27] [config] dropout-src: 0
[2020-11-06 14:00:27] [config] dropout-trg: 0
[2020-11-06 14:00:27] [config] dump-config: ""
[2020-11-06 14:00:27] [config] early-stopping: 20
[2020-11-06 14:00:27] [config] embedding-fix-src: false
[2020-11-06 14:00:27] [config] embedding-fix-trg: false
[2020-11-06 14:00:27] [config] embedding-normalization: false
[2020-11-06 14:00:27] [config] embedding-vectors:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] enc-cell: gru
[2020-11-06 14:00:27] [config] enc-cell-depth: 1
[2020-11-06 14:00:27] [config] enc-depth: 6
[2020-11-06 14:00:27] [config] enc-type: bidirectional
[2020-11-06 14:00:27] [config] english-title-case-every: 0
[2020-11-06 14:00:27] [config] exponential-smoothing: 0
[2020-11-06 14:00:27] [config] factor-weight: 1
[2020-11-06 14:00:27] [config] grad-dropping-momentum: 0
[2020-11-06 14:00:27] [config] grad-dropping-rate: 0
[2020-11-06 14:00:27] [config] grad-dropping-warmup: 100
[2020-11-06 14:00:27] [config] gradient-checkpointing: false
[2020-11-06 14:00:27] [config] guided-alignment: corpus.aln.gz
[2020-11-06 14:00:27] [config] guided-alignment-cost: mse
[2020-11-06 14:00:27] [config] guided-alignment-weight: 0.1
[2020-11-06 14:00:27] [config] ignore-model-config: false
[2020-11-06 14:00:27] [config] input-types:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] interpolate-env-vars: false
[2020-11-06 14:00:27] [config] keep-best: true
[2020-11-06 14:00:27] [config] label-smoothing: 0
[2020-11-06 14:00:27] [config] layer-normalization: false
[2020-11-06 14:00:27] [config] learn-rate: 0.0003
[2020-11-06 14:00:27] [config] lemma-dim-emb: 0
[2020-11-06 14:00:27] [config] log: train.finetune.log
[2020-11-06 14:00:27] [config] log-level: info
[2020-11-06 14:00:27] [config] log-time-zone: ""
[2020-11-06 14:00:27] [config] lr-decay: 0
[2020-11-06 14:00:27] [config] lr-decay-freq: 50000
[2020-11-06 14:00:27] [config] lr-decay-inv-sqrt:
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config] lr-decay-repeat-warmup: false
[2020-11-06 14:00:27] [config] lr-decay-reset-optimizer: false
[2020-11-06 14:00:27] [config] lr-decay-start:
[2020-11-06 14:00:27] [config]   - 10
[2020-11-06 14:00:27] [config]   - 1
[2020-11-06 14:00:27] [config] lr-decay-strategy: epoch+stalled
[2020-11-06 14:00:27] [config] lr-report: true
[2020-11-06 14:00:27] [config] lr-warmup: 16000
[2020-11-06 14:00:27] [config] lr-warmup-at-reload: false
[2020-11-06 14:00:27] [config] lr-warmup-cycle: false
[2020-11-06 14:00:27] [config] lr-warmup-start-rate: 0
[2020-11-06 14:00:27] [config] max-length: 200
[2020-11-06 14:00:27] [config] max-length-crop: false
[2020-11-06 14:00:27] [config] max-length-factor: 3
[2020-11-06 14:00:27] [config] maxi-batch: 1000
[2020-11-06 14:00:27] [config] maxi-batch-sort: trg
[2020-11-06 14:00:27] [config] mini-batch: 1000
[2020-11-06 14:00:27] [config] mini-batch-fit: true
[2020-11-06 14:00:27] [config] mini-batch-fit-step: 10
[2020-11-06 14:00:27] [config] mini-batch-track-lr: false
[2020-11-06 14:00:27] [config] mini-batch-warmup: 0
[2020-11-06 14:00:27] [config] mini-batch-words: 0
[2020-11-06 14:00:27] [config] mini-batch-words-ref: 0
[2020-11-06 14:00:27] [config] model: model.finetune.npz
[2020-11-06 14:00:27] [config] multi-loss-type: sum
[2020-11-06 14:00:27] [config] multi-node: false
[2020-11-06 14:00:27] [config] multi-node-overlap: true
[2020-11-06 14:00:27] [config] n-best: false
[2020-11-06 14:00:27] [config] no-nccl: false
[2020-11-06 14:00:27] [config] no-reload: false
[2020-11-06 14:00:27] [config] no-restore-corpus: false
[2020-11-06 14:00:27] [config] normalize: 1
[2020-11-06 14:00:27] [config] normalize-gradient: false
[2020-11-06 14:00:27] [config] num-devices: 0
[2020-11-06 14:00:27] [config] optimizer: adam
[2020-11-06 14:00:27] [config] optimizer-delay: 4
[2020-11-06 14:00:27] [config] optimizer-params:
[2020-11-06 14:00:27] [config]   - 0.9
[2020-11-06 14:00:27] [config]   - 0.98
[2020-11-06 14:00:27] [config]   - 1e-09
[2020-11-06 14:00:27] [config] output-omit-bias: false
[2020-11-06 14:00:27] [config] overwrite: true
[2020-11-06 14:00:27] [config] precision:
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config] pretrained-model: ""
[2020-11-06 14:00:27] [config] quantize-biases: false
[2020-11-06 14:00:27] [config] quantize-bits: 8
[2020-11-06 14:00:27] [config] quantize-log-based: false
[2020-11-06 14:00:27] [config] quantize-optimization-steps: 0
[2020-11-06 14:00:27] [config] quiet: false
[2020-11-06 14:00:27] [config] quiet-translation: true
[2020-11-06 14:00:27] [config] relative-paths: false
[2020-11-06 14:00:27] [config] right-left: false
[2020-11-06 14:00:27] [config] save-freq: 500
[2020-11-06 14:00:27] [config] seed: 0
[2020-11-06 14:00:27] [config] sentencepiece-alphas:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] sentencepiece-max-lines: 10000000
[2020-11-06 14:00:27] [config] sentencepiece-options: ""
[2020-11-06 14:00:27] [config] shuffle: data
[2020-11-06 14:00:27] [config] shuffle-in-ram: true
[2020-11-06 14:00:27] [config] sigterm: save-and-exit
[2020-11-06 14:00:27] [config] skip: false
[2020-11-06 14:00:27] [config] sqlite: ""
[2020-11-06 14:00:27] [config] sqlite-drop: false
[2020-11-06 14:00:27] [config] sync-sgd: true
[2020-11-06 14:00:27] [config] tempdir: /tmp
[2020-11-06 14:00:27] [config] tied-embeddings: false
[2020-11-06 14:00:27] [config] tied-embeddings-all: true
[2020-11-06 14:00:27] [config] tied-embeddings-src: false
[2020-11-06 14:00:27] [config] train-sets:
[2020-11-06 14:00:27] [config]   - corpus.is.gz
[2020-11-06 14:00:27] [config]   - corpus.en.gz
[2020-11-06 14:00:27] [config] transformer-aan-activation: swish
[2020-11-06 14:00:27] [config] transformer-aan-depth: 2
[2020-11-06 14:00:27] [config] transformer-aan-nogate: false
[2020-11-06 14:00:27] [config] transformer-decoder-autoreg: rnn
[2020-11-06 14:00:27] [config] transformer-depth-scaling: false
[2020-11-06 14:00:27] [config] transformer-dim-aan: 2048
[2020-11-06 14:00:27] [config] transformer-dim-ffn: 1536
[2020-11-06 14:00:27] [config] transformer-dropout: 0
[2020-11-06 14:00:27] [config] transformer-dropout-attention: 0
[2020-11-06 14:00:27] [config] transformer-dropout-ffn: 0
[2020-11-06 14:00:27] [config] transformer-ffn-activation: relu
[2020-11-06 14:00:27] [config] transformer-ffn-depth: 2
[2020-11-06 14:00:27] [config] transformer-guided-alignment-layer: last
[2020-11-06 14:00:27] [config] transformer-heads: 8
[2020-11-06 14:00:27] [config] transformer-no-projection: false
[2020-11-06 14:00:27] [config] transformer-pool: false
[2020-11-06 14:00:27] [config] transformer-postprocess: dan
[2020-11-06 14:00:27] [config] transformer-postprocess-emb: d
[2020-11-06 14:00:27] [config] transformer-postprocess-top: ""
[2020-11-06 14:00:27] [config] transformer-preprocess: ""
[2020-11-06 14:00:27] [config] transformer-tied-layers:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] transformer-train-position-embeddings: false
[2020-11-06 14:00:27] [config] tsv: false
[2020-11-06 14:00:27] [config] tsv-fields: 0
[2020-11-06 14:00:27] [config] type: transformer
[2020-11-06 14:00:27] [config] ulr: false
[2020-11-06 14:00:27] [config] ulr-dim-emb: 0
[2020-11-06 14:00:27] [config] ulr-dropout: 0
[2020-11-06 14:00:27] [config] ulr-keys-vectors: ""
[2020-11-06 14:00:27] [config] ulr-query-vectors: ""
[2020-11-06 14:00:27] [config] ulr-softmax-temperature: 1
[2020-11-06 14:00:27] [config] ulr-trainable-transformation: false
[2020-11-06 14:00:27] [config] unlikelihood-loss: false
[2020-11-06 14:00:27] [config] valid-freq: 500
[2020-11-06 14:00:27] [config] valid-log: valid.finetune.log
[2020-11-06 14:00:27] [config] valid-max-length: 1000
[2020-11-06 14:00:27] [config] valid-metrics:
[2020-11-06 14:00:27] [config]   - bleu-detok
[2020-11-06 14:00:27] [config]   - ce-mean-words
[2020-11-06 14:00:27] [config] valid-mini-batch: 16
[2020-11-06 14:00:27] [config] valid-reset-stalled: false
[2020-11-06 14:00:27] [config] valid-script-args:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] valid-script-path: ""
[2020-11-06 14:00:27] [config] valid-sets:
[2020-11-06 14:00:27] [config]   - devset.is
[2020-11-06 14:00:27] [config]   - devset.en
[2020-11-06 14:00:27] [config] valid-translation-output: devset.out
[2020-11-06 14:00:27] [config] version: v1.9.36 c9a2dcce 2020-09-03 13:56:51 +0100
[2020-11-06 14:00:27] [config] vocabs:
[2020-11-06 14:00:27] [config]   - vocab.spm
[2020-11-06 14:00:27] [config]   - vocab.spm
[2020-11-06 14:00:27] [config] word-penalty: 0
[2020-11-06 14:00:27] [config] word-scores: false
[2020-11-06 14:00:27] [config] workspace: 8000
[2020-11-06 14:00:27] [config] Loaded model has been created with Marian v1.9.36 c9a2dcce 2020-09-03 13:56:51 +0100, will be overwritten with current version v1.9.37 741cd865 2020-11-02 09:00:03 +0000 at saving
[2020-11-06 14:00:27] Using synchronous SGD
[2020-11-06 14:00:27] [data] Loading SentencePiece vocabulary from file vocab.spm
[2020-11-06 14:00:27] [data] Setting vocabulary size for input 0 to 32,000
[2020-11-06 14:00:27] [data] Loading SentencePiece vocabulary from file vocab.spm
[2020-11-06 14:00:27] [data] Setting vocabulary size for input 1 to 32,000
[2020-11-06 14:00:27] [data] Using word alignments from file corpus.aln.gz
[2020-11-06 14:00:27] [comm] Compiled without MPI support. Running as a single process on mari
[2020-11-06 14:00:27] [batching] Collecting statistics for batch fitting with step size 10
[2020-11-06 14:00:29] [memory] Extending reserved space to 8064 MB (device gpu0)
[2020-11-06 14:00:29] [memory] Extending reserved space to 8064 MB (device gpu1)
[2020-11-06 14:00:29] [comm] Using NCCL 2.3.7 for GPU communication
[2020-11-06 14:00:29] [comm] NCCLCommunicator constructed successfully
[2020-11-06 14:00:29] [training] Using 2 GPUs
[2020-11-06 14:00:29] [logits] Applying loss function for 1 factor(s)
[2020-11-06 14:00:29] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:00:29] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2020-11-06 14:00:29] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:00:45] [batching] Done. Typical MB size is 109,844 target words
[2020-11-06 14:00:45] [memory] Extending reserved space to 8064 MB (device gpu0)
[2020-11-06 14:00:45] [memory] Extending reserved space to 8064 MB (device gpu1)
[2020-11-06 14:00:45] [comm] Using NCCL 2.3.7 for GPU communication
[2020-11-06 14:00:45] [comm] NCCLCommunicator constructed successfully
[2020-11-06 14:00:45] [training] Using 2 GPUs
[2020-11-06 14:00:45] Loading model from model.finetune.npz
[2020-11-06 14:00:45] Loading model from model.finetune.npz
[2020-11-06 14:00:45] [training] Model reloaded from model.finetune.npz
[2020-11-06 14:00:45] Training started
[2020-11-06 14:00:45] [data] Shuffling data
[2020-11-06 14:00:49] [data] Done reading 2,284,376 sentences
[2020-11-06 14:00:49] [data] Done shuffling 2,284,376 sentences (cached in RAM)
[2020-11-06 14:02:11] [training] Batches are processed as 1 process(es) x 2 devices/process
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] Quantizing the model to 8-bits
[2020-11-06 14:02:11] Quantizing the model to 8-bits
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 4 B, device gpu1
[2020-11-06 14:02:11] Error: Segmentation fault
[2020-11-06 14:02:11] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /work/user/bleualign/marian-fixedquant/src/common/logging.cpp:130
[memory] Reserving 4 B, device gpu0
[2020-11-06 14:02:11] Error: Segmentation fault
[2020-11-06 14:02:11] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /work/user/bleualign/marian-fixedquant/src/common/logging.cpp:130

[CALL STACK]
[0x55ffd6a32459]                                                       + 0x3ea459
[0x55ffd6a32809]                                                       + 0x3ea809
[0x7fcb57f178a0]                                                       + 0x128a0
[0x55ffd6ce2d8f]    marian::TensorBase::  subtensor  (unsigned long,  unsigned long) + 0x1f
[0x55ffd704b068]    marian::ModelQuantizer::  quantizeImpl  (IntrusivePtr<marian::TensorBase>) + 0x3d8
[0x55ffd704c25b]    marian::ModelQuantizer::  quantize  (std::shared_ptr<marian::ExpressionGraph>) + 0x5cb
[0x55ffd6cf6f7c]                                                       + 0x6aef7c
[0x55ffd6d8d7c6]    marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x56
[0x55ffd6d8e330]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
[0x55ffd693d389]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x29
[0x7fcb57f14827]                                                       + 0xf827
[0x55ffd6d6e28d]    std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x16d
[0x55ffd693f41b]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1ab
[0x7fcb4ac696df]                                                       + 0xbd6df
[0x7fcb57f0c6db]                                                       + 0x76db
[0x7fcb4a326a3f]    clone                                              + 0x3f

The train script is:

#!/bin/bash -v

# Set GPUs.
GPUS="0 1"
MARIAN=../../../../marian-fixedquant/build

SRC=is
TRG=en

# Add symbolic links to the training files.
test -e corpus.$SRC.gz || exit 1    # e.g. ../../data/train.en.gz
test -e corpus.$TRG.gz || exit 1    # e.g. ../../data/train.es.translated.gz
test -e corpus.aln.gz  || exit 1    # e.g. ../../alignment/corpus.aln.gz
test -e lex.s2t.gz     || exit 1    # e.g. ../../alignment/lex.s2t.pruned.gz
test -e vocab.spm      || exit 1    # e.g. ../../data/vocab.spm

# Validation set with original source and target sentences (not distilled).
test -e devset.$SRC || exit 1
test -e devset.$TRG || exit 1

$MARIAN/marian \
    --model model.finetune.npz \
    --train-sets corpus.{$SRC,$TRG}.gz -T /tmp --shuffle-in-ram \
    --guided-alignment corpus.aln.gz \
    --vocabs vocab.spm vocab.spm \
    --dim-vocabs 32000 32000 \
    --max-length 200 \
    --mini-batch-fit -w 8000 --mini-batch 1000 --maxi-batch 1000 --devices $GPUS --sync-sgd --optimizer-delay 4 \
    --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 \
    --cost-type ce-mean-words \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 \
    --valid-freq 500 --save-freq 500 --disp-freq 100 --disp-first 10 \
    --valid-metrics bleu-detok ce-mean-words \
    --valid-sets devset.{$SRC,$TRG} --valid-translation-output devset.out --quiet-translation \
    --valid-mini-batch 16 --beam-size 1 --normalize 1 \
    --early-stopping 20 \
    --overwrite --keep-best \
    --log train.finetune.log --valid-log valid.finetune.log --quantize-bits 8

Monitor pl-en training

Running in screen 86852 on the second half of alvis. It's currently in student training.

CsEn models moved

The CsEn models have been moved, but the corresponding download scripts still point to the old location.

Remaining QE Models

Location of trained and ready to use quality estimation models for Czech, Spanish, and German.

From the meeting it appears that a JSON is provided by email and not the binary. The QE team is expected to check-in to this repository, as they did with the English-Estonian model (#44):

Binary models converted for the missing languages.
Proper documentation to convert from JSON to binary somewhere. Training information should be a good addition. Refer to https://github.com/browsermt/students/blob/master/train-student/README.md which mentions packaging models for machine-translation and repurpose the structure for QE.

Provision of new model(s) for testing multiple model workflows

Need a few more distinct models to test applications enabled by multiple model capability (browsermt/bergamot-translator#209, browsermt/bergamot-translator#210).

It may be desirable to have something compatible with the forward model. Currently there's an archive from here (ende.student.tiny.for.regression.tests) that is kept with separate specs from the usual pool (see #37 (comment)). Wouldn't mind having the same spec as the rest of the models here at the cost of discarding file based shortlists etc.

One kind of model required is a backward model (de -> en for existing en -> de) so an outbound translation example can be pointed to in code somewhere and runs on CI everytime (For browsermt/bergamot-translator#78)
Another kind of model we will soon require is (xx->en, en->yy) if and when pivoting (browsermt/bergamot-translator#212) comes into the picture.

Monitor fr-en training

It's slowly crunching through forward translation on nanna

for i in /mnt/nanna0/nbogoych/data/data/fr-en/fr-en-prod/translated/corpus/file.*.ref; do f=$(dirname $i)/$(basename $i .ref).nbest; if [ ! -f $f ]; then echo $i; fi; done |wc

Incorrect yml reference

students/train-student/models/student.base/train.sh

Line 24 in 9ce4409

--model model.npz -c student.tiny11.yml \

Current BLEU implementation prefers empty hypothesis

Hi,

I think there is something wrong with the bleu implementation bestbleu.py because it prefers empty hypothesis. For example having the sentence:

 (" Gjør det fordi jeg sier det. ")

as source, and the reference:

 (" Do it because I say so. ")

and the nbest list as:

0 ||| (“Do it because I say so.”) ||| F0= -5.48577 ||| -0.457148
0 ||| (“Do it because I say it.”) ||| F0= -5.69615 ||| -0.474679
0 ||| (“Do it because I say it.") ||| F0= -6.20291 ||| -0.5639
0 ||| (“Do so because I say so.”) ||| F0= -7.10361 ||| -0.591967
0 ||| (“Do this because I say so.”) ||| F0= -7.25004 ||| -0.60417
0 ||| ("Do it because I say it.") ||| F0= -6.30103 ||| -0.630103
0 ||| (“Do it because I say so.” ||| F0= -7.08298 ||| -0.643907
0 |||  ||| F0= -1.93607 ||| -1.93607

it chooses the empty sentence always. Maybe my teacher model is not so good and it is producing a lot of empty hypothesis and after bestbleu filtering I'm getting a lot of empty lines.

I tried using sacrebleu instead and the problem seems to disappear.

Also some fixes of sacrebleu #9

Re-train pl-en

Backtranslations were broken. Currently re-running on alvis. Follow progress here:

/fs/surtr0/germann/firefox-translations-training-new/.snakemake/log/2022-03-21T144013.644988.snakemake.log

Training speed Teacher to Student

Is there any data that you can share on how long it took to train the student models with the recommended setup of 4 GPUs with 12GB memory (What GPU series and model are we talking about here..?)

Months, weeks, days?

I'm interested in potentially contributing in the future but I'd need to know what to expect before getting hardware to do so.

Cheers :)

Negotiate testing requirements

@jerinphilip let's see what you need for the test setup before we redeploy the latest-and-greatest models and break your tests.

Could you give me a list of files that you use and if you want anything changed.

Post en-nl model

It's in /fs/mimir0/firefox-translations-training , was trained on alvis. Check the quality first.

what is the right way to get a 8-bit model?

1、which version or branch of marian to complie? marian-master， marian-dev or https://github.com/afaji/Marian/tree/fixed-quant?
2、could the teacher model trained by teh marian-master?

Why do we have both a catalog-entry.yml and a model_info.json?

These metadata files have slightly different information. Is catalog-entry.yml parsed by anything?

CSEN & ENCS best-bleu-detok models are the same

The following two models are the same. Namely, they are both CS->EN.

http://data.statmt.org/bergamot/models/8bit-students/csen/base/model.npz.best-bleu-detok.npz

http://data.statmt.org/bergamot/models/8bit-students/encs/base/model.npz.best-bleu-detok.npz

If they are not used anywhere, that's ok.

Raid the Gourmet models

https://github.com/EdinburghNLP/gourmet-models/blob/main/models.md

Binarize QE models and make tool to do that

@mfomicheva e-mailed some JSON files which I have now checked in: 962d419

As @mfomicheva wrote in a follow up, these need to be binarized. The documentation refers to this thing:

https://github.com/abarbosa94/bergamot-translator/blob/quality-estimator-update/src/tests/units/quality_estimator_tests.cpp#L43

Can we get a proper tool for that? It doesn't have to be a full-on JSON parsing thing, but the values shouldn't be hard-coded in the C++. Binarized models are more important though.

Inconsistent handling of Markdown syntax

Originally opened in the Firefox Translations repo as issue 499

Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.

This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.

Related issues
mozilla/firefox-translations#486

Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue

Example

Test environment

Translations were run with simplified BergamotWorker script and WASM without postprocessing
Firefox Translation Models were used as translation models, version 0.3.3

Example

Markdown snippet used for testing, covering most aspects of the syntax

--- 
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax. 

1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness

- The word bergamot is derived from the Italian word _bergamotto_
	- It is a small tree that blossoms during the winter

\```js
variable = 10
if (variable == "10")
	variable = "10" + 1
\```

> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
---

Some failing examples

French:

# turns into -
** disappears or gets changed into a single quote

Dutch:

--- turns into -- ---
# turns into •
Numberings gets repeated

German:

# turns into -
_ turns into "

Clarify that alignment is optional

We use word alignment but it's an optional feature as far as other people care. Though they still need the lexical shortlist.

Polish .npz students files

It seems that .npz files of Polish students are not available to download. Should they be?

Clarify licensing terms for models

Catalog-entry.yml has licensing info at least for some models, model-info.json appears not to have it.

Is there any particular reason that some models are CC-BY-SA-NC (e.g., bg<->en)? I'd prefer CC-BY-SA in most cases, unless there is a particular reason to make them *-NC. If there is a reason (e.g., corpora used), this should be documented.
If catalog-entry.yml is removed, licensing terms should be documented elsewhere (e.g. in model-info.json).

On a marginal note: As a rule of thumb I prefer yaml over json, because the former allows comments whereas the latter, to the best of my knowledge, doesn't.

Check for pigz early

I wouldn't say pigz is fully standard; we should check for pigz / fallback to gzip instead of just failing.

students/train-student/alignment/generate-alignment-and-shortlist.sh

Line 42 in ff9e45b

pigz $DIR/corpus.aln

Icelandic models are missing their vocabulary

https://github.com/browsermt/students/blob/master/isen/download-models.sh#L11 returns a 404 @ZJaume

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.