Giter Site home page Giter Site logo

browsermt / students Goto Github PK

View Code? Open in Web Editor NEW
48.0 48.0 19.0 5.36 MB

Efficient teacher-student models and scripts to make them

License: Other

Shell 4.61% JavaScript 24.17% Python 0.40% Makefile 6.48% Perl 0.06% Emacs Lisp 7.34% Handlebars 49.57% Mathematica 3.94% Slash 3.44%

students's People

Contributors

afaji avatar felipesantosk avatar godnoken avatar jelmervdl avatar kaleidoescape avatar kpu avatar pacien avatar pdehaan avatar pinzhenchen avatar qianqianzhu avatar snukky avatar sukantasen avatar ugermann avatar xapajiamnu avatar zjaume avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

students's Issues

Reduce number of Marian branches

Users of this repo have to compile three different branches of Marian:

  1. master:
    https://github.com/browsermt/students/blob/master/install.sh#L8-L9
    (I think we can remove master)

  2. Alham's quantization training branch:
    https://github.com/afaji/Marian/tree/fixed-quant

  3. Nick's intgemm_reintegated_computestats branch

So a couple of questions:

  1. How far is @afaji's fixed-quant branch from something that can be in master?
  2. Can we merge fixed-quant and intgemm_reintegrated_computestats?

tcmalloc causes extract_stats.py to crash

"In very rare cases, the extract_stats.py script might crash. If this happens, the most likely culprit is tcmalloc. If the model is too large (or the workspace too small), tcmalloc will spit tcmalloc: large alloc... to stderr which our script doesn't know how to handle. Just delete that line inside the quantmuls file and you're good to go."

Citation of models

Is there a list of works to cite when using the trained models in academic works? Namely the models cs-en, de-en.

New de-en models

On valhalla in /fs/surtr0/heafield/{base,tiny11}-de-en there are new German->English student models based on the WMT21 submissions that include guided alignment. We should give them the 8-bit and lexical shortlist treatment then release them here.

Model license clarification

Hello,

'Models produced by the Bergamot project are released under the CC-BY-SA 4.0 license.'

If I use the software under the MIT license in this repo and create a model - is that model automatically under this license? Or are we strictly talking about the models released in this repo?

And if I may be so cheeky to ask, without taking it as legal advice - is it possible to not have to release my own software under the CC-BY-SA 4.0 license (or any other compatible license), if I include any of these Bergamot-released models (unmodified) in my software?

Thank you!

Update en-de models

The en-de models submitted to WMT21 were trained without guided alignment. They are in /fs/alvis0/heafield/student (base) and /fs/surtr0/heafield/student-tiny11 (tiny11).

  1. Retrain with guided alignment (find alignments or make them on the parallel data). See train.sh
  2. Make 8-bit and lexical shortlists
  3. Post here

Start en-pl and en-fr training

These can get a leg up by using the existing pl-en (#61) and fr-en (#60) models that have already built their teacher systems, which should then be used for back translation. Note that forward translation for making a pl-en student is the same thing as backtranslation for making an en-pl model.

See mozilla/firefox-translations-training#46 on how to short-circuit this and talk to @eu9ene as necessary.

Re-train fr-en

Trained models for fr-en suffered from broken back-translation.

Steps taken to resurrect training:

  • rsync-ed /fs/nanna0/nbogoych/firefox-translations-training into /fs/nanna0/germann/bergamot/train/firefox-translations-training
  • edited CONDA_PATH in Makefile to ${HOME}/mambaforge
  • ran make conda; make snakemake
  • changed precision to always float32 in configs/config.fr-en.yml
  • rsynced data/fr-en/fre-en-prod/clean/mono.* and data/fr-en/fr-en-prod/biclean/corpus.* into respective directory in /fs/nanna0/germann/bergamot/train/fr-en
  • rsynced data/fr-en/fre-en-prod/original/{devset.*,eval}
  • rsynced models/fr-en/fre-en-prod/{backward|vocab} for backtranslation
  • hacked Snakefile in a truly awful way to declare ancient all the depedencies that I didn't want to recreate. This is necessary because my run of snakemake recreated various tools upon which stuff depends; I kept re-running make dry-run till things started to make sense.

[ will be edited as I progress]

Broken back-translations in model training

Looks like we're back to square 2 (data is cleaned) for nl-en, en-nl, pl-en, fr-en ...

All trained models (post the backtranslation model)

translation direction snakemake data dir status
nl-en /fs/elli/0nbogoych/data/data/nl-en/snakemake-nl-en/ broken
en-nl /fs/mimir0/nbogoych/data/data/en-nl/snakemake-en-nl/ broken
pl-en /fs/surtr0/nbogoych/data/data/pl-en/pl-en-prod/ broken
fr-en /fs/nanna0/nbogoych/data/data/fr-en/fr-en-prod/ broken

Collapse translatelocally deployment and this repository

Currently we have two sets of tarballs: those downloaded by this students repo (which often has multiple models at once) and those downloaded by translatelocally. We should consolidate, probably by throwing out the more ad-hoc format used by the students repo.

csen

Hey,
Could we have Czech models and scripts uploaded here following a similar fashion as the others? @ugermann @radidd

Cheers,

Nick

Is tiny student training config correct?

Hi,

I'm exploring the train-student directory to see if I can train student models for Paracrawl. I have taken a look at the training config file for tiny models at https://github.com/browsermt/students/blob/master/train-student/models/student.tiny11/student.tiny11.yml and found options a bit confusing. The paper says tiny students use transformer encoder and rnn decoder but the config file is using:

enc-cell: gru
enc-cell-depth: 1
enc-type: bidirectional

Maybe this options are ignored when using type: transformer but I wanted be sure it is the correct configuration before using it.

Segmentation fault on finetuning

Hi @afaji,

I'm getting a segmentation fault when trying to finetune with https://github.com/afaji/Marian/tree/fixed-quant.

Compiled today with: cmake .. -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DUSE_FBGEMM=on

[2020-11-06 14:00:27] [marian] Marian v1.9.37 741cd865 2020-11-02 09:00:03 +0000
[2020-11-06 14:00:27] [marian] Running on host as process 2003 with command line:
[2020-11-06 14:00:27] [marian] ../../../../marian-fixedquant/build/marian --model model.finetune.npz --train-sets corpus.is.gz corpus.en.gz -T /tmp --shuffle-in-ram --guided-alignment corpus.aln.gz --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 --max-length 200 --mini-batch-fit -w 8000 --mini-batch 1000 --maxi-batch 1000 --devices 0 1 --sync-sgd --optimizer-delay 4 --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 --cost-type ce-mean-words --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 --valid-freq 500 --save-freq 500 --disp-freq 100 --disp-first 10 --valid-metrics bleu-detok ce-mean-words --valid-sets devset.is devset.en --valid-translation-output devset.out --quiet-translation --valid-mini-batch 16 --beam-size 1 --normalize 1 --early-stopping 20 --overwrite --keep-best --log train.finetune.log --valid-log valid.finetune.log --quantize-bits 8
[2020-11-06 14:00:27] [config] after-batches: 0
[2020-11-06 14:00:27] [config] after-epochs: 0
[2020-11-06 14:00:27] [config] all-caps-every: 0
[2020-11-06 14:00:27] [config] allow-unk: false
[2020-11-06 14:00:27] [config] authors: false
[2020-11-06 14:00:27] [config] beam-size: 1
[2020-11-06 14:00:27] [config] bert-class-symbol: "[CLS]"
[2020-11-06 14:00:27] [config] bert-mask-symbol: "[MASK]"
[2020-11-06 14:00:27] [config] bert-masking-fraction: 0.15
[2020-11-06 14:00:27] [config] bert-sep-symbol: "[SEP]"
[2020-11-06 14:00:27] [config] bert-train-type-embeddings: true
[2020-11-06 14:00:27] [config] bert-type-vocab-size: 2
[2020-11-06 14:00:27] [config] build-info: ""
[2020-11-06 14:00:27] [config] cite: false
[2020-11-06 14:00:27] [config] clip-gemm: 0
[2020-11-06 14:00:27] [config] clip-norm: 0
[2020-11-06 14:00:27] [config] cost-scaling:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] cost-type: ce-mean-words
[2020-11-06 14:00:27] [config] cpu-threads: 0
[2020-11-06 14:00:27] [config] data-weighting: ""
[2020-11-06 14:00:27] [config] data-weighting-type: sentence
[2020-11-06 14:00:27] [config] dec-cell: ssru
[2020-11-06 14:00:27] [config] dec-cell-base-depth: 2
[2020-11-06 14:00:27] [config] dec-cell-high-depth: 1
[2020-11-06 14:00:27] [config] dec-depth: 2
[2020-11-06 14:00:27] [config] devices:
[2020-11-06 14:00:27] [config]   - 0
[2020-11-06 14:00:27] [config]   - 1
[2020-11-06 14:00:27] [config] dim-emb: 256
[2020-11-06 14:00:27] [config] dim-rnn: 1024
[2020-11-06 14:00:27] [config] dim-vocabs:
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config] disp-first: 10
[2020-11-06 14:00:27] [config] disp-freq: 100
[2020-11-06 14:00:27] [config] disp-label-counts: false
[2020-11-06 14:00:27] [config] dropout-rnn: 0
[2020-11-06 14:00:27] [config] dropout-src: 0
[2020-11-06 14:00:27] [config] dropout-trg: 0
[2020-11-06 14:00:27] [config] dump-config: ""
[2020-11-06 14:00:27] [config] early-stopping: 20
[2020-11-06 14:00:27] [config] embedding-fix-src: false
[2020-11-06 14:00:27] [config] embedding-fix-trg: false
[2020-11-06 14:00:27] [config] embedding-normalization: false
[2020-11-06 14:00:27] [config] embedding-vectors:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] enc-cell: gru
[2020-11-06 14:00:27] [config] enc-cell-depth: 1
[2020-11-06 14:00:27] [config] enc-depth: 6
[2020-11-06 14:00:27] [config] enc-type: bidirectional
[2020-11-06 14:00:27] [config] english-title-case-every: 0
[2020-11-06 14:00:27] [config] exponential-smoothing: 0
[2020-11-06 14:00:27] [config] factor-weight: 1
[2020-11-06 14:00:27] [config] grad-dropping-momentum: 0
[2020-11-06 14:00:27] [config] grad-dropping-rate: 0
[2020-11-06 14:00:27] [config] grad-dropping-warmup: 100
[2020-11-06 14:00:27] [config] gradient-checkpointing: false
[2020-11-06 14:00:27] [config] guided-alignment: corpus.aln.gz
[2020-11-06 14:00:27] [config] guided-alignment-cost: mse
[2020-11-06 14:00:27] [config] guided-alignment-weight: 0.1
[2020-11-06 14:00:27] [config] ignore-model-config: false
[2020-11-06 14:00:27] [config] input-types:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] interpolate-env-vars: false
[2020-11-06 14:00:27] [config] keep-best: true
[2020-11-06 14:00:27] [config] label-smoothing: 0
[2020-11-06 14:00:27] [config] layer-normalization: false
[2020-11-06 14:00:27] [config] learn-rate: 0.0003
[2020-11-06 14:00:27] [config] lemma-dim-emb: 0
[2020-11-06 14:00:27] [config] log: train.finetune.log
[2020-11-06 14:00:27] [config] log-level: info
[2020-11-06 14:00:27] [config] log-time-zone: ""
[2020-11-06 14:00:27] [config] lr-decay: 0
[2020-11-06 14:00:27] [config] lr-decay-freq: 50000
[2020-11-06 14:00:27] [config] lr-decay-inv-sqrt:
[2020-11-06 14:00:27] [config]   - 32000
[2020-11-06 14:00:27] [config] lr-decay-repeat-warmup: false
[2020-11-06 14:00:27] [config] lr-decay-reset-optimizer: false
[2020-11-06 14:00:27] [config] lr-decay-start:
[2020-11-06 14:00:27] [config]   - 10
[2020-11-06 14:00:27] [config]   - 1
[2020-11-06 14:00:27] [config] lr-decay-strategy: epoch+stalled
[2020-11-06 14:00:27] [config] lr-report: true
[2020-11-06 14:00:27] [config] lr-warmup: 16000
[2020-11-06 14:00:27] [config] lr-warmup-at-reload: false
[2020-11-06 14:00:27] [config] lr-warmup-cycle: false
[2020-11-06 14:00:27] [config] lr-warmup-start-rate: 0
[2020-11-06 14:00:27] [config] max-length: 200
[2020-11-06 14:00:27] [config] max-length-crop: false
[2020-11-06 14:00:27] [config] max-length-factor: 3
[2020-11-06 14:00:27] [config] maxi-batch: 1000
[2020-11-06 14:00:27] [config] maxi-batch-sort: trg
[2020-11-06 14:00:27] [config] mini-batch: 1000
[2020-11-06 14:00:27] [config] mini-batch-fit: true
[2020-11-06 14:00:27] [config] mini-batch-fit-step: 10
[2020-11-06 14:00:27] [config] mini-batch-track-lr: false
[2020-11-06 14:00:27] [config] mini-batch-warmup: 0
[2020-11-06 14:00:27] [config] mini-batch-words: 0
[2020-11-06 14:00:27] [config] mini-batch-words-ref: 0
[2020-11-06 14:00:27] [config] model: model.finetune.npz
[2020-11-06 14:00:27] [config] multi-loss-type: sum
[2020-11-06 14:00:27] [config] multi-node: false
[2020-11-06 14:00:27] [config] multi-node-overlap: true
[2020-11-06 14:00:27] [config] n-best: false
[2020-11-06 14:00:27] [config] no-nccl: false
[2020-11-06 14:00:27] [config] no-reload: false
[2020-11-06 14:00:27] [config] no-restore-corpus: false
[2020-11-06 14:00:27] [config] normalize: 1
[2020-11-06 14:00:27] [config] normalize-gradient: false
[2020-11-06 14:00:27] [config] num-devices: 0
[2020-11-06 14:00:27] [config] optimizer: adam
[2020-11-06 14:00:27] [config] optimizer-delay: 4
[2020-11-06 14:00:27] [config] optimizer-params:
[2020-11-06 14:00:27] [config]   - 0.9
[2020-11-06 14:00:27] [config]   - 0.98
[2020-11-06 14:00:27] [config]   - 1e-09
[2020-11-06 14:00:27] [config] output-omit-bias: false
[2020-11-06 14:00:27] [config] overwrite: true
[2020-11-06 14:00:27] [config] precision:
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config]   - float32
[2020-11-06 14:00:27] [config] pretrained-model: ""
[2020-11-06 14:00:27] [config] quantize-biases: false
[2020-11-06 14:00:27] [config] quantize-bits: 8
[2020-11-06 14:00:27] [config] quantize-log-based: false
[2020-11-06 14:00:27] [config] quantize-optimization-steps: 0
[2020-11-06 14:00:27] [config] quiet: false
[2020-11-06 14:00:27] [config] quiet-translation: true
[2020-11-06 14:00:27] [config] relative-paths: false
[2020-11-06 14:00:27] [config] right-left: false
[2020-11-06 14:00:27] [config] save-freq: 500
[2020-11-06 14:00:27] [config] seed: 0
[2020-11-06 14:00:27] [config] sentencepiece-alphas:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] sentencepiece-max-lines: 10000000
[2020-11-06 14:00:27] [config] sentencepiece-options: ""
[2020-11-06 14:00:27] [config] shuffle: data
[2020-11-06 14:00:27] [config] shuffle-in-ram: true
[2020-11-06 14:00:27] [config] sigterm: save-and-exit
[2020-11-06 14:00:27] [config] skip: false
[2020-11-06 14:00:27] [config] sqlite: ""
[2020-11-06 14:00:27] [config] sqlite-drop: false
[2020-11-06 14:00:27] [config] sync-sgd: true
[2020-11-06 14:00:27] [config] tempdir: /tmp
[2020-11-06 14:00:27] [config] tied-embeddings: false
[2020-11-06 14:00:27] [config] tied-embeddings-all: true
[2020-11-06 14:00:27] [config] tied-embeddings-src: false
[2020-11-06 14:00:27] [config] train-sets:
[2020-11-06 14:00:27] [config]   - corpus.is.gz
[2020-11-06 14:00:27] [config]   - corpus.en.gz
[2020-11-06 14:00:27] [config] transformer-aan-activation: swish
[2020-11-06 14:00:27] [config] transformer-aan-depth: 2
[2020-11-06 14:00:27] [config] transformer-aan-nogate: false
[2020-11-06 14:00:27] [config] transformer-decoder-autoreg: rnn
[2020-11-06 14:00:27] [config] transformer-depth-scaling: false
[2020-11-06 14:00:27] [config] transformer-dim-aan: 2048
[2020-11-06 14:00:27] [config] transformer-dim-ffn: 1536
[2020-11-06 14:00:27] [config] transformer-dropout: 0
[2020-11-06 14:00:27] [config] transformer-dropout-attention: 0
[2020-11-06 14:00:27] [config] transformer-dropout-ffn: 0
[2020-11-06 14:00:27] [config] transformer-ffn-activation: relu
[2020-11-06 14:00:27] [config] transformer-ffn-depth: 2
[2020-11-06 14:00:27] [config] transformer-guided-alignment-layer: last
[2020-11-06 14:00:27] [config] transformer-heads: 8
[2020-11-06 14:00:27] [config] transformer-no-projection: false
[2020-11-06 14:00:27] [config] transformer-pool: false
[2020-11-06 14:00:27] [config] transformer-postprocess: dan
[2020-11-06 14:00:27] [config] transformer-postprocess-emb: d
[2020-11-06 14:00:27] [config] transformer-postprocess-top: ""
[2020-11-06 14:00:27] [config] transformer-preprocess: ""
[2020-11-06 14:00:27] [config] transformer-tied-layers:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] transformer-train-position-embeddings: false
[2020-11-06 14:00:27] [config] tsv: false
[2020-11-06 14:00:27] [config] tsv-fields: 0
[2020-11-06 14:00:27] [config] type: transformer
[2020-11-06 14:00:27] [config] ulr: false
[2020-11-06 14:00:27] [config] ulr-dim-emb: 0
[2020-11-06 14:00:27] [config] ulr-dropout: 0
[2020-11-06 14:00:27] [config] ulr-keys-vectors: ""
[2020-11-06 14:00:27] [config] ulr-query-vectors: ""
[2020-11-06 14:00:27] [config] ulr-softmax-temperature: 1
[2020-11-06 14:00:27] [config] ulr-trainable-transformation: false
[2020-11-06 14:00:27] [config] unlikelihood-loss: false
[2020-11-06 14:00:27] [config] valid-freq: 500
[2020-11-06 14:00:27] [config] valid-log: valid.finetune.log
[2020-11-06 14:00:27] [config] valid-max-length: 1000
[2020-11-06 14:00:27] [config] valid-metrics:
[2020-11-06 14:00:27] [config]   - bleu-detok
[2020-11-06 14:00:27] [config]   - ce-mean-words
[2020-11-06 14:00:27] [config] valid-mini-batch: 16
[2020-11-06 14:00:27] [config] valid-reset-stalled: false
[2020-11-06 14:00:27] [config] valid-script-args:
[2020-11-06 14:00:27] [config]   []
[2020-11-06 14:00:27] [config] valid-script-path: ""
[2020-11-06 14:00:27] [config] valid-sets:
[2020-11-06 14:00:27] [config]   - devset.is
[2020-11-06 14:00:27] [config]   - devset.en
[2020-11-06 14:00:27] [config] valid-translation-output: devset.out
[2020-11-06 14:00:27] [config] version: v1.9.36 c9a2dcce 2020-09-03 13:56:51 +0100
[2020-11-06 14:00:27] [config] vocabs:
[2020-11-06 14:00:27] [config]   - vocab.spm
[2020-11-06 14:00:27] [config]   - vocab.spm
[2020-11-06 14:00:27] [config] word-penalty: 0
[2020-11-06 14:00:27] [config] word-scores: false
[2020-11-06 14:00:27] [config] workspace: 8000
[2020-11-06 14:00:27] [config] Loaded model has been created with Marian v1.9.36 c9a2dcce 2020-09-03 13:56:51 +0100, will be overwritten with current version v1.9.37 741cd865 2020-11-02 09:00:03 +0000 at saving
[2020-11-06 14:00:27] Using synchronous SGD
[2020-11-06 14:00:27] [data] Loading SentencePiece vocabulary from file vocab.spm
[2020-11-06 14:00:27] [data] Setting vocabulary size for input 0 to 32,000
[2020-11-06 14:00:27] [data] Loading SentencePiece vocabulary from file vocab.spm
[2020-11-06 14:00:27] [data] Setting vocabulary size for input 1 to 32,000
[2020-11-06 14:00:27] [data] Using word alignments from file corpus.aln.gz
[2020-11-06 14:00:27] [comm] Compiled without MPI support. Running as a single process on mari
[2020-11-06 14:00:27] [batching] Collecting statistics for batch fitting with step size 10
[2020-11-06 14:00:29] [memory] Extending reserved space to 8064 MB (device gpu0)
[2020-11-06 14:00:29] [memory] Extending reserved space to 8064 MB (device gpu1)
[2020-11-06 14:00:29] [comm] Using NCCL 2.3.7 for GPU communication
[2020-11-06 14:00:29] [comm] NCCLCommunicator constructed successfully
[2020-11-06 14:00:29] [training] Using 2 GPUs
[2020-11-06 14:00:29] [logits] Applying loss function for 1 factor(s)
[2020-11-06 14:00:29] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:00:29] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2020-11-06 14:00:29] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:00:45] [batching] Done. Typical MB size is 109,844 target words
[2020-11-06 14:00:45] [memory] Extending reserved space to 8064 MB (device gpu0)
[2020-11-06 14:00:45] [memory] Extending reserved space to 8064 MB (device gpu1)
[2020-11-06 14:00:45] [comm] Using NCCL 2.3.7 for GPU communication
[2020-11-06 14:00:45] [comm] NCCLCommunicator constructed successfully
[2020-11-06 14:00:45] [training] Using 2 GPUs
[2020-11-06 14:00:45] Loading model from model.finetune.npz
[2020-11-06 14:00:45] Loading model from model.finetune.npz
[2020-11-06 14:00:45] [training] Model reloaded from model.finetune.npz
[2020-11-06 14:00:45] Training started
[2020-11-06 14:00:45] [data] Shuffling data
[2020-11-06 14:00:49] [data] Done reading 2,284,376 sentences
[2020-11-06 14:00:49] [data] Done shuffling 2,284,376 sentences (cached in RAM)
[2020-11-06 14:02:11] [training] Batches are processed as 1 process(es) x 2 devices/process
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] Quantizing the model to 8-bits
[2020-11-06 14:02:11] Quantizing the model to 8-bits
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu1
[2020-11-06 14:02:11] [memory] Reserving 64 MB, device gpu0
[2020-11-06 14:02:11] [memory] Reserving 4 B, device gpu1
[2020-11-06 14:02:11] Error: Segmentation fault
[2020-11-06 14:02:11] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /work/user/bleualign/marian-fixedquant/src/common/logging.cpp:130
[memory] Reserving 4 B, device gpu0
[2020-11-06 14:02:11] Error: Segmentation fault
[2020-11-06 14:02:11] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /work/user/bleualign/marian-fixedquant/src/common/logging.cpp:130

[CALL STACK]
[0x55ffd6a32459]                                                       + 0x3ea459
[0x55ffd6a32809]                                                       + 0x3ea809
[0x7fcb57f178a0]                                                       + 0x128a0
[0x55ffd6ce2d8f]    marian::TensorBase::  subtensor  (unsigned long,  unsigned long) + 0x1f
[0x55ffd704b068]    marian::ModelQuantizer::  quantizeImpl  (IntrusivePtr<marian::TensorBase>) + 0x3d8
[0x55ffd704c25b]    marian::ModelQuantizer::  quantize  (std::shared_ptr<marian::ExpressionGraph>) + 0x5cb
[0x55ffd6cf6f7c]                                                       + 0x6aef7c
[0x55ffd6d8d7c6]    marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x56
[0x55ffd6d8e330]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
[0x55ffd693d389]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x29
[0x7fcb57f14827]                                                       + 0xf827
[0x55ffd6d6e28d]    std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x16d
[0x55ffd693f41b]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1ab
[0x7fcb4ac696df]                                                       + 0xbd6df
[0x7fcb57f0c6db]                                                       + 0x76db
[0x7fcb4a326a3f]    clone                                              + 0x3f

The train script is:

#!/bin/bash -v

# Set GPUs.
GPUS="0 1"
MARIAN=../../../../marian-fixedquant/build

SRC=is
TRG=en

# Add symbolic links to the training files.
test -e corpus.$SRC.gz || exit 1    # e.g. ../../data/train.en.gz
test -e corpus.$TRG.gz || exit 1    # e.g. ../../data/train.es.translated.gz
test -e corpus.aln.gz  || exit 1    # e.g. ../../alignment/corpus.aln.gz
test -e lex.s2t.gz     || exit 1    # e.g. ../../alignment/lex.s2t.pruned.gz
test -e vocab.spm      || exit 1    # e.g. ../../data/vocab.spm

# Validation set with original source and target sentences (not distilled).
test -e devset.$SRC || exit 1
test -e devset.$TRG || exit 1

$MARIAN/marian \
    --model model.finetune.npz \
    --train-sets corpus.{$SRC,$TRG}.gz -T /tmp --shuffle-in-ram \
    --guided-alignment corpus.aln.gz \
    --vocabs vocab.spm vocab.spm \
    --dim-vocabs 32000 32000 \
    --max-length 200 \
    --mini-batch-fit -w 8000 --mini-batch 1000 --maxi-batch 1000 --devices $GPUS --sync-sgd --optimizer-delay 4 \
    --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 \
    --cost-type ce-mean-words \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 \
    --valid-freq 500 --save-freq 500 --disp-freq 100 --disp-first 10 \
    --valid-metrics bleu-detok ce-mean-words \
    --valid-sets devset.{$SRC,$TRG} --valid-translation-output devset.out --quiet-translation \
    --valid-mini-batch 16 --beam-size 1 --normalize 1 \
    --early-stopping 20 \
    --overwrite --keep-best \
    --log train.finetune.log --valid-log valid.finetune.log --quantize-bits 8

Monitor pl-en training

Running in screen 86852 on the second half of alvis. It's currently in student training.

CsEn models moved

The CsEn models have been moved, but the corresponding download scripts still point to the old location.

Remaining QE Models

Location of trained and ready to use quality estimation models for Czech, Spanish, and German.

From the meeting it appears that a JSON is provided by email and not the binary. The QE team is expected to check-in to this repository, as they did with the English-Estonian model (#44):

  1. Binary models converted for the missing languages.
  2. Proper documentation to convert from JSON to binary somewhere. Training information should be a good addition. Refer to https://github.com/browsermt/students/blob/master/train-student/README.md which mentions packaging models for machine-translation and repurpose the structure for QE.

Provision of new model(s) for testing multiple model workflows

Need a few more distinct models to test applications enabled by multiple model capability (browsermt/bergamot-translator#209, browsermt/bergamot-translator#210).

It may be desirable to have something compatible with the forward model. Currently there's an archive from here (ende.student.tiny.for.regression.tests) that is kept with separate specs from the usual pool (see #37 (comment)). Wouldn't mind having the same spec as the rest of the models here at the cost of discarding file based shortlists etc.

  • One kind of model required is a backward model (de -> en for existing en -> de) so an outbound translation example can be pointed to in code somewhere and runs on CI everytime (For browsermt/bergamot-translator#78)
  • Another kind of model we will soon require is (xx->en, en->yy) if and when pivoting (browsermt/bergamot-translator#212) comes into the picture.

Monitor fr-en training

It's slowly crunching through forward translation on nanna

for i in /mnt/nanna0/nbogoych/data/data/fr-en/fr-en-prod/translated/corpus/file.*.ref; do f=$(dirname $i)/$(basename $i .ref).nbest; if [ ! -f $f ]; then echo $i; fi; done |wc

Current BLEU implementation prefers empty hypothesis

Hi,

I think there is something wrong with the bleu implementation bestbleu.py because it prefers empty hypothesis. For example having the sentence:

 (" Gjør det fordi jeg sier det. ")

as source, and the reference:

 (" Do it because I say so. ")

and the nbest list as:

0 ||| (“Do it because I say so.”) ||| F0= -5.48577 ||| -0.457148
0 ||| (“Do it because I say it.”) ||| F0= -5.69615 ||| -0.474679
0 ||| (“Do it because I say it.") ||| F0= -6.20291 ||| -0.5639
0 ||| (“Do so because I say so.”) ||| F0= -7.10361 ||| -0.591967
0 ||| (“Do this because I say so.”) ||| F0= -7.25004 ||| -0.60417
0 ||| ("Do it because I say it.") ||| F0= -6.30103 ||| -0.630103
0 ||| (“Do it because I say so.” ||| F0= -7.08298 ||| -0.643907
0 |||  ||| F0= -1.93607 ||| -1.93607

it chooses the empty sentence always. Maybe my teacher model is not so good and it is producing a lot of empty hypothesis and after bestbleu filtering I'm getting a lot of empty lines.

I tried using sacrebleu instead and the problem seems to disappear.

Also some fixes of sacrebleu #9

Re-train pl-en

Backtranslations were broken. Currently re-running on alvis. Follow progress here:

/fs/surtr0/germann/firefox-translations-training-new/.snakemake/log/2022-03-21T144013.644988.snakemake.log

Training speed Teacher to Student

Is there any data that you can share on how long it took to train the student models with the recommended setup of 4 GPUs with 12GB memory (What GPU series and model are we talking about here..?)

Months, weeks, days?

I'm interested in potentially contributing in the future but I'd need to know what to expect before getting hardware to do so.

Cheers :)

Negotiate testing requirements

@jerinphilip let's see what you need for the test setup before we redeploy the latest-and-greatest models and break your tests.

Could you give me a list of files that you use and if you want anything changed.

Post en-nl model

It's in /fs/mimir0/firefox-translations-training , was trained on alvis. Check the quality first.

Binarize QE models and make tool to do that

@mfomicheva e-mailed some JSON files which I have now checked in: 962d419

As @mfomicheva wrote in a follow up, these need to be binarized. The documentation refers to this thing:

https://github.com/abarbosa94/bergamot-translator/blob/quality-estimator-update/src/tests/units/quality_estimator_tests.cpp#L43

Can we get a proper tool for that? It doesn't have to be a full-on JSON parsing thing, but the values shouldn't be hard-coded in the C++. Binarized models are more important though.

Inconsistent handling of Markdown syntax

Originally opened in the Firefox Translations repo as issue 499

Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.

This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.

Related issues
mozilla/firefox-translations#486

Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue


Example

Test environment

Translations were run with simplified BergamotWorker script and WASM without postprocessing
Firefox Translation Models were used as translation models, version 0.3.3

Example

Markdown snippet used for testing, covering most aspects of the syntax

--- 
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax. 

1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness

- The word bergamot is derived from the Italian word _bergamotto_
	- It is a small tree that blossoms during the winter

\```js
variable = 10
if (variable == "10")
	variable = "10" + 1
\```

> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
--- 

Some failing examples

French:

  • # turns into -
  • ** disappears or gets changed into a single quote

Dutch:

  • --- turns into -- ---
  • # turns into
  • Numberings gets repeated

German:

  • # turns into -
  • _ turns into "

Clarify licensing terms for models

Catalog-entry.yml has licensing info at least for some models, model-info.json appears not to have it.

  1. Is there any particular reason that some models are CC-BY-SA-NC (e.g., bg<->en)? I'd prefer CC-BY-SA in most cases, unless there is a particular reason to make them *-NC. If there is a reason (e.g., corpora used), this should be documented.
  2. If catalog-entry.yml is removed, licensing terms should be documented elsewhere (e.g. in model-info.json).

On a marginal note: As a rule of thumb I prefer yaml over json, because the former allows comments whereas the latter, to the best of my knowledge, doesn't.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.