Improving Simultaneous Machine Translation with Monolingual Data

Setup

Install fairseq Stick to the specified checkout version to avoid compatibility issues.

git clone https://github.com/pytorch/fairseq.git
cd fairseq
git checkout 8b861be
python setup.py build_ext --inplace
pip install .

(Optional) Install apex for faster mixed precision (fp16) training.
Install dependencies (clone in folder utility if possible).

pip install -r requirements.txt

For the installation guide, see extra_installation.

Data Preparation

All corresponding bashes are in folder data.

To download corresponding datasets, go to Google Drive for cleaned dataset, or run bashes begin with 0.

cd data
bash 0-get_data_cwmt.sh
bash 0-get_en_mono.sh

After distilling, run 1-preprocess-distill.py to preprocess those data, and then run bashes beginning with 2 to calculate corresponding scores.

cd data
python 1-preprocess-distill.py
bash 2-train_align.sh
bash 2-train_kenlm.sh
bash 2-fast-align.sh
bash 2-k-anticipation.sh
python 2-get_uncertainty.py

Finally, run 3-scoring_preprocessing.py to calculate the score of the distilled data and extract the data according to the metrics we propose.

cd data
python 3-scoring_preprocessing.py

Note that you need to change the data path mannually.

Training

We need a full-sentence model as teacher for sequence-KD.

The following command will train the teacher model.

cd train/cwmt-enzh
bash 0-teacher.sh

To distill the training set, run

cd train/cwmt-enzh
bash 0-distill_enzh_mono.sh

We provide our dataset including distill set and pseudo reference set for easier reproducibility.

We can now train vanilla wait-k model. To do this, run

bash 1b-distill_all_wait_k.sh generate/teacher_cwmt_mono/data-bin 3_anticipation_rate_low_chunking_LM_filter

3_anticipation_rate_low_chunking_LM_filter is the default name of our best strategy, change this field to run wait-k under any dataset (raw for original bilingual datasets).

Our models are released at Google Drive.

Evaluation (SimulEval)

Install SimulEval.

full-sentence model

cd train/cwmt-enzh
bash 2-test_model_full.sh

wait-k models

cd train/cwmt-enzh
bash 2-test_model.sh 3_anticipation_rate_low_chunking_LM_filter

Change 3_anticipation_rate_low_chunking_LM_filter to run evaluation under any dataset (raw for original bilingual datasets).

or simply run:

cd train
python get_score.py

for all subsets.

alphadl / mono4simt Goto Github PK

mono4simt's Introduction

Improving Simultaneous Machine Translation with Monolingual Data

Setup

Data Preparation

Training

Evaluation (SimulEval)

full-sentence model

wait-k models

mono4simt's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent