Giter Site home page Giter Site logo

alphadl / mono4simt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hexuandeng/mono4simt

0.0 0.0 0.0 96 KB

Implementation of the paper “Improving Simultaneous Machine Translation with Monolingual Data”.

License: Apache License 2.0

Shell 17.01% Python 71.61% Jupyter Notebook 11.38%

mono4simt's Introduction

Improving Simultaneous Machine Translation with Monolingual Data

Setup

  1. Install fairseq Stick to the specified checkout version to avoid compatibility issues.
git clone https://github.com/pytorch/fairseq.git
cd fairseq
git checkout 8b861be
python setup.py build_ext --inplace
pip install .
  1. (Optional) Install apex for faster mixed precision (fp16) training.

  2. Install dependencies (clone in folder utility if possible).

pip install -r requirements.txt

For the installation guide, see extra_installation.

Data Preparation

All corresponding bashes are in folder data.

  1. To download corresponding datasets, go to Google Drive for cleaned dataset, or run bashes begin with 0.
cd data
bash 0-get_data_cwmt.sh
bash 0-get_en_mono.sh
  1. After distilling, run 1-preprocess-distill.py to preprocess those data, and then run bashes beginning with 2 to calculate corresponding scores.
cd data
python 1-preprocess-distill.py
bash 2-train_align.sh
bash 2-train_kenlm.sh
bash 2-fast-align.sh
bash 2-k-anticipation.sh
python 2-get_uncertainty.py
  1. Finally, run 3-scoring_preprocessing.py to calculate the score of the distilled data and extract the data according to the metrics we propose.
cd data
python 3-scoring_preprocessing.py

Note that you need to change the data path mannually.

Training

We need a full-sentence model as teacher for sequence-KD.

The following command will train the teacher model.

cd train/cwmt-enzh
bash 0-teacher.sh

To distill the training set, run

cd train/cwmt-enzh
bash 0-distill_enzh_mono.sh

We provide our dataset including distill set and pseudo reference set for easier reproducibility.

We can now train vanilla wait-k model. To do this, run

bash 1b-distill_all_wait_k.sh generate/teacher_cwmt_mono/data-bin 3_anticipation_rate_low_chunking_LM_filter

3_anticipation_rate_low_chunking_LM_filter is the default name of our best strategy, change this field to run wait-k under any dataset (raw for original bilingual datasets).

Our models are released at Google Drive.

Evaluation (SimulEval)

Install SimulEval.

full-sentence model

cd train/cwmt-enzh
bash 2-test_model_full.sh

wait-k models

cd train/cwmt-enzh
bash 2-test_model.sh 3_anticipation_rate_low_chunking_LM_filter

Change 3_anticipation_rate_low_chunking_LM_filter to run evaluation under any dataset (raw for original bilingual datasets).

or simply run:

cd train
python get_score.py

for all subsets.

mono4simt's People

Contributors

hexuandeng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.