Giter Site home page Giter Site logo

dpe's Introduction

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Descriptions

This repo contains source code and pre-processed corpora for Dynamic Programming Encoding (DPE) for Subword Segmentation in Neural Machine Translation (accepted to ACL2020) (paper)

Dependencies

Usage

git clone https://github.com/pytorch/fairseq.git
git clone https://github.com/xlhex/dpe.git

# change to 58b912f branch
cd fairseq
git checkout 58b912f

# copy files from dpe to fairseq
cp -r ../dpe/fairseq ./ # please overwrite all conflicted files
cp ../dpe/*py ./
cp ../dpe/*sh ./

Data Preparation

  • Using any tokenizer (we use MOSES toolkit) to tokenize your corpus
  • Using sentencepiece (bpe mode) to segment your tokenized corpus (you can refer to seg_data.py)
  • Using fariseq to construct your bpe dictionary: dict.{src}.txt dict.{tgt}.txt
  • Constructing your char dictionary: dict.{tgt}.in.txt (you can refere to build_dict.py)
  • Keep your dataset in plain text format: {train/valid/test}.src-tgt.{src/tgt}, where src and tgt are your source and target language pairs respectively

Training

Before start segmenting your corpus, we need to train a DPE segmenter

# SRC: source language
# TGT: target language
# SEED: a seed for reproducibility
sh run_batch SRC TGT SEED

MAP Inference

To segment a corpus

# SRC: source language
# TGT: target language
# SEED: a seed for reproducibility
sh seg_batch.sh SRC TGT SEED

Machine Translation

Once your corpus is segmented, you can use your favourite MT toolkit to train a MT system. We use fairseq for our experiments.

  • source sentences can be segmented by one of the following segmentation algorithms:
    • bpe
    • unigram
    • bpe-droput
    • dpe
  • target sentences are dpe segmented

Segmented Corpora

Citation

Please cite as:

@inproceedings{he2020-dynamic,
    title = "Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation",
    author = "He, Xuanli  and
      Haffari, Gholamreza  and
      Norouzi, Mohammad",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.275",
    doi = "10.18653/v1/2020.acl-main.275",
    pages = "3042--3051",
}

dpe's People

Contributors

zodiacr avatar

Stargazers

Yu Zhang avatar VHT avatar Maharaj Brahma avatar  avatar  avatar Koichiro Ueki avatar  avatar 爱可可-爱生活 avatar Bailin avatar KunWang avatar Song avatar Yunlong Lyu avatar Shun Kiyono avatar  avatar Wes Feely avatar Elizabeth Salesky avatar Alexandra Chronopoulou avatar gyunggyung avatar ASTONE avatar Guanlin Li avatar  avatar

Watchers

Xue Ruiqing avatar paper2code - bot avatar

dpe's Issues

Hungarian and Romanian data

Thank you for an interesting paper and linking the datasets you used for En-{Fi,Et,De}. I was wondering if the data for En-Hu and En-Ro are available at all? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.