Giter Site home page Giter Site logo

lixiangnlp / c2e-mt-benchmark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nusnlp/c2e-mt-benchmark

0.0 3.0 0.0 28.99 MB

Chinese-to-English Machine Translation Benchmark

License: BSD 3-Clause "New" or "Revised" License

Perl 11.34% sed 2.69% Shell 19.08% Python 66.89%

c2e-mt-benchmark's Introduction

Chinese-to-English Machine Translation Benchmark

Codes and pre-trained models for the Chinese-to-English machine translation benchmark.

Setup

Fistly, clone this repository and the related submodules:

git clone https://github.com/nusnlp/c2e-mt-benchmark.git
cd c2e-mt-benchmark
git submodule update --init --recursive

Secondly, go to each subdirectories under tools/* and follow the setup/installation instructions accordingly.

Finally, download and unpack the pre-trained models to the models/ subdirectory:

cd models/
wget http://sterling8.d2.comp.nus.edu.sg/~christian/c2e-mt-benchmark/pretrained.tar.gz
tar -xvzf pretrained.tar.gz
cd ..

Translating Text

The input is a plain text file containing Chinese sentences, one sentence per line. The input file is passed through the following pipeline:

  1. Chinese word segmentation, by running scripts/segment.sh < input > input.seg
  2. Translation (ensure that Theano flags are set as environment variables, replace nist with unpc for models trained on UN Parallel Corpus)
    • without re-ranking: scripts/translate-norerank.sh nist input.seg output [device(s)], where the device(s) include "gpu0", "gpu0 gpu1", or the default "cpu"
    • with re-ranking: scripts/translate-rerank.sh nist input.seg output [device(s)]
  3. Recasing, by running scripts/recase.sh < output > output.rc
  4. Detokenization, by running perl scripts/detokenizer.perl -l en < output.rc > output.detok

Test Set Translation Outputs

The outputs/ subdirectory contains the translation outputs produced by our models.

Scoreboard

The comparisons between the NIST test set results in BLEU achieved by our model and those achieved by prior published work are available here.

Publication

If you use the pre-trained models and settings from this repository, please cite the following paper:

Hadiwinoto, Christian and Ng, Hwee Tou (2018). Upping the ante: Towards a better benchmark for Chinese-to-English machine translation. To appear in Proceedings of the 11th edition of the Language Resources and Evaluation Conference. Miyazaki, Japan.

c2e-mt-benchmark's People

Contributors

chrhad avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.