Giter Site home page Giter Site logo

bitext-tokaligner's Introduction

bitext-tokaligner

This script is a simple algorithm to align (at word level) a source sentence and its human-generated reference using a raw translation version as pivot. It was designed to extract new phrase pairs in a real-time post-editing context. Both its usability and its time consuming aspects make 'bitext-tokaligner' a suitable approach for this purpose.

Note that 'bitext-tokaligner' expects a single sentence on its standard input. Hence, if you want using it to align a "bitext", when the reviewing process is simulated for instance, use one line per sentence.

-- How does it work?

It's a 3-steps process:

1. The bitext-tokaligner needs to know for each word of a raw translation, which word of the source sentence is associated with.

To do so, using the Moses toolkit for the translation step, you must ask the decoder to print out the word-level alignments: -alignment-output-file [src-to-trans.align] (see doc for more details).

Note that this step is time-free since the phrase table (binary or textual) includes word-to-word alignments between source and target phrases. Then Moses can report them in the output.

2. Compute the edit distance (Levenstein) between a raw translation and a reference, using the TER algorithm (Snover).

To do so, use TERcpp, a c++ implementation of the TER designed by Christophe Servan:

$PATH/tercpp-bin --noTxtIds --printAlignments -r [ref] -h [trans]

A file named [trans].output.alignments containing the TER informations will be created. Note that the bitext-tokaligner has been tested with TERcpp v0.6.2.

This step is the most time-consuming part of the process. Nevertheless, TERcpp requires less than 5 sec to compute the edit-path over a test set containing 1,5k sentences pairs (about 35k words).

3. Finally, use the script 'bitext-tokalign.pl' on the previous alignment informations to deduce a word-level source-to-reference alignment, as follow:

$PATH/bitext-tokalign.pl [src-to-trans.align] [trans-to-ref.align] > [src-to-ref.align]

It was measured that 'bitext-tokalign.pl', applied on the same test set as step2, requires less than 1 sec for computing the src-to-ref alignment using both src-to-trans and trans-to-ref alignments.

-- Reference

If you use the bitext-tokaligner for your work, please cite this article in your publications:

Frederic Blain, Holger Schwenk, Jean Senellart, "Incremental Adaptation Using Translation Information and Post-Editing Analysis", Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Hong-Kong, China, December 2012.

-- Acknowledgment

This research was partially financed by the DGA and the ANRT under CIFRE-Defense 7/2009, the french ANR project COSMAT under ANR-09-CORD-004, and the European Commission under the project MATECAT, ICT-2011.4.2โ€“287688.

bitext-tokaligner's People

Contributors

riilp avatar

Stargazers

Indll avatar Andrew Ahlstrom avatar  avatar

Watchers

Fred Blain avatar James Cloos avatar

Forkers

lixiangnlp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.