Giter Site home page Giter Site logo

wxjiao / uncsamp Goto Github PK

View Code? Open in Web Editor NEW
31.0 2.0 1.0 1.12 MB

Implementation of our paper "Self-training Sampling with Monolingual Data Uncertainty for Neural Machine Translation" to appear in ACL-2021.

Shell 3.99% Python 92.91% Makefile 0.03% Batchfile 0.04% C++ 0.78% Cuda 1.80% Lua 0.21% Perl 0.24%
neural-machine-translation self-training translation-uncertainty

uncsamp's Introduction

UncSamp: Self-training Sampling with Monolingual Data Uncertainty for Neural Machine Translation

Implementation of our paper "Self-training Sampling with Monolingual Data Uncertainty for Neural Machine Translation" to appear in ACL 2021. [paper]

Brief Introduction

Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling (UncSamp) strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English⇒German and English⇒Chinese datasets demonstrate the effectiveness of the proposed method. Extensive analyses provide a deeper understanding of how the proposed method improves the translation performance.

Figure 1: The framework of self-training with uncertainty-based sampling.

Reference Performance

We evaluate the proposed UncSamp approach on two high-resource translation tasks. As shown, our Transformer-Big models trained on the authentic parallel data achieve the performance competitive with or even better than the submissions to WMT competitions. Based on such strong baselines, self-training with RandSamp improves the performance by +2.0 and +0.9 BLEU points on En⇒De and En⇒Zh tasks respectively, demonstrating the effectiveness of the large-scale self-training for NMT models.

With our UncSamp approach, self-training achieves further significant improvement by +1.1 and +0.6 BLEU points over the random sampling strategy, which demonstrates the effectiveness of exploiting uncertain monolingual sentences.

Table 1: Evaluation of translation performance.

Further analyses suggest that our UncSamp approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target-side.

Table 2: Analysis for uncertain sentences.

Table 3: Analysis for low frequency words.

Public Impact

Citation

Please kindly cite our paper if you find it helpful:

@inproceedings{jiao2021self,
  title={Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation},
  author={Wenxiang Jiao and Xing Wang and Zhaopeng Tu and Shuming Shi and Michael R. Lyu and Irwin King},
  booktitle = {ACL},
  year      = {2021}
}

uncsamp's People

Contributors

wxjiao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

chenweihua91

uncsamp's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.