Giter Site home page Giter Site logo

pangjh3 / word-embeddings-for-nmt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from neulab/word-embeddings-for-nmt

0.0 0.0 0.0 299 KB

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018

Python 100.00%

word-embeddings-for-nmt's Introduction

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

This page contains the details of the code and TED talks dataset which was used for conducting the experiments included in the above paper.

The content could also be found at https://github.com/neulab/word-embeddings-for-nmt.

Contents

Software:

We used XNMT with commitID 38044b3 for all the experiments.

Experiments:
Data Processing:

In order to perform experiments, we collected (during early 2017) a common corpus of TED talks which has been translated into many low-resource languages. Under the Open Translation project, TED talks transcripts are available for more than 2400 talks in 109 languages. A histogram plot of language (represented by its ISO Code) vs total number of talks in the original dataset is visualized in the figure below.

TED Talks statistics

To obtain a parallel corpus for experiments, we preprocessed the dataset using Moses tokenizer and used hard punctuation symbols to identify valid sentence boundaries for English language. In order to create train, dev and test sets, we apply a greedy selection algorithm based on the popularity of the talks and selected disjoint talks for each split. We selected talks which had translations in more than 50 languages. Finally, we selected a list of 60 languages that had sufficient data for performing meaningful experiments. The train, test and dev splits for the most common talks are also shown in the table alongside the above figure.

  • The train, dev and test splits for the above TED talks: ted_talks.tar.gz.
  • ted_reader.py is a sample python script to read this TED talks data. An example is shown under the "main" attribute of the code.

If you use the dataset or code, please consider citing the paper using following bibtex:

BibTex

@inproceedings{Ye2018WordEmbeddings,
  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},
  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},
  booktitle = {HLT-NAACL},
  year    = {2018},
  }

word-embeddings-for-nmt's People

Contributors

charlotteyeq avatar neubig avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.