Giter Site home page Giter Site logo

wmt2020_biomedical's Introduction

WMT2020_BioMedical_MT

http://www.statmt.org/wmt20/biomedical-translation-task.html

The Tencent AI Lab participated in the WMT20 Shared Task on Biomedical Translation in four language directions: German<->English, Chinese<->English. Our systems in German->Enlgish and English-> German were ranked 1st and 3rd respectively according to the automatic evaluation.

Data and Pre-trained models

Data provided by organizers

No. Corpus File En-Zh En-De En
1 Biomedical Translation wmt18/Medline/training/es-en n/a n/a 287811
wmt18/Medline/training/fr-en n/a n/a 627576
wmt18/Medline/training/pt-en n/a n/a 74645
wmt18/Medline/test - - -
wmt19/Medline/training/de-en n/a 40398 40398
wmt19/Medline/training/fr-en n/a n/a 75049
wmt19/Medline/training/es-en n/a n/a 100257
wmt19/Medline/training/pt-en n/a n/a 49918
wmt19/Medline/test/Medline - - -
wmt20/Medline/training/it-en n/a n/a 14756
wmt20/Medline/training/ru-en n/a n/a 46782
2 UFAL Medical Corpus shuffled.de-en n/a 37814533 n/a
3 HimL test sets khresmoi-summary-dev n/a 500 n/a
khresmoi-summary-test n/a 1000 n/a
4 Khresmoi development data himl-test-2015/cochrane n/a 1433 n/a
himl-test-2015/himl n/a 3892 n/a
himl-test-2015/nhs24 n/a 2459 n/a
himl-test-2017/cochrane n/a 467 n/a

Model hyperparameters

Deep Transformer Hybrid Transformer Big Transformer Large Transformer
Encoder Layer 40 40 6 20
Decoder Layer 6 6 6 6
Attention Heads 8 8 16 16
Embedding Size 512 512 1024 1024
FFN Size 2048 2048 4096 4096

Pre-trained models

Deep Transformer Hybrid Transformer Big Transformer Large Transformer
De->En download download download download
En->De download download download download

BPE Model: De-En BPE models

Synthetic Chinese-English bilingual data

Chinese-English Biomedical bilingual data

Training Details

Data Preprocessing

Moses scripts

$lang=en(de)

#Step1. normalize-punctuation

./mosesdecoder-master/scripts/tokenizer/normalize-punctuation.perl -l $lang < data.$lang > data.$lang.norm

#Step2. remove-non-printing-char

./mosesdecoder-master/scripts/tokenizer/remove-non-printing-char.perl < data.$lang.norm > data.$lang.norm.remv

#Step3. tokenize

./mosesdecoder-master/scripts/tokenizer/tokenizer.perl -l $lang -threads 10 < data.$lang.norm.remv > data.$lang.norm.remv.tok

Model: Transformer

Toolkit: Fairseq

Citation

Please kindly cite our paper if you find it helpful:

@inproceedings{wang2020tencent,
  title={Tencent AI Lab machine translation systems for the WMT20 biomedical translation task},
  author={Wang, Xing and Tu, Zhaopeng and Wang, Longyue and Shi, Shuming},
  booktitle={Proceedings of the Fifth Conference on Machine Translation},
  pages={881--886},
  year={2020}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.