Giter Site home page Giter Site logo

rocs-mt's Introduction

RoCS-MT (Robust Challenge Set for Machine Translation)

RoCS-MT, a Robust Challenge Set for Machine Translation (MT), is designed to test MT systems’ ability to translate user-generated content (UGC) that displays non-standard characteristics, such as spelling errors, devowelling, acronymisation, etc. RoCS-MT is composed of English comments from Reddit, selected for their non-standard nature, which have been manually normalised and professionally translated into five languages: French, German, Czech, Ukrainian and Russian. The challenge set was included as a test suite at WMT 2023. This repository therefore also includes automatic translations from the submissions to the general MT task.

News

  • 06/10/2023: Version 1 release: this initial release correpsonds to the first version of the challenge set at WMT 2023. Scripts for analysis and reproducing results will be made available shortly after the conference!

Citation

Please cite the following article:

Rachel Bawden and Benoît Sagot. 2023. RoCS-MT: Robustness Challenge Set for Machine Translation. In Proceedings of the Eighth Conference on Machine Translation, pages 198–216, Singapore. Association for Computational Linguistics.

@inproceedings{bawden-sagot-2023-rocs,
    title = "{R}o{CS}-{MT}: Robustness Challenge Set for Machine Translation",
    author = "Bawden, Rachel  and
      Sagot, Beno{\^\i}t",
    editor = "Koehn, Philipp  and
      Haddow, Barry  and
      Kocmi, Tom  and
      Monz, Christof",
    booktitle = "Proceedings of the Eighth Conference on Machine Translation",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wmt-1.21",
    pages = "198--216"
}

Description of files

Each version of the challenge set (only v1 published for now) contains the following files:

Source files (raw + normalised versions)

The source texts are available in several versions, depending on the sentence segmentation (manual or with spaCy) and on whether the sentences are in their raw or normalised version:

  • manual sentence segmentation, raw text: src/RoCS-MT.src.raw-manseg.en
  • manual sentence segmentation, normalised text: src/RoCS-MT.src.raw-manseg.en
  • spaCy sentence segmentation, raw text: src/RoCS-MT.src.raw-spacyseg.en
  • spaCy sentence segmentation, normalised text: src/RoCS-MT.src.norm-spacyseg.en

.tsv versions of the files are available in the same directory, indicating the document ids of each sentence.

Reference files

The manually segmented, normalised texts were translated by professional translators into French, German, Czech, Ukrainian and Russian. They can be found in ref/en-{fr,de,cs,uk,ru}. We additionally include manual annotations of normalisation phenomena in ref/RoCS-annotated.tsv.

N.B. Some postedition still needs to be done for some of the translators due to variants being included (for gender variation).

MT system translations

The sys/ folder contains system outputs from the 2023 general MT task. They correspond to the translations of the concatenation of the four versions of the source texts.

Guidelines

We include the first version of the guidelines for both normalisation and translation in guidelines. Both guides, in particular the one for normalisation, is likely to evolve in future releases.

rocs-mt's People

Contributors

rbawden avatar

Stargazers

 avatar Mark Franey avatar

Watchers

Benoît Sagot avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.