Giter Site home page Giter Site logo

jaencoco's Introduction

JaEnCOCO

This is the Japanese translation for the Ambiguous COCO dataset.

Our dataset is based on the original Ambiguous COCO dataset. We do not own the copyrights of images, English captions, and annotations in the dataset. Please visit the authors' web pages and follow their instructions to download them.

We are going to have an Ambiguous MS COCO Japanese-English Multimodal Task at WAT 2021!

License

Our dataset is released under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.

Dataset Description

There are two main folders that correspond to the validation data and the test data. First, we will explain the files in each folder: ("*" indicates either "validation" or "test," depending on whether you are in the validation or test folder)

  1. *Data.ja: This file contains the unsegmented Japanese sentences.
  2. *Data.tok.ja: This file contains the Japanese sentences segmented by Mecab.
  3. *FileNames.txt: This file contains the corresponding image file names.
  4. *Index.txt: This file contains the corresponding sentence indices for the original Ambiguous COCO sentences.

In order to construct a full English-Japanese validation and testing dataset, you will need to combine this data with the English Ambiguous COCO sentences. Now, we will explain how to combine this data with the English Ambiguous COCO sentences:

  1. Download the Ambiguous COCO captions and images. Links to the dataset can be found at the WMT 2017 Multimodal MT task page (Look under the "Datasets" section of this webpage)

  2. By matching the provided sentence IDs for the validation and test data with the sentence IDs for the English captions, you can associate the Japanese sentences with their corresponding English translations. From this you should be able to construct a validation file with 230 English sentences and a test file with 231 English sentences.

  3. The provided image file name list in each folder can be used to link each sentence to the corresponding image. (the file names are ordered in the same order as the sentences. i.e., The first file name in the validation set corresponds to the first validation sentence, etc.)

Training Data

For standard training data, please use the Flickr30kEntities Japanese (F30kEnt-Jp) dataset.

As additional training data, you can use the MS COCO English image captions and STAIR Japanese image captions.

Reference

If you use this dataset, please cite the following paper:

Andrew Merritt, Chenhui Chu, Yuki Arase. A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences. arXiv:2010.08725.

@misc{merritt2020corpus,
      title={A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences}, 
      author={Andrew Merritt and Chenhui Chu and Yuki Arase},
      year={2020},
      eprint={2010.08725},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgment

This work was supported by Microsoft Research Asia Collaborative Research Grant and Grant-in-Aid for Young Scientists #19K20343, Japan.

jaencoco's People

Contributors

knccch avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.