Giter Site home page Giter Site logo

xnli_extension's Introduction

An Extension of XNLI

Abstract

Multilingual transfer learning can benefit both high- and low-resource languages, but the source of these improvements is not well understood. Canonical Correlation Analysis (CCA) of the internal representations of a pretrained, multilingual BERT model reveals that the model partitions representations for each language rather than using a common, shared, interlingual space. This effect is magnified at deeper layers, suggesting that the model does not progressively abstract semantic content while disregarding languages. Hierarchical clustering based on the CCA similarity scores between languages reveals a tree structure that mirrors the phylogenetic trees hand-designed by linguists. The subword tokenization employed by BERT provides a stronger bias towards such structure than character- and word-level tokenizations. We release a subset of the MultiNLI/XNLI dataset translated into an additional 14 languages in this repository to assist further research into multilingual representations.

Citation

@article{singhMultiLingTokBias2019,
title={{BERT is Not an Interlingua and the Bias of Tokenization}},
author={Singh, Jasdeep and McCann, Bryan and Xiong, Caiming and Socher, Richard},
journal={The Workshop on Deep Learning for Low-Resource NLP at EMNLP 2019},
year={2019}
}

xnli_extension's People

Contributors

bmccann avatar svc-scm avatar

Stargazers

 avatar Ana Sabina Uban avatar Ismael Garrido Muñoz avatar Amr Kayid avatar Adam Ek avatar Tim Isbister avatar Edward Burgin avatar

Watchers

Ryan Michela avatar Josh Simmons avatar James Cloos avatar  avatar Damini Satya avatar  avatar Laura avatar  avatar

Forkers

isabella232

xnli_extension's Issues

Low quality translation

First of all thank you for this work. It's great to see more effort towards multilingual datasets.

However, as a native speaker of Hungarian, the Hungarian dataset has quality issues. About 10-20% of the sentences are not grammatical and many are simply incorrect translations.

This one is grammatically correct but the translation is wrong:

Let me stay in your arms
Hadd maradjak a karjaimban
Let stay in my-arms.
Meaning: Let me stay in my own arms.

This one is both ungrammatical and incorrect:

There are apple and banana nut ones , but my favorite is the poppy seed .
Vannak alma és banán dió, de a kedvencem a mák.
There-are apple and banana walnut, but the my-favorite is-the poppyseed.

It would be correct with a few changes assuming that the sentence is talking about flavors not the actual fruits.

Van almás és banános diós, de a kedvencem a mákos.

Overall I think the dataset would make a valuable contribution to the research community - certainly for the Hungarian NLP community - if the issues were fixed by a native translator.

Similar or more serious quality issues were observed in other datasets by Bulgarian, Dutch and Russian native speakers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.