Giter Site home page Giter Site logo

xnli's Introduction

XNLI: The Cross-Lingual NLI Corpus

XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.

New: XLM and Multilingual BERT use XNLI to evaluate the quality of the cross-lingual representations.

Introduction

Many NLP systems (e.g. sentiment analysis, topic classification, feed ranking) rely on training data in one high-resource language, but cannot be directly used to make predictions for other languages at test time. This problem happens in almost any industrial application that involves cross-lingual data.

Machine translation can be used to translate any sample into the high-resource language to alleviate this issue. However, having an MT system in every direction is costly and may not the best solution for cross-lingual classification. Cross-lingual encoders are a cheaper and more elegant alternative.

To evaluate such cross-lingual sentence understanding methods, we built XNLI, an extension of the SNLI/MultiNLI corpus in 15 languages. XNLI raises the following research question: how can we make predictions in any language at test time, when we only have English training data for that task?

While industrial applications may not include NLI in their routine tasks, we believe that NLI is a good testbed for evaluating cross-lingual sentence representations, and that better approaches for XNLI will result in better XLU methods.

The XNLI corpus

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations.

Examples

Download

XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.

Download: XNLI 1.0 (17MB, ZIP)

XNLI can also be used as a 15way parallel corpus of 10,000 sentences, for building or evaluating Machine Translation systems. XNLI provides additional open parallel data for low-resource languages such as Swahili or Urdu.

Download XNLI-15way (12MB, ZIP)

Useful XLU resources

Parallel data for aligning sentence encoders: OPUS: the open parallel corpus

Monolingual word embeddings in 157 languages: fastText CommonCrawl vectors

Aligning word embeddings: MUSE

Data description paper and citation

A description of the data can be found here or in the corpus package zip. If you use the corpus in an academic paper, please consider citing:

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
        and Rinott, Ruty
        and Lample, Guillaume
        and Williams, Adina
        and Bowman, Samuel R.
        and Schwenk, Holger
        and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

Baselines and MT data

The XNLI paper presents several baselines for language adaptation.

We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:

Download: XNLI-MT 1.0 (445MB, ZIP)

License

XNLI is licensed under the license found in the LICENSE file. See details in the XNLI paper.

xnli's People

Contributors

facebook-github-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xnli's Issues

Clarification of license of XNLI

Hi, currently I see at https://github.com/facebookresearch/XNLI/blob/main/LICENSE it XNLI is licensed under the "Attribution-NonCommercial 4.0 International" Creative Commons license.

However, at https://www.tensorflow.org/datasets/catalog/xnli it says at the bottom of the page "Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License", and in the paper itself at https://arxiv.org/pdf/1809.05053.pdf I see there is the following text:

The current version of the corpus is freely avail-
able23 for typical machine learning uses, and may
be modified and redistributed. The majority of the
corpus sentences are released under the OANC’s
license which allows all content to be freely used,
modified, and shared under permissive terms. The
data in the Fiction genre from Captain Blood are
in the public domain in the United States (but may
be licensed differently elsewhere).

Could I please clarify which license the XNLI and XNLI 15-Way datasets are under, and whether commercial usage is allowed?

Label change 'contradictory' to 'contradiction' issue

in the original MNLI and XNLI paper, labels when two sentences contradict are named as 'contradiction',

but in the XNLI 1.0.zip, the labels are 'contradictory' instead of 'contradiction'. (for all languages)

it can be a problem when training multiple languages at once.

i think changing label 'contradictory' to 'contradiction' would be proper.

XNLI 2.0

Another group made an 'XNLI 2.0' dataset without talking with us. It's unclear whether we should recommend it, but it's probably worth linking to it and making it clear that it's a separate project we don't actively endorse.

https://arxiv.org/abs/2301.06527

XNLI Arabic data read problem

Following this and this script,
I tried to load Arabic data by the following script.

# data paths
MAIN_PATH=$PWD
OUTPATH=$PWD/data/xnli
XNLI_PATH=$PWD/data/xnli/XNLI-1.0

# tools paths
TOOLS_PATH=$PWD/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
LOWER_REMOVE_ACCENT=$TOOLS_PATH/lowercase_and_remove_accent.py

# install tools
./scripts/install-tools.sh

# create directories
mkdir -p $OUTPATH

# download data
if [ ! -d $OUTPATH/XNLI-MT-1.0 ]; then
  if [ ! -f $OUTPATH/XNLI-MT-1.0.zip ]; then
    wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip -P $OUTPATH
  fi
  unzip $OUTPATH/XNLI-MT-1.0.zip -d $OUTPATH
fi
if [ ! -d $OUTPATH/XNLI-1.0 ]; then
  if [ ! -f $OUTPATH/XNLI-1.0.zip ]; then
    wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip -P $OUTPATH
  fi
  unzip $OUTPATH/XNLI-1.0.zip -d $OUTPATH
fi


# English train set
for lg in ar; do
  echo "*** Preparing $lg train set ****"
  echo -e "premise\thypo\tlabel" > $XNLI_PATH/$lg.train
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f1 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f1
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f2 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f2
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f3 | sed 's/contradictory/contradiction/g' > $XNLI_PATH/train.f3
  paste $XNLI_PATH/train.f1 $XNLI_PATH/train.f2 $XNLI_PATH/train.f3 >> $XNLI_PATH/$lg.train
done

Now the lines from 390702-392701 in $XNLI_PATH/train.f2 are empty. So from header premise\thypo\tlabel, hypo will always be empty from 390702-392701 in ar.train file.

Is this correct behavior?

@aconneau

can not download XNLI-15way.zip

when i open the XNLI-15way.zip link, https://s3.amazonaws.com/xnli/XNLI-15way.zip

return this below, help me, pls, many thanks.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>NoSuchBucket</Code>
<Message>The specified bucket does not exist</Message>
<BucketName>xnli</BucketName>
<RequestId>98B2A67BB673D19F</RequestId>
<HostId>
h6joAzTWqJSaOLywynelTZTxW8WYTVLsK79Ezdt/sd6mLQeqwQ4Nrni647VHIeZmliBbnXgfKo4=
</HostId>
</Error>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.