Giter Site home page Giter Site logo

mhardalov / exams-qa Goto Github PK

View Code? Open in Web Editor NEW
35.0 4.0 4.0 451.83 MB

A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Home Page: https://www.aclweb.org/anthology/2020.emnlp-main.438/

License: Creative Commons Attribution Share Alike 4.0 International

Python 99.26% Shell 0.74%
nlp multilingual dataset question-answering reading-comprehension multiple-choice crosslingual exams cross-lingual school-examinations-dataset

exams-qa's Introduction

EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. It contains more than 24,000 high-quality high school exam questions in 26 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models.

This repository contains links to the data, the models, and a set of scripts for preparing the dataset, and evaluating new models.

For more details on how the dataset was created, and baseline models testing multilingual and cross-lingual transfer, please refer to our paper, EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Dataset

The data can be downloaded from here: (1) Multilingual, (2) Cross-lingual

The two testbeds are described in the paper (also on arXiv). The files are in jsonl format and follow the ARC Dataset's structure. Each file is named using the following pattern: data/exams/{testbed}/{subset}.jsonl

We also provide the questions with the resolved contexts from Wikipedia articles. The files are in the with_paragraphs folders, folder, and are named {subset}_with_para.jsonl.

Multilingual

In this setup, we want to train and to evaluate a given model with multiple languages, and thus we need multilingual training, validation and test sets. In order to ensure that we include as many of the languages as possible, we first split the questions independently for each language L into TrainL, DevL, TestL with 37.5%, 12.5%, 50% of the examples, respectively.

*For languages with fewer than 900 examples, we only have TestL.

Language Train Dev Test
Albanian 565 185 755
Arabic - - 562
Bulgarian 1,100 365 1,472
Croatian 1,003 335 1,541
French - - 318
German - - 577
Hungarian 707 263 1,297
Italian 464 156 636
Lithuanian - - 593
Macedonian 778 265 1,032
Polish 739 246 986
Portuguese 346 115 463
Serbian 596 197 844
Spanish - - 235
Turkish 747 240 977
Vietnamese 916 305 1,222
Combined 7,961 2,672 13,510

Cross-lingual

In this setting, we want to explore the capability of a model to transfer its knowledge from a single source language Lsrc to a new unseen target language Ltgt. In order to ensure that we have a larger training set, we train the model on 80% of Lsrc, we validate on 20% of the same language, and we test on a subset of Ltgt.

For this setup, we offer per-language subsets for both the train, and dev sets. The file naming patter is {subset}_{lang}.jsonl}, e.g., train_ar.jsonl, train_ar_with_para.jsonl, dev_bg.jsonl, etc.

Finally, in this setup the test.jsonl is the same one as in the Multilignaul one.

Language Train Dev
Albanian 1,194 311
Arabic - -
Bulgarian 2,344 593
Croatian 2,341 538
French - -
German - -
Hungarian 1,731 536
Italian 1,010 246
Lithuanian - -
Macedonian 1,665 410
Polish 1,577 394
Portuguese 740 184
Serbian 1,323 314
Spanish - -
Turkish 1,571 393
Vietnamese 1,955 488

Parallel Questions

The EXAMS dataset contains 10,000 paralell questions, therefore we also provide the mappings between questions in jsonl format. Each row from the file contains a mapping from question id to a list of parallel ones in other languages.

Resolved Hits

We also release the resolved hits from the ElasticSearch including links to the Wikipedia pages, titles, and the returned relevance scores from the engine. The hits are avaible in a tar.gz archive containing a jsonl with the aforementioned fields.

Training and Evaluation

For both scripts the supported values for (multilingual) model types ($MODEL_TYPE) are: "bert", "xlm-roberta", "bert-kb", "xlm-roberta-kb".

The paragraph type ($PARA_TYPE) modes are: 'per_choice', 'concat_choices', 'ignore'

When using EXAMS with run_multiple_choice one should use --task_name exams, otherwise the one suitable for the task, e.g., arc, or race.

Training

We use HuggingFace's scripts for training the models, with slight modifications to allow for 3- to 5-way multiple-choice questions. The python scripts are available under the scripts/experiments folder.

Here is an example:

python ./scripts/experiments/run_multiple_choice.py \
    --model_type $MODEL_TYPE \
    --task_name $TASK_NAME \
    --tb_log_dir runs/${TRAIN_OUTPUT_SUBDIR}/$RUN_SETTING_NAME \
    --model_name_or_path $TRAINED_MODEL_DIR \
    --do_train \
    --do_eval \
    --warmup_proportion ${WARM_UP} \
    --evaluate_during_training \
    --logging_steps ${LOGGING_STEPS} \
    --save_steps ${LOGGING_STEPS} \
    --data_dir $TRAIN_DATA_DIR \
    --learning_rate $LEARNING_RATE \
    --num_train_epochs $MAX_EPOCHS \
    --max_seq_length $MAX_SEQ_LENGTH \
    --output_dir $TRAIN_OUTPUT \
    --weight_decay $WEIGHT_DECAY \
    --overwrite_cache \
    --per_gpu_eval_batch_size=$EVAL_BATCH_SIZE \
    --per_gpu_train_batch_size=$BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
    --overwrite_output 

Evaluation

We provide an evaluation script that allows fine-grained evaluation on both subject, and language level. The script is available at scripts/evaluation/evaluate_exams.py.

Example usage:

python evaluate_exams.py \
    --predictions_path predictions.json \
    --dataset_path dev.jsonl \
    --granularity all \
    --output_path results.json

The possible granularities that the scripts supports are: language, subject, subject_and_language, and all (includes all other options).

A sample predictions file can be found here: sample_predictions.jsonl.

Predictions

The following script can be used to obtain predictions from pre-trained models.

python ./scripts/experiments/run_multiple_choice.py \
    --model_type $MODEL_TYPE \
    --task_name exams \
    --do_test \
    --para_type per_choice \
    --model_name_or_path $TRAINED_MODEL_DIR \
    --data_dir $INPUT_DATA_DIR \
    --max_seq_length $MAX_SEQ_LENGTH \
    --output_dir $OUTPUT_DIR \
    --per_gpu_eval_batch_size=$EVAL_BATCH_SIZE \
    --overwrite_cache \
    --overwrite_output

Contexts

The scripts used for downloading the Wikipedia articles, and context resolution can be in the scripts/dataset folder.

Baselines

The EXAMS paper presents several baselines for zero-shot, and few-shot training using publicly avaible multiple-choice datasets: RACE, ARC, OpenBookQA, Regents.

Multilingual

  • The (Full) models are trained on all aforementioned datasets, including EXAMS.
Lang/Set ar bg de es fr hr hu it lt mk pl pt sq sr tr vi All
Random Guess 25.0 25.0 29.4 32.0 29.4 26.7 27.7 26.0 25.0 25.0 25.0 25.0 25.0 26.2 23.1 25.0 25.9
IR (Wikipedia) 31.0 29.6 29.3 27.2 32.1 31.9 29.7 27.6 29.8 32.2 29.2 27.5 25.3 31.8 28.5 27.5 29.5
XLM-R on RACE 39.1 43.9 37.2 40.0 37.4 38.8 39.9 36.9 40.5 45.9 33.9 37.4 42.3 35.6 37.1 35.9 39.1
w/ SciENs 39.1 44.2 35.5 37.9 37.1 38.5 37.9 39.5 41.3 49.8 36.1 39.3 42.5 37.4 37.4 35.9 39.6
then on Eχαμs (Full) 40.7 47.2 39.7 42.1 39.6 41.6 40.2 40.6 40.6 53.1 38.3 38.9 44.6 39.6 40.3 37.5 42.0
XLM-RBase (Full) 34.5 35.7 36.7 38.3 36.5 35.6 33.3 33.3 33.2 41.4 30.8 29.8 33.5 32.3 30.4 32.1 34.1
mBERT (Full) 34.5 39.5 35.3 40.9 34.9 35.3 32.7 36.0 34.4 42.1 30.0 29.8 30.9 34.3 31.8 31.7 34.6

References

Please cite as [1]. There is also a arXiv version.

[1] M. Hardalov, T. Mihaylov, D. Zlatkova, Y. Dinkov, I. Koychev, P. Nakov "EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering", Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

@inproceedings{hardalov-etal-2020-exams,
    title = "{EXAMS}: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering",
    author = "Hardalov, Momchil  and
      Mihaylov, Todor  and
      Zlatkova, Dimitrina  and
      Dinkov, Yoan  and
      Koychev, Ivan  and
      Nakov, Preslav",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.438",
    pages = "5427--5444",
    series = "EMNLP~'20"
}

License

The dataset, which contains paragraphs from Wikipedia, is licensed under CC-BY-SA 4.0. The code in this repository is licenced under the Apache 2.0 License.

exams-qa's People

Contributors

mhardalov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

exams-qa's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.