What I have changed

Added support to download the existsed dataset directly from various websites (natural_question, quac, and TriviaQA - 3/20/2022)
Removed support for NewsQA since it was more complicated (licenses from CNN, I do not want to get into that)

QA Dataset Converter

In this repository, we release code from the paper What do Models Learn from Question Answering Datasets? by Priyanka Sen and Amir Saffari.

These scripts convert four popular question answering datasets into a common format based on SQuAD 2.0 to allow for easier probing and experimentation. An example of a question in the SQuAD 2.0 format is shown below:

{
  "context": "The Normans were the people who in the 10th and 11th centuries..."
  "qas": [
    {
      "question": "In what country is Normandy located?",
      "id": "56ddde6b9a695914005b9628",
      "answers": [
        {
          "text": "France",
          "answer_start": 159
        }
      ],
      "is_impossible": false
    }
...

In the following sections, we guide you through converting TriviaQA, Natural Question, QuAC, and NewsQA into a SQuAD 2.0 format.

TriviaQA

Step 1

Clone this repo and go into the TriviaQA directory.

cd qa-dataset-converter/triviaqa

Step 2

Download the TriviaQA dataset from https://nlp.cs.washington.edu/triviaqa/ This will include a qa directory with question-answer files and an evidence containing the documents for context.

Step 3

Clone the TriviaQA repo.

git clone https://github.com/mandarjoshi90/triviaqa

Step 4

Move our triviaqa_to_squad.py script into the TriviaQA repo.

mv triviaqa_to_squad.py  triviaqa/

Step 5

Set --triviaqa_file to a file in your qa directory and --data_dir to the Wikipedia path in your evidence directory. Run:

python triviaqa_to_squad.py --triviaqa_file qa/wikipedia-train.json --data_dir evidence/wikipedia/ --output_file triviaqa_train.json

python triviaqa_to_squad.py --triviaqa_file qa/wikipedia-dev.json --data_dir evidence/wikipedia/ --output_file triviaqa_dev.json

This will return two files triviaqa_train.json and triviaqa_dev.json in the SQuAD 2.0 format.

Natural Questions

Step 1

Clone this repo and go into the Natural Questions directory.

cd qa-dataset-converter/nq

Step 2

Download the Natural Questions dataset from https://ai.google.com/research/NaturalQuestions/download This will download train and dev directories of jsonl.gz files.

Step 3

Set --nq_dir to your Natural Questions train or dev directory. Run:

python nq_to_squad.py --nq_dir train/ --output_file nq_train.json

python nq_to_squad.py --nq_dir dev/ --output_file nq_dev.json

This will return two files nq_train.json and nq_dev.json in the SQuAD 2.0 format.

QuAC

Step 1

Clone this repo and go into the QuAC directory

cd qa-dataset-converter/quac

Step 2

Download the QuAC dataset from https://quac.ai/

Step 3

Set --quac_file to the path of your QuAC train or dev file. Run:

python quac_to_squad.py --quac_file train_v0.2.json --output_file quac_train.json

python quac_to_squad.py --quac_file val_v0.2.json --output_file quac_dev.json

This will return two files quac_train.json and quac_dev.json in the SQuAD 2.0 format.

Acknowledgements

Our TriviaQA script modifies code released in TriviaQA repo In particular, we take inspiration from convert_to_squad_format.py for all our scripts.

We also use modified code from the Nautral Question browser script to process Natural Questions examples.

We are thankful to the authors for making this code available.

License

This code is licensed under the Apache License, Version 2.0.

Citation

If you use our code, please cite us!

@inproceedings{sen-saffari-2020-models,
    title = "What do Models Learn from Question Answering Datasets?",
    author = "Sen, Priyanka  and
      Saffari, Amir",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.190",
    doi = "10.18653/v1/2020.emnlp-main.190",
    pages = "2429--2438",
}

pat266 / qa-dataset-to-squad Goto Github PK

qa-dataset-to-squad's Introduction

What I have changed

QA Dataset Converter

TriviaQA

Natural Questions

QuAC

Acknowledgements

License

Citation

qa-dataset-to-squad's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent