Giter Site home page Giter Site logo

qa-dataset-to-squad's Introduction

What I have changed

  1. Added support to download the existsed dataset directly from various websites (natural_question, quac, and TriviaQA - 3/20/2022)
  2. Removed support for NewsQA since it was more complicated (licenses from CNN, I do not want to get into that)

QA Dataset Converter

In this repository, we release code from the paper What do Models Learn from Question Answering Datasets? by Priyanka Sen and Amir Saffari.

These scripts convert four popular question answering datasets into a common format based on SQuAD 2.0 to allow for easier probing and experimentation. An example of a question in the SQuAD 2.0 format is shown below:

{
  "context": "The Normans were the people who in the 10th and 11th centuries..."
  "qas": [
    {
      "question": "In what country is Normandy located?",
      "id": "56ddde6b9a695914005b9628",
      "answers": [
        {
          "text": "France",
          "answer_start": 159
        }
      ],
      "is_impossible": false
    }
...

In the following sections, we guide you through converting TriviaQA, Natural Question, QuAC, and NewsQA into a SQuAD 2.0 format.


TriviaQA

Step 1

Clone this repo and go into the TriviaQA directory.

cd qa-dataset-converter/triviaqa

Step 2

Download the TriviaQA dataset from https://nlp.cs.washington.edu/triviaqa/ This will include a qa directory with question-answer files and an evidence containing the documents for context.

Step 3

Clone the TriviaQA repo.

git clone https://github.com/mandarjoshi90/triviaqa

Step 4

Move our triviaqa_to_squad.py script into the TriviaQA repo.

mv triviaqa_to_squad.py  triviaqa/

Step 5

Set --triviaqa_file to a file in your qa directory and --data_dir to the Wikipedia path in your evidence directory. Run:

python triviaqa_to_squad.py --triviaqa_file qa/wikipedia-train.json --data_dir evidence/wikipedia/ --output_file triviaqa_train.json

python triviaqa_to_squad.py --triviaqa_file qa/wikipedia-dev.json --data_dir evidence/wikipedia/ --output_file triviaqa_dev.json

This will return two files triviaqa_train.json and triviaqa_dev.json in the SQuAD 2.0 format.


Natural Questions

Step 1

Clone this repo and go into the Natural Questions directory.

cd qa-dataset-converter/nq

Step 2

Download the Natural Questions dataset from https://ai.google.com/research/NaturalQuestions/download This will download train and dev directories of jsonl.gz files.

Step 3

Set --nq_dir to your Natural Questions train or dev directory. Run:

python nq_to_squad.py --nq_dir train/ --output_file nq_train.json

python nq_to_squad.py --nq_dir dev/ --output_file nq_dev.json

This will return two files nq_train.json and nq_dev.json in the SQuAD 2.0 format.


QuAC

Step 1

Clone this repo and go into the QuAC directory

cd qa-dataset-converter/quac

Step 2

Download the QuAC dataset from https://quac.ai/

Step 3

Set --quac_file to the path of your QuAC train or dev file. Run:

python quac_to_squad.py --quac_file train_v0.2.json --output_file quac_train.json

python quac_to_squad.py --quac_file val_v0.2.json --output_file quac_dev.json

This will return two files quac_train.json and quac_dev.json in the SQuAD 2.0 format.


Acknowledgements

Our TriviaQA script modifies code released in TriviaQA repo In particular, we take inspiration from convert_to_squad_format.py for all our scripts.

We also use modified code from the Nautral Question browser script to process Natural Questions examples.

We are thankful to the authors for making this code available.


License

This code is licensed under the Apache License, Version 2.0.


Citation

If you use our code, please cite us!

@inproceedings{sen-saffari-2020-models,
    title = "What do Models Learn from Question Answering Datasets?",
    author = "Sen, Priyanka  and
      Saffari, Amir",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.190",
    doi = "10.18653/v1/2020.emnlp-main.190",
    pages = "2429--2438",
}

qa-dataset-to-squad's People

Contributors

pat266 avatar priyanka-sen avatar amazon-auto avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.