Giter Site home page Giter Site logo

newsqa's Introduction

Maluuba NewsQA

Tools for using Maluuba's news questions and answer data.

You can find more information about the dataset here.

Data Description

The combined dataset is made of several columns to show the story text and the derived answers from several crowdsourcers.

Column Name Description
story_id The identifier for the story. Comes from the member name in the CNN stories package.
story_text The text for the story.
question A question about the story.
answer_char_ranges Character based indices to answers in story_text. E.g. `196:228
is_answer_absent Proportion of crowdsourcers that thought there was no answer to the question in the story.
is_question_bad Proportion of crowdsourcers that thought the question does not make sense.
validated_answers After crowdsourcing, we validated some answers when consensus was required. This shows how crowdsourcers voted during validation. E.g. {"none": 1, "294:297": 2} means that 1 crowdsourcer thought that none of the answers were good and 2 crowdsourcers thought that 294:297 was the best answer.

PEP8

The code in this repository complies with PEP8 standards with a maximum line length of 99 characters.

Requirements

  • Download the CNN stories from here to the maluuba/newsqa folder (for legal reasons, we can't automatically download these for you)
  • Download the questions and answers from here to the maluuba/newsqa folder
  • Extract the dowloaded tar.gz contents into the maluuba/newsqa folder (we'll automate this step in the future)
  • Use Python 2
  • Run pip install --requirement requirements.txt
  • Run python maluuba/newsqa/example.py --help to see instructions

Package the Dataset

Run

python maluuba/newsqa/example.py

Split the Dataset

To split the dataset into train, dev, and test, run

python maluuba/newsqa/split_dataset.py

The file to check will be printed.

Legal

Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Maluuba or its activities.

Terms: See LICENSE.pdf.

newsqa's People

Contributors

juharris avatar

Watchers

James Cloos avatar Hai Wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.