Giter Site home page Giter Site logo

eduquest's Introduction

EduQuest

This repository contains the EduQuest dataset, composed of 68248 questions developed by professional instructors for higher education learning. The dataset focuses on advanced and complex questions, to acess and train both students and machine learning algorithms.

Dataset Statistics

Core statistics of the FairytaleQA dataset Breakdown statistics of QAs based on the 7 narrative elements' schema in FairytaleQA dataset

The table on the left contains the core statistics of the dataset. On the right, a breakdown of the EduQuest dataset by source text. Books refers to one full course, with the exact definition varying across source texts. For Openstax, Opentext and CK12 each book represents a graduate, undergraduate or high-school level textbook. For MIT OCW, each book corresponds to one lecture series given at graduate or undergraduate level. For Khan Academy, each book represents one URL in the source website.

Core statistics of the FairytaleQA dataset by train/test/val splits

Above is the statistics of the dataset broken down according to their respective cognitive process and knowledge dimensions (Factual - Fact, Procedural - Pro, MetaCognitive - MC and Conceptual - Concept)

Dataset Structure

The main structure is the same for all subgroups:

Book Chapter text is_question truefalse fillblank mc
Origin Chapter Text of Lecture or Question 1/0 1/0 1/0 1/0
  • is_question indicates if text contains a lecture or a question

  • truefalse is 1 if the question is of a true false type

  • fillblank is 1 if the question is of a fillblank type (can also be mc simultanously)

  • mc is 1 if the question is a multiple choice type

  • length indicates the number of characters in text

  • num_words indicates the number of words in text

  • num_sentences indicates the number of sentences in text

  • dimension The dimension in the revised Bloom's Taxonomy.

  • cognitive_process_dimension_level cognitive dimension level in the revised Bloom's Taxonomy.

  • knowledge_dimension_level knowledge dimension level in the revised Bloom's Taxonomy.

  • Regarding Blooms Taxonomy, the following one hot encoded columns are here for convenience, a 1 indicates the question belongs to that level, 0 that it does not as Blooms Taxonomy does not require exclusivity.

    'Analyzing', 'Understanding', 'Evaluating', 'Applying', 'Remembering', 'Creating', 'conceptual_knowledge','factual_knowledge', 'procedural_knowledge', 'metacognitive_knowledge'
    

Because not all sources did indicate whether a question was of Multiple-Choice, True or False, Fill in the Blank or other types, it had to be inferred. For openstax, the above flags should be mostly exclusive. I.e. if truefalse == 0 then the question should not be of truefalse type, because the type of questions were largely indicated in the source materials. For the other dataset sub-groups ck12, KhanAacademy, OCW, Opentext this does not have to be the case. I.e. some questions in CK12 with truefalse == 0 could still be of truefalse type. Similar with the other flags.

For more information please refer to the paper.

By Group

detailed info on the flags and properties of this subgroup, certain datasets have a few indicators that we were able to extract to use for more specialized purposes, these columns are explained below.

Openstax

  • key_terms if the text is a summary of key terms from the section or chapter
  • references if the text is a list of references used in the previous lecture
  • summary if the text is a summary of the chapter.
  • question_answers if the text is an answer to a question
  • open_ended if the question is an open ended question (as indicated by openstax)
  • concept if the question is a concept check type from openstax

CK12

no special columns

Khan Academy

  • paragraph_order each many of the lessons on Khan are split into multiple paragraphs. This is the intended reading order from the website. I.e. Article X starts with X ,paragraph_order == 1, then X, paragraph_order == 2 and so on.
  • description and translated_descr sometimes contain the original text before it was split by paragraphs
  • images and widgets to our knowledge indicate if the scraped page were images or widgets, altough we did not find any usefulness in it.

MIT OCW

  • is_exam If the question is part of an exam (these comprise content spanning several lectures)
  • is_answer If the text is an answer to a question, if yes, it should have an q_id associated with that row that indicates which question/exercise this is the answer for.
  • q_id

Opentext / Libretexts

  • is_learning_objective many textbooks from opentext included learning objectives at the beginning of each chapter. These are indicated here.

Using this Dataset

From this repo

To start with this dataset, either clone the repo or download it as a zip file.

quick_start.py contains some code to train a model on the dataset, as an example to get started with NLP on EduQuest.

Future Work

eduquest's People

Contributors

aegonwolf avatar filipe-m-cunha avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.