Giter Site home page Giter Site logo

synthetic-mc-qa's Introduction

Synthetic QA Generation for Multiple Choice

This repo contains the code developed for my Master's Thesis for the Language Technologies Master at UNED, designed to be used in a Colab environment with a GPU and the RACE and EntranceExams English datasets.

It has all been condensed into two notebooks for ease of use, containing all the necessary code to generate new synthetic data and the code to carry the experiments respectively.

You can download the necessary files from the RACE and EE sites respectively.

The first notebook QA_MC_GEN contains all the code necessary to generate Question-Answer-Distractor pairs from a given set of data, expected in a given format to be obtained from said dataset. The necessary preprocessing pipelines are also provided. It makes use of the T5 model to generate Question-Answer pairs, from which distractors are generated with several methods, own strategy could be easily implemented by modifying them.

The second notebook QA_MC_EVAL contains all the code necessary to evaluate said generated pairs and demonstrate their usefulness in improving model performance and evaluating whether the synthetic data is useful enough on its own and demonstrates predictability potential. It does so through a BERT pretrained model, where we finetune the model with different combinations of real and synthetic data. Expanded tests are carried exploring the limits of this approach in the last sections.

Most code has been developed making use of ๐Ÿค— Pipelines and hosted models.

Experiment Scripts

Under src/generate_data.py we've added a script to generate the synthetic data. You need to provide it with a destination folder.

To run it (in /src/)

$ python -m pip install -r requirements.txt
$ python generate_data.py save_folder n_samples questions_per_doc strategy
$ python run_experiment.py all save_folder

For more info you can do python generate_data.py --help or python run_experiment.py --help

Objectives

The main objectives of this work are to evaluate methods to automatically generate Multiple Choice collections and help evaluate their contribution to improve current systems. To this end we will work towards the following objectives:

In QA_MC_GEN

  1. Generate Question-Answer pairs from a given text.
  2. Propose different methods of distractor generation for said pairs.

In QA_MC_EVAL

  1. Evaluate how the quality and quantity of these tuples affects current systems.
  2. Evaluate how the different types of distractors impact the results.

synthetic-mc-qa's People

Contributors

jorses avatar

Stargazers

 avatar

Watchers

 avatar

synthetic-mc-qa's Issues

DATASET EE (Clef- Entrance Exams) is not available

Hello, I found your work interesting. However, it looks difficult to replicate it with the dataset you suggested (Clef- Entrance Exams). On their website, they ask for a registration and a special mail-post request for accessing the data.

Would you mind sharing the EE dataset as part of your repository, please?

We would appreciate it. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.