Giter Site home page Giter Site logo

gapfill's Introduction

Fill in the Blank

A simple web app that presents the user with a series of fill in the blank questions. The code is divided into two parts: Python scripts that are used to automatically generate fill in the blank questions using Wikipedia articles, and a Node.js web app that presents those questions.

Installation

  • Clone the repository
  • Make sure you have Python and Node.js installed
  • In the parser directory, run pip install -r requirements.txt
  • To re-fetch the articles from Wikipedia, run python wiki_fetcher.py
  • To regenerate the questions from the fetched articles, run python wiki_parser.py
  • To use the new list of questions in the web app, copy outputs/gapfill_questions.json to the app/data directory
  • In the app directory, run npm install
  • Run node app.js
  • Go to localhost:3000 in your browser to try it out!

The Parser

The Wikipedia fetching and parsing code is in the parser directory. The fetching of article content is handled by wiki_fetcher.py, and the contents are stored in an output json file. The content parsing and question generation is handled by wiki_parser.py.

The high-level approach for question generation involves:

  1. Downloading the content (just the summary section at the top) of each Wikipedia article listed in wiki_fetcher.py
  2. Breaking up each article into sentences
  3. Choosing the sentences that are best suited for fill in the blank questions
  4. For each of those sentences, choosing the best word to blank out
  5. Writing the questions to a json file that can be used by the web app

The approach for choosing sentences and keywords was mostly inspired by this paper: http://www.anthology.aclweb.org/W/W11/W11-1407.pdf. The best sentences are chosen based on the following criteria:

  • Whether the sentence is the first sentence in the article
  • The number of words that the sentence shares with the article title
  • Whether the sentence contains at least one superlative
  • The length of the sentence
  • The number of nouns in the sentence
  • The number of pronouns in the sentence (this is a negative signal)

Each chosen sentence is then broken down into noun phrases and one candidate word to blank out is chosen from each noun phrase. The best word is chosen from the candidate list based on how often that word appears in the article and whether it appears in the article title.

The App

The web app is a pretty simple Node.js app. All of the server code is in app.js, which parses the questions from the input json file and exposes endpoints for getting questions and checking answers. The client code is in the static directory, which includes Javascript that handles user input and fetches data from the server.

Potential Improvements

There are some fairly obvious problems in the current approach that I would fix if I had more time:

  • Right now only the sentence and word strings are saved to the json file, without indicating the position of the word in the sentence. Words that appear twice in a sentence are never chosen, but if the chosen word is a substring of another word in the sentence, some of the js code will produce bad outputs.
  • The keyword choosing code heavily favors cardinal numbers, which is good because they are usually well-constrained fill in the blank answers, but from eyeballing the outputs, it seems like too many of the questions have numbers blanked out instead of words.
  • Besides the number problem, there are some other sentences that aren't great fill in the blank questions. The weighting of the sentence and word scores can be tweaked further, and more scoring functions can be added to make this better.

gapfill's People

Contributors

jeevnayak avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.