Giter Site home page Giter Site logo

toefl-spell's Introduction

TOEFL-Spell

A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.

This repository contains the TOEFL-Spell annotated data set. The data is described in the paper A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction (Flor, Fried & Rozovskaya, 2019).

The TOEFL-Spell data set contains annotations of 6000+ spelling errors from essays written by non-native speakers of English taking the TOEFL iBT test.

We based our data set on the publicly available ETS Corpus of Non-Native Written English, a.k.a. TOEFL11, which contains 12,100 essays from 11 first language backgrounds. We sampled 883 essays from that corpus and manually annotated them for spelling errors.

We provide two files: FilesCounts.tsv and Annotations.tsv (both have tab-separated values, first line is header). (There are now also FilesCounts.csv and Annotations.csv, that have the same data in csv format).

FilesCounts.tsv contains the names of annotated files (essays) and the count of spelling errros for each. Note that 35 essays had no spelling errors.

The file Annotations.tsv contains tab-separated annotations for all the data. Each annotation appears on a separate line, like this:

Filename OffsetSpan Misspelling Type Correction
1004135 1186-1193 beacuse M because

The value of the Filename field matches the corresponding text file in the full TOEFL11 corpus. The value in the span field gives the offset of the misspelling in the original text file.

In order to appreciate the annotations in full context of the original essay (or to run your own experiments), you will need to obtain the essays from the Linguistic Data Consortium (LDC Catalog Number: LDC2014T06) and link them to the annotation via filenames and offset values.

A note on 'types' of misspellings

The article mentions 6121 misspelings, where each is a single token nonword. Those are marked as type M in the annotation. The annotation file has 112 additional misspellings, with other type names (M2,MWM,MWM2), which were marked for addiitoanl research. Only type M misspellings were used in system evaluations reported in the paper.

Questions? Send email to [email protected]

toefl-spell's People

Contributors

maafiah avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.