Giter Site home page Giter Site logo

bwaldon / cgel Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nert-nlp/cgel

0.0 0.0 0.0 72.38 MB

CGEL trees.

License: Creative Commons Attribution 4.0 International

Shell 0.04% Python 8.30% CSS 0.01% TeX 16.97% HTML 40.94% Common Lisp 33.74%

cgel's Introduction

cgel

This repo contains CGELBank, a human-annotated treebank of English using the syntactic formalism of the Cambridge Grammar of the English Language (CGEL). The treebank is described in Reynolds et al. (2023), published at the Linguistic Annotation Workshop (LAW).

Status CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Datasets

We annotated data from Twitter and the English Web Treebank (EWT).

To load the CGEL trees for scripting, use the cgel.py library.

Summary information is available in:

  • STATS.md (statistics extracted from the trees)
  • INDEX.md (list of sentences and notable properties)

Gold Data

  • datasets/twitter.cgel: CGEL gold trees from Twitter
  • datasets/ewt.cgel: CGEL trees from a sample of EWT train sentences (manually annotated by Brett Reynolds)
  • datasets/{ewt-test_iaa50.cgel, ewt-test_pilot5.cgel}: Adjudicated and up-to-date trees from the IAA experiment; sentences drawn from the EWT test split
  • datasets/trial/{ewt-trial.cgel, twitter-etc-trial.cgel}: Miscellaneous trees annotated but not adjudicated by both annotators
  • datasets/oneoff/*.cgel: Various CGEL trees for ad hoc sentences

Corresponding .conllu files are also available alongside the datasets/*.cgel and datasets/trial/*.cgel files. EWT .conllu files are gold trees; other .conllu files are manual corrections of Stanza output.

All data was revised with the aid of consistency-checking scripts.

Other subdirectories contain older/silver versions of the trees.

Interannotator Data

Under datasets/iaa/:

  • ewt-test_pilot5.{nschneid, brettrey, adjudicated}.cgel: Pilot interannotator study (5 sentences from EWT).
  • ewt-test_iaa50.{...}.cgel: Main interannotator study (50 sentences from EWT).
    • {nschneid, brettrey}.novalidator: Initial annotation.
    • {nschneid, brettrey}.validator: Corrected individual annotation after running automatic validation script to catch common errors.
    • adjudicated: Final adjudicated version combining both annotations.

Structure

  • cgel.py: library that implements classes for CGEL trees and the nodes within them, incl. helpful functions for printing and processing trees in PENMAN notation
  • cgel2ptb.py: prints CGEL trees in PTB bracketed style
  • constituent.py: information about how constituents join in a tree, for use by other scripts
  • eval.py: script for comparing two sets of CGEL annotations with tree edit distance (and derived metrics)
  • iaa.sh: script that runs eval.py on all files involved in our interannotator study (comparing pre- and post-validation trees as well as final adjudicated version)
  • tree2tex.py: print CGEL trees in pretty LaTeX
  • ud2cgel.py: converts UD trees (from English EWT treebank) to CGEL format using rule-based system
  • validate_trees.py: script to check the well-formedness of trees

Folders

  • analysis/: scripts for analysing the datasets, incl. edit distance
  • convertor/: includes conversion rules in DepEdit script + outputs from conversion, with a simple Flask web interface for local testing in the browser (English text > automatic UD w/ Stanza > CGEL)
  • datasets/: all the final datasets
  • figures/: figures for papers/posters and code for generating them
  • scripts/: one-off scripts that were used to clean/restructure data
  • test/: validation tests

Tests

To run tests locally:

$ python -m pytest

This will validate the trees and test distance metrics (Levenshtein and TED).

History

  • CGELBank 1.0: 2023-07-04.
    • Initial release of 257 trees.

Resources

Overview of the project:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2023). Unified Syntactic Annotation of English in the CGEL Framework. Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII), Toronto, Canada.

@inproceedings{cgelbank-law,
    address = {Toronto, Canada},
    title = {Unified Syntactic Annotation of {E}nglish in the {CGEL} Framework},
    author = {Reynolds, Brett and Arora, Aryaman and Schneider, Nathan},
    year = {2023},
    month = jul,
    url = {https://people.cs.georgetown.edu/nschneid/p/cgeltrees.pdf},
    booktitle = {Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII)}
}

Annotation manual:

Brett Reynolds, Nathan Schneider, and Aryaman Arora (2023). CGELBank Annotation Manual v1.0. arXiv.

Further analysis:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2022). CGELBank: CGEL as a Framework for English Syntax Annotation. arXiv.

Aryaman Arora, Nathan Schneider, and Brett Reynolds (2022). A CGEL-formalism English treebank. MASC-SLL (poster), Philadelphia, USA.

Source data:

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, Chris Manning (2014). A Gold Standard Dependency Corpus for English. Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC '14).

Ann Bies, Justin Mott, Colin Warner, Seth Kulick (2012). English Web Treebank. LDC.

cgel's People

Contributors

nschneid avatar aryamanarora avatar tomlup avatar brettrey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.