Giter Site home page Giter Site logo

slanglab / cgedit Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 83 KB

Models and training sets to accompany: Masis, Neal, Green, and O'Connor. "Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties." Field Matters Workshop at COLING, 2022.

Python 25.06% Jupyter Notebook 74.94%

cgedit's Introduction

CGEdit/

Code, training data, and models to accompany the paper: Masis, Neal, Green, and O'Connor. "Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties." Field Matters Workshop at COLING, 2022.

Contact: Tessa Masis ([email protected]), Brendan O'Connor ([email protected])

  • data/

    • documentation.md: dataset documentation
    • CGEdit/
      • AAE.tsv, IndE.tsv: training sets for AAE and IndE generated via CGEdit method
    • CGEdit-ManualGen/
      • AAE.tsv, IndE.tsv: training sets for AAE and IndE generated via both ManualGen and CGEdit
  • code/

    • train.py: code to fine-tune BERT-variant model
    • eval.py: code to evaluate fine-tuned model
    • preprocessCORAAL.py: code used to preprocess CORAAL transcript files for extrinsic evaluation in the paper (see Section 6); note that only interviewee speech files were used for our evaluation, not interviewer speech files
    • Note that the above scripts may require modifications in order to run on your computer
    • tutorial.ipynb: copy of the tutorial walking through how to use our fine-tuned models (see below, section "Using our models")

Training models

Run the train script with the contrast set generation method ('CGEdit' or 'CGEdit-ManualGen') as the first argument and the language ('AAE' or 'IndE') as the second argument. For example:

python train.py CGEdit-ManualGen AAE 

Evaluation

The eval script will print a prediction in [0, 1] for each linguistic feature, for each test example.

Run the eval script with the contrast set generation method used for training ('CGEdit' or 'CGEdit-ManualGen') as the first argument, the language ('AAE' or 'IndE') as the second argument, and the test set filename as the third argument (not included in this repo). For example:

python eval.py CGEdit-ManualGen AAE testFileName

Using our models

To access our fine-tuned model trained on the data in CGEdit-ManualGen/AAE.tsv for 17 African American English features, please see the Google Colab notebook here (or see code/tutorial.ipynb in this repo). This tutorial will walk you through how to access and use the model for linguistic feature detection.

Please contact us if you would like to access our model fine-tuned on the data in CGEdit-ManualGen/IndE.tsv for 10 Indian English features.

cgedit's People

Contributors

tmasis avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

gnkitaa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.