Giter Site home page Giter Site logo

royshan / pytorch-human-performance-gec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rgcottrell/pytorch-human-performance-gec

0.0 2.0 0.0 137 KB

A PyTorch implementation of "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study"

License: Apache License 2.0

Batchfile 11.50% Python 88.50%

pytorch-human-performance-gec's Introduction

pytorch-human-performance-gec

A PyTorch implementation of "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study"

Initialize Submodules

After checking out the repository, be sure to initialize the included git submodules:

git submodule update --init --recursive

Install Required Dependencies

This project requires the use of PyTorch, which can be installed by following the directions on its project page

This project also uses the fairseq NLP library, which is included as a submodule in this repository. To prepare the library for use, make sure that it is installed along with its dependencies.

cd fairseq
pip install -r requirements.txt
python setup.py build develop

OpenNMT Scripts (Legacy)

All OpenNMT scripts have been grouped under opennmt-scripts folder.

Preparing Data

The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip under corpus folder. Then create another folder lang-8-opennmt under corpus folder to store re-formatted corpus.

To split the Lang-8 learner data training set, use the following command:

python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>

e.g.

python transform-lang8.py -src_dir ../corpus/lang-8-en-1.0 -out_dir ../corpus/lang-8-opennmt

Once the data has been extracted from the dataset, use OpenNMT to prepare the training and validation data and create the vocabulary:

preprocess-lang8.bat

Train the Model

To train the error-correcting model, run the following command:

train.bat

Note that this script may need to be adjusted based on the GPU and memory resources available for training.

Testing the Model

To test the model, run the following command to try to correct a test list of sentences:

translate.bat

After the sentences have been translated, the source and target sentence may be compared side to side using the following command:

python compare.py

Patching OpenNMT-py Environment

If preprocess.py fails with a TypeError, then you may need to patch OpenNMT-py.

Update OpenNMT-py\onmt\inputters\dataset_base.py with the following code:

def __reduce_ex__(self, proto):
    "This is a hack. Something is broken with torch pickle."
    return super(DatasetBase, self).__reduce_ex__(proto)

If TypeError: __init__() got an unexpected keyword argument 'dtype' occurs, pytorch/text installed by pip may be out of date. Update it using pip install git+https://github.com/pytorch/text

If RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS occurs during training, try install pytorch with CUDA 9.2 using conda instead of using default CUDA 9.0.

fairseq Scripts

All fairseq scripts have been grouped under fairseq-scripts folder.

Preparing Data

The first step is to prepare the source and target pairs of training and validation data. Extract original lang-8-en-1.0.zip under corpus folder. Then create another folder lang-8-fairseq under corpus folder to store re-formatted corpus.

To split the Lang-8 learner data training set, use the following command:

python transform-lang8.py -src_dir <dataset-src> -out_dir <corpus-dir>

e.g.

python transform-lang8.py -src_dir ../corpus/lang-8-en-1.0 -out_dir ../corpus/lang-8-fairseq

Once the data has been extracted from the dataset, use fairseq to prepare the training and validation data and create the vocabulary:

preprocess-lang8.bat

Train the Model

To train the error-correcting model, run the following command:

train-lang8-cnn.bat

Note that this script may need to be adjusted based on the GPU and memory resources available for training.

Testing the Model

To test the model, run the following command to try to correct a test list of sentences:

generate-lang8-cnn.bat

Patching fairseq Environment

If error AttributeError: function 'bleu_zero_init' not found occurs on Windows, modify functions to have __declspec(dllexport) then build again. See Issue 292

If error UnicodeDecodeError: 'charmap' codec can't decode byte error occurs, modify fairseq/tokenizer.py to include , encoding='utf8' for all open functions.

When trying built-in example from fairseq/examples/translation/prepare-[dataset].sh, scripts may need to change .py path from $BPEROOT/[script].py to $BPEROOT/subword_nmt/[script].py.

pytorch-human-performance-gec's People

Contributors

tianfeichen avatar rgcottrell avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.