Giter Site home page Giter Site logo

nusnlp / greco Goto Github PK

View Code? Open in Web Editor NEW
8.0 0.0 0.0 8.05 MB

The official code for the "System Combination via Quality Estimation for Grammatical Error Correction" paper, published in EMNLP 2023.

Home Page: https://aclanthology.org/2023.emnlp-main.785

License: GNU General Public License v3.0

Python 0.84% Macaulay2 99.16%
deep-learning ensemble-model gec grammatical-error-correction pytorch quality-estimation re-ranking

greco's Introduction

System Combination via Quality Estimation for Grammatical Error Correction

This repository provides the code to easily score, re-rank, and combine corrections from Grammatical Error Correction (GEC) models, as reported in this paper:

System Combination via Quality Estimation for Grammatical Error Correction
Muhammad Reza Qorib and Hwee Tou Ng
The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (PDF)

Installation

Please install the necessary libraries by running the following commands:

pip install -e requirements.txt
wget -P models https://sterling8.d2.comp.nus.edu.sg/~reza/GRECO/checkpoint.bin
wget https://www.comp.nus.edu.sg/~nlp/sw/m2scorer.tar.gz
tar -xf m2scorer.tar.gz

Please check whether the installed PyTorch matches your hardware CUDA version.

To also run other quality estimation models, please run the following commands:

git clone https://github.com/nusnlp/neuqe
git clone https://github.com/thunlp/VERNet
git clone https://github.com/kokeman/SOME

And download the model checkpoints from

Quality Estimation

Scoring hypotheses in your code

You can import the GRECO class from models.py, instantiate the class, and pass the source(s) and hypotheses (in the form of python list of strings) to the .score() function.

import torch
from models import GRECO

model = GRECO('microsoft/deberta-v3-large').to(device)
model.load_state_dict(torch.load('models/checkpoint.bin))
model.score(source, hyphoteses)

Correlation coefficient

Get the scores on all text by running this command. In this example, we will also score the text with SOME.

python score_all.py --auto --data_dir data/conll-official/texts --output_path outputs/greco_scores.json --model greco --lm_model microsoft/deberta-v3-large --checkpoint models/checkpoint.bin --source_file data/conll-source.txt --batch_size 16
python score_all.py --auto --data_dir data/conll-official/texts --output_path outputs/some_scores.json --model some --source_file data/conll-source.txt --batch_size 16

Get the gold F0.5 score for each sentence by running this command.

python m2_for_corr.py --data_dir data/conll-official/reports --scorer m2scorer --output_path outputs/target.json

Calculate the correlation by running this command

python correlation.py --system_A outputs/greco_scores.json --system_B outputs/some_scores.json --target outputs/target.json --metric spearman

Re-ranking

Reproducing re-ranking F0.5 score

Run the following to re-rank the corrections

python rerank.py --data_dir data/conll-official/texts --source_file data/conll-source.txt --auto --output_path outputs/greco_rerank.out --model greco --lm_model microsoft/deberta-v3-large --checkpoint models/checkpoint.bin --batch_size 16

Run the following to get the F0.5 score

python2 m2scorer/scripts/m2scorer.py outputs/greco_rerank.out data/conll-2014.m2

Re-ranking your top-k model outputs

You can run the same command as above but change the data path in the --data_dir argument. For all k, print the k-th best correction for each source sentence into a single file inside a folder, and pass that folder path to the --data_dir argument. The code will read all files inside that folder. You can check the data/conll-official/texts as an example.

System Combination

Reproducing system combination F0.5 score

Run the following command to reproduce the BEA-2019 test result

python run_combination.py --model greco --lm_model microsoft/deberta-v3-large --output_path outputs/bea-test.out --beam_size 16 --batch_size 16 --checkpoint models/checkpoint.bin --data data/test-m2/Riken-Tohoku.m2 data/test-m2/Kakao-Brain.m2 data/test-m2/UEDIN-MS.m2 data/test-m2/T5-Large.m2 data/test-m2/GECToR-XLNet.m2 data/test-m2/GECToR-Roberta.m2 --vote_coef 0.4 --edit_scores edit_scores/bea-test_score.json --score_ratio 0.7

Then, compress outputs/bea-test.out into a zip file and upload it to https://codalab.lisn.upsaclay.fr/competitions/4057#participate

Run the following command to reproduce the CoNLL-2014 test result

python run_combination.py --model greco --lm_model microsoft/deberta-v3-large --output_path outputs/conll-2014.out --beam_size 16 --batch_size 16 --checkpoint models/checkpoint.bin --data data/conll-m2/Riken-Tohoku.m2 data/conll-m2/UEDIN-MS.m2 data/conll-m2/T5-Large.m2 data/conll-m2/GECToR-XLNet.m2 data/conll-m2/GECToR-Roberta.m2 --vote_coef 0.4

Run the following to get the F0.5 score

python2 m2scorer/scripts/m2scorer.py outputs/conll-2014.out data/conll-2014.m2

Retraining the model

Run the following command to train a new model

python train.py --do_train --model_name_or_path microsoft/deberta-v3-large --output_dir models/new_model --learning_rate 2e-5 --word_dropout 0.25 --save_strategy epoch --per_device_train_batch_size 32 --gradient_accumulation_steps 4 --num_train_epochs 15 --alpha 1 --data data/train.json --data_mode hierarchical --edit_weight 2.0 --rank_multiplier 5

License

The source code and models in this repository are licensed under the GNU General Public License Version 3 (see License). For commercial use of this code and models, separate commercial licensing is also available. Please contact Hwee Tou Ng ([email protected])

greco's People

Contributors

mrqorib avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

greco's Issues

how to get the `models/checkpoint.bin` file

Thanks for your work on gec. I have a problem about how to get the models/checkpoint.bin file. Do I need to retrain to obtain it, or will an already trained one be provided for testing? The relevant code is as follows.

import torch
from models import GRECO

model = GRECO('microsoft/deberta-v3-large').to(device)
model.load_state_dict(torch.load('models/checkpoint.bin))
model.score(source, hyphoteses)

Missing checkpoint?

Hi, I'm new to this topic, I found something wrong when running the scripts your provided

humor@Charon:~/greco$ python score_all.py --auto --data_dir data/conll-official/texts --output_path outputs/some_scores.json --model some --source_file data/conll-source.txt --batch_size 16
Traceback (most recent call last):
  File "/home/humor/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/home/humor/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/humor/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'checkpoints/some/grammer'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/humor/greco/score_all.py", line 98, in <module>
    main(args)
  File "/home/humor/greco/score_all.py", line 37, in main
    model = get_model(args)
  File "/home/humor/greco/models.py", line 875, in get_model
    model = SOME(model_args)
  File "/home/humor/greco/models.py", line 596, in __init__
    self.model_g = BertForSequenceClassification.from_pretrained(self.args.g_dir)
  File "/home/humor/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2899, in from_pretrained
    resolved_config_file = cached_file(
  File "/home/humor/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'checkpoints/some/grammer'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I think this should be a local folder with checkpoints, but after a quick search I didn't find it.

Could u please provide more info to run this if u can?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.