hillzhang1999 / mucgec Goto Github PK

MuCGEC中文纠错数据集及文本纠错SOTA模型开源；Code & Data for our NAACL 2022 Paper "MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction"

Home Page: https://aclanthology.org/2022.naacl-main.227/

License: Apache License 2.0

Python 94.17% Shell 3.56% Macaulay2 2.27%

dataset gec generation naacl grammatical-error-correction

mucgec's Introduction

MuCGEC: A Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction & SOTA Models

English | 简体中文

引用

如果您认为我们的工作对您的工作有帮助，请引用我们的论文：

MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction (Accepted by NAACL2022 main conference) [PDF]

@inproceedings{zhang-etal-2022-mucgec,
    title = "{M}u{CGEC}: a Multi-Reference Multi-Source Evaluation Dataset for {C}hinese Grammatical Error Correction",
    author = "Zhang, Yue  and
      Li, Zhenghua  and
      Bao, Zuyi  and
      Li, Jiacheng  and
      Zhang, Bo  and
      Li, Chen  and
      Huang, Fei  and
      Zhang, Min",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.227",
    pages = "3118--3130",
    }

简介

给定一段中文文本，中文语法纠错(Chinese Grammatical Error Correction, CGEC)技术旨在对其中存在的拼写、词法、语法、语义等各类错误进行自动纠正。该技术在教育、新闻、通讯乃至搜索等领域都拥有着广阔的应用空间。

目前的中文语法纠错评测集存在着数据量小，答案少，领域单一等缺陷。为了提供更加合理的模型评估结果，本仓库提供了一个高质量、多答案CGEC评测数据集MuCGEC。与此同时，为了推动CGEC领域的发展，我们还额外提供了如下资源：

中文语法纠错数据标注规范./guidelines：我们详细定义了常见的中文语法错误的类别体系，针对每类错误，给出了对应的修改方案和丰富的修改样例，从而可以促进中文语法纠错数据标注领域的研究。
中文语法纠错评测工具./scorers：
- ChERRANT：我们对目前英文上通用的可细分类别评估的评测工具ERRANT进行了中文适配和修改，并命名为ChERRANT(Chinese ERRANT)。ChERRANT支持字、词粒度的评估。字级别的ChERRANT指标是MuCGEC数据集主要使用的评测指标，缓解了中文上因为分词错误导致的评估不准确现象。词粒度的评估支持更细的错误类型（如拼写错误、名词错误、动词错误等），可供研究人员更好地分析模型。
中文语法纠错基线模型./models：
- Seq2Edit模型./models/seq2edit-based-CGEC：设计编辑动作标签（如替换、删除、插入、调序等），从而将语法纠错任务视作序列标注任务进行解决。
  - 我们对英文上SOTA的Seq2Edit模型GECToR进行了一些修改，以使其支持中文。
- Seq2Seq模型./models/seq2seq-based-CGEC：将语法纠错看做是一个从错误句子翻译为正确句子的过程，利用先进的神经机器翻译模型进行解决。
  - 我们微调了大规模Seq2Seq预训练语言模型中文BART用于语法纠错任务。
- 集成模型./scorers/ChERRANT/emsemble.sh：我们提供了一种简单的基于编辑的模型集成方法，支持异构模型（如Seq2Seq和Seq2Edit）的融合。
中文语法纠错常用工具./tools：
- 分词工具
- 数据增强 (Todo)
- 数据清洗 (Todo)

MuCGEC数据集

我们的数据主要来自中文二语学习者，分别采样自以下数据集：NLPCC18测试集（来自于NLPCC18-shared Task2评测任务）、CGED测试集（来自于CGED18&20评测任务）以及中文Lang8训练集（来自于NLPCC18-shared Task2评测任务）。我们从三个数据来源各采样2000-3000句，采用三人随机标注加审核专家审核方式构建测试集。数据的整体统计如下表所示。

数据集	句子数	错误句子数(比例)	平均字数	平均编辑数	平均答案数
MuCGEC-NLPCC18	1996	1904(95.4%)	29.7	2.5	2.5
MuCGEC-CGED	3125	2988(95.6%)	44.8	4.0	2.3
MuCGEC-Lang8	1942	1652(85.1%)	37.5	2.8	2.1
MuCGEC-ALL	7063	6544(92.7%)	38.5	3.2	2.3

相较于之前的CGEC评测集（如NLPCC18和CGED)，MuCGEC拥有更丰富的答案和数据来源。此外，在标注过程中，我们还发现有74句句子因为句意不清等问题无法标注。

更多关于MuCGEC数据集的细节，请参考我们的论文。

数据下载链接

MuCGEC数据集目前开放了开发集，测试集以在线榜单形式给出，请参考链接https://tianchi.aliyun.com/dataset/dataDetail?dataId=131328 使用。

CGEC基准模型

实验环境安装

我们采用Python 3.8进行实验，通过以下代码可以安装必要的依赖，考虑到Seq2Edit模型的环境和Seq2Seq模型存在一些冲突，需要分别安装两个环境：

# Seq2Edit模型
pip install -r requirements_seq2edit.txt

# Seq2Seq模型
pip install -r requirements_seq2seq.txt

训练数据

我们实验所用训练集为：Lang8数据集(来自外语学习网站Lang8）和HSK数据集(北语开发的汉语学习考试数据集)中的错误句子，并且对HSK数据集上采样5次，过滤掉和我们测试集重复的部分，共计约150万对。

下载方式：Google Drive

模型使用

我们提供了使用模型的流水线脚本，包含预处理-训练-推理的流程，可参考./models/seq2edit-based-CGEC/pipeline.sh及./models/seq2seq-based-CGEC/pipeline.sh

与此同时，我们也提供了训练后的checkpoint以供测试（下列指标均为精确度/召回度/F0.5值）：

模型	NLPCC18-Official(m2socrer)	MuCGEC(ChERRANT)
seq2seq_lang8[Link]	37.78/29.91/35.89	40.44/26.71/36.67
seq2seq_lang8+hsk[Link]	41.50/32.87/39.43	44.02/28.51/39.70
seq2edit_lang8[Link]	37.43/26.29/34.50	38.08/22.90/33.62
seq2edit_lang8+hsk[Link]	43.12/30.18/39.72	44.65/27.32/39.62

下载后，分别解压放入./models/seq2seq-based-CGEC/exps和./models/seq2edit-based-CGEC/exps即可使用。其中，seq2seq模型基于Chinese-BART-Large预训练语言模型，seq2edit模型基于StructBERT-Large预训练语言模型。

我们在论文中使用的模型融合策略请参考./scorers/ChERRANT/ensemble.sh。

Tips

我们发现在英文上有用的一些trick，在中文上同样有效，例如GECToR的额外置信度trick和Seq2Seq的R2L-Reranking trick，如果您对模型性能要求较高，可以尝试这些trick。
我们发现两阶段训练（先Lang8+Hsk再单独Hsk）所得模型效果相较于单阶段训练效果会有进一步提升，您感兴趣的话可以按照两阶段训练策略重新训练模型。
我们发现基于中文BART的Seq2Seq模型存在一些改进空间：1）原始中文BART的词表缺少一些常见的中文标点/字符；2）transformers库训练和推理速度较慢，所占显存也较大。我们最近基于fairseq重新实现了一版基于BART的Seq2Seq模型，并加入了一些额外的训练trick，使其效果有了大幅提升（4-5个F值），且训练/推理速度快了3-4倍。该工作后续也将整理并开源。
我们目前提供的基线模型仅使用了公开训练集。关于数据增强技术，可以参考我们之前在CTC2021比赛中的方案[Link]，合理构建的人造数据对模型性能有着巨大的提升。

模型评估

针对NLPCC18官方数据集，可使用我们的基准模型预测后，通过NLPCC18的官方工具M2Scorer进行计算指标。需要注意的是预测结果必须使用PKUNLP工具分词。

针对MuCGEC数据集的相关指标，可以采用我们提供的ChERRANT工具进行指标计算。 ChERRANT的相关使用可参考./scorers/ChERRANT/demo.sh。对于字级别指标，我们部分参考了ERRANT_zh仓库，词级别指标及错误类型划分我们则参考了原始ERRANT。

错误类型

操作级别（字/词粒度）：
- M(missing)：缺失错误，需要添加缺失字/词
- R(redundant)：冗余错误，需要删除冗余字/词
- S(substitute)：替换错误，需要替换错误字/词
- W(word-order)：词序错误，需要进行调序
语言学级别（仅词粒度）：
- 我们设计了14种主要的语言学错误类型（基本上是根据词性），除拼写错误(SPELL)和词序错误(W)外，还可以根据替换/删除/冗余操作进一步划分，如形容词冗余错误可以表示为：R:ADJ

联系

如果您在使用我们的数据集及代码的过程中遇到了任何问题，可联系 [email protected]。

mucgec's People

Contributors

Stargazers

Watchers

Forkers

suda-la hfxunlp allensmile yz-liu harry8207 meverystrong lipiji dumpmemory ajunlonglive uloveqian2021 iamxiatian haochenjiang2000 wangphoebe dage0127 chizhu wangwang110 shelleyyyyu 123mj321 qss2012 dpamk haojiepan1 fangcao1314 shiyangzhounlp qzl164 qingboai peter65374 padadox sevenclay tianbuwei tanshoudong ucas010 f-crystal yingtong66 onejune2018 qurikuduo yijie007 zhiqiao2001 chenlk96 ggxyb bucephiluce wangjiaqiys newuser2023hello zhangshu891 jus1mple cifangyiquan ai-hr ymliucs jaingnanming hao-www leehyong frankindf chauncey-jheng mesosxzan jwen6118 rynarzzz xiaopengli-cn zhqiao-nlp tiffen aayingk mingxinliang atr1eus

mucgec's Issues

求StructBERT-Large预训练语言模型的pytorch_model.bin文件

您好，您seq2edit模型所使用的StructBERT-Large预训练语言模型文件中的pytorch_model.bin文件好像没有提供，请问您有下载地址吗？谢谢您！

seq2edit 模型的参数设置

大佬您好，我用您这个模型就错的时候他会把三四季度就整成三四月份。
我是不是可以修改参数搞定这个啊。
parser.add_argument('--min_probability',
type=float,
default=0) # token级别最小修改阈值
这个调大一点是不是有用。
还有
parser.add_argument('--additional_confidence',
type=float,
help='How many probability to add to $KEEP token',
default=0.0) # Keep标签额外置信度
这个参数是啥作用啊

官方基线模型的指标是在什么集合上做的？

请问 与此同时，我们也提供了训练后的checkpoint以供测试（下列指标均为精确度/召回度/F0.5值）： 下面的指标是在哪个集合上做的？

data
├── MuCGEC
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   ├── README.en.md
│   └── README.md
├── mucgec_A
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── mucgec_A.zip
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   └── README.md
├── mucgec_B
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   └── README.md
├── README.md
└── utils.py

seq2edit 模型的vocab文件

大佬您好，我想直接用您的seq2edit模型infer。
seq2edit pipline predict.py 中用的VOCAB_PATH=./data/output_vocabulary_chinese_char_hsk+lang8_5
这个文件没有找到，您可以指导一下吗

Cannot open Tianchi files

Hi. :)

I just downloaded the MuCGEC data files from Tianchi but cannot open them because they are password protected.

Can you you help?

训练过程 labels_accuracy_except_keep 指标为0

作者你好，我在训练过程中 labels_accuracy_except_keep: 0.0000, 这个指标，跑着跑着就变成0了，请问这个有可能是什么原因导致的呢？
第一阶段还好，都是到了第二阶段慢慢就出现了。

关于fairseq重新实现BART-seq2seq的问题？

您好，我很好奇您在readme里提到的【我们最近基于fairseq重新实现了一版基于BART的Seq2Seq模型】
能否简单描述下如何通过fairseq大幅提升推理速度的呢，或者提供下关键词作为线索？
期待这部分内容的开源🎉

好像文件edit_ensemble.py没有？请问在哪里可以拿到

ensemble.sh 用于集成异构模型，该脚本用到了edit_ensemble.py文件，请问该文件是没有上传上来吗？

请问有目前仍然active的lang8和hsk数据集下载链接吗？

在网上找了很多，目前还没发现lang8的下载链接，hsk的链接挂掉了。

预训练语言模型

以bert为backbone的transformer模型都支持吗

can't load tokenizer

你好，我在使用pipeline训练时，出现了这个报错，这个文件要怎么获取，还是说我的路径错了？麻烦您具体回答一下我，谢谢
`Traceback(most recent call last) :
File "predict.py",line 19,in cmodule>
tokenizer=BertTokenizer.fram pretrained (args.model_path)
File "/hnaxre/u19011006 .1ocaL/lib/pythen3.7/site-packiages/transformers/tokenization utils bae.py"，1ine 1693，in from pretrainedraise EnvironmentError (msg)
OSError: Can't load tokenizer for './exps/seq2seq_lang8'. Make sure that:

'./exps/seq2seq_lang8' is a correct model identifier listed on 'https://huggingface.co/models'
or './exps/seq2seq_langB' is the correct path to a directory containing relevant tokenizer files`

无法下载训练数据

ChERRANT的requirement, OpenCC必须1.1.2么？

Environment: Mac-arm64, new conda virtual env.

pip install OpenCC==1.1.2
ERROR: Could not find a version that satisfies the requirement OpenCC==1.1.2 (from versions: 0.1, 0.2, 1.1.0, 1.1.0.post1, 1.1.1, 1.1.3, 1.1.4)
ERROR: No matching distribution found for OpenCC==1.1.2

Tried Tuna mirror or pypi.org/simple channel. Failed to install.

安装了OpenCC 1.1.3之后，尝试python parallel_to_m2.py, Import LTP package的时候报错

  File "/Users/peter42/Documents/github/MuCGEC/scorers/ChERRANT/modules/tokenizer.py", line 1, in <module>
    from ltp import LTP
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/__init__.py", line 8, in <module>
    from . import nn, utils
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/utils/__init__.py", line 10, in <module>
    from .convertor import map2device, convert2npy
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/utils/convertor.py", line 6, in <module>
    from torch._six import container_abcs
ImportError: cannot import name 'container_abcs' from 'torch._six' (/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/torch/_six.py)

是 ltp==4.1.3.post1 和 OpenCC==1.1.2 必须严格匹配么？？

使用seq2seq模型问题

你好，我想试运行/MuCGEC-main/models/seq2seq-based-CGEC/里的predict.py文件时，出现 requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/vocab.txt 报错，请问这是什么原因造成的呢？

MuCGEC_CGED_Dev.label文件不存在

请教一下，
执行:sh pipeline.sh时，出现以下错误，MuCGEC_CGED_Dev.label文件从哪里可以获取？

FileNotFoundError: file ../../data/valid_data/MuCGEC_CGED_Dev.label not found

FileNotFoundError

在执行pipelin.sh 文件的时候遇到如下问题
FileNotFoundError: [Errno 2] No such file or directory: './exps/seq2edit_lang8/Temp_Model.th'

这个怎么解决呢？

自己复现Seq2Edit效果没有那么好

您好，我自己用Transformer架构复现了一个Seq2Edit, 目前仅支持单轮纠正，但是我在训练的时候发现nonkeep标签正确率挺低的，我检查了输入，labels, d_tags都是和原文代码一致的。我觉得很有可能是训练过程中的问题，原文的代码训练过程在Allennlp的高度封装之下，我担心因为训练设置的不同导致性能没那么好。（在自己的数据集上训练，原文代码能达到80%多，但是我自己复现的只有30%多）

关于seq2edit模型的能力问题

您好，很感谢您开源这么优秀的工作。在学习过程中，碰到如下问题：

seq2edit模型似乎不能胜任缺失较长宾语的情况（会不会是tokenizer是按单个字的原因？），如下几条样例：

source： 我们要尽一切力量使我们农业走上机械化、集体化。
target： 我们要尽一切力量使我们农业走上机械化、集体化轨轨。

source： 近两年来，他们在全县推广了马河大队坚持的科学种田。
target： 近两年来，他们在全县推广了马河大队坚持实的科学种田方。

source： 开展走访慰问工会系统部分老党员、老干部。
target： 开展走访慰问工会系统部分老党员、老干部活。

从测试结果看，模型好像大概知道哪里有问题，但是纠正能力好像不行。我试过设置参数iteration_count，min_error_probability，additional_confidence。但是效果都不佳。

不知道是模型本身能力的受限，还是还需要大量类似的训练数据训练？

期待您的回复。

数据预处理模块似乎没有“调序”编辑操作。

readme中提到编辑动作标签包括替换、删除、插入、调序等，但实际处理的结果似乎只有替换、删除、插入和$KEEP

在线demo的基线模型

章岳老师您好：
我在使用了您训练后的seq2edit_lang8+hsk模型在测试时，有些数据无法正确预测，但是使用在线demo却可以正确预测，请问在线demo的基线模型是哪个。

训练过程中的一点疑问

您好。我最近在复现您的代码。有一点疑问。
参考gector的论文中写道，stage1和stage2使用了不同的训练数据。其中stage1使用了大量的合成数据，stage2使用了少量的非合成数据。
在您的代码中，我发现貌似stage1和stage2使用了相同的数据。所以请问您在训练过程中，有没有试过使用gector论文中描述的方式去训练，会不会得到更好的效果？

seq2edit模型基于StructBERT-Large预训练语言模型，该原始模型提供下载吗？

seq2edit模型基于StructBERT-Large预训练语言模型，但是没有找到StructBERT-Large对应的下载地址，请问该模型应该如何获取？

allennlp和torch版本不匹配，建议修改

在requirements_seq2edit.txt中，allennlp和torch版本不匹配，建议修改

貌似不能多GPU训练

您好，关于融合的结果文件格式是怎样的?

# Seq2Edit的三个模型，更换了3个随机种子
RESULT_FILE1=$1
RESULT_FILE2=$2
RESULT_FILE3=$3

# Seq2Seq的三个模型，更换了3个随机种子
RESULT_FILE4=$4
RESULT_FILE5=$5
RESULT_FILE6=$6

每个结果文件中是每一句话，用空格隔开吗？

seq2seq环境问题

您好~不造为啥我运行seq2seq 的pipeline,sh的时候会遇见环境配置问题，
Traceback (most recent call last):
File "train.py", line 11, in
import transformers
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/init.py", line 43, in
from . import dependency_versions_check
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/utils/versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/utils/versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/packaging/version.py", line 52, in parse
return Version(version)
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/packaging/version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

不知道这是我哪个包转装版本的问题呀……

请问在线演示的Demo为何下线了？

如题，请问何时会上线，我今天想给领导汇报BART的效果，打开地址发现无法访问了：139.224.234.18，多谢

seq2edit

seq2edit预测阶段，参数min_probability，min_error_probability，confidence如何巧妙的设置防止误纠？

论文中测试集指标是否为分句预测后结果？

你好，注意到predict相关代码中有对测试集进行分句预测的操作，想问下论文中的所以指标是否也包含这个后处理的过程。我本地测试下分句对结果的影响还是很大的。

SRC_FILE=../../data/train_data/lang8+hsk/train.src文件怎么获得？

seq2edit predict问题

作者您好，我仿照pipline.sh脚本里的预测脚本测试样例，input文件就是简单的两行文本，像下面这样的格式:
今天气真不戳啊。
我下午要去**人民很行取钱。

在使用seq2edit_lang8模型预测时，无论输入什么文本都不会进行任何纠错；在使用seq2edit_lang8+hsk模型预测时，无论输入什么文本，纠错结果都是仅仅一个中文双引号 ” 。
请问可能的问题是什么呢？

想请问一下seq2edit里面的output_vocabulary是怎么确定的

关于CGEC注释综合指南的问题

您好，想请问您是否可以分享一下30页的CGEC注释综合指南呢？

FileNotFoundError: [Errno 2] No such file or directory: './ensemble_results/3seq2edit_3seq2seq_threshold_4/MuCGEC_test.m2_temp'

这几个路径中的文件没有找到具体应该怎么解决
sh .\ensemble.sh
usage: rule_ensemble.py [-h] --result_path RESULT_PATH [RESULT_PATH ...] --output_path OUTPUT_PATH [-T THRESHOLD]
rule_ensemble.py: error: argument --result_path: expected at least one argument
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 103, in
main()
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 89, in main
for src_sent, edit_lines in read_file():
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 73, in read_file
with open(args.f, "r", encoding="utf8") as fr:
FileNotFoundError: [Errno 2] No such file or directory: './ensemble_results/3seq2edit_3seq2seq_threshold_4/MuCGEC_test.m2_temp'

FileNotFoundError: [Errno 2] No such file or directory: '../../data/train_data/lang8+hsk/train.src_only_erroneous'

sh .\pipeline.sh
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 211, in
convert_parallel_data_to_json_file(args[0], args[1],args[2])
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 27, in convert_parallel_data_to_json_file
with open(source_data_file, "r", encoding='utf-8') as f1:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/train_data/lang8+hsk/train.src_only_erroneous'
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 211, in
convert_parallel_data_to_json_file(args[0], args[1],args[2])
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 27, in convert_parallel_data_to_json_file
with open(source_data_file, "r", encoding='utf-8') as f1:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/valid_data/MuCGEC_CGED_Dev.src'
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\train.py", line 11, in
import transformers
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers_init_.py", line 43, in
from . import dependency_versions_check
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 52, in parse
return Version(version)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'
Generating...
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\predict.py", line 7, in
from transformers import BartForConditionalGeneration, BertTokenizer
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers_init_.py", line 43, in
from . import dependency_versions_check
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 52, in parse
return Version(version)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'
Generating Finish!
0 minutes and 0 seconds elapsed.

Seq2Edit pipeline中两个检查点的问题

您好，目前我正在复现MuCGEC的两个模型，在运行您的Seq2Edit模型的pipeline过程中，我看到了单次模型训练会产生两个检查点，分别是Temp_Model.th和Best_Model_Stage_1.th(或Best_Model_Stage_2.th)，请问这个Temp Model和Best Model两个检查点有什么区别？以及为什么pipeline.sh中第二阶段训练加载的是前者而不是后者？

请问有没有相关的参考文章指导如何提高推理速度的

请问有没有相关的参考文章指导如何提高推理速度的？我使用了 https://modelscope.cn/models/damo/nlp_bart_text-error-correction_chinese/summary 这个模型，我的环境是四块Tesla v100s，对一句话进行纠错耗时33.6秒，五句话耗时35.5秒。数据如下：
'''
沙尘暴也是一类空气污染之一。
所以一些人说，：读书一点用处都没有。
从学习写日常书信和专用书信的课中，我毕到了如何在不同场合使用不同语言的方式。
我学到了如和更简练，更书面地表达自己的意见。
我学到了为自己的过去进行歹思并鼓起勇气的方法。
'''
测试代码如下：
'''
def correctSentense(sentense, batch_size=1):
t = time.time()
p = pipeline('text-error-correction', 'damo/nlp_bart_text-error-correction_chinese')
print(p(sentense, batch_size))
print(f'coast:{time.time() - t:.8f}s')

'''
非常感谢。

windows下OpenCC版本无法下载

作者你好，我在配置seq2seq的环境时，出现如上报错，无法找到1.1.2版本，可否下载其他的版本？

Issue on reproducing the Seq2edit model result on NLPCC-2018 test set

Hi authors,

Thank you very much for the great work:) I have tried to follow your code to reproduce the result on the NLPCC-2018 test set by fine-tuning the seq2edit model on the NLPCC_2018 train set only.

I firstly filter duplicate sentences that appear in the NLPCC-2018 dataset, and discard correct sentences by following your description in "Training data" under Section 5. After filteration, I obtained 1,091,542 sentences.

After this I used your pipeline.sh (https://github.com/HillZhang1999/MuCGEC/blob/main/models/seq2edit-based-CGEC/pipeline.sh) file to train the seq2edit model.

After training and inference, I only obtained an F0.5 score of 33.57 (P: 37.78; R: 23.21).

Below are my training logs for cold start and non-cold start respectively:

Cold Start:

2022-04-29 13:17:09,787 - train.py - INFO - Data is loaded
2022-04-29 13:17:19,663 - train.py - INFO - Model is set
2022-04-29 13:17:19,671 - train.py - INFO - Start training
2022-04-29 13:17:19,672 - train.py - INFO - epoch: 0
2022-04-29 13:47:51,824 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8213796019554138
2022-04-29 13:47:51,825 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.018478067591786385
2022-04-29 13:48:01,068 - train.py - INFO - Model is dumped
2022-04-29 13:48:01,068 - train.py - INFO - (best)Saving Model...
2022-04-29 13:48:11,674 - train.py - INFO - Model is dumped
2022-04-29 13:48:11,674 - train.py - INFO - best labels_accuracy till now:0.8213796019554138
2022-04-29 13:48:11,675 - train.py - INFO - epoch: 1
2022-04-29 13:48:22,458 - train.py - INFO - Model is dumped
2022-04-29 13:48:22,459 - train.py - INFO - epoch: 1 checkpoint saved to ./exps/seq2edit_lang8/epoch1.th
2022-04-29 14:15:59,206 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8210737109184265
2022-04-29 14:15:59,861 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.018300961703062057
2022-04-29 14:16:11,182 - train.py - INFO - Model is dumped
2022-04-29 14:16:11,182 - train.py - INFO - best labels_accuracy till now:0.8213796019554138
2022-04-29 14:16:11,182 - train.py - INFO - epoch: 2
2022-04-29 14:16:23,102 - train.py - INFO - Model is dumped
2022-04-29 14:16:23,102 - train.py - INFO - epoch: 2 checkpoint saved to ./exps/seq2edit_lang8/epoch2.th

Non-Cold Start:

2022-04-29 14:16:31,549 - train.py - INFO - Data is loaded
2022-04-29 14:16:46,257 - train.py - INFO - load pretrained model
2022-04-29 14:16:46,587 - train.py - INFO - Model is set
2022-04-29 14:16:46,598 - train.py - INFO - Start training
2022-04-29 14:16:46,598 - train.py - INFO - epoch: 0
2022-04-29 15:24:24,749 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8269697427749634
2022-04-29 15:24:24,749 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.13035008311271667
2022-04-29 15:24:34,925 - train.py - INFO - Model is dumped
2022-04-29 15:24:34,926 - train.py - INFO - (best)Saving Model...
2022-04-29 15:24:46,774 - train.py - INFO - Model is dumped
2022-04-29 15:24:46,775 - train.py - INFO - best labels_accuracy till now:0.8269697427749634
2022-04-29 15:24:46,775 - train.py - INFO - epoch: 1
2022-04-29 15:24:57,297 - train.py - INFO - Model is dumped
2022-04-29 15:24:57,297 - train.py - INFO - epoch: 1 checkpoint saved to ./exps/seq2edit_lang8/epoch1.th
2022-04-29 16:32:52,444 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8282670378684998
2022-04-29 16:32:52,446 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.15213413536548615
2022-04-29 16:33:03,145 - train.py - INFO - Model is dumped
2022-04-29 16:33:03,145 - train.py - INFO - (best)Saving Model...
2022-04-29 16:33:17,291 - train.py - INFO - Model is dumped
2022-04-29 16:33:17,291 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 16:33:17,292 - train.py - INFO - epoch: 2
2022-04-29 16:33:28,897 - train.py - INFO - Model is dumped
2022-04-29 16:33:28,898 - train.py - INFO - epoch: 2 checkpoint saved to ./exps/seq2edit_lang8/epoch2.th
2022-04-29 17:41:54,118 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8282248973846436
2022-04-29 17:41:54,119 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.16317373514175415
2022-04-29 17:42:04,435 - train.py - INFO - Model is dumped
2022-04-29 17:42:04,436 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 17:42:04,436 - train.py - INFO - epoch: 3
2022-04-29 17:42:14,426 - train.py - INFO - Model is dumped
2022-04-29 17:42:14,426 - train.py - INFO - epoch: 3 checkpoint saved to ./exps/seq2edit_lang8/epoch3.th
2022-04-29 19:22:24,987 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8274760246276855
2022-04-29 19:22:24,990 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.17834582924842834
2022-04-29 19:22:38,266 - train.py - INFO - Model is dumped
2022-04-29 19:22:38,266 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 19:22:38,266 - train.py - INFO - epoch: 4
2022-04-29 19:22:51,440 - train.py - INFO - Model is dumped
2022-04-29 19:22:51,440 - train.py - INFO - epoch: 4 checkpoint saved to ./exps/seq2edit_lang8/epoch4.th
2022-04-29 22:17:08,472 - train.py - INFO - The accuracy of predicting for edit labels is: 0.827201783657074
2022-04-29 22:17:08,689 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.18666981160640717
2022-04-29 22:17:21,019 - train.py - INFO - Model is dumped
2022-04-29 22:17:21,019 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 22:17:21,019 - train.py - INFO - epoch: 5
2022-04-29 22:17:33,877 - train.py - INFO - Model is dumped
2022-04-29 22:17:33,877 - train.py - INFO - epoch: 5 checkpoint saved to ./exps/seq2edit_lang8/epoch5.th

Can you suggest why the reproduced result is about one point lower?

Thank you very much for your time and effort:)

preprocess data

在内存128g的机器，用自己构造的数据，train.src.char和train.tgt.cgar分别21G，运行preprocess_data.py，会报Cannot allocate memory

请问有多个候选答案的gold.m2的生成脚本是啥？我自测生成的一直有问题，有大面积空白行

INPUT_FILE=./data/input_grammar.txt
REF_FILE=./data/ref_grammar.txt
REF_PARA_FILE=./data/gram.ref.para
REF_M2_FILE=./data/gram.ref.m2.char

Step1. extract edits from hypothesis file.

paste $INPUT_FILE $REF_FILE | awk '{print NR"\t"$p}' > $REF_PARA_FILE # only for single hypothesis situation

python parallel_to_m2.py -f $REF_PARA_FILE -o $REF_M2_FILE -g char # char-level evaluation

输入格式是怎样的？

作者你好，./samples/demo.input是输入是吧？格式是怎样的？没找到代码里的示例

seq2seq的predict预测长度限制问题

seq2seq的predict.py tokenizer的时候已经设置了max_len 和padding，为什么下面还需要再进一步的判断是否超过100。是否有特殊情况？

preprocess_data中的问题

老师您好。
src_sent : 而且在世界有很多病的话
tgt_sent : 而且世界上有很多病的话

preprocess_data.py 得到的train_label是：
$STARTSEPL|||SEPR$KEEP 而SEPL|||SEPR$KEEP 且SEPL|||SEPR$KEEP 在SEPL|||SEPR$DELETE 世SEPL|||SEPR$KEEP 界SEPL|||SEPR$APPEND_上有SEPL|||SEPR$KEEP 很SEPL|||SEPR$KEEP 多SEPL|||SEPR$KEEP 病SEPL|||SEPR$KEEP 的SEPL|||SEPR$KEEP 话SEPL|||SEPR$REPLACE_？

我的理解是：
在-DELETE
上-APPEND

但label的结果是：
世-DELETE
上-APPEND

是我理解上有问题吗？

关于ChERRANT评估指标

请问下比如对于一条测试数据，有三个预测答案，分数变动

执行pipeline.sh，file does not exist. "data/train_data/lang8+hsk/ "

只使用models/seq2edit-based-CGEC能进行中文修改么？

有么有使用的用例，我想测试下例子，感谢！

请问seq2edit里面使用的vocab如何生成

如题，项目里好像没有生成这个vocab的代码，想要生成自己下游任务的vocab

RuntimeError: CUDA out of memory.

我在执行pipeline.sh的时候，当执行到CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python train.py --tune_bert 1\ ……刚要训练就会产生这样的错误：

Data is loaded
load pretrained model
Model is set
Start training

epoch: 0
Traceback (most recent call last):
File "train.py", line 315, in
main(args)
File "train.py", line 205, in main
trainer.train()
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 966, in train
return self._try_train()
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 1001, in _try_train
train_metrics = self._train_epoch(epoch)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 716, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 604, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/jydong/MuCGEC-main/models/seq2edit-based-CGEC/gector/seq2labels_model.py", line 129, in forward
ret_train = self.decode(encoded_text, batch_size, sequence_length, mask, labels, d_tags, metadata)
File "/data/jydong/MuCGEC-main/models/seq2edit-based-CGEC/gector/seq2labels_model.py", line 175, in decode
class_probabilities_labels = F.softmax(logits_labels, dim=-1).view(
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/torch/nn/functional.py", line 1512, in softmax
ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 10.76 GiB total capacity; 9.52 GiB already allocated; 51.44 MiB free; 9.71 GiB reserved in total by PyTorch)

我看以往问题中有人问也是不可以多卡，那么除了换个卡以外……还有啥方法能让这个模型正常运行呀？

demo 接口

想问下，http://139.224.234.18:5002/ ，这个 demo 中的【语法纠错(母语)】接口，是以下哪种策略？
1）直接调用 seq2edit 模型；2）直接调用 seq2seq 模型；3）seq2edit + seq2seq 做 merge；4）seq2edit + seq2seq + 规则

是否可以继续训练

作者，你好！是否可以根据自己的特定领域的数据，对你提供的模型继续训练？