Giter Site home page Giter Site logo

hillzhang1999 / mucgec Goto Github PK

View Code? Open in Web Editor NEW
487.0 6.0 63.0 5.2 MB

MuCGEC中文纠错数据集及文本纠错SOTA模型开源;Code & Data for our NAACL 2022 Paper "MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction"

Home Page: https://aclanthology.org/2022.naacl-main.227/

License: Apache License 2.0

Python 94.17% Shell 3.56% Macaulay2 2.27%
dataset gec generation naacl grammatical-error-correction

mucgec's Introduction

MuCGEC: A Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction & SOTA Models

English | 简体中文

最新消息

  • 2022.5.26 我们的最新工作NaSGEC被ACL2023会议录用,在这篇文章中我们提出一个多领域中文母语纠错数据集,以及面向社交媒体学术写作复杂病句的定制化纠错模型。欢迎大家试用!链接:[Link]

  • 2023.1.12 我们在阿里巴巴魔搭社区开源了两个最新的SOTA纠错模型(基于BART),分别面向通用领域和法律领域,支持一键调用推理和Demo试玩,欢迎大家试用:通用领域法律领域

  • 2022.10.18 我们的最新工作SynGEC被EMNLP2022会议录用,在这篇文章中我们提出的融入适配句法的SynGEC模型可以在NLPCC-18和MuCGEC-Test上取得45.32/46.51的F值。欢迎大家试用!链接:[Link]

  • 2022.8.29 我们上传了MuCGEC数据集(含开发集答案和测试集输入)[Link],并且在天池平台开放长期测试集评测榜单[Link),欢迎大家提交结果,提交方式可以参考[Link]。

  • 2022.6.23 我们开源了CTC-2021评测中所搜集的语义纠错模板,请参考:[Github],[论文]

  • 2022.6.5 MuCGEC数据集作为CCL2022-CLTC评测的Track 4在阿里云天池平台开放,欢迎大家使用和打榜!

引用

如果您认为我们的工作对您的工作有帮助,请引用我们的论文:

MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction (Accepted by NAACL2022 main conference) [PDF]

@inproceedings{zhang-etal-2022-mucgec,
    title = "{M}u{CGEC}: a Multi-Reference Multi-Source Evaluation Dataset for {C}hinese Grammatical Error Correction",
    author = "Zhang, Yue  and
      Li, Zhenghua  and
      Bao, Zuyi  and
      Li, Jiacheng  and
      Zhang, Bo  and
      Li, Chen  and
      Huang, Fei  and
      Zhang, Min",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.227",
    pages = "3118--3130",
    }

简介

给定一段中文文本,中文语法纠错(Chinese Grammatical Error Correction, CGEC)技术旨在对其中存在的拼写、词法、语法、语义等各类错误进行自动纠正。该技术在教育、新闻、通讯乃至搜索等领域都拥有着广阔的应用空间。

目前的中文语法纠错评测集存在着数据量小,答案少,领域单一等缺陷。为了提供更加合理的模型评估结果,本仓库提供了一个高质量、多答案CGEC评测数据集MuCGEC。与此同时,为了推动CGEC领域的发展,我们还额外提供了如下资源:

  • 中文语法纠错数据标注规范./guidelines:我们详细定义了常见的中文语法错误的类别体系,针对每类错误,给出了对应的修改方案和丰富的修改样例,从而可以促进中文语法纠错数据标注领域的研究。
  • 中文语法纠错评测工具./scorers
    • ChERRANT:我们对目前英文上通用的可细分类别评估的评测工具ERRANT进行了中文适配和修改,并命名为ChERRANT(Chinese ERRANT)。ChERRANT支持字、词粒度的评估。字级别的ChERRANT指标是MuCGEC数据集主要使用的评测指标,缓解了中文上因为分词错误导致的评估不准确现象。词粒度的评估支持更细的错误类型(如拼写错误、名词错误、动词错误等),可供研究人员更好地分析模型。
  • 中文语法纠错基线模型./models
    • Seq2Edit模型./models/seq2edit-based-CGEC:设计编辑动作标签(如替换、删除、插入、调序等),从而将语法纠错任务视作序列标注任务进行解决。
      • 我们对英文上SOTA的Seq2Edit模型GECToR进行了一些修改,以使其支持中文。
    • Seq2Seq模型./models/seq2seq-based-CGEC:将语法纠错看做是一个从错误句子翻译为正确句子的过程,利用先进的神经机器翻译模型进行解决。
      • 我们微调了大规模Seq2Seq预训练语言模型中文BART用于语法纠错任务。
    • 集成模型./scorers/ChERRANT/emsemble.sh:我们提供了一种简单的基于编辑的模型集成方法,支持异构模型(如Seq2Seq和Seq2Edit)的融合。
  • 中文语法纠错常用工具./tools
    • 分词工具
    • 数据增强 (Todo)
    • 数据清洗 (Todo)

MuCGEC数据集

我们的数据主要来自中文二语学习者,分别采样自以下数据集:NLPCC18测试集(来自于NLPCC18-shared Task2评测任务)、CGED测试集(来自于CGED18&20评测任务)以及中文Lang8训练集(来自于NLPCC18-shared Task2评测任务)。我们从三个数据来源各采样2000-3000句,采用三人随机标注加审核专家审核方式构建测试集。数据的整体统计如下表所示。

数据集 句子数 错误句子数(比例) 平均字数 平均编辑数 平均答案数
MuCGEC-NLPCC18 1996 1904(95.4%) 29.7 2.5 2.5
MuCGEC-CGED 3125 2988(95.6%) 44.8 4.0 2.3
MuCGEC-Lang8 1942 1652(85.1%) 37.5 2.8 2.1
MuCGEC-ALL 7063 6544(92.7%) 38.5 3.2 2.3

相较于之前的CGEC评测集(如NLPCC18和CGED),MuCGEC拥有更丰富的答案和数据来源。此外,在标注过程中,我们还发现有74句句子因为句意不清等问题无法标注。

更多关于MuCGEC数据集的细节,请参考我们的论文。

数据下载链接

MuCGEC数据集目前开放了开发集,测试集以在线榜单形式给出,请参考链接https://tianchi.aliyun.com/dataset/dataDetail?dataId=131328 使用。

CGEC基准模型

实验环境安装

我们采用Python 3.8进行实验,通过以下代码可以安装必要的依赖,考虑到Seq2Edit模型的环境和Seq2Seq模型存在一些冲突,需要分别安装两个环境:

# Seq2Edit模型
pip install -r requirements_seq2edit.txt

# Seq2Seq模型
pip install -r requirements_seq2seq.txt

训练数据

我们实验所用训练集为:Lang8数据集(来自外语学习网站Lang8)和HSK数据集(北语开发的汉语学习考试数据集)中的错误句子,并且对HSK数据集上采样5次,过滤掉和我们测试集重复的部分,共计约150万对。

下载方式:Google Drive

模型使用

我们提供了使用模型的流水线脚本,包含预处理-训练-推理的流程,可参考./models/seq2edit-based-CGEC/pipeline.sh./models/seq2seq-based-CGEC/pipeline.sh

与此同时,我们也提供了训练后的checkpoint以供测试(下列指标均为精确度/召回度/F0.5值):

模型 NLPCC18-Official(m2socrer) MuCGEC(ChERRANT)
seq2seq_lang8[Link] 37.78/29.91/35.89 40.44/26.71/36.67
seq2seq_lang8+hsk[Link] 41.50/32.87/39.43 44.02/28.51/39.70
seq2edit_lang8[Link] 37.43/26.29/34.50 38.08/22.90/33.62
seq2edit_lang8+hsk[Link] 43.12/30.18/39.72 44.65/27.32/39.62

下载后,分别解压放入./models/seq2seq-based-CGEC/exps./models/seq2edit-based-CGEC/exps即可使用。其中,seq2seq模型基于Chinese-BART-Large预训练语言模型,seq2edit模型基于StructBERT-Large预训练语言模型。

我们在论文中使用的模型融合策略请参考./scorers/ChERRANT/ensemble.sh

Tips

  • 我们发现在英文上有用的一些trick,在中文上同样有效,例如GECToR的额外置信度trick和Seq2Seq的R2L-Reranking trick,如果您对模型性能要求较高,可以尝试这些trick。
  • 我们发现两阶段训练(先Lang8+Hsk再单独Hsk)所得模型效果相较于单阶段训练效果会有进一步提升,您感兴趣的话可以按照两阶段训练策略重新训练模型。
  • 我们发现基于中文BART的Seq2Seq模型存在一些改进空间:1)原始中文BART的词表缺少一些常见的中文标点/字符;2)transformers库训练和推理速度较慢,所占显存也较大。我们最近基于fairseq重新实现了一版基于BART的Seq2Seq模型,并加入了一些额外的训练trick,使其效果有了大幅提升(4-5个F值),且训练/推理速度快了3-4倍。该工作后续也将整理并开源。
  • 我们目前提供的基线模型仅使用了公开训练集。关于数据增强技术,可以参考我们之前在CTC2021比赛中的方案[Link],合理构建的人造数据对模型性能有着巨大的提升。

模型评估

针对NLPCC18官方数据集,可使用我们的基准模型预测后,通过NLPCC18的官方工具M2Scorer进行计算指标。需要注意的是预测结果必须使用PKUNLP工具分词。

针对MuCGEC数据集的相关指标,可以采用我们提供的ChERRANT工具进行指标计算。 ChERRANT的相关使用可参考./scorers/ChERRANT/demo.sh。对于字级别指标,我们部分参考了ERRANT_zh仓库,词级别指标及错误类型划分我们则参考了原始ERRANT

错误类型

  • 操作级别(字/词粒度):

    • M(missing):缺失错误,需要添加缺失字/词
    • R(redundant):冗余错误,需要删除冗余字/词
    • S(substitute):替换错误,需要替换错误字/词
    • W(word-order):词序错误,需要进行调序
  • 语言学级别(仅词粒度):

    • 我们设计了14种主要的语言学错误类型(基本上是根据词性),除拼写错误(SPELL)和词序错误(W)外,还可以根据替换/删除/冗余操作进一步划分,如形容词冗余错误可以表示为:R:ADJ

      error types

相关工作

  • 我们在CTC2021评测中使用了本仓库的一些技术,并且获得了Top1的成绩,相关技术报告可见:CTC-report
  • 我们的基线模型提供在线演示平台:GEC demo (校外访问可能较慢,请耐心等待)。
  • YACLC中文学习者语料库:YACLC
  • NLPCC18中文语法纠错数据集:NLPCC18

联系

如果您在使用我们的数据集及代码的过程中遇到了任何问题,可联系 [email protected]

mucgec's People

Contributors

haochenjiang2000 avatar hillzhang1999 avatar ymliucs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mucgec's Issues

seq2edit 模型的参数设置

大佬您好,我用您这个模型就错的时候他会把三四季度就整成三四月份。
我是不是可以修改参数搞定这个啊。
parser.add_argument('--min_probability',
type=float,
default=0) # token级别最小修改阈值
这个调大一点是不是有用。
还有
parser.add_argument('--additional_confidence',
type=float,
help='How many probability to add to $KEEP token',
default=0.0) # Keep标签额外置信度
这个参数是啥作用啊

官方基线模型的指标是在什么集合上做的?

请问 与此同时,我们也提供了训练后的checkpoint以供测试(下列指标均为精确度/召回度/F0.5值): 下面的指标是在哪个集合上做的?

data
├── MuCGEC
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   ├── README.en.md
│   └── README.md
├── mucgec_A
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── mucgec_A.zip
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   └── README.md
├── mucgec_B
│   ├── example_pred_dev.txt
│   ├── filter_sentences.txt
│   ├── MuCGEC_dev.txt
│   ├── MuCGEC_test.txt
│   └── README.md
├── README.md
└── utils.py

seq2edit 模型的vocab文件

大佬您好,我想直接用您的seq2edit模型infer。
seq2edit pipline predict.py 中用的VOCAB_PATH=./data/output_vocabulary_chinese_char_hsk+lang8_5
这个文件没有找到,您可以指导一下吗

Cannot open Tianchi files

Hi. :)

I just downloaded the MuCGEC data files from Tianchi but cannot open them because they are password protected.

Can you you help?

训练过程 labels_accuracy_except_keep 指标为0

作者你好,我在训练过程中 labels_accuracy_except_keep: 0.0000, 这个指标,跑着跑着就变成0了,请问这个有可能是什么原因导致的呢?
第一阶段还好,都是到了第二阶段慢慢就出现了。

关于fairseq重新实现BART-seq2seq的问题?

您好,我很好奇您在readme里提到的【我们最近基于fairseq重新实现了一版基于BART的Seq2Seq模型】
能否简单描述下如何通过fairseq大幅提升推理速度的呢,或者提供下关键词作为线索?
期待这部分内容的开源🎉

can't load tokenizer

你好,我在使用pipeline训练时,出现了这个报错,这个文件要怎么获取,还是说我的路径错了?麻烦您具体回答一下我,谢谢
`Traceback(most recent call last) :
File "predict.py",line 19,in cmodule>
tokenizer=BertTokenizer.fram pretrained (args.model_path)
File "/hnaxre/u19011006 .1ocaL/lib/pythen3.7/site-packiages/transformers/tokenization utils bae.py",1ine 1693,in from pretrainedraise EnvironmentError (msg)
OSError: Can't load tokenizer for './exps/seq2seq_lang8'. Make sure that:

  • './exps/seq2seq_lang8' is a correct model identifier listed on 'https://huggingface.co/models'
  • or './exps/seq2seq_langB' is the correct path to a directory containing relevant tokenizer files`

ChERRANT的requirement, OpenCC必须1.1.2么?

Environment: Mac-arm64, new conda virtual env.

pip install OpenCC==1.1.2
ERROR: Could not find a version that satisfies the requirement OpenCC==1.1.2 (from versions: 0.1, 0.2, 1.1.0, 1.1.0.post1, 1.1.1, 1.1.3, 1.1.4)
ERROR: No matching distribution found for OpenCC==1.1.2

Tried Tuna mirror or pypi.org/simple channel. Failed to install.

安装了OpenCC 1.1.3之后,尝试python parallel_to_m2.py, Import LTP package的时候报错

  File "/Users/peter42/Documents/github/MuCGEC/scorers/ChERRANT/modules/tokenizer.py", line 1, in <module>
    from ltp import LTP
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/__init__.py", line 8, in <module>
    from . import nn, utils
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/utils/__init__.py", line 10, in <module>
    from .convertor import map2device, convert2npy
  File "/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/ltp/utils/convertor.py", line 6, in <module>
    from torch._six import container_abcs
ImportError: cannot import name 'container_abcs' from 'torch._six' (/Users/peter42/opt/miniconda3/envs/pdenv/lib/python3.9/site-packages/torch/_six.py)

是 ltp==4.1.3.post1 和 OpenCC==1.1.2 必须严格匹配么??

MuCGEC_CGED_Dev.label文件不存在

请教一下,
执行:sh pipeline.sh时,出现以下错误,MuCGEC_CGED_Dev.label文件从哪里可以获取?

FileNotFoundError: file ../../data/valid_data/MuCGEC_CGED_Dev.label not found

FileNotFoundError

在执行pipelin.sh 文件的时候遇到如下问题
FileNotFoundError: [Errno 2] No such file or directory: './exps/seq2edit_lang8/Temp_Model.th'

这个怎么解决呢?

自己复现Seq2Edit效果没有那么好

您好,我自己用Transformer架构复现了一个Seq2Edit, 目前仅支持单轮纠正,但是我在训练的时候发现nonkeep标签正确率挺低的,我检查了输入,labels, d_tags都是和原文代码一致的。我觉得很有可能是训练过程中的问题,原文的代码训练过程在Allennlp的高度封装之下,我担心因为训练设置的不同导致性能没那么好。(在自己的数据集上训练,原文代码能达到80%多,但是我自己复现的只有30%多)

关于seq2edit模型的能力问题

您好,很感谢您开源这么优秀的工作。在学习过程中,碰到如下问题:

seq2edit模型似乎不能胜任缺失较长宾语的情况(会不会是tokenizer是按单个字的原因?),如下几条样例:

source我们要尽一切力量使我们农业走上机械化集体化target我们要尽一切力量使我们农业走上机械化集体化轨轨source近两年来他们在全县推广了马河大队坚持的科学种田target近两年来他们在全县推广了马河大队坚持实的科学种田方source开展走访慰问工会系统部分老党员老干部target开展走访慰问工会系统部分老党员老干部活

从测试结果看,模型好像大概知道哪里有问题,但是纠正能力好像不行。我试过设置参数iteration_count,min_error_probability,additional_confidence。但是效果都不佳。

不知道是模型本身能力的受限,还是还需要大量类似的训练数据训练?

期待您的回复。

在线demo的基线模型

章岳老师您好:
我在使用了您训练后的seq2edit_lang8+hsk模型在测试时,有些数据无法正确预测,但是使用在线demo却可以正确预测,请问在线demo的基线模型是哪个。

训练过程中的一点疑问

您好。我最近在复现您的代码。有一点疑问。
参考gector的论文中写道,stage1和stage2使用了不同的训练数据。其中stage1使用了大量的合成数据,stage2使用了少量的非合成数据。
在您的代码中,我发现貌似stage1和stage2使用了相同的数据。所以请问您在训练过程中,有没有试过使用gector论文中描述的方式去训练,会不会得到更好的效果?

您好,关于融合的结果文件格式是怎样的?

# Seq2Edit的三个模型,更换了3个随机种子
RESULT_FILE1=$1
RESULT_FILE2=$2
RESULT_FILE3=$3

# Seq2Seq的三个模型,更换了3个随机种子
RESULT_FILE4=$4
RESULT_FILE5=$5
RESULT_FILE6=$6

每个结果文件中是每一句话,用空格隔开吗?

seq2seq环境问题

您好~不造为啥我运行seq2seq 的pipeline,sh的时候会遇见环境配置问题,
Traceback (most recent call last):
File "train.py", line 11, in
import transformers
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/init.py", line 43, in
from . import dependency_versions_check
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/utils/versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/transformers/utils/versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/packaging/version.py", line 52, in parse
return Version(version)
File "/data/jydong/anaconda3/envs/SQ/lib/python3.8/site-packages/packaging/version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

不知道这是我哪个包转装版本的问题呀……

seq2edit

seq2edit预测阶段,参数min_probability,min_error_probability,confidence如何巧妙的设置防止误纠?

seq2edit predict问题

作者您好,我仿照pipline.sh脚本里的预测脚本测试样例,input文件就是简单的两行文本,像下面这样的格式:
今天气真不戳啊。
我下午要去**人民很行取钱。

在使用seq2edit_lang8模型预测时,无论输入什么文本都不会进行任何纠错;在使用seq2edit_lang8+hsk模型预测时,无论输入什么文本,纠错结果都是仅仅一个中文双引号 ” 。
请问可能的问题是什么呢?

FileNotFoundError: [Errno 2] No such file or directory: './ensemble_results/3seq2edit_3seq2seq_threshold_4/MuCGEC_test.m2_temp'

这几个路径中的文件没有找到 具体应该怎么解决
sh .\ensemble.sh
usage: rule_ensemble.py [-h] --result_path RESULT_PATH [RESULT_PATH ...] --output_path OUTPUT_PATH [-T THRESHOLD]
rule_ensemble.py: error: argument --result_path: expected at least one argument
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 103, in
main()
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 89, in main
for src_sent, edit_lines in read_file():
File "F:\dasixia\MuCGEC\scorers\ChERRANT\m2convertor.py", line 73, in read_file
with open(args.f, "r", encoding="utf8") as fr:
FileNotFoundError: [Errno 2] No such file or directory: './ensemble_results/3seq2edit_3seq2seq_threshold_4/MuCGEC_test.m2_temp'

FileNotFoundError: [Errno 2] No such file or directory: '../../data/train_data/lang8+hsk/train.src_only_erroneous'

sh .\pipeline.sh
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 211, in
convert_parallel_data_to_json_file(args[0], args[1],args[2])
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 27, in convert_parallel_data_to_json_file
with open(source_data_file, "r", encoding='utf-8') as f1:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/train_data/lang8+hsk/train.src_only_erroneous'
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 211, in
convert_parallel_data_to_json_file(args[0], args[1],args[2])
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\utils.py", line 27, in convert_parallel_data_to_json_file
with open(source_data_file, "r", encoding='utf-8') as f1:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/valid_data/MuCGEC_CGED_Dev.src'
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\train.py", line 11, in
import transformers
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers_init_.py", line 43, in
from . import dependency_versions_check
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 52, in parse
return Version(version)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'
Generating...
Traceback (most recent call last):
File "F:\dasixia\MuCGEC\models\seq2seq-based-CGEC\predict.py", line 7, in
from transformers import BartForConditionalGeneration, BertTokenizer
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers_init_.py", line 43, in
from . import dependency_versions_check
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 94, in require_version_core
return require_version(requirement, hint)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\transformers\utils\versions.py", line 85, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 52, in parse
return Version(version)
File "F:\dasixia\MuCGEC\venvSeq\lib\site-packages\packaging\version.py", line 197, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'
Generating Finish!
0 minutes and 0 seconds elapsed.

Seq2Edit pipeline中两个检查点的问题

您好,目前我正在复现MuCGEC的两个模型,在运行您的Seq2Edit模型的pipeline过程中,我看到了单次模型训练会产生两个检查点,分别是Temp_Model.th和Best_Model_Stage_1.th(或Best_Model_Stage_2.th),请问这个Temp Model和Best Model两个检查点有什么区别?以及为什么pipeline.sh中第二阶段训练加载的是前者而不是后者?

请问有没有相关的参考文章指导如何提高推理速度的

请问有没有相关的参考文章指导如何提高推理速度的?我使用了 https://modelscope.cn/models/damo/nlp_bart_text-error-correction_chinese/summary 这个模型,我的环境是四块Tesla v100s,对一句话进行纠错耗时33.6秒,五句话耗时35.5秒。数据如下:
'''
沙尘暴也是一类空气污染之一。
所以一些人说,:读书一点用处都没有。
从学习写日常书信和专用书信的课中,我毕到了如何在不同场合使用不同语言的方式。
我学到了如和更简练,更书面地表达自己的意见。
我学到了为自己的过去进行歹思并鼓起勇气的方法。
'''
测试代码如下:
'''
def correctSentense(sentense, batch_size=1):
t = time.time()
p = pipeline('text-error-correction', 'damo/nlp_bart_text-error-correction_chinese')
print(p(sentense, batch_size))
print(f'coast:{time.time() - t:.8f}s')

'''
非常感谢。

Issue on reproducing the Seq2edit model result on NLPCC-2018 test set

Hi authors,

Thank you very much for the great work:) I have tried to follow your code to reproduce the result on the NLPCC-2018 test set by fine-tuning the seq2edit model on the NLPCC_2018 train set only.

I firstly filter duplicate sentences that appear in the NLPCC-2018 dataset, and discard correct sentences by following your description in "Training data" under Section 5. After filteration, I obtained 1,091,542 sentences.

After this I used your pipeline.sh (https://github.com/HillZhang1999/MuCGEC/blob/main/models/seq2edit-based-CGEC/pipeline.sh) file to train the seq2edit model.

After training and inference, I only obtained an F0.5 score of 33.57 (P: 37.78; R: 23.21).

Below are my training logs for cold start and non-cold start respectively:

Cold Start:

2022-04-29 13:17:09,787 - train.py - INFO - Data is loaded
2022-04-29 13:17:19,663 - train.py - INFO - Model is set
2022-04-29 13:17:19,671 - train.py - INFO - Start training
2022-04-29 13:17:19,672 - train.py - INFO - epoch: 0
2022-04-29 13:47:51,824 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8213796019554138
2022-04-29 13:47:51,825 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.018478067591786385
2022-04-29 13:48:01,068 - train.py - INFO - Model is dumped
2022-04-29 13:48:01,068 - train.py - INFO - (best)Saving Model...
2022-04-29 13:48:11,674 - train.py - INFO - Model is dumped
2022-04-29 13:48:11,674 - train.py - INFO - best labels_accuracy till now:0.8213796019554138
2022-04-29 13:48:11,675 - train.py - INFO - epoch: 1
2022-04-29 13:48:22,458 - train.py - INFO - Model is dumped
2022-04-29 13:48:22,459 - train.py - INFO - epoch: 1 checkpoint saved to ./exps/seq2edit_lang8/epoch1.th
2022-04-29 14:15:59,206 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8210737109184265
2022-04-29 14:15:59,861 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.018300961703062057
2022-04-29 14:16:11,182 - train.py - INFO - Model is dumped
2022-04-29 14:16:11,182 - train.py - INFO - best labels_accuracy till now:0.8213796019554138
2022-04-29 14:16:11,182 - train.py - INFO - epoch: 2
2022-04-29 14:16:23,102 - train.py - INFO - Model is dumped
2022-04-29 14:16:23,102 - train.py - INFO - epoch: 2 checkpoint saved to ./exps/seq2edit_lang8/epoch2.th

Non-Cold Start:

2022-04-29 14:16:31,549 - train.py - INFO - Data is loaded
2022-04-29 14:16:46,257 - train.py - INFO - load pretrained model
2022-04-29 14:16:46,587 - train.py - INFO - Model is set
2022-04-29 14:16:46,598 - train.py - INFO - Start training
2022-04-29 14:16:46,598 - train.py - INFO - epoch: 0
2022-04-29 15:24:24,749 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8269697427749634
2022-04-29 15:24:24,749 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.13035008311271667
2022-04-29 15:24:34,925 - train.py - INFO - Model is dumped
2022-04-29 15:24:34,926 - train.py - INFO - (best)Saving Model...
2022-04-29 15:24:46,774 - train.py - INFO - Model is dumped
2022-04-29 15:24:46,775 - train.py - INFO - best labels_accuracy till now:0.8269697427749634
2022-04-29 15:24:46,775 - train.py - INFO - epoch: 1
2022-04-29 15:24:57,297 - train.py - INFO - Model is dumped
2022-04-29 15:24:57,297 - train.py - INFO - epoch: 1 checkpoint saved to ./exps/seq2edit_lang8/epoch1.th
2022-04-29 16:32:52,444 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8282670378684998
2022-04-29 16:32:52,446 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.15213413536548615
2022-04-29 16:33:03,145 - train.py - INFO - Model is dumped
2022-04-29 16:33:03,145 - train.py - INFO - (best)Saving Model...
2022-04-29 16:33:17,291 - train.py - INFO - Model is dumped
2022-04-29 16:33:17,291 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 16:33:17,292 - train.py - INFO - epoch: 2
2022-04-29 16:33:28,897 - train.py - INFO - Model is dumped
2022-04-29 16:33:28,898 - train.py - INFO - epoch: 2 checkpoint saved to ./exps/seq2edit_lang8/epoch2.th
2022-04-29 17:41:54,118 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8282248973846436
2022-04-29 17:41:54,119 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.16317373514175415
2022-04-29 17:42:04,435 - train.py - INFO - Model is dumped
2022-04-29 17:42:04,436 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 17:42:04,436 - train.py - INFO - epoch: 3
2022-04-29 17:42:14,426 - train.py - INFO - Model is dumped
2022-04-29 17:42:14,426 - train.py - INFO - epoch: 3 checkpoint saved to ./exps/seq2edit_lang8/epoch3.th
2022-04-29 19:22:24,987 - train.py - INFO - The accuracy of predicting for edit labels is: 0.8274760246276855
2022-04-29 19:22:24,990 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.17834582924842834
2022-04-29 19:22:38,266 - train.py - INFO - Model is dumped
2022-04-29 19:22:38,266 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 19:22:38,266 - train.py - INFO - epoch: 4
2022-04-29 19:22:51,440 - train.py - INFO - Model is dumped
2022-04-29 19:22:51,440 - train.py - INFO - epoch: 4 checkpoint saved to ./exps/seq2edit_lang8/epoch4.th
2022-04-29 22:17:08,472 - train.py - INFO - The accuracy of predicting for edit labels is: 0.827201783657074
2022-04-29 22:17:08,689 - train.py - INFO - The accuracy of predicting for edit labels except keep label is: 0.18666981160640717
2022-04-29 22:17:21,019 - train.py - INFO - Model is dumped
2022-04-29 22:17:21,019 - train.py - INFO - best labels_accuracy till now:0.8282670378684998
2022-04-29 22:17:21,019 - train.py - INFO - epoch: 5
2022-04-29 22:17:33,877 - train.py - INFO - Model is dumped
2022-04-29 22:17:33,877 - train.py - INFO - epoch: 5 checkpoint saved to ./exps/seq2edit_lang8/epoch5.th

Can you suggest why the reproduced result is about one point lower?

Thank you very much for your time and effort:)

preprocess data

在内存128g的机器,用自己构造的数据,train.src.char和train.tgt.cgar分别21G,运行preprocess_data.py,会报Cannot allocate memory

请问有多个候选答案的gold.m2的生成脚本是啥?我自测生成的一直有问题,有大面积空白行

INPUT_FILE=./data/input_grammar.txt
REF_FILE=./data/ref_grammar.txt
REF_PARA_FILE=./data/gram.ref.para
REF_M2_FILE=./data/gram.ref.m2.char

Step1. extract edits from hypothesis file.

paste $INPUT_FILE $REF_FILE | awk '{print NR"\t"$p}' > $REF_PARA_FILE # only for single hypothesis situation

python parallel_to_m2.py -f $REF_PARA_FILE -o $REF_M2_FILE -g char # char-level evaluation

seq2seq的predict预测长度限制问题

seq2seq的predict.py tokenizer的时候已经设置了max_len 和padding,为什么下面还需要再进一步的判断是否超过100。是否有特殊情况?
image

preprocess_data中的问题

老师您好。
src_sent : 而且世界有很多病的话
tgt_sent : 而且世界有很多病的话

preprocess_data.py 得到的train_label是:
$STARTSEPL|||SEPR$KEEP 而SEPL|||SEPR$KEEP 且SEPL|||SEPR$KEEP 在SEPL|||SEPR$DELETE 世SEPL|||SEPR$KEEP 界SEPL|||SEPR$APPEND_上 有SEPL|||SEPR$KEEP 很SEPL|||SEPR$KEEP 多SEPL|||SEPR$KEEP 病SEPL|||SEPR$KEEP 的SEPL|||SEPR$KEEP 话SEPL|||SEPR$REPLACE_?

我的理解是:
在-DELETE
上-APPEND

但label的结果是:
世-DELETE
上-APPEND

是我理解上有问题吗?

RuntimeError: CUDA out of memory.

我在执行pipeline.sh的时候,当执行到CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python train.py --tune_bert 1\ ……刚要训练就会产生这样的错误:

Data is loaded
load pretrained model
Model is set
Start training

epoch: 0
Traceback (most recent call last):
File "train.py", line 315, in
main(args)
File "train.py", line 205, in main
trainer.train()
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 966, in train
return self._try_train()
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 1001, in _try_train
train_metrics = self._train_epoch(epoch)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 716, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/allennlp/training/trainer.py", line 604, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/jydong/MuCGEC-main/models/seq2edit-based-CGEC/gector/seq2labels_model.py", line 129, in forward
ret_train = self.decode(encoded_text, batch_size, sequence_length, mask, labels, d_tags, metadata)
File "/data/jydong/MuCGEC-main/models/seq2edit-based-CGEC/gector/seq2labels_model.py", line 175, in decode
class_probabilities_labels = F.softmax(logits_labels, dim=-1).view(
File "/data/jydong/anaconda3/envs/SE/lib/python3.8/site-packages/torch/nn/functional.py", line 1512, in softmax
ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 10.76 GiB total capacity; 9.52 GiB already allocated; 51.44 MiB free; 9.71 GiB reserved in total by PyTorch)

我看以往问题中有人问也是不可以多卡,那么除了换个卡以外……还有啥方法能让这个模型正常运行呀?

demo 接口

想问下,http://139.224.234.18:5002/ ,这个 demo 中的【语法纠错(母语)】接口,是以下哪种策略?
1)直接调用 seq2edit 模型;2)直接调用 seq2seq 模型;3)seq2edit + seq2seq 做 merge;4)seq2edit + seq2seq + 规则

是否可以继续训练

作者,你好!是否可以根据自己的特定领域的数据,对你提供的模型继续训练?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.