Giter Site home page Giter Site logo

miductc-competition's Introduction

文本智能校对大赛

最新动态

时间 事件
2022.7.19 修复指标计算的Bug, 详见metry.py ,感谢@HillZhang1999的提醒和贡献
2022.7.21 更新baseline在a榜数据集上的表现

赛程

时间 事件
2022.7.13 比赛启动,开放报名,赛事网址,初赛A榜数据集,初赛A榜提交入口
2022.8.12 报名截止,关闭初赛A榜评测入口
2022.8.13 开放初赛B榜数据集、评测入口
2022.8.17 关闭初赛B榜数据集、评测入口
2022.8.18 开放决赛数据集、评测入口
2022.8.20 关闭决赛数据集、评测入口

获奖队伍

排名 参赛队伍 得分
1 苏州大学-阿里达摩院联队 0.7637
2 Grandmaly 0.7338
3 语言组小分队 0.6916
4 YanSun的团队 0.6779
5 TAL-有错必改 0.6528
6 NLPIR 0.6425

任务描述

本次赛题选择网络文本作为输入,从中检测并纠正错误,实现中文文本校对系统。即给定一段文本,校对系统从中检测出错误字词、错误类型,并进行纠正,最终输出校正后的结果。

文本校对又称文本纠错,相关资料可参考自然语言处理方向的语法纠错(Grammatical Error Correction, GEC)任务和中文拼写纠错(Chinese spelling check, CSC)

Baseline介绍

模型

提供了GECToR作为baseline模型,可参考GECToR论文GECToR源代码

代码结构

├── command
│   └── train.sh       # 训练脚本
├── data
├── logs
├── pretrained_model
└── src
    ├── __init__.py
    ├── baseline       # baseline系统
    ├── corrector.py   # 文本校对入口
    ├── evaluate.py    # 指标评估
    ├── metric.py      # 指标计算文件 
    ├── prepare_for_upload.py  # 生成要提交的结果文件
    └── train.py       # 训练入口

使用说明

  • 数据集获取:请于比赛网站获取数据集
  • 提供了基础校对系统的baseline,其中baseline模型训练参数说明参考src/baseline/trainer.py
  • baseline中的预训练模型支持使用bert类模型,可从HuggingFace下载bert类预训练模型,如: chinese-roberta-wwm-ext
  • baseline仅作参考,参赛队伍可对baseline进行二次开发,或采取其他解决方案。

baseline表现

  • baseline在a榜训练集(不含preliminary_extend_train.json),使用单机4卡分布式训练的情况下
  • 训练到第4个epoch结束在a榜提交得分约为:0.3587

具体训练参数如下:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m src.train \
--in_model_dir "pretrained_model/chinese-roberta-wwm-ext" \
--out_model_dir "model/ctc_train" \
--epochs "50" \
--batch_size "158" \
--max_seq_len "128" \
--learning_rate "5e-5" \
--train_fp "data/comp_data/preliminary_a_data/preliminary_train.json" \
--test_fp "data/comp_data/preliminary_a_data/preliminary_val.json" \
--random_seed_num "42" \
--check_val_every_n_epoch "0.5" \
--early_stop_times "20" \
--warmup_steps "-1" \
--dev_data_ratio "0.01" \
--training_mode "ddp" \
--amp true \
--freeze_embedding false

开始训练

cd command && sh train.sh

其他公开数据集

相关资源

miductc-competition's People

Contributors

bitallin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

miductc-competition's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.