Giter Site home page Giter Site logo

fdchongli / twowaystoimprovecsc Goto Github PK

View Code? Open in Web Editor NEW
67.0 67.0 13.0 4.75 MB

This is the official code for paper titled "Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models".

License: MIT License

Python 92.74% Shell 7.26%

twowaystoimprovecsc's People

Contributors

fdchongli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

twowaystoimprovecsc's Issues

对模型结果的一些疑问

image

这是我实验的一些结果,不知道是否合理?

我发现一个问题,使用wang2018训练过后bert模型,直接在sighan15测试集上进行测试,效果是很差的(表一),但是使用spellgcn模型,在wang2018数据集上训练后,在sighan2015测试集上测试correct F1可以达到60%

在sighan2015的训练集上进行微调,确实很快可以达到74%

在sighan2015,sighan2013,sighan2014合并的训练集上进行微调,效果也达不到74.7%。

我可否认为,模型其实是有过拟合的?另外在完全没有领域标签数据的csc任务上,结果会不会比较差。

关于混淆集

我看到spell gcn中的混淆集是这样的
image

但是Bert/save/confusion.file 文件中,混淆集大小只有4922个,这里是做了什么处理呢?

第一阶段271K语料训练细节询问

您好!感谢您开源的代码!我正在尝试从头复现SoftMaskedBert第一阶段的训练过程,超参的选择和您一样,lr=2e-5,batch_size=20,同时也是从271k语料中随机抽取了1000条作为开发集,但是我在训练过程中在第8轮就收敛了,第八轮之后valid_loss开始上升,loss还在下降,最后用保存的第8轮模型在sighan2013上进行训练,发现结果特别差,所以想向您询问一下,您在第一阶段训练的时候是第几轮收敛了?

预训练代码

请问预训练代码和数据处理有开源计划吗?
我尝试做了复现,效果很差,应该哪里存在问题

推理代码

请问有没有推理的代码呢?感谢!

pretrain problem

您好:

   请问在预训练阶段的时候:
   1. 你们使用了多少规模的语料(如一千万条句子这种)
   2. learning rate的设置是多少
   3. 跑了多少steps
   4. pretrain的子任务是和bert保持一致,还是和csc的任务保持一致

   十分感谢!

请教我对 training settings 的理解是否正确

您好~非常感谢你们开源了工作,对我非常有帮助!

关于您在网盘中提供下载的模型,分别是如何训练得到的,我的理解是这样的,盼望您能指正

BERT文件夹下有三个文件夹(baseline, preTrain, advTrain)

0. 以 vanilla Bert 为例

1.  baseline
1.1 baseline initial
train_data, valid_data = train_val_split("Wang等人于2018年生成的270K数据集")
bert = BertModel.from_pretrained('bert-base-chinese', return_dict=True)
lr = 2e-5
batch_size = 20

mode:baseline
[output]: baseline/initial/model.pkl

1.2 baseline sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(baseline/initial/model.pkl))
lr = 2e-5
batch_size = 20

mode:finetune
[output]: baseline/sighan13/model.pkl



2. preTrain
2.1 preTrain initial
train_data_path = "Wiki-zh数据集 + Weibo数据集 产生的伪训练样本(随机选择25%的汉字进行替换)"
bert = BertModel.from_pretrained('bert-base-chinese', return_dict=True)
lr = 2e-5
batch_size = 20

mode:pretrain
[output]: preTrain/initial/model.pkl

2.2 preTrain sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(preTrain/initial/model.pkl))
lr = 2e-5
batch_size = 20

mode:finetune
[output]: preTrain/sighan13/model.pkl


3. advTrain
3.1 advTrain sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(preTrain/initial/model.pkl))
lr = 2e-5
batch_size = 20
train_ratio = 0.02
attack_ratio = 0.02

mode:adtrain
[output]: advTrain/sighan13/model.pkl

想请教一下您, 上述的 training settings, 尤其是 选取训练数据 和 加载预训练模型 的设置是否有误呢?

太感谢啦!

请教下训练阶段的具体数据使用

您好,恭喜您中了顶会paper,很有意思的工作。我有一点问题请教一下。
1.您给的下载模型的网盘中,BERT文件夹下有三个文件(baseline,preTrain,advTrain),其中的baseline文件下,有四个文件(inital,sighan13,sighan14,sighan15)。请问下inital是否代表没有fintune的bert呢?sighan13,14,15是用什么数据,什么顺序训练过的模型呢?为什么要分开呢?不同测试集不用一个模型呢?

我下载了sighan15下的模型,并对data/15test.txt做预测,用cpu做预测,得到如下结果,感觉与paper中相应的结果对不上,不清楚是什么原因?
----Task: sighan15 begin !----
detection sentence accuracy:0.8354545454545454,precision:0.7482517482517482,recall:0.7896678966789668,F1:0.7684021543985636
correction sentence accuracy:0.8254545454545454,precision:0.7290209790209791,recall:0.7693726937269373,F1:0.748653500897666
sentence target modify:542,sentence sum:1100,sentence modified accurate:417

请问是否可以开源处理维基和微博数据的代码?

按照该项目给出的流程,一步步处理的,训练,baseline都能达到论文给出的结果。但是在构造预训练数据阶段,我处理之后维基大概是600w,微博大概是400w,训练模型结果与论文也有点差距。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.