fdchongli / twowaystoimprovecsc Goto Github PK
View Code? Open in Web Editor NEWThis is the official code for paper titled "Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models".
License: MIT License
This is the official code for paper titled "Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models".
License: MIT License
您好!感谢您开源的代码!我正在尝试从头复现SoftMaskedBert第一阶段的训练过程,超参的选择和您一样,lr=2e-5,batch_size=20,同时也是从271k语料中随机抽取了1000条作为开发集,但是我在训练过程中在第8轮就收敛了,第八轮之后valid_loss开始上升,loss还在下降,最后用保存的第8轮模型在sighan2013上进行训练,发现结果特别差,所以想向您询问一下,您在第一阶段训练的时候是第几轮收敛了?
请问预训练代码和数据处理有开源计划吗?
我尝试做了复现,效果很差,应该哪里存在问题
请问有没有推理的代码呢?感谢!
微博数据集地址失效
您好:
请问在预训练阶段的时候:
1. 你们使用了多少规模的语料(如一千万条句子这种)
2. learning rate的设置是多少
3. 跑了多少steps
4. pretrain的子任务是和bert保持一致,还是和csc的任务保持一致
十分感谢!
您好~非常感谢你们开源了工作,对我非常有帮助!
关于您在网盘中提供下载的模型,分别是如何训练得到的,我的理解是这样的,盼望您能指正
BERT文件夹下有三个文件夹(baseline, preTrain, advTrain)
0. 以 vanilla Bert 为例
1. baseline
1.1 baseline initial
train_data, valid_data = train_val_split("Wang等人于2018年生成的270K数据集")
bert = BertModel.from_pretrained('bert-base-chinese', return_dict=True)
lr = 2e-5
batch_size = 20
mode:baseline
[output]: baseline/initial/model.pkl
1.2 baseline sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(baseline/initial/model.pkl))
lr = 2e-5
batch_size = 20
mode:finetune
[output]: baseline/sighan13/model.pkl
2. preTrain
2.1 preTrain initial
train_data_path = "Wiki-zh数据集 + Weibo数据集 产生的伪训练样本(随机选择25%的汉字进行替换)"
bert = BertModel.from_pretrained('bert-base-chinese', return_dict=True)
lr = 2e-5
batch_size = 20
mode:pretrain
[output]: preTrain/initial/model.pkl
2.2 preTrain sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(preTrain/initial/model.pkl))
lr = 2e-5
batch_size = 20
mode:finetune
[output]: preTrain/sighan13/model.pkl
3. advTrain
3.1 advTrain sighan13
train_data, valid_data = train_val_split("TwoWaysToImproveCSC/BERT/data/13train.txt")
bert = model.load_state_dict(torch.load(preTrain/initial/model.pkl))
lr = 2e-5
batch_size = 20
train_ratio = 0.02
attack_ratio = 0.02
mode:adtrain
[output]: advTrain/sighan13/model.pkl
想请教一下您, 上述的 training settings, 尤其是 选取训练数据 和 加载预训练模型 的设置是否有误呢?
太感谢啦!
您好,恭喜您中了顶会paper,很有意思的工作。我有一点问题请教一下。
1.您给的下载模型的网盘中,BERT文件夹下有三个文件(baseline,preTrain,advTrain),其中的baseline文件下,有四个文件(inital,sighan13,sighan14,sighan15)。请问下inital是否代表没有fintune的bert呢?sighan13,14,15是用什么数据,什么顺序训练过的模型呢?为什么要分开呢?不同测试集不用一个模型呢?
我下载了sighan15下的模型,并对data/15test.txt做预测,用cpu做预测,得到如下结果,感觉与paper中相应的结果对不上,不清楚是什么原因?
----Task: sighan15 begin !----
detection sentence accuracy:0.8354545454545454,precision:0.7482517482517482,recall:0.7896678966789668,F1:0.7684021543985636
correction sentence accuracy:0.8254545454545454,precision:0.7290209790209791,recall:0.7693726937269373,F1:0.748653500897666
sentence target modify:542,sentence sum:1100,sentence modified accurate:417
按照该项目给出的流程,一步步处理的,训练,baseline都能达到论文给出的结果。但是在构造预训练数据阶段,我处理之后维基大概是600w,微博大概是400w,训练模型结果与论文也有点差距。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.