Giter Site home page Giter Site logo

hemingkx / wordseg Goto Github PK

View Code? Open in Web Editor NEW
195.0 1.0 40.0 4.3 MB

A PyTorch implementation of a BiLSTM \ BERT \ Roberta (+ BiLSTM + CRF) model for Chinese Word Segmentation (中文分词) .

Python 100.00%
bert pytorch roberta chinese-word-segmentation bilstm-crf bert-crf

wordseg's Introduction

Chinese Word Segmentation

本项目为中文分词任务baseline的代码实现,模型包括

  • BiLSTM-CRF + pretrained embedding (unigram/unigram+bigram)
  • BERT-base + X (softmax/CRF/BiLSTM-CRF)
  • Roberta + X (softmax/CRF/BiLSTM-CRF)

本项目是 CLUENER2020 的拓展项目。

项目说明参考知乎文章:紧追SOTA!教你用Pytorch上手BERT中文分词!

Dataset

数据集来源于SIGHAN 2005第二届中文分词任务中的Peking University数据集。

如有需要,这里是PKU数据集及评测脚本的百度网盘地址👇:

链接: https://pan.baidu.com/s/1yZrDSrvAXi-0mCxRp3L-sw 密码: 31qd

将数据集及评测脚本解压后放置在所用模型的./data/路径下即可。

Model

本项目实现了中文分词任务的baseline模型,对应路径分别为:

  • BiLSTM-CRF ( + pretrained embedding)
  • BERT-Softmax
  • BERT-CRF
  • BERT-BiLSTM-CRF

其中,根据使用的预训练模型的不同,BERT-base-X 模型可转换为 RoBERTa-X 模型。

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:

  • tqdm
  • scikit-learn
  • pytorch >= 1.5.1
  • 🤗transformers == 2.2.2

To get the environment settled, run:

pip install -r requirements.txt

Pretrained Model Required

需要提前下载BERT的预训练模型,包括

  • pytorch_model.bin
  • vocab.txt

放置在./pretrained_bert_models对应的预训练模型文件夹下,其中

bert-base-chinese模型:下载地址

注意,以上下载地址仅提供tensorflow版本,需要根据huggingface suggest将其转换为pytorch版本。

chinese_roberta_wwm_large模型:下载地址

如果觉得麻烦,pytorch版本的上述模型可以通过下方网盘链接直接获取😊:

链接: https://pan.baidu.com/s/1rhleLywF_EuoxB2nmA212w 密码: isc5

Pretrained Embedding Required

BiLSTM-CRF中使用的unigram-embedding和bigram-embedding均来自Chinese-Word-Vectors

Various Co-occurrence Information$\rightarrow$Character Feature$\rightarrow$Word$\rightarrow$ Character (1-2)$\rightarrow$Context Word Vectors 百度网盘下载地址

关于选用该文件的具体说明参见 中文字符向量 #18

Results

各个模型在数据集上的结果(f1 score)如下表所示:

BERT均指Bert-chinese-base模型,RoBERTa均指RoBERTa-wwm-ext-large模型。

模型 F1 Score Recall Precision OOV Rate OOV Recall IV Recall
BiLSTM+CRF 0.927 0.922 0.933 0.058 0.622 0.940
BiLSTM+CRF (unigram) 0.922 0.927 0.916 0.058 0.538 0.939
BiLSTM+CRF (unigram+bigram) 0.945 0.947 0.944 0.058 0.649 0.962
BERT+Softmax 0.965 0.961 0.968 0.058 0.850 0.968
BERT+CRF 0.965 0.962 0.969 0.058 0.858 0.968
BERT+BiLSTM+CRF 0.964 0.960 0.969 0.058 0.861 0.966
RoBERTa+CRF 0.962 0.960 0.965 0.058 0.827 0.968

我们在BiLSTM-CRF代码中实现了K折验证,结果如下:

Model K-fold Average F1 Score Average Test Loss
BiLSTM+CRF 10 0.923 887.08
BiLSTM+CRF (unigram) 10 0.918 640.43
BiLSTM+CRF (unigram+bigram) 10 0.943 506.95

Parameter Setting

1.model parameters

在./experiments/seg/config.json中设置了Bert/Roberta模型的基本参数,而在./pretrained_bert_models下的两个预训练文件夹中,config.json除了设置Bert/Roberta的基本参数外,还设置了'X'模型(如LSTM)参数,可根据需要进行更改。

2.other parameters

环境路径以及其他超参数在./config.py中进行设置。

Usage

打开指定模型对应的目录,命令行输入:

python run.py

模型运行结束后,最优模型和训练log保存在./experiments/路径下。在测试集中的bad case保存在./case/bad_case.txt中。

Evaluation

本项目可以使用SIGHAN 2005官方脚本计算F1 Score等评测指标。

BiLSTM-CRF需在./data/路径下运行

perl scripts/score training_vocab.txt test.txt output.txt > score.txt

BERT + X需在./data/路径下运行

perl scripts/score training_vocab.txt test.txt res.txt > score.txt

MultiGPU

由于Roberta模型占用显存较多,因此该项目中的BERT部分支持Multi GPU训练。BERT-base可直接设置为单卡训练。

单卡训练指令:

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 run.py

多卡训练指令:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 run.py

后台训练:

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m torch.distributed.launch --nproc_per_node=4 run.py >/dev/null 2>&1

Attention

1.log保存

目前,当前模型的train.log已保存在./experiments/路径下,如要重新运行模型,请先将train.log移出当前路径,以免覆盖。

2.关于注释

本项目是 CLUENER2020 的拓展项目,因此有些代码注释没有改过来,对项目运行无任何影响。

wordseg's People

Contributors

hemingkx avatar sunyiwen1998 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

wordseg's Issues

awesome work!!!!!!

很好的repo,帮助我这个菜鸟快速了解nlp分词领域,请问一下可以提供训练之后的model嘛,感谢

error

pretrained_bert_models/bert-base-chinese/
bert_config.json
bert_model.ckpt.index
bert_model.ckpt.meta
config.json
pytorch_model.bin
readme
vocab.txt

device: cpu
--------Process Done!--------
Model name 'pretrained_bert_models/bert-base-chinese/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). Assuming 'pretrained_bert_models/bert-base-chinese/' is a path or url to a directory containing tokenizer files.
Didn't find file pretrained_bert_models/bert-base-chinese/added_tokens.json. We won't load it.
Didn't find file pretrained_bert_models/bert-base-chinese/special_tokens_map.json. We won't load it.
Didn't find file pretrained_bert_models/bert-base-chinese/tokenizer_config.json. We won't load it.
loading file pretrained_bert_models/bert-base-chinese/vocab.txt
loading file None
loading file None
loading file None
Model name 'pretrained_bert_models/bert-base-chinese/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). Assuming 'pretrained_bert_models/bert-base-chinese/' is a path or url to a directory containing tokenizer files.
Didn't find file pretrained_bert_models/bert-base-chinese/added_tokens.json. We won't load it.
Didn't find file pretrained_bert_models/bert-base-chinese/special_tokens_map.json. We won't load it.
Didn't find file pretrained_bert_models/bert-base-chinese/tokenizer_config.json. We won't load it.
loading file pretrained_bert_models/bert-base-chinese/vocab.txt
loading file None
loading file None
loading file None
--------Dataset Build!--------

FileNotFoundError: [Errno 2] No such file or directory:

Traceback (most recent call last):
File "E:\Program Files\PycharmProjects\bert\WordSeg-main\BERT-Softmax\metrics.py", line 189, in
output2res()
File "E:\Program Files\PycharmProjects\bert\WordSeg-main\BERT-Softmax\metrics.py", line 167, in output2res
with open(config.output_dir, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'E:\Program Files\PycharmProjects\bert\WordSeg-main\BERT-Softmax/data/output.txt'

请问这种报错,是需要我提前建一个output的txt文档吗?

AttributeError

在对BERT-BiLSTM-CRF进行训练的时候一直报错显示AttributeError,有解决的参考意见吗,谢谢。

test error:ValueError: cannot copy sequence with size 19 to array axis with dimension 18

File "run.py", line 49, in test
val_metrics = evaluate(test_loader, model, mode='test')
File "/zhangleisx4614/code/WordSeg-main/BERT-CRF/train.py", line 85, in evaluate
for idx, batch_samples in enumerate(dev_loader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/zhangleisx4614/code/WordSeg-main/BERT-CRF/data_loader.py", line 96, in collate_fn
batch_labels[j][:cur_tags_len] = labels[j]
ValueError: cannot copy sequence with size 19 to array axis with dimension 18

提供训练之后的model

您好,很高兴您提供了如此棒的开源代码,请问一下您稍后会提供训练之后的model吗?

跑bert模型遇到问题,求助

在运行bert+softmax模型的过程中,一直再报一个错误,迟迟没有解决,想问一下,大家运行的时候遇到过吗,这个问题有知道该怎么解决的吗,谢谢各位啦
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.