determined22 / zh-ner-tf Goto Github PK

A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow)

Perl 33.22% Python 66.78%

named-entity-recognition bilstm-crf-model tensorflow

zh-ner-tf's Introduction

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

This repository includes the code for buliding a very simple character-based BiLSTM-CRF sequence labeling model for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.

This code works on Python 3 & TensorFlow 1.2 and the following repository https://github.com/guillaumegenthial/sequence_tagging gives me much help.

Model

This model is similar to the models provided by paper [1] and [2]. Its structure looks just like the following illustration:

For one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.

The first layer, look-up layer, aims at transforming each character representation from one-hot vector into character embedding. In this code I initialize the embedding matrix randomly. We could add some linguistic knowledge later. For example, do tokenization and use pre-trained word-level embedding, then augment character embedding with the corresponding token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).

The second layer, BiLSTM layer, can efficiently use both past and future input information and extract features automatically.

The third layer, CRF layer, labels the tag for each character in one sentence. If we use a Softmax layer for labeling, we might get ungrammatic tag sequences beacuse the Softmax layer labels each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax doesn't know. Compared to Softmax, a CRF layer can use sentence-level tag information and model the transition behavior of each two different tags.

Dataset

	#sentence	#PER	#LOC	#ORG
train	46364	17615	36517	20571
test	4365	1973	2877	1331

It looks like a portion of MSRA corpus. I downloaded the dataset from the link in ./data_path/original/link.txt

data files

The directory ./data_path contains:

the preprocessed data files, train_data and test_data
a vocabulary file word2id.pkl that maps each character to a unique id

For generating vocabulary file, please refer to the code in data.py.

data format

Each data file should be in the following format:

中	B-LOC
国	I-LOC
很	O
大	O

句	O
子	O
结	O
束	O
是	O
空	O
行	O

If you want to use your own dataset, please:

transform your corpus to the above format
generate a new vocabulary file

How to Run

train

python main.py --mode=train

test

python main.py --mode=test --demo_model=1521112368

Please set the parameter --demo_model to the model that you want to test. 1521112368 is the model trained by me.

An official evaluation tool for computing metrics: here (click 'Instructions')

My test performance:

P	R	F	F (PER)	F (LOC)	F (ORG)
0.8945	0.8752	0.8847	0.8688	0.9118	0.8515

demo

python main.py --mode=demo --demo_model=1521112368

You can input one Chinese sentence and the model will return the recognition result:

Reference

[1] Bidirectional LSTM-CRF Models for Sequence Tagging

[2] Neural Architectures for Named Entity Recognition

[3] Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition

[4] https://github.com/guillaumegenthial/sequence_tagging

zh-ner-tf's People

Contributors

Stargazers

Watchers

Forkers

colinsongf vino5211 grainw suzhidong fuhuamosi leezqcst huoliangyu spbohai catcatrun searchmodel zhanglv0209 xuhanvsxuhan wodeweilai kaijyunl cuilunan xitongdashi jackysnake ufukhurriyetoglu tjunlp mariobai rchen1207 zhujiahui shihuaxing igoingdown liqipap cyzhangathit innerface henryflee guidachengong eight-corner skywindy zouxiaoyuonly theoqian freedomkite zgd716 xiedake davidxiaozhi cwellszhang facingwaller davidie radi9 cedar33 sasaki-gg jasonhoou hongweijun811 zhangqiking junfengduan chaoongithub githubbeinner wanyongtao1988 daiyl zhuhaiqing42 huandalu liuxinalice vincentcullen allonbrooks decade2014 xiaoshuangzi zhaohuiqiang little-gg vanecloud lemonnight vpegasus moree0 cosastro joyle yang9112 zhaopsoul qhhan binkes liushui9404 wysqh wutonghua talkischeapgivememoney dimkang windyjune dderek-01 xxbb1234021 x-hacker nicoleqwerty sunyunyun shinebst hinanmu lrisliu hujunxianligong itsmengzaime hncz003 hailiang-wang aicbic shelleyhlx tianyikenan hualichenxi puluwen jmctian majingbit linkfar dx2048 xujunrt zhangchuncheng dotaartist

zh-ner-tf's Issues

语料和W2V的疑问

首先感谢老师对之前几个问题的解答，对我帮助很大。我仔细检查了自己的语料，确实发现了bug，调整后就运行成功了，不过现在我有两个疑问需要老师给予解答，谢谢！
1.语料中，每行的句子长度有没有限制？如果有，在什么地方进行的规定？
因为我在检查语料的时候，起初是以为有些句子过长，导致训练出了问题，就直接把长句子给拆分了，这个过程中发现了一些连续换行的问题（我认为这是我的语料的主要问题）。
2.word2vec训练出的词向量应当怎么用进去，我在程序中看到了这块内容，词向量模型应当是什么样的格式？

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

训练时一直有这个warning，我定位了一下是``grads_and_vars = optim.compute_gradients(self.loss)这句的问题。我训练时并没有占用多大内存，却一直提示我GPU内存不够，我怀疑是这个警告的问题。该如何解决？

预测文本有时出现报错，UnboundLocalError local variable pre loc org referenced before assignment

我随意用网上的一段新闻报道进行预测，这个错出现在调用util.py提取实体的函数时：
33行 per += char
或者 56行 loc += char
或者 79行 org += char
预测出来的tag是基本正确的，请问这个问题该怎么解决？

tf1.2的支持问题

在我的tf1.0版本的机器上可以正常运行，但是在我的tf1.2机器上报错
NotFoundError (see above for traceback): Key bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel not found in checkpoint
请问怎么解决？

unexpected number of features of 1 (3)

when I directly use python main.py --mode=demo --demo_model=1521112368,it print 'unexpected number of features of 1 (3)'

How to train my own data set?

I want to define my own Entity Type, but I don't know how to train the data set.
Will you'll offer great help!

几点小疑问

如果使用word2vec训练的词向量，不管改不改训练语料，都需要自己生成一个word2id.pkl文件，我这么理解对吗？
在训练的时候accuracy和precision分别是怎么算的？
一般词训练的过程中，没有把数字改成训练的选项，为什么您会使用这种方法呢？效果会好么？

请问可以分享您的预训练embedding吗？

首先感谢您提供的代码，我使用了随机初始化embedding的方法来训练了我的模型，但是效果并不理想，希望可以得到您但是说使用的预训练enbedding，十分感谢。

自己找的训练数据训练完成之后输入测试报错

============= demo =============
Please input your sentence:
宏远信息科技有限公司
Traceback (most recent call last):
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/main.py", line 123, in
ORG = get_entity(tag, demo_sent)
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/utils.py", line 19, in get_entity
ORG = get_ORG_entity(tag_seq, char_seq)
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/utils.py", line 81, in get_ORG_entity
org += char
UnboundLocalError: local variable 'org' referenced before assignment

代码里面构建label2tag时候为啥特殊处理

代码中的evaluate函数里面

        for tag, label in self.tag2label.items():
            label2tag[label] = tag if label != 0 else label

构建label2tag映射时候，在对label=0做特殊处理。
我看你最后在进行评估(conlleval函数)的时候，又转回来了。

tag = '0' if tag == 'O' else tag

直观的处理，会有什么潜在问题？

测试时无法调用多次

把main.py的101行到122行拆出来封装成方法测试，发现这个方法不能调用第二次，麻烦问下是哪里的问题？

调用方式：
PER, LOC, ORG = ge.entity_extractor("**人民银行是一个银行")
PER1, LOC1, ORG1 = ge.entity_extractor("上海")

报错信息：
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 408, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 747, in _get_single_variable
name, "".join(traceback.format_list(tb))))
ValueError: Variable bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

File "/Users/xuli/PycharmProjects/entity_extraction/model.py", line 71, in biLSTM_layer_op
dtype=tf.float32)
File "/Users/xuli/PycharmProjects/entity_extraction/model.py", line 37, in build_graph
self.biLSTM_layer_op()
File "/Users/xuli/PycharmProjects/entity_extraction/get_entity.py", line 93, in entity_extractor
model.build_graph()

Requirement to use this tool for English or Spanish

Hi!
Please let us know what should I modify (apart from the training data of course) if I have to use it for any other language like English or Spanish?

Thanks in advance!

How to transform the original data to the train data?

Hi there,
I have some questions about the data preprocessing. I want to know how you transform the original data in data_path/original/train1.txt to the train data in data_path/train_data. Since some data in data_path/train_data do not exit in the train1.txt, like B-LOC, I want to know how you do that transform.
Thank you very much!

Missing B-LOC before I-LOC

I was trying to implement the demo mode with my own model, trained with own corpus. Unfortunately I got an error says:

============= demo =============
INFO:tensorflow:Restoring parameters from ./data_path_save/1513130611/checkpoints/model-560
Restoring parameters from ./data_path_save/1513130611/checkpoints/model-560
Please input your sentence:
马云
['I-LOC', 'I-LOC'] # print out the tags model predicts
Traceback (most recent call last):
  File "main.py", line 114, in <module>
    PER, LOC, ORG = get_entity(tag, demo_sent)
  File "/home/fanfan/py/nlp/ner/utils.py", line 16, in get_entity
    LOC = get_LOC_entity(tag_seq, char_seq)
  File "/home/fanfan/py/nlp/ner/utils.py", line 56, in get_LOC_entity
    loc += char
UnboundLocalError: local variable 'loc' referenced before assignment

It seems the cause is that I have I-LOC predicted without a B-LOC ahead of it. I thought LSTM+CRF is able to learn this information and knows that this is illegal. Any idea why this happened?

Thank you.

可以加上时间表达式抽取吗

ValueError: Variable bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel already exists

ValueError: Variable bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?

A question about word_embedding

Hi, I have a question about _word_embedding that why you set it's trainable as True. In my opinion, the _word_embedding is used for mapping words to vector so that we can use it as the input of LSTM, why would the _word_embedding change while training?

thank you so much

ValueError: None values not supported.

When I ran the train model in main.py, an error appeared:

ValueError: None values not supported.

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 688, in runfile
execfile(filename, namespace)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/BiLSTM+CRF/BILSTM+CRF/zh-NER-TF-master/main.py", line 76, in
model.build_graph()

File "E:\BiLSTM+CRF\ BILSTM+CRF\zh-NER-TF-master\model.py", line 43, in build_graph
self.trainstep_op()

File "E:\BiLSTM+CRF\BILSTM+CRF\zh-NER-TF-master\model.py", line 135, in trainstep_op
grads_and_vars_clip = [[tf.clip_by_value(g, -self.clip_grad, self.clip_grad), v] for g, v in grads_and_vars]

File "E:\BiLSTM+CRF\BILSTM+CRF\zh-NER-TF-master\model.py", line 135, in
grads_and_vars_clip = [[tf.clip_by_value(g, -self.clip_grad, self.clip_grad), v] for g, v in grads_and_vars]

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 59, in clip_by_value
t = ops.convert_to_tensor(t, name="t")

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 676, in convert_to_tensor
as_ref=False)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 741, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 364, in make_tensor_proto
raise ValueError("None values not supported.")

ValueError: None values not supported.

训练时GPU利用率低

你好，我想问下我训练BILSTM时，发现GPU 的利用率比较低只有11%左右，需要设置什么吗？

为啥不能复现您的结果？

@Determined22 多谢多谢！！

直接调用您的模型会报一些tensor不能匹配的错，所以我从头训练的

出来全标成person了

你好，我在读代码时的一些困惑？

代码部分貌似没有出现计算P\R、F值的部分，请问是调用那个包？计算？

请问测试精度是怎么计算的呢？？

test之后要怎么看呢？？

有一个小问题希望您能够解答

代码test时用的是最后一个训练的模型，我们是否应该保存并且使用eval测试效果最好的那一个模型用于test？
等待您的回复，谢谢。

Model not work

When I use the pre-trained model '1521112368' to test, it raised a problem that "InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4272,300] rhs shape= [3905,300]". I wonder if you used different word_id or something? How should I reuse this pre-trained model?
Thank you so much！

一个小疑问

BLSTM可以自动标注语料，所以我们可以不用自己标注吗？还是只是测试集不用标注？
测试集是用来测试模型好用的吗？那为什么最后测试使用一段话测试的？

关于运行的问题

刚拿到您的代码，准备好好学学一下，请问能告诉我一下正确的运行流程吗

Your test performance

Could you please shard your test performance? I cannot load your pre-trained model and want to compare of the model I train myself. Thanks.

效果提升

现在这个基本的模型F1score比较低，比单纯CRF做的结果也要低，向请教一下如果想要提升这个效果该加写什么呢

测试集

请问您的data_path里面的test_data这个数据是用来测试的吗？还是用来对你的测试结果做对比的，校验识别的结果是否正确的？

Missing B-LOC before I-LOC

数据源问题

请问您的数据源是什么？

您utils.py中line27的per,请问对应的是什么啊？在哪定义的变量？

我想修改标注类型，但是不知道per是在哪定义的变量？

您好，请问792个batch在哪里改啊？

还有您utils.py中line27的per,请问是什么啊？我想改一下标注类型，但是不知道这个per对应的是什么？
麻烦了。

如果train1.txt的分词效果不好，是不是会影响实体识别的效果

楼主你好！
我的问题：
如果train1.txt中某些人名或者地名分词没有分出来，进而导致BIO标注不准确，那实体识别是不是也识别不出来了。

感觉就像是一个分词查找呢

实验了一个：
小明在山上吃饭
结果什么输出都没有。
实际效果看起来很像一个分词查找器，一点也没有深度网络的效果。

What is the data set?

Thanks for your code, it helps me l a lot. Can you tell me What is the data set in your code? Is it MSRA?

请问自己标注的语料训练,你代码中的word2id.pkl是如何生成的？如果我用自己语料训练是否需要重新生成word2id.pkl文件?如果需要怎么生成?

word2id = pickle.load(fr) ValueError: unsupported pickle protocol: 3

请问是什么原因啊？

在训练模型的时候得需要多久？

在原来的训练数据上我又添加了一点新的数据，然后重新训练的模型，我用的机器是MacBook Air，4G内存，截止到现在为止，差不多已经跑了一整天了，还在跑，不知道还需要多久，现在强制停止可以吗，强制停止后的训练模型能用吗？

想问一下，1.2.1tensorflow没有tf.contrib.layers.xavier_initializer(),这里该怎么设置呢?

tensorflow1.2.1
我改了一下，结果出现了问题，mode=demo的时候出现
ValueError: Shape must be rank 3 but is rank 6 for 'train_step/gradients/bi-lstm/concat_grad/Slice' (op: 'Slice') with input shapes: [?,?,600,?,?,300], [3], [3].

    with tf.variable_scope("proj"):
        W = tf.get_variable(name="W",
                            shape=[2 * self.hidden_dim, self.num_tags],
                            initializer=tf.zeros_initializer,
                            # tf.random_normal_initializer,
                            # tf.constant(0.5,dtype=None,shape=[2 * self.hidden_dim, self.num_tags]),
                            # tf.contrib.layers.xavier_initializer(),
                            dtype=tf.float32
                            )

        b = tf.get_variable(name="b",
                            shape=[self.num_tags],
                            initializer=tf.zeros_initializer,
                            # ([self.num_tags]),
                            # initializer=tf.zeros_initializer(),
                            dtype=tf.float32
                            )

请问最后一层crf是完整的crf吗，还有特征函数吗，还是说只是一个viterbi解码

你好，我刚接触这部分，看了您的博客过来的，这个问题困扰我很久，希望您可以帮忙解答。
在lstm输出得到k维的向量代表分类到这k个标签的概率之后，转移概率又是如何得到的呢？得到了转移概率的情况下是否可以直接用viterbi解码就够了，如果这样的话后接的这个并不是一个完整的CRF？
这里lstm是如何跟crf做衔接的？

Info：'perl' is not recognized as an internal or external command, operable program or batch file.

Hi, I have a question that when I run the model as train mode, at the end of each epoch, it have a log
says that : 'perl' is not recognized as an internal or external command,operable program or batch file.

what does it mean

您好，请问训练好的模型能否用C\C++进行加载

如题所示，最终训练得到的模型能否用C++进行加载

word2id.pkl

Did you create word2id.pkl using Glove?

demo的时候出不了结果

Traceback (most recent call last):
File "main.py", line 121, in
tag = model.demo_one(sess, demo_data)
File "D:\npl\test_predata\model.py", line 177, in demo_one
for seqs, labels in batch_yield(sent, self.batch_size, self.vocab, self.tag2label, shuffle=False):
File "D:\npl\test_predata\data.py", line 142, in batch_yield
label_ = [tag2label[tag] for tag in tag_]
File "D:\npl\test_predata\data.py", line 142, in
label_ = [tag2label[tag] for tag in tag_]
KeyError: 'O'
您能看看这是怎么回事吗？？
相关：我的数据标签只有两个T和F，因为想做输入全称输出简称的模型。

word2vec

您好,我是新手,我想知道词向量是如何添加的,格式是什么样子的?

test和demo

test遇到了问题，在train过程中一切正常：各项数值基本能到80%+，但是用训练好的模型进行test时，全都变成了不到1%，这是为什么？

怎样生成word2id.pkl这个文件？

https://github.com/Determined22/zh-NER-TF/blob/master/data_path/word2id.pkl
多谢！@Determined22

请问生成word2id文件时一直报错，要怎么解决呢？

用的自己的语料训练的，标签只有两个‘T'和‘F'，但是总是有这个值不够的错误，请问要怎么解决呢？
File "D:\npl\test_predata\data.py", line 154, in
vocab_build('D:/npl/test_predata/word2id.pkl','D:/npl/test_predata/data1.txt', 5)
File "D:\npl\test_predata\data.py", line 40, in vocab_build
data = read_corpus(corpus_path)
File "D:\npl\test_predata\data.py", line 23, in read_corpus
[char, label] = line.strip().split()
ValueError: not enough values to unpack (expected 2, got 1)

英文貌似都被识别成ORG了

如题

determined22 / zh-ner-tf Goto Github PK

zh-ner-tf's Introduction

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

Model

Dataset

data files

data format

How to Run

train

test

demo

Reference

zh-ner-tf's People

Contributors

Stargazers

Watchers

Forkers

zh-ner-tf's Issues

Recommend Projects

Recommend Topics

Recommend Org