Giter Site home page Giter Site logo

determined22 / zh-ner-tf Goto Github PK

View Code? Open in Web Editor NEW
2.3K 62.0 939.0 109.76 MB

A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow)

Perl 33.22% Python 66.78%
named-entity-recognition bilstm-crf-model tensorflow

zh-ner-tf's Introduction

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

This repository includes the code for buliding a very simple character-based BiLSTM-CRF sequence labeling model for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.

This code works on Python 3 & TensorFlow 1.2 and the following repository https://github.com/guillaumegenthial/sequence_tagging gives me much help.

Model

This model is similar to the models provided by paper [1] and [2]. Its structure looks just like the following illustration:

Network

For one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.

The first layer, look-up layer, aims at transforming each character representation from one-hot vector into character embedding. In this code I initialize the embedding matrix randomly. We could add some linguistic knowledge later. For example, do tokenization and use pre-trained word-level embedding, then augment character embedding with the corresponding token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).

The second layer, BiLSTM layer, can efficiently use both past and future input information and extract features automatically.

The third layer, CRF layer, labels the tag for each character in one sentence. If we use a Softmax layer for labeling, we might get ungrammatic tag sequences beacuse the Softmax layer labels each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax doesn't know. Compared to Softmax, a CRF layer can use sentence-level tag information and model the transition behavior of each two different tags.

Dataset

#sentence #PER #LOC #ORG
train 46364 17615 36517 20571
test 4365 1973 2877 1331

It looks like a portion of MSRA corpus. I downloaded the dataset from the link in ./data_path/original/link.txt

data files

The directory ./data_path contains:

  • the preprocessed data files, train_data and test_data
  • a vocabulary file word2id.pkl that maps each character to a unique id

For generating vocabulary file, please refer to the code in data.py.

data format

Each data file should be in the following format:

中	B-LOC
国	I-LOC
很	O
大	O

句	O
子	O
结	O
束	O
是	O
空	O
行	O

If you want to use your own dataset, please:

  • transform your corpus to the above format
  • generate a new vocabulary file

How to Run

train

python main.py --mode=train

test

python main.py --mode=test --demo_model=1521112368

Please set the parameter --demo_model to the model that you want to test. 1521112368 is the model trained by me.

An official evaluation tool for computing metrics: here (click 'Instructions')

My test performance:

P R F F (PER) F (LOC) F (ORG)
0.8945 0.8752 0.8847 0.8688 0.9118 0.8515

demo

python main.py --mode=demo --demo_model=1521112368

You can input one Chinese sentence and the model will return the recognition result:

demo_pic

Reference

[1] Bidirectional LSTM-CRF Models for Sequence Tagging

[2] Neural Architectures for Named Entity Recognition

[3] Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition

[4] https://github.com/guillaumegenthial/sequence_tagging

zh-ner-tf's People

Contributors

determined22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zh-ner-tf's Issues

语料和W2V的疑问

首先感谢老师对之前几个问题的解答,对我帮助很大。我仔细检查了自己的语料,确实发现了bug,调整后就运行成功了,不过现在我有两个疑问需要老师给予解答,谢谢!
1.语料中,每行的句子长度有没有限制?如果有,在什么地方进行的规定?
因为我在检查语料的时候,起初是以为有些句子过长,导致训练出了问题,就直接把长句子给拆分了,这个过程中发现了一些连续换行的问题(我认为这是我的语料的主要问题)。
2.word2vec训练出的词向量应当怎么用进去,我在程序中看到了这块内容,词向量模型应当是什么样的格式?

tf1.2的支持问题

在我的tf1.0版本的机器上可以正常运行,但是在我的tf1.2机器上报错
NotFoundError (see above for traceback): Key bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel not found in checkpoint
请问怎么解决?

几点小疑问

如果使用word2vec训练的词向量,不管改不改训练语料,都需要自己生成一个word2id.pkl文件,我这么理解对吗?
在训练的时候accuracy和precision分别是怎么算的?
一般词训练的过程中,没有把数字改成训练的选项,为什么您会使用这种方法呢?效果会好么?

请问可以分享您的预训练embedding吗?

首先感谢您提供的代码,我使用了随机初始化embedding的方法来训练了我的模型,但是效果并不理想,希望可以得到您但是说使用的预训练enbedding,十分感谢。

自己找的训练数据训练完成之后输入测试报错

============= demo =============
Please input your sentence:
宏远信息科技有限公司
Traceback (most recent call last):
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/main.py", line 123, in
ORG = get_entity(tag, demo_sent)
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/utils.py", line 19, in get_entity
ORG = get_ORG_entity(tag_seq, char_seq)
File "/Users/zhangchangrui/Downloads/zh-NER-TF-master/utils.py", line 81, in get_ORG_entity
org += char
UnboundLocalError: local variable 'org' referenced before assignment

代码里面构建label2tag时候为啥特殊处理

代码中的evaluate函数里面

        for tag, label in self.tag2label.items():
            label2tag[label] = tag if label != 0 else label

构建label2tag映射时候,在对label=0做特殊处理。
我看你最后在进行评估(conlleval函数)的时候,又转回来了。

tag = '0' if tag == 'O' else tag

直观的处理,会有什么潜在问题?

测试时无法调用多次

把main.py的101行到122行拆出来封装成方法测试,发现这个方法不能调用第二次,麻烦问下是哪里的问题?

调用方式:
PER, LOC, ORG = ge.entity_extractor("**人民银行是一个银行")
PER1, LOC1, ORG1 = ge.entity_extractor("上海")

报错信息:
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 408, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 747, in _get_single_variable
name, "".join(traceback.format_list(tb))))
ValueError: Variable bi-lstm/bidirectional_rnn/fw/lstm_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

File "/Users/xuli/PycharmProjects/entity_extraction/model.py", line 71, in biLSTM_layer_op
dtype=tf.float32)
File "/Users/xuli/PycharmProjects/entity_extraction/model.py", line 37, in build_graph
self.biLSTM_layer_op()
File "/Users/xuli/PycharmProjects/entity_extraction/get_entity.py", line 93, in entity_extractor
model.build_graph()

How to transform the original data to the train data?

Hi there,
I have some questions about the data preprocessing. I want to know how you transform the original data in data_path/original/train1.txt to the train data in data_path/train_data. Since some data in data_path/train_data do not exit in the train1.txt, like B-LOC, I want to know how you do that transform.
Thank you very much!

Missing B-LOC before I-LOC

I was trying to implement the demo mode with my own model, trained with own corpus. Unfortunately I got an error says:

============= demo =============
INFO:tensorflow:Restoring parameters from ./data_path_save/1513130611/checkpoints/model-560
Restoring parameters from ./data_path_save/1513130611/checkpoints/model-560
Please input your sentence:
马云
['I-LOC', 'I-LOC'] # print out the tags model predicts
Traceback (most recent call last):
  File "main.py", line 114, in <module>
    PER, LOC, ORG = get_entity(tag, demo_sent)
  File "/home/fanfan/py/nlp/ner/utils.py", line 16, in get_entity
    LOC = get_LOC_entity(tag_seq, char_seq)
  File "/home/fanfan/py/nlp/ner/utils.py", line 56, in get_LOC_entity
    loc += char
UnboundLocalError: local variable 'loc' referenced before assignment

It seems the cause is that I have I-LOC predicted without a B-LOC ahead of it. I thought LSTM+CRF is able to learn this information and knows that this is illegal. Any idea why this happened?

Thank you.

A question about word_embedding

Hi, I have a question about _word_embedding that why you set it's trainable as True. In my opinion, the _word_embedding is used for mapping words to vector so that we can use it as the input of LSTM, why would the _word_embedding change while training?

thank you so much

ValueError: None values not supported.

When I ran the train model in main.py, an error appeared:

ValueError: None values not supported.

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 688, in runfile
execfile(filename, namespace)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/BiLSTM+CRF/BILSTM+CRF/zh-NER-TF-master/main.py", line 76, in
model.build_graph()

File "E:\BiLSTM+CRF\ BILSTM+CRF\zh-NER-TF-master\model.py", line 43, in build_graph
self.trainstep_op()

File "E:\BiLSTM+CRF\BILSTM+CRF\zh-NER-TF-master\model.py", line 135, in trainstep_op
grads_and_vars_clip = [[tf.clip_by_value(g, -self.clip_grad, self.clip_grad), v] for g, v in grads_and_vars]

File "E:\BiLSTM+CRF\BILSTM+CRF\zh-NER-TF-master\model.py", line 135, in
grads_and_vars_clip = [[tf.clip_by_value(g, -self.clip_grad, self.clip_grad), v] for g, v in grads_and_vars]

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 59, in clip_by_value
t = ops.convert_to_tensor(t, name="t")

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 676, in convert_to_tensor
as_ref=False)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 741, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))

File "D:\ANACONDA INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 364, in make_tensor_proto
raise ValueError("None values not supported.")

ValueError: None values not supported.

训练时GPU利用率低

qq 20180317105756

你好,我想问下我训练BILSTM时,发现GPU 的利用率比较低只有11%左右,需要设置什么吗?

有一个小问题希望您能够解答

代码test时用的是最后一个训练的模型,我们是否应该保存并且使用eval测试效果最好的那一个模型用于test?
等待您的回复,谢谢。

Model not work

When I use the pre-trained model '1521112368' to test, it raised a problem that "InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4272,300] rhs shape= [3905,300]". I wonder if you used different word_id or something? How should I reuse this pre-trained model?
Thank you so much!

一个小疑问

BLSTM可以自动标注语料,所以我们可以不用自己标注吗?还是只是测试集不用标注?
测试集是用来测试模型好用的吗?那为什么最后测试使用一段话测试的?

关于运行的问题

刚拿到您的代码,准备好好学学一下,请问能告诉我一下正确的运行流程吗

Your test performance

Could you please shard your test performance? I cannot load your pre-trained model and want to compare of the model I train myself. Thanks.

效果提升

现在这个基本的模型F1score比较低,比单纯CRF做的结果也要低,向请教一下如果想要提升这个效果该加写什么呢

测试集

请问您的data_path里面的test_data这个数据是用来测试的吗?还是用来对你的测试结果做对比的,校验识别的结果是否正确的?

感觉就像是一个分词查找呢

实验了一个:
小明在山上吃饭
结果什么输出都没有。
实际效果看起来很像一个分词查找器,一点也没有深度网络的效果。

What is the data set?

Thanks for your code, it helps me l a lot. Can you tell me What is the data set in your code? Is it MSRA?

在训练模型的时候得需要多久?

在原来的训练数据上我又添加了一点新的数据,然后重新训练的模型,我用的机器是MacBook Air,4G内存,截止到现在为止,差不多已经跑了一整天了,还在跑,不知道还需要多久,现在强制停止可以吗,强制停止后的训练模型能用吗?

想问一下,1.2.1tensorflow没有tf.contrib.layers.xavier_initializer(),这里该怎么设置呢?

tensorflow1.2.1
我改了一下,结果出现了问题,mode=demo的时候出现
ValueError: Shape must be rank 3 but is rank 6 for 'train_step/gradients/bi-lstm/concat_grad/Slice' (op: 'Slice') with input shapes: [?,?,600,?,?,300], [3], [3].

    with tf.variable_scope("proj"):
        W = tf.get_variable(name="W",
                            shape=[2 * self.hidden_dim, self.num_tags],
                            initializer=tf.zeros_initializer,
                            # tf.random_normal_initializer,
                            # tf.constant(0.5,dtype=None,shape=[2 * self.hidden_dim, self.num_tags]),
                            # tf.contrib.layers.xavier_initializer(),
                            dtype=tf.float32
                            )

        b = tf.get_variable(name="b",
                            shape=[self.num_tags],
                            initializer=tf.zeros_initializer,
                            # ([self.num_tags]),
                            # initializer=tf.zeros_initializer(),
                            dtype=tf.float32
                            )

请问最后一层crf是完整的crf吗,还有特征函数吗,还是说只是一个viterbi解码

你好,我刚接触这部分,看了您的博客过来的,这个问题困扰我很久,希望您可以帮忙解答。
在lstm输出得到k维的向量代表分类到这k个标签的概率之后,转移概率又是如何得到的呢?得到了转移概率的情况下是否可以直接用viterbi解码就够了,如果这样的话后接的这个并不是一个完整的CRF?
这里lstm是如何跟crf做衔接的?

demo的时候出不了结果

Traceback (most recent call last):
File "main.py", line 121, in
tag = model.demo_one(sess, demo_data)
File "D:\npl\test_predata\model.py", line 177, in demo_one
for seqs, labels in batch_yield(sent, self.batch_size, self.vocab, self.tag2label, shuffle=False):
File "D:\npl\test_predata\data.py", line 142, in batch_yield
label_ = [tag2label[tag] for tag in tag_]
File "D:\npl\test_predata\data.py", line 142, in
label_ = [tag2label[tag] for tag in tag_]
KeyError: 'O'
您能看看这是怎么回事吗??
相关:我的数据标签只有两个T和F,因为想做输入全称输出简称的模型。

word2vec

您好,我是新手,我想知道词向量是如何添加的,格式是什么样子的?

test和demo

test遇到了问题,在train过程中一切正常:各项数值基本能到80%+,但是用训练好的模型进行test时,全都变成了不到1%,这是为什么?

请问生成word2id文件时一直报错,要怎么解决呢?

用的自己的语料训练的,标签只有两个‘T'和‘F',但是总是有这个值不够的错误,请问要怎么解决呢?
File "D:\npl\test_predata\data.py", line 154, in
vocab_build('D:/npl/test_predata/word2id.pkl','D:/npl/test_predata/data1.txt', 5)
File "D:\npl\test_predata\data.py", line 40, in vocab_build
data = read_corpus(corpus_path)
File "D:\npl\test_predata\data.py", line 23, in read_corpus
[char, label] = line.strip().split()
ValueError: not enough values to unpack (expected 2, got 1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.