scofield7419 / sequence-labeling-bilstm-crf Goto Github PK

View Code? Open in Web Editor NEW

695.0 695.0 260.0 95.23 MB

The BiLSTM-CRF model implementation in Tensorflow, for sequence labeling tasks.

License: GNU General Public License v3.0

Python 4.39% HTML 6.08% CSS 14.82% JavaScript 74.36% PHP 0.02% Makefile 0.01% CoffeeScript 0.31%

bilstm-crf ner nlp python35 sequence-labeling tensorflow

sequence-labeling-bilstm-crf's Introduction

Hi there Coding hard

sequence-labeling-bilstm-crf's People

Contributors

Stargazers

Watchers

Forkers

billpei tianyunzqs spbohai luofeng1994 rongchen89 leyiwang bringtree zhangjiulong fence qjin2016 leeeeoliu june1990 davidwuzc blakezuo ppn029012 1sfortheelder gokunwu dongxf369 zhaoguangxiang curiouscowboy robertyin-sa facingwaller hanksantford qenvelope alicebupt zhoujiang2013 csm94 liangxi627 zyc130130 airob eminemrain waiteryee1 youlei5898 zsgchinese vpegasus yanaguo silasxue binkes hotlize skywindy cedar33 jayvischeng shangxiaokeng connietong haif-liu hwaking squirrel1982 jliangnku savour-yu chaoongithub liushui9404 1202zhyl wutonghua fishexpert yuhanyu lcy081099 zhangxuemiao rtygbwwwerr junman xueguohua lzjtt2017 xiaxyun fuxia0425 daniel1586 lxldfzr zizhanchen dx2048 yaolinxia zouchl yanminbit joey10huawei siviltaram cxncu001 cosastro yanghaocsg flyrae javelir hydercps fendaq ben2017a meccy liyuejul winston17 junailin 1780041410 jz3707 liurenstrong qianyiwei xuman-amy coeasy mstar1992 hust-bxc mollyhoo zhouyonglong jeremyzhang866 tianyikenan shinichr george86028 del18687058912 cjopengler

sequence-labeling-bilstm-crf's Issues

关于数据集

@scofield7419 您好，请问您使用的是什么样的数据集，我试着用里面的train.in迭代到第12次就报错，而且F1值数-1。

如何进行模型固化

代码写的很赞，用起来很顺手。
预测时用了GPU，效率大概0.1s每条。能否增加模型导出模型固化的函数，方便进行模型发布？
比如 convert_variables_to_constants
https://www.jianshu.com/p/091415b114e2

checkpoint issue

I trained the model by using my own data,but when the time to test,means the mode = test,the program give me a problem about can't find the model,why?the train did't save the model?

It takes too long to predict a sentence

I just have a test, and produce a model use the example_datasets2. But I found it costs to long to predict, so could you please give me some messages to accelerate the prediction.

how to calculate the metrics based on test.out and test.csv

in the HandBook ,you say "calcu_measure_testout.py" can compute the metrics based on test.out and test.csv, but run this file can't get the result, need your help!

Improve the code

Can you improve the code to show the accuracy, recall and F value of each label when testing over?

Logic Problem in self.parpare (DataManger.py)

Basically, I tested this project on new dataset. But I always got:

training set size: 0 validating set size:0

I output the place to generate such error, it is in the prepare function:

def prepare(self, tokens, labels, is_padding=True, return_psyduo_label=False):
        **X = []
        y = []**
        y_psyduo = []
        tmp_x = []
        tmp_y = []
        tmp_y_psyduo = []

        for record in zip(tokens, labels):
            c = record[0]
            l = record[1]
            if c == -1:  # empty line
                if len(tmp_x) <= self.max_sequence_length:
                    X.append(tmp_x)
                    y.append(tmp_y)
                    if return_psyduo_label: y_psyduo.append(tmp_y_psyduo)
                tmp_x = []
                tmp_y = []
                if return_psyduo_label: tmp_y_psyduo = []
            else:
                **tmp_x.append(c)
                tmp_y.append(l)**
                if return_psyduo_label: tmp_y_psyduo.append(self.label2id["O"])
        if is_padding:
            **X = np.array(self.padding(X))**
        else:
            X = np.array(X)
        y = np.array(self.padding(y))
        if return_psyduo_label:
            y_psyduo = np.array(self.padding(y_psyduo))
            return X, y_psyduo

        return X, y

Based on the is_padding and psyduo_label:

        if is_padding:
            **X = np.array(self.padding(X))**
        else:
            X = np.array(X)

X will always be blank. Please have a check.

Thanks

请问BILSTM_CRF.py中self.targets_weight的作用是什么？

您好，您的代码包括char序列和word序列的标注，两个model下都有这一句，
y_train_weight_batch = 1 + np.array((y_train_batch == label2id['B']) | (y_train_batch == label2id['E']), float)
self.targets_weight:y_train_weight_batch
貌似不参与运算，请问BILSTM_CRF.py中self.targets_weight的作用是什么？

Dataset establishment

Can you share the way to generate the dataset? Many thanks

What's the use of dev_file ?

What's the use of dev_file ?What is the difference between dev_file and the test_file?

Typo

There is a typo in README.md file. "world embedding" should be "word embedding".

请问下能够支持中文分词与命名实体吗？

如果，不清楚基于word 标注能否同时支持中文分词、以及命名实体？

ValueError: Cannot reshape a tensor with 28800 elements to shape [128, 6, 6] (4608 elements) We've got an error while stopping in post-mortem: <type 'exceptions.KeyboardInterrupt'>

大神请问下在运行bilstm_crf_word embedding时使用命令python train.py train.in model -v validation.in -e 10 出现了这样的错误
transitions = tf.reshape(tf.concat(0, [transitions] * self.batch_size), [self.batch_size, 6, 6])

ValueError: Cannot reshape a tensor with 28800 elements to shape [128, 6, 6] (4608 elements)
We've got an error while stopping in post-mortem: <type 'exceptions.KeyboardInterrupt'>

该怎么解决啊

Divide by Zero error in engines/BiLSTM_CRF

I have been trying to use the code on my own data. I followed the instructions and changed the sytem.config file accordingly. However, when I train the model, I keep getting the following DividebyZero error:

File "/sequence-labeling-BiLSTM/engines/BiLSTM_CRFs.py", line 302, in train val_results[k] /= num_val_iterations ZeroDivisionError: division by zero

The speed of prediction is slow

It takes at least 5mins to load vocab and dataManager, the prediction is too slow and I try to make CUDA_VISIBLE_DEVICES=1 but it doesn't use CUDA to extract entity, I want to know why and I make sure I have successfully configure CUDA for tensorflow-gpu.

char2id和label2id

char2id和label2id是提前生成好的么？

关于CRF层的问题

我看源码上面CRF层是自己实现的，有没有考虑过使用tf.contrib.crf.crf_log_likelihood 这些tensorflow封装的接口，这样比较方便一些。
但是我自己使用封装接口的时候遇到一些问题，还希望能够指点

提出一个错误

数据集中的tag是14个，而部分代码将数据集tag写死成6个了，估计是按照O B I E S O的方式设计的？
似乎应该修改成num_classes

num_steps

为什么训练和测试的时候num_steps需要一致？

something not mentioned

1, for prediction, sentence length should be shorter than 1000, according to utils.extractEntity_
2, if ending of entity is in the end of sentence, it cant be predicted,for some bug in utils.extractEntity, line86, can change to
reg_str = r'([0-9][0-9][0-9]B'+label_hyphen + tag_str + r' )([0-9][0-9][0-9]I'+label_hyphen + tag_str + r' )*([0-9][0-9][0-9]E'+label_hyphen + tag_str + r')|([0-9][0-9][0-9]S'+label_hyphen + tag_str + r' )'

always thanks to your code