zhengyanzhao1997 / nlp-model Goto Github PK

View Code? Open in Web Editor NEW

271.0 5.0 82.0 197 KB

Python 100.00%

nlp-model's Introduction

NLP-model

自己在学习与工作中搭建的NLP模型，论文复现或实际生产应用

具体代码解读可以follow我的博客 https://blog.csdn.net/weixin_45839693

语言：Python 3.8

框架：Tensorflow 2.0 Transformers 3.1.0

目前更新的模型：

Sentence_bert NLP-model/model/TF_model/Train_Sentence-BERT.py

Bert-Last_3embedding_concat 情绪分类模型 NLP-model/model/TF_model/Train_Bert-Last_3embedding_concat_classification.py

SQuAD 2020语言与智能技术竞赛：机器阅读理解任务 baseline模型 NLP-model/model/TF_model/SQuAD_baseline.py

关系抽取——基于主语感知的层叠式指针网络 NLP-model/model/TF_model/Information_extraction/三元组抽取_指针标注.py

关系抽取——基于 Muti_head_selection NLP-model/model/TF_model/Information_extraction/关系抽取_Multi-head Selection.py

关系抽取——基于 Deep Biaffine Attention NLP-model/model/TF_model/Information_extraction/关系抽取_Deep Biaffine Attention.py

Unified Language Model 新闻摘要生成 NLP-model/model/TF_model/Unified Language Model

NEZHA 相对位置模型（处理长文本）法律摘要生成 NLP-model/TF_model/model/NEZHA

SDP 2021@NAACL LongSumm 第一名模型集合 NLP-model/model/TF_model/Longsumm

框架：torch 1.8.0 Transformers 4.1.5

目前更新的模型：

2021搜狐校园文本匹配算法大赛 P-tuning-Bert BaseLine NLP-model/model/Torch_model/Souhu_TextMatch

2021搜狐校园文本匹配算法大赛 Layer_conditional_norm BaseLine NLP-model/model/Torch_model/Souhu_TextMatch

SimCSE 论文复现无监督/有监督对比学习 NLP-model/model/Torch_model/SimCSE-Chinese

嵌套实体命名识别 GlobalPointer、TPLinker、Tencent Muti-head、Deep Biaffine NLP-model/model/Torch_model/ExtractionEntities

Efficient-GlobalPointer 联合事件抽取 NLP-model/model/Torch_model/ExtractionEntities/GPLinker_DUEE

nlp-model's People

Contributors

Stargazers

Watchers

Forkers

zhengxin2016 del18687058912 mingkin qsong4 vpegasus 2019hong xbqnl julyhcw galbya henrideng majokiki huiyangzhou fangzheng354 binkes qingkongzhiqian wangbq18 z1qsx zero165 henryyuen128 flyrainkey chenyumiyu jdcmj powerycy chenshenghao cold-eye mars-wei harvey1477 caofaxin changgeng-wei cindymuji aronbryant quanjiehan bestpredicts iseef2 willzli helldog-star xuyongfu tiffen heihei2015 liangzongchang yliuhb gzglss learnmf lucky-2017 wushicanasl harryingit3 cfhgithub kquark kg-nlp anshiquanshu66 haojiepan1 ericwang970322 peirisi czheng17 1147355607 qyjcode albertbj hi-archers zhaoguangyao neekchaw yzzzz3 nlpformyself 11chase11 yuzhang112 sherryran08 nealchanai monnli hanwenxuthu shiyanlou-015555 shihanma wjn1996 superwei2021xxx zh190920 igoslow hlhqbzd jalork xiejinwen113 javaxiong brucekyle99 maodong9 svpur

nlp-model's Issues

代码在验证阶段报错

大佬你好，我在跑了multi_head_selection.py和deep_biaffine_attention.py两个模型时报错了，错误信息为：

ValueError: Dimension 1 in both shapes must be equal, but are 1 and 128. Shapes are [32,1,128] and [32,128,1]. for '{{node model_2/biaffine_2/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](model_2/biaffine_2/Tile, model_2/biaffine_2/Tile_1, model_2/biaffine_2/concat/axis)' with input shapes: [32,1,128,144], [32,128,1,144], [] and with computed input tensors: input[2] = <-1>.

0it [01:22, ?it/s]
2021-02-03 09:05:28.842743: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.

请问这是什么原因呢，

关于监督simCSE有一处疑问

为什么计算损失时，没有用到数据集中的label，而是自己构造label呢？

请问Global Pointer 和其它NER方法所使用的数据是否有处理脚本或者数据范例？

作者您好，我在blog中看到你对四个方法在CMeEE上的表现都做了对比，但是代码实现里用的不是阿里云标注比赛提供的原数据格式，请问Global Pointer 和其它NER方法所使用数据是否有处理脚本或者数据范例以供参考？

请问globalpointer,MutiHeadSelection用的是那个数据集呢？

sim-cse 里的 train dataset, 为啥要在 tokenize 阶段传 2 次 source ，而不在 training 时对 input_ids 进行 repeat 呢？

如题，

我简单实验了下， tokenize 时传 2 次 source ，结果正常，loss 能快速下降，见代码 tokenize source source

class TrainDataset(Dataset):
...
    def text_to_id(self, source):
        sample = self.tokenizer([source, source],max_length=self.maxlen,truncation=True,padding='max_length',return_tensors='pt')
        return sample
...

而 tokenize 时只传一次 source，且在 train 时 repeat input_ids, loss 一直不降

class TrainDataset(Dataset):
...
    def text_to_id(self, source):
        sample = self.tokenizer(source,max_length=self.maxlen,truncation=True,padding='max_length',return_tensors='pt')
        return sample
...

def train(dataloader,testdata, model, optimizer):
    ...
    for batch, data in enumerate(dataloader):
        input_ids = data['input_ids'].view(len(data['input_ids'])*2,-1).repeat(2,1).to(device)
        ...

为什么会出现loss降不下去的情况

哪怕是一个batch的样本差别很大，也应该有一些作用吧

您好，很抱歉打扰您，调试您的三元组抽取_指针标注.py，报了一个我无法解决的错误

大佬您好！！很抱歉打扰您，我现在正在学习关系抽取的初级阶段，看了您的博客受益匪浅，调试了您的代码“三元组抽取_指针标注.py”，除了

这里没做其他修改，

报了

这个错误，我调整了bert_model = TFBertForSequenceClassification.from_pretrained(pretrained_path)，可是还是不懂，我实在太菜了，想问下您之前有碰到过类似的错误吗？这是什么模型输入的问题吗？我应该往那方面尝试进行修改呢？

期待回复~

请教一个代码上的问题

请问一下这段代码中从第64行开始
origin = source['origin'] entailment = source['entailment'] contradiction = source['contradiction']
这三段代码中的source表达什么含义，在哪里定义的
感谢您的帮助

关于数据集

你好我想问一下这个transformer的输入，我们要用别的数据集，也是序列，但是不是语料库是生物基因片段，那我要怎么改这个书输入呀，还是说直接替换就行呢

测试代码问题

model.load_state_dict(torch.load(save_path))
corrcoef = test(deving_data, model)
print(f"dev_corrcoef: {corrcoef:>4f}")

test之前应该需要先调用一下model.eval()的吧

nuion_data_pre.json数据集

你好，这个数据 nuion_data_pre.json 数据集能够传一份么。

GPLinker_DUEE

使用DUEE 1.0数据
Epoch=200

前面的几个epoch argu的F1都非常低这是正常的嘛？

Train argu F1: 0.000011
100% 1498/1498 [00:13<00:00, 114.86it/s]
1.1173184357540651e-13 1.0 5.5865921787706374e-14
5.651313930488519e-14 1.0 2.8256569652443394e-14
Higher F1: 0.000000

Train argu F1: 0.000048
100% 1498/1498 [00:12<00:00, 117.61it/s]
1.1173184357540651e-13 1.0 5.5865921787706374e-14
5.651313930488519e-14 1.0 2.8256569652443394e-14

关于有监督simcse 的loss计算

您好，关于simcse有监督的loss的计算有一些疑问，模型的输入是三个句子，原始句，正例句，反例句，在计算loss的时候是把负例去掉了，这里只考虑了原始句子和正例句的相似度，没有考虑原始句子和反例之间的loss，跟原始论文的监督版本的loss计算不太一样，这里为什么不考虑负例呢？

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

您好，我想使用roberta-chinesetiny来运行一下您的代码但是出现了如下错误
请问该如何解决呢
output = outputs.last_hidden_state[:,0]

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

关于GlobalPointer的一点疑惑，还望解答

您好，非常有幸在CSDN看到了您的博文，关系抽取系列文章阅读下来后对您是无比的敬佩（尤其是附有pytorch和tf的实现）

但是有点关于GlobalPointer的小问题想要请教，相信您关于GlobalPointer的代码实现也参考了苏神的源码，我看到苏神在GlobalPointer的相关实现中用到了sequence_mask，即 logits = sequence_masking(logits, mask, '-inf', 2) logits = sequence_masking(logits, mask, '-inf', 3) 但是在调用GlobalPointer的时候却没有传入mask矩阵，即model = build_transformer_model(config_path, checkpoint_path) output = GlobalPointer(len(categories), 64)(model.output)请问这是为何？

RuntimeError: The size of tensor a (768) must match the size of tensor b (128) at non-singleton dimension 2

您好打扰您了，我在测试simCSE无监督的时候出现了如下错误。可以麻烦您给点指导吗
2021-07-05 16:28:46.674 Traceback (most recent call last):

2021-07-05 16:28:46.674 File "/var/dl/runtime/script/unsupervised.py", line 263, in

2021-07-05 16:28:46.674 train(train_iter,testing_data, model, optimizer)

2021-07-05 16:28:46.674 File "/var/dl/runtime/script/unsupervised.py", line 216, in train

2021-07-05 16:28:46.674 loss = compute_loss(pred)

2021-07-05 16:28:46.674 File "/var/dl/runtime/script/unsupervised.py", line 177, in compute_loss

2021-07-05 16:28:46.674 similarities = similarities-torch.eye(y_pred.shape[0],device='cuda') * 1e12

2021-07-05 16:28:46.674 RuntimeError: The size of tensor a (768) must match the size of tensor b (128) at non-singleton dimension 2

simcse的评测问题

hello , 想问一下simcse的模型在你使用的数据集上 , 有监督和无监督最后的相关度得分是多少呢 ?
我这边无监督训练了一万条数据 , 在dev上测评得分只有0.23 , 我不清楚这分数是否正常

cnsd_snli_v1.0.trainproceed.txt文件不存在

问题解决了