xiangking / ark-nlp Goto Github PK
View Code? Open in Web Editor NEWA private nlp coding package, which quickly implements the SOTA solutions.
License: Apache License 2.0
A private nlp coding package, which quickly implements the SOTA solutions.
License: Apache License 2.0
没有安装cuda,如何调用cpu训练,现在报如下错误:
AssertionError: Torch not compiled with CUDA enabled
Python 3.8.10
ark-nlp 0.0.6
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
您好,我利用样例以及GlobelPointerBert做NER任务加载数据时发现代码跑到convert_to_ids时,服务器CPU占用十分严重,通过htop查到服务器64核CPU占据了将约5000%,请问是什么原因呢
增加对两个句子或者一个句子加条件时的编码截断情况:
Python 3.8.10
ark-nlp 0.0.7
tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线')
>>>
['森',
'麥',
'康',
'小',
'米',
'3',
'm',
'4',
'm',
'5',
'5',
'c',
'5',
'x',
'5',
's',
'5',
's',
'p',
'l',
'u',
's',
'm',
'i',
'6',
'6',
'x',
'电',
'源',
'开',
'机',
'音',
'量',
'按',
'键',
'排',
'线',
'侧',
'键',
'小',
'米',
'5',
'c',
'开',
'机',
'音',
'量',
'排',
'线']
The example in Introduction docs is too long and complex.
Many settings can be wrappered into a default configuration, such as epoch, batchsize, optimizer, etc.
users can custom their own configuration through some string options instead of declaring a specific class.
The all process can also be wrappered into a default class (called pipeline in huggingface)
This is an example
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
'score': 0.1073106899857521,
'token': 4827,
'token_str': 'fashion'},
{'sequence': "[CLS] hello i'm a role model. [SEP]",
'score': 0.08774490654468536,
'token': 2535,
'token_str': 'role'},
{'sequence': "[CLS] hello i'm a new model. [SEP]",
'score': 0.05338378623127937,
'token': 2047,
'token_str': 'new'},
{'sequence': "[CLS] hello i'm a super model. [SEP]",
'score': 0.04667217284440994,
'token': 3565,
'token_str': 'super'},
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
'score': 0.027095865458250046,
'token': 2986,
'token_str': 'fine'}]
By this way, we can refine README docs, make it clear and easy-understanding.
As for some users who want to further custom their model, we can provide more complex example scripts under test directory.
和bert4keras对比,感觉这里的分词和转换id很慢,是不是有什么地方没优化好?
单机多卡跑GlobalPoint模型,出现以上错误,其他模型多卡代码跑GlobalPoint没有报错
您好,能否在后续版本迭代中加入EarlyStopping的策略?
Python 3.8.10
ark-nlp 0.0.6
读取本地文件失败
from ark_nlp.dataset import SentenceClassificationDataset
train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')
报错信息
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-52c3ab3bcf08> in <module>
----> 1 train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')
2 # dev_dataset = SentenceClassificationDataset('../data/source_datasets/cMedTC/dev_data.csv')
~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in __init__(self, data, categories, is_retain_df, is_retain_dataset, is_train, is_test)
~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _load_dataset(self, data_path)
~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _read_data(self, data_path, data_format, skiprows)
ValueError: The data format does not exist
def _on_backward(
self,
inputs,
outputs,
logits,
loss,
gradient_accumulation_steps=1,
**kwargs
):
# 如果GPU数量大于1
if self.n_gpu > 1:
loss = loss.mean()
# 如果使用了梯度累积,除以累积的轮数
if gradient_accumulation_steps > 1:
loss = loss / gradient_accumulation_steps
loss.backward()
self.pgd.backup_grad()
# 对抗训练
for t in range(self.K):
self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
if t != self.K-1:
self.module.zero_grad()
else:
self.pgd.restore_grad()
logits = self.module(**inputs)
logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
# 如果GPU数量大于1
if self.n_gpu > 1:
loss_adv = loss_adv.mean()
# 如果使用了梯度累积,除以累积的轮数
if gradient_accumulation_steps > 1:
loss_adv = loss_adv / gradient_accumulation_steps
loss_adv.backward()
self.pgd.restore() # 恢复embedding参数
self._on_backward_record(loss, **kwargs)
return loss
在loss.backward()后面添加PGD对抗训练报错
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
请问作者大大这个是怎么回事?
看上去是网络问题,但是我这边能正常访问https://huggingface.co/models的
后从该地址上下载了词向量 nghuyong/ernie-1.0-base-zh并指定绝对路径,程序正常了,但是之前的模型准确性全不对了,是词向量文件改了吗,还是什么原因?
参考:#46
查询发现是loss 释放了,采用了 loss.backward(retain_graph=True) 的方法,替换所有的 loss.backward()为loss.backward(retain_graph=True)。能够正常训练。
但F1指标效果低于 fgm。如果有效添加 对抗训练的 pgd
def _on_backward(
self,
inputs,
outputs,
logits,
loss,
gradient_accumulation_steps=1,
**kwargs
):
# 如果GPU数量大于1
if self.n_gpu > 1:
loss = loss.mean()
# 如果使用了梯度累积,除以累积的轮数
if gradient_accumulation_steps > 1:
loss = loss / gradient_accumulation_steps
loss.backward()
self.pgd.backup_grad()
# 对抗训练
for t in range(self.K):
self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
if t != self.K-1:
self.module.zero_grad()
else:
self.pgd.restore_grad()
logits = self.module(**inputs)
logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
# 如果GPU数量大于1
if self.n_gpu > 1:
loss_adv = loss_adv.mean()
# 如果使用了梯度累积,除以累积的轮数
if gradient_accumulation_steps > 1:
loss_adv = loss_adv / gradient_accumulation_steps
loss_adv.backward()
self.pgd.restore() # 恢复embedding参数
self._on_backward_record(loss, **kwargs)
return loss
在loss.backward()后面添加PGD对抗训练报错
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
Python 3.8.10
ark-nlp 0.0.7
from ark_nlp.dataset import BIONERDataset as Dataset
from ark_nlp.dataset import BIONERDataset as CrfBertNERDataset
from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as Tokenizer
from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as CrfBertNERTokenizer
from ark_nlp.nn import BertConfig as CrfBertConfig
from ark_nlp.nn import BertConfig as ModuleConfig
from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert
from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert as Module
from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_model_optimizer
from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_crf_bert_optimizer
from ark_nlp.factory.task import BIONERTask as Task
from ark_nlp.factory.task import BIONERTask as CrfBertNERTask
from ark_nlp.factory.predictor import BIONERPredictor as Predictor
from ark_nlp.factory.predictor import BIONERPredictor as CrfBertNERPredictor
你好:
在 gobalpoint_bert.ipynb 中的 EfficientGlobalPointerBert 和 gobalpoint_bert 如何save,load 模型及Tokenizer 如何存储和读取训练模型?
直接save EfficientGlobalPointerBert; model.module; 均出错。
增加对SimBert模型的支持
我用ark_nlp里面的 from ark_nlp.model.tc.bert import Bert训练的模型保存之后,用transformers里的BertModel去load_state_dict模型,结果发现参数维度对不上,报错如下:
size mismatch for pooler.dense.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([768]).
PS:而且还发现了用ark_nlp里的Bert训练完后会比transformers里的BertModel少两个参数,分别是:"classifier.weight"和 "classifier.bias"
你好,在解码的时候,_convert_to_transfomer_ids中的span_mask在哪里用到了呀
Python 3.8.10
ark-nlp 0.0.6
设置cpu无法生效
model = Task(dl_module, optimizer, 'lsce', device='cpu')
Environment info:
ark-nlp 0.0.9
python 3.9
Information:
在使用bert模型时,用SpanTokenizer会导致token_mapping的索引不正确。
比如输入以下句子时的结果(其中下划线表示空格):
input:B o s e _ S o u n d S p o r t _ F r e e _ 真 无 线 蓝 牙 耳 机
tokens:['[UNK]', '[unused1]', '[UNK]', '[unused1]', '[UNK]', '[unused1]', '真', '无', '线', '蓝', '牙', '耳', '机']
token_mapping:[[0], [1], [2], [3], [4], [5], [21], [22], [23], [24], [25], [26], [27]]
正确的token_mapping应该是如下的:
[[0,1,2,3], [4], [5,6,7,8,9,10,11,12,13,14], [15], [16,17,18,19], [20], [21], [22], [23], [24], [25], [26], [27]]
请问关系抽取中输入数据的格式是什么样子的?
“列表中每个元素是如下组织的字典”([头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置])
以上所说的字典是什么意思?没太理解
你好,在https://github.com/DataArk/GAIIC2022-Product-Title-Entity-Recognition-Baseline 中
已有finetune的例子,如果要提升的话需要做pretrain,如何基于arknlp 实现pretrain, 能否提供样例?
# RoPE编码
if self.RoPE:
pos = SinusoidalPositionEmbedding(self.head_size, 'zero')(inputs)
# cos_pos = pos[..., 1::2].repeat(1, 1, 2)
# sin_pos = pos[..., ::2].repeat(1, 1, 2)
cos_pos = pos[..., 1::2].repeat_interleave(2, dim=-1) # 修改后
sin_pos = pos[..., ::2].repeat_interleave(2, dim=-1) # 修改后
大佬你好,你的RoPE在实现上是不是有点问题,按照苏神的博客应该是上面修改后的代码吧
.py scripts can be run in terminal directly.
It is more user-friendly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.