Giter Site home page Giter Site logo

ark-nlp's People

Contributors

jimme0421 avatar xiangking avatar zhw666888 avatar zrealshadow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ark-nlp's Issues

没有cuda

没有安装cuda,如何调用cpu训练,现在报如下错误:
AssertionError: Torch not compiled with CUDA enabled

Fix: SpanTokenizer使用'[blank]'作为空格,但在预训练模型词典不包含该符号时并不报错

Environment info

Python 3.8.10
ark-nlp 0.0.6

Information

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [94,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

convert_to_ids函数占用大量CPU

您好,我利用样例以及GlobelPointerBert做NER任务加载数据时发现代码跑到convert_to_ids时,服务器CPU占用十分严重,通过htop查到服务器64核CPU占据了将约5000%,请问是什么原因呢

Fix: TokenTokenizer存在分词忽略空格的问题

Environment info

Python 3.8.10
ark-nlp 0.0.7

Information

tokenizer.tokenize('森麥康 小米3 M4 M5 5C 5X 5S 5Splus mi 6 6X电源开机音量按键排线侧键 小米5C 开机音量排线')

>>> 
['森',
 '麥',
 '康',
 '小',
 '米',
 '3',
 'm',
 '4',
 'm',
 '5',
 '5',
 'c',
 '5',
 'x',
 '5',
 's',
 '5',
 's',
 'p',
 'l',
 'u',
 's',
 'm',
 'i',
 '6',
 '6',
 'x',
 '电',
 '源',
 '开',
 '机',
 '音',
 '量',
 '按',
 '键',
 '排',
 '线',
 '侧',
 '键',
 '小',
 '米',
 '5',
 'c',
 '开',
 '机',
 '音',
 '量',
 '排',
 '线']

New Feature : Adding Pipeline and refine Introduction of doc.

Description

The example in Introduction docs is too long and complex.
Many settings can be wrappered into a default configuration, such as epoch, batchsize, optimizer, etc.
users can custom their own configuration through some string options instead of declaring a specific class.

The all process can also be wrappered into a default class (called pipeline in huggingface)
This is an example

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

By this way, we can refine README docs, make it clear and easy-understanding.
As for some users who want to further custom their model, we can provide more complex example scripts under test directory.

Fix: ValueError: The data format does not exist

Environment info

Python 3.8.10
ark-nlp 0.0.6

Information

读取本地文件失败

from ark_nlp.dataset import SentenceClassificationDataset

train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')

报错信息

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-52c3ab3bcf08> in <module>
----> 1 train_dataset = SentenceClassificationDataset('../data/task_datasets/cMedTC/train_data.csv')
      2 # dev_dataset = SentenceClassificationDataset('../data/source_datasets/cMedTC/dev_data.csv')

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in __init__(self, data, categories, is_retain_df, is_retain_dataset, is_train, is_test)

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _load_dataset(self, data_path)

~/anaconda3/lib/python3.7/site-packages/ark_nlp/dataset/base/_dataset.py in _read_data(self, data_path, data_format, skiprows)

ValueError: The data format does not exist

添加PGM对抗训练报错

def _on_backward(
        self,
        inputs,
        outputs,
        logits,
        loss,
        gradient_accumulation_steps=1,
        **kwargs
    ):

        # 如果GPU数量大于1
        if self.n_gpu > 1:
            loss = loss.mean()
        # 如果使用了梯度累积,除以累积的轮数
        if gradient_accumulation_steps > 1:
            loss = loss / gradient_accumulation_steps

        loss.backward()
        self.pgd.backup_grad()
        # 对抗训练
        for t in range(self.K):
            self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
            if t != self.K-1:
                self.module.zero_grad()
            else:
                self.pgd.restore_grad()
            logits = self.module(**inputs)
            logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
            # 如果GPU数量大于1
            if self.n_gpu > 1:
                loss_adv = loss_adv.mean()
            # 如果使用了梯度累积,除以累积的轮数
            if gradient_accumulation_steps > 1:
                loss_adv = loss_adv / gradient_accumulation_steps
            loss_adv.backward()
        self.pgd.restore() # 恢复embedding参数 

        self._on_backward_record(loss, **kwargs)

        return loss

在loss.backward()后面添加PGD对抗训练报错

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

请问作者大大这个是怎么回事?

分词器报错了

昨天出现个问题,加载词向量文件时出现下面截图中的问题
image

看上去是网络问题,但是我这边能正常访问https://huggingface.co/models的
后从该地址上下载了词向量 nghuyong/ernie-1.0-base-zh并指定绝对路径,程序正常了,但是之前的模型准确性全不对了,是词向量文件改了吗,还是什么原因?

如何有效添加 对抗训练的 pgd。参考:https://github.com/xiangking/ark-nlp/issues/46

参考:#46
查询发现是loss 释放了,采用了 loss.backward(retain_graph=True) 的方法,替换所有的 loss.backward()为loss.backward(retain_graph=True)。能够正常训练。
但F1指标效果低于 fgm。如果有效添加 对抗训练的 pgd

def _on_backward(
self,
inputs,
outputs,
logits,
loss,
gradient_accumulation_steps=1,
**kwargs
):

    # 如果GPU数量大于1
    if self.n_gpu > 1:
        loss = loss.mean()
    # 如果使用了梯度累积,除以累积的轮数
    if gradient_accumulation_steps > 1:
        loss = loss / gradient_accumulation_steps

    loss.backward()
    self.pgd.backup_grad()
    # 对抗训练
    for t in range(self.K):
        self.pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
        if t != self.K-1:
            self.module.zero_grad()
        else:
            self.pgd.restore_grad()
        logits = self.module(**inputs)
        logits, loss_adv = self._get_train_loss(inputs, outputs, **kwargs)
        # 如果GPU数量大于1
        if self.n_gpu > 1:
            loss_adv = loss_adv.mean()
        # 如果使用了梯度累积,除以累积的轮数
        if gradient_accumulation_steps > 1:
            loss_adv = loss_adv / gradient_accumulation_steps
        loss_adv.backward()
    self.pgd.restore() # 恢复embedding参数 

    self._on_backward_record(loss, **kwargs)

    return loss

在loss.backward()后面添加PGD对抗训练报错

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

Fix: 使用model调用CrfBert本质在调用bert+softmax

Environment info

Python 3.8.10
ark-nlp 0.0.7

Information

from ark_nlp.dataset import BIONERDataset as Dataset
from ark_nlp.dataset import BIONERDataset as CrfBertNERDataset

from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as Tokenizer
from ark_nlp.processor.tokenizer.transfomer import TokenTokenizer as CrfBertNERTokenizer

from ark_nlp.nn import BertConfig as CrfBertConfig
from ark_nlp.nn import BertConfig as ModuleConfig

from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert
from ark_nlp.model.ner.crf_bert.crf_bert import CrfBert as Module

from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_model_optimizer
from ark_nlp.factory.optimizer import get_default_crf_bert_optimizer as get_default_crf_bert_optimizer

from ark_nlp.factory.task import BIONERTask as Task
from ark_nlp.factory.task import BIONERTask as CrfBertNERTask

from ark_nlp.factory.predictor import BIONERPredictor as Predictor
from ark_nlp.factory.predictor import BIONERPredictor as CrfBertNERPredictor

Bert模型参数维度对不上

我用ark_nlp里面的 from ark_nlp.model.tc.bert import Bert训练的模型保存之后,用transformers里的BertModel去load_state_dict模型,结果发现参数维度对不上,报错如下:
size mismatch for pooler.dense.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([768]).

PS:而且还发现了用ark_nlp里的Bert训练完后会比transformers里的BertModel少两个参数,分别是:"classifier.weight"和 "classifier.bias"

span_mask

你好,在解码的时候,_convert_to_transfomer_ids中的span_mask在哪里用到了呀

SpanTokenizer会导致token_mapping的索引不正确问题

Environment info:
ark-nlp 0.0.9
python 3.9

Information:
在使用bert模型时,用SpanTokenizer会导致token_mapping的索引不正确。
比如输入以下句子时的结果(其中下划线表示空格):
input:B o s e _ S o u n d S p o r t _ F r e e _ 真 无 线 蓝 牙 耳 机
tokens:['[UNK]', '[unused1]', '[UNK]', '[unused1]', '[UNK]', '[unused1]', '真', '无', '线', '蓝', '牙', '耳', '机']
token_mapping:[[0], [1], [2], [3], [4], [5], [21], [22], [23], [24], [25], [26], [27]]
正确的token_mapping应该是如下的:
[[0,1,2,3], [4], [5,6,7,8,9,10,11,12,13,14], [15], [16,17,18,19], [20], [21], [22], [23], [24], [25], [26], [27]]

输入数据的格式

请问关系抽取中输入数据的格式是什么样子的?

“列表中每个元素是如下组织的字典”([头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置])

以上所说的字典是什么意思?没太理解

RoPE实现细节

# RoPE编码
if self.RoPE:
    pos = SinusoidalPositionEmbedding(self.head_size, 'zero')(inputs)
    # cos_pos = pos[..., 1::2].repeat(1, 1, 2)
    # sin_pos = pos[..., ::2].repeat(1, 1, 2)
    cos_pos = pos[..., 1::2].repeat_interleave(2, dim=-1)  # 修改后
    sin_pos = pos[..., ::2].repeat_interleave(2, dim=-1)  # 修改后

大佬你好,你的RoPE在实现上是不是有点问题,按照苏神的博客应该是上面修改后的代码吧

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.