liuwei1206 / lebert Goto Github PK

View Code? Open in Web Editor NEW

336.0 336.0 60.0 541 KB

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Python 98.77% Shell 1.23%

lebert's People

Contributors

Stargazers

Watchers

Forkers

datalee jxlijunhao yinazhu wj-mcat sunnyhuma171 zhihao-chen luojie-roger ishine zlh1992 nanqiai jingsongs dioxideme trendingtechnology 612twilight yang-zhikai anshiquanshu66 shenyi666666 qfzxhy lijiadong china-challengehub lankunyao a278574361 liulx20 zzysh12345 chenwenlong2017 liwli xixia0523 hili322 tulpen mondon11 gao-lex jujuezaiwan wmt123 dghlnvyps lijia2019310 zshy1205 fuxuelinwudi shine-zx hailangrex yydg1 rootlt dumpmemory kk19990709 suc16 megan0228 rocke2020 jingcangcang hupidong fireae rocke-dong nlper01 mxa4646 flywolfs 43reyerhrstj faker9134 xzbin yangjun-wow deepml2020 dsuanya dong4325 gisfanmachel haojunzuo

lebert's Issues

Where can I download the file "tencent_vocab.txt"?

Hi @liuwei1206 ，Congratulations to your ACL2021 acceptance. Thanks a lot for your work. I was very interested in your research and I'm desired to go through the code to learn more details about your model.
Can you give me some instructions that provide the addresses for the file "tencent_vocab.txt"?

There are some undefined variables and objects

A great job, thank you for sharing it publicly, but has the author of this project run it? Because i have a lot of problems.
I have run the code, but need to be changed, not friendly to newcomers.

wcbert_modeling.py
line 455: BaseModelOutputWithPooling
line 531: word_pooling_type

Trainer.py
line 197
line 278~314

Problems on my own task

Here's the running error.

Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/wen/anaconda3/envs/pytorch/bin/python3', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/COIE/origin', '--output_dir=data/result/COIE/origin/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=1000000', '--max_word_num=5', '--label_file=data/dataset/COIE/origin/labels.txt', '--word_embedding=data/embedding/Tencent_AILab_ChineseEmbedding.txt', '--saved_embedding_dir=data/dataset/COIE/origin', '--model_type=WCBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=16', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=20', '--warmup_steps=190', '--save_steps=600', '--logging_steps=100']' died with <Signals.SIGKILL: 9>.

And here's the log.

2021-07-07 15:40:29:INFO: Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2021-07-07 15:40:29:INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, config_name='data/berts/bert/config.json', data_dir='data/dataset/COIE/origin', default_label='O', device=device(type='cuda', index=0), do_eval=True, do_predict=True, do_shuffle=True, do_train=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_file='data/dataset/COIE/origin/labels.txt', learning_rate=1e-05, local_rank=0, logging_dir='data/log', logging_steps=100, max_grad_norm=1.0, max_scan_num=1000000, max_seq_length=256, max_steps=-1, max_word_num=5, model_name_or_path='data/berts/bert/pytorch_model.bin', model_type='WCBertCRF_Token', n_gpu=1, no_cuda=False, nodes=1, num_train_epochs=20, output_dir='data/result/COIE/origin/lebertcrf', overwrite_cache=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=4, save_steps=600, save_total_limit=50, saved_embedding_dir='data/dataset/COIE/origin', seed=106524, sgd_momentum=0.9, vocab_file='data/berts/bert/vocab.txt', warmup_steps=190, weight_decay=0.0, word_embed_dim=200, word_embedding='data/embedding/Tencent_AILab_ChineseEmbedding.txt', word_vocab_file='data/vocab/tencent_vocab.txt')

Hope you can reply. Thx.

What's your pretrained model?

Hello, I want to ask you what's your pretrained model, bert-base-chinese or bert-wwm? chinese-roberta-wwm? I read your paper, you said bert-base, So I want to know your specific pretrained model.

Some undefined variables

wcbert_modeling.py
Line 99, config.chunk_size_feed_forward is not defined in data/berts/bert/config.json
Line 119, config.layer_norm_eps is not defined in data/berts/bert/config.json

Looking forward your reply

where download the checkpoints?

Hello, I have downloaded all the files you provide in the github, but I didn't find the checkpoints. Can you tell how to download it? Providing the links is the best, thank you.

腾讯词嵌入向量下载链接失效了

Where can I find these datasets？

sentence 中字匹配到词的mask问题

for idy in range(sent_length):
now_words = matched_words[idy]
now_word_ids = self.word_vocab.convert_items_to_ids(now_words)
matched_word_ids[idy][:len(now_word_ids)] = now_word_ids
matched_word_mask[idy][:len(matched_word_ids)] = 1

这里 matched_word_ids 的长度为max_seq_len，matched_word_mask[idy] 的长度为 max_num_word , max_seq_len >max_num_word,这样不管当前字是否匹配到词，pad位置在计算attention的时候不都是参与计算么？

IndexError: list index out of range

def evaluate(model, args, dataset, label_vocab, global_step, description="dev", write_file=False):
"""
evaluate the model's performance
"""
dataloader = get_dataloader(dataset, args, mode='dev')
if (not args.do_train) and (not args.no_cuda) and args.local_rank != -1:
model = model.cuda()
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True
)

batch_size = dataloader.batch_size
if args.local_rank == 0 or args.local_rank == -1:
    logger.info("***** Running %s *****", description)
    logger.info("  Num examples = %d", len(dataloader.dataset))
    logger.info("  Batch size = %d", batch_size)
eval_losses = []
model.eval()

all_input_ids = None
all_label_ids = None
all_predict_ids = None
all_attention_mask = None

for batch in tqdm(dataloader, desc=description):
    # new batch data: [input_ids, token_type_ids, attention_mask, matched_word_ids,
    # matched_word_mask, boundary_ids, labels
    batch_data = (batch[0], batch[2], batch[1], batch[3], batch[4], batch[5], batch[6])
    new_batch = batch_data
    batch = tuple(t.to(args.device) for t in new_batch)
    inputs = {"input_ids": batch[0], "attention_mask": batch[1], "token_type_ids": batch[2],
              "matched_word_ids": batch[3], "matched_word_mask": batch[4],
              "boundary_ids": batch[5], "labels": batch[6], "flag": "Predict"}
    batch_data = None
    new_batch = None

    with torch.no_grad():
        outputs = model(**inputs)
        preds = outputs[0]

=========================================================================
training has no problem，but weibo/labels.txt has 28 tags，two tags '' and ''，add to 30. but 31 in pred value.

O
B-PER.NOM
E-PER.NOM
B-LOC.NAM
E-LOC.NAM
B-PER.NAM
I-PER.NAM
E-PER.NAM
S-PER.NOM
B-GPE.NAM
E-GPE.NAM
B-ORG.NAM
I-ORG.NAM
E-ORG.NAM
I-PER.NOM
S-GPE.NAM
B-ORG.NOM
E-ORG.NOM
I-LOC.NAM
I-ORG.NOM
B-LOC.NOM
I-LOC.NOM
E-LOC.NOM
B-GPE.NOM
E-GPE.NOM
I-GPE.NAM
S-PER.NAM
S-LOC.NOM

class ItemVocabFile():
"""
Build vocab from file.
Note, each line is a item in vocab, or each items[0] is in vocab
"""
def init(self, files, is_word=False, has_default=False, unk_num=0):
self.files = files
self.item2idx = {}
self.idx2item = []
self.item_size = 0
self.is_word = is_word
if not has_default and not self.is_word:
self.item2idx[''] = self.item_size
self.idx2item.append('')
self.item_size += 1
self.item2idx[''] = self.item_size
self.idx2item.append('')
self.item_size += 1
# for unk words
for i in range(unk_num):
self.item2idx['{}'.format(i+1)] = self.item_size
self.idx2item.append('{}'.format(i+1))
self.item_size += 1

    self.init_vocab()
    print('=======labels info========')
    print(self.item2idx)
    print(self.idx2item)

=======labels info========
{'': 0, '': 1, 'O': 2, 'B-PER.NOM': 3, 'E-PER.NOM': 4, 'B-LOC.NAM': 5, 'E-LOC.NAM': 6, 'B-PER.NAM': 7, 'I-PER.NAM': 8, 'E-PER.NAM': 9, 'S-PER.NOM': 10, 'B-GPE.NAM': 11, 'E-GPE.NAM': 12, 'B-ORG.NAM': 13, 'I-ORG.NAM': 14, 'E-ORG.NAM': 15, 'I-PER.NOM': 16, 'S-GPE.NAM': 17, 'B-ORG.NOM': 18, 'E-ORG.NOM': 19, 'I-LOC.NAM': 20, 'I-ORG.NOM': 21, 'B-LOC.NOM': 22, 'I-LOC.NOM': 23, 'E-LOC.NOM': 24, 'B-GPE.NOM': 25, 'E-GPE.NOM': 26, 'I-GPE.NAM': 27, 'S-PER.NAM': 28, 'S-LOC.NOM': 29}
['', '', 'O', 'B-PER.NOM', 'E-PER.NOM', 'B-LOC.NAM', 'E-LOC.NAM', 'B-PER.NAM', 'I-PER.NAM', 'E-PER.NAM', 'S-PER.NOM', 'B-GPE.NAM', 'E-GPE.NAM', 'B-ORG.NAM', 'I-ORG.NAM', 'E-ORG.NAM', 'I-PER.NOM', 'S-GPE.NAM', 'B-ORG.NOM', 'E-ORG.NOM', 'I-LOC.NAM', 'I-ORG.NOM', 'B-LOC.NOM', 'I-LOC.NOM', 'E-LOC.NOM', 'B-GPE.NOM', 'E-GPE.NOM', 'I-GPE.NAM', 'S-PER.NAM', 'S-LOC.NOM']

IndexError: list index out of range

when I run run_ner.sh and --do_evaluate I have encountered these kind of problems：

Traceback (most recent call last):
  File "Trainer.py", line 598, in <module>
    main()
  File "Trainer.py", line 574, in main
    train(model, args, train_dataset, dev_dataset, test_dataset, label_vocab, tb_writer)
  File "Trainer.py", line 377, in train
    metrics, _ = evaluate(model, args, dev_dataset, label_vocab, global_step, description="Dev", write_file=True)
  File "Trainer.py", line 465, in evaluate
    all_label_ids, all_predict_ids, all_attention_mask, label_vocab)
  File "LEBERT/function/metrics.py", line 40, in seq_f1_with_mask
    tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-"))
  File "LEBERT/feature/vocab.py", line 84, in convert_id_to_item
    return self.idx2item[id]
IndexError: list index out of range

ontonote4 checkpoint and experimental replication problems

Hi, I met some problems when replicating your experiments.

I use the ontonote4 checkpint you provided to do prediction. It shows that the crf.transitions has a shape torch.Size([20, 20]) from checkpoint, but actually this dataset only has 17 labels and the crf.transitions should be [19, 19]. So could you please check the checkpoint file of ontonote4?
I used a single GPU to train the models on weibo and ontonote4 dataset without changing any code or paramters. However the best F1 score of weibo is 0.68 and the ontonote4 is 0.80, which is lower than your result. If it is because you used the ditributed training, would you please provide the detailed paramters of distributed training, or the training parameters of single GPU that could reach your scores?

Many thanks in advance.

Results all "O" on weibo

Your work is really amazing!
I am currently learning your code. When I try to train LEBERT on the weibo dataset, I find that the predicted results are all "O" despite that I haven't done any changes to your code. However, using the checkpoint you provide can indeed get good results.
What could be the reason for this? How can I train on my own to get these checkpoints provided by you?
I would be very appreciated if you helped me!

In the crf, why you add 2 to the tagset_size?

in the following code, I can't understand. Why you add 2 to the self.tagset_size, what's the START_TAG or STOP_TAG?

        # We add 2 here, because of START_TAG and STOP_TAG  

        # transitions (f_tag_size, t_tag_size), transition value from f_tag to t_tag

        init_transitions = torch.zeros(self.tagset_size + 2, self.tagset_size + 2)

which results in the following error:

File "E:/Paper/NER/LEBERT_original_code/LEBERT/Trainer.py", line 602, in <module>
    main()
  File "E:/Paper/NER/LEBERT_original_code/LEBERT/Trainer.py", line 579, in main
    train(model, args, train_dataset, dev_dataset, test_dataset, label_vocab, tb_writer)
  File "E:/Paper/NER/LEBERT_original_code/LEBERT/Trainer.py", line 382, in train
    metrics, _ = evaluate(model, args, test_dataset, label_vocab, global_step, description="Test", write_file=True)
  File "E:/Paper/NER/LEBERT_original_code/LEBERT/Trainer.py", line 463, in evaluate
    acc, p, r, f1, all_true_labels, all_pred_labels = seq_f1_with_mask(
  File "E:\Paper\NER\LEBERT_original_code\LEBERT\function\metrics.py", line 37, in seq_f1_with_mask
    tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-"))
  File "E:\Paper\NER\LEBERT_original_code\LEBERT\feature\vocab.py", line 81, in convert_id_to_item
    return self.idx2item[id]
IndexError: list index out of range

When I modified it to the init_transitions = torch.zeros(self.tagset_size , self.tagset_size ) , it is ok.

运行时出错，我的版本是 torch1.6.0-cuda10.1-cudnn7

word_embedding.txt

FileNotFoundError: [Errno 2] No such file or directory : " data/embedding/word_embedding.txt "

Where can I download this file " word_embedding.txt "? Is it to replace " Tencent_AILAB_ChineseEmbedding.tar.gz " with " word_embedding.txt ".

thank u very much !

AttributeError: 'tuple' object has no attribute 'hidden_states'

2021-09-19 20:23:44:INFO: ***** Running dev *****
2021-09-19 20:23:44:INFO:   Num examples = 271
2021-09-19 20:23:44:INFO:   Batch size = 16
dev:   0%|                                               | 0/17 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/root/python_proj/LEBERT/Trainer.py", line 601, in <module>
    main()
  File "/root/python_proj/LEBERT/Trainer.py", line 584, in main
    eval_output, _ = evaluate(model, args, dev_dataset, label_vocab, global_steps, "dev", write_file=True)
  File "/root/python_proj/LEBERT/Trainer.py", line 444, in evaluate
    outputs = model(**inputs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/python_proj/LEBERT/wcbert_modeling.py", line 498, in forward
    outputs = self.bert(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/python_proj/LEBERT/wcbert_modeling.py", line 463, in forward
    hidden_states=encoder_outputs.hidden_states,
AttributeError: 'tuple' object has no attribute 'hidden_states'

Process finished with exit code 1

运行代码报tuple中没有hidden_states, 我的transformers版本是4.10.2,难道是transformers版本的问题？

IndexError: list index out of range

你好，我在单机GPU复现代码时，在做验证集时出现错误IndexError: list index out of range，
test: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:03<00:00, 4.32it/s]
Traceback (most recent call last):
File "Trainer.py", line 597, in
main()
File "Trainer.py", line 591, in main
eval_output, _ = evaluate(model, args, test_dataset, label_vocab, global_steps, "test", write_file=True)
File "Trainer.py", line 464, in evaluate
all_label_ids, all_predict_ids, all_attention_mask, label_vocab)
File "/root/autodl-tmp/function/metrics.py", line 37, in seq_f1_with_mask
tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-"))
File "/root/autodl-tmp/feature/vocab.py", line 81, in convert_id_to_item
return self.idx2item[id]
IndexError: list index out of range

但是我在代码中找不到错误，这种情况怎么处理？

您好！方便发一份处理好的Notonotes4给我吗？

我最近在做对比实验，但是根据 Named Entity Recognition with Bilingual Constraints 的处理方法，处理后的结果是：

不知道在哪些地方遗漏处理了，得出的dev/test数据在char数量上与论文中的数量对不上。

方便分享一下您的预处理程序吗？（或者方便发一份处理好的数据集给我吗？）
谢谢您！（我的邮箱是[email protected])

请问每个实体的P，R，F1-score需要怎么修改相应的代码？

do_predict问题

对torch不太熟悉，请问在只do_predict的时候，没有找到load_state_dict这样的操作，模型是怎么预测的？麻烦指点一二，感谢。作者是否应该在do_predict中加上model.load_state_dict(torch.load(model_name))

Readme torch wrong version 1.6.0

关于NER实验中部分超参数设置的问题

--max_scan_num=1000000
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
1）论文中NER部分的实验结果也是在以上参数下得出的吗？
2）对于max_scan_num，是否可以理解为只扫描预训练词向量中前1,000,000个词？或者说整个实验中只用到了 https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz这个文件中的前1,000,000个词向量？

training about MSRA dataset

关于bilinear attention的疑问

请问为什么使用了bilinear attention？
word embedding已经通过non-linear transformation与char embedding做到了维度对齐，是出于什么考虑使用了bilinear attention呢？

batch size

why do you set the batch size 4 or 20，have you set it larger，such as 64 or 128

this line code I don't understand

When you complete the bi-linear attention in the class BertLayer like the following code :

    if self.has_word_attn:
                assert input_word_mask is not None
        # transform
        # paper's (W_1 * X + b_1) X:(N,L,W,D) W_1:(D,d_c) => word_outputs (N,L,W,H)
        word_outputs = self.word_transform(input_word_embeddings)  # [N, L, W, D]
        # paper's tanh(W_1 * X + b_1) => word_outputs(N,L,W,d_c)
        word_outputs = self.act(word_outputs)
        # paper's W2 * (tanh(W_1 * X + b_1)) + b2 , W_2(d_c,d_c) => word_outputs(N,L,W,H)
        word_outputs = self.word_word_weight(word_outputs)
        word_outputs = self.dropout(word_outputs)

        # attention_output = attention_output.unsqueeze(2) # [N, L, D] -> [N, L, 1, D]
        # W_attn: the weight matrix of bilinear attention: (d_c, d_c)
        # layer_output.unsqueeze(2) -> (batch_size, seq_length, 1, hidden_size) =>(N,L,1,H)
        # alhpa => (batch_size, seq_length, 1, hidden_size) : (N,L,1,H)
        alpha = torch.matmul(layer_output.unsqueeze(2), self.attn_W)  # [N, L, 1, H]
        # word_outputs:(N,L,W,H)  transpose(word_outputs, 2, 3) -> (N,L,H,W)
        alpha = torch.matmul(alpha, torch.transpose(word_outputs, 2, 3))  # [N, L, 1, W], bi-linear transform end.
        alpha = alpha.squeeze()  # [N, L, W]
        alpha = alpha + (1 - input_word_mask.float()) * (-10000.0)
        alpha = torch.nn.Softmax(dim=-1)(alpha)  # [N, L, W]
        alpha = alpha.unsqueeze(-1)  # [N, L, W, 1]
        weighted_word_embedding = torch.sum(word_outputs * alpha, dim=2)  # [N, L, D]
        layer_output = layer_output + weighted_word_embedding

        layer_output = self.dropout(layer_output)
        layer_output = self.fuse_layer_norm(layer_output)

I don't understand the code
alpha = alpha + (1 - input_word_mask.float()) * (-10000.0)
can you give me an explanation?

Some weights of the model checkpoint at ../berts/bert/pytorch_model.bin were not used when initializing WCBertCRFForTokenClassification

hi, some weights not used when training. see the picture

Some weights of the model checkpoint at ../berts/bert/pytorch_model.bin were not used when initializing WCBertCRFForTokenClassification

howboundary_ids

LEBERT实验复现

根据LEBERT论文的**，我进行了实验复现，LEBERT基本达到了论文的指标，不过BERT模型的复现指标比论文中的效果要好一些。LEBERT在四个数据集上比BERT高出0.5-1.0个点，没有带来非常大的效果增益。不知道是否与词向量的质量有关。复现效果详见：
https://github.com/yangjianxin1/LEBERT-NER-Chinese

你好，请问方便发一份处理好的Ontonote4.0 Chinese的数据集给我吗？（我已经得到了授权）

你好，我已经得到了LDC的授权：

但是现在网络上只有Ontonote5.0版本的处理脚本，而Ontonote4.0我不知道该怎么去处理，希望能发一份处理好的Ontonote4.0数据集给我，万分感谢！🙏 （邮箱：[email protected]）

single GPU can use ddp? (distributed data parallel)

First, I want to say, I execute your code when I don't use ddp, it is ok, but When I use ddp, it can't work, I only have one GPU.
I noticed that in your shell scripts:

CUDA_VISIBLE_DEVICES=5 python3 -m torch.distributed.launch --master_port 13517 --nproc_per_node=1
I know this is a ddp way to run the GPUs training.

So, Can I use the single Gpu to train the model with ddp, why are you ok?

请问读入word_embedding需要多少内存啊？ 32G貌似不够

Where can I get vocab.txt in the Bert folder?

Where can I get vocab.txt in the Bert folder?and is it generated from all the data sets? Are tencent_vocab.txt and word_embedding. TXT the same file?
Thanks！

Where can I get vocab.txt in the Bert folder?

Where can I get vocab.txt in the Bert folder?and is it generated from all the data sets? Are tencent_vocab.txt and word_embedding. TXT the same file?
Thanks！

NER代码运行问题

你好,大佬:
首先，感谢开源！
尝试复现论文结果的时候遇到了一些问题,不知能否抽空解答一下.
1、在运行weiboNER的实验代码时，超参数设置与论文中一样，训练时loss下降有些异常(震荡下降，且前几个epoch验证集测试集f1均为0)，训练日志已邮件发送；
2、具体环境及运行设置：
GPU：A100-SXM4-40GB； torch:1.8.1+cu111 训练方式：单卡
期待大佬回复指导，谢谢！

trainner can not support LEBertCRF_Token model type

when I prepare all of dataset, and run ./run_ner.sh, it will throw the error:

Traceback (most recent call last):
  File "Trainer.py", line 593, in <module>
    main()
  File "Trainer.py", line 552, in main
    model = model.cuda()
UnboundLocalError: local variable 'model' referenced before assignment
Traceback (most recent call last):
  File "/home/human/miniconda3/envs/qzqExp/lib/python3.7/runpy.py", line 193, in _run_module_as_main

After checking the code, it's:

https://github.com/liuwei1206/LEBERT/blob/main/Trainer.py#L537-L550

there is no 'LEBertCRF_Token' in Trainner.

Note复现

@liuwei1206 您好，我的脚本参照给出的note4 shell设置如下:
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --master_port 13017 --nproc_per_node=1 \ Trainer.py --do_train --do_eval --do_predict --evaluate_during_training \ --data_dir="data/dataset/NER/note4" \ --output_dir="data/result/NER/note4/wcbertcrf" \ --config_name="data/berts/bert/config.json" \ --model_name_or_path="/home/root1/lizheng/pretrainModels/torch/chinese/bert-base-chinese/pytorch_model.bin" \ --vocab_file="/home/root1/lizheng/pretrainModels/torch/chinese/bert-base-chinese/vocab.txt" \ --word_vocab_file="data/vocab/tencent_vocab.txt" \ --max_scan_num=1500000 \ --max_word_num=5 \ --label_file="data/dataset/NER/note4/labels.txt" \ --word_embedding="data/embedding/word_embedding.txt" \ --saved_embedding_dir="data/dataset/NER/note4" \ --model_type="WCBertCRF_Token" \ --seed=106524 \ --per_gpu_train_batch_size=4 \ --per_gpu_eval_batch_size=32 \ --learning_rate=1e-5 \ --max_steps=-1 \ --max_seq_length=256 \ --num_train_epochs=20 \ --warmup_steps=190 \ --save_steps=600 \ --logging_steps=300

但结果test F1为80左右，是否因为您是多卡训练，我是单卡训练的差异，可否看一下脚本是否无误

Mac使用intel的cpu跑代码遇到的问题

您好，十分感谢您开源该代码。
我在使用MacBook pro 16(16 inch, 2019, intel core-i7) 运行您的代码时，使用 NER/weibo 的 checkpoint 和数据，会出现报错，如下：

请问有什么解决方案吗，十分期待您的回复

请问能提供一份config.json文件吗？

请问能提供一份config.json文件，包含chunk_size_feed_forward、add_cross_attention、use_return_dict等参数

For LEBERT NER tasks, how to set the "model_type" parameter?

I'm planning to do Chinese NER with the LeBert model.
From my understanding I have to set model_type = "LEBertCRF_Token" to train LEBERT model, but I got an error like this "UnboundLocalError: local variable 'model' referenced before assignment". Is there anything else I have to change? Any example or lead would be really helpful.

IndexError: list index out of range

你好，感谢你的开源。我在做do_evaluate和do_predict时，不管是你提供的数据，还是我自己的数据，都会出现错误，不知道这个问题时什么导致的，想请教一下你。

Traceback (most recent call last):
File "Trainer.py", line 598, in
main()
File "Trainer.py", line 574, in main
train(model, args, train_dataset, dev_dataset, test_dataset, label_vocab, tb_writer)
File "Trainer.py", line 377, in train
metrics, _ = evaluate(model, args, dev_dataset, label_vocab, global_step, description="Dev", write_file=True)
File "Trainer.py", line 465, in evaluate
all_label_ids, all_predict_ids, all_attention_mask, label_vocab)
File "LEBERT/function/metrics.py", line 40, in seq_f1_with_mask
tmp_pred.append(label_vocab.convert_id_to_item(all_pred_labels[i][j]).replace("M-", "I-"))
File "LEBERT/feature/vocab.py", line 84, in convert_id_to_item
return self.idx2item[id]
IndexError: list index out of range

改完transformers的源码，运行shell脚本出现问题

/opt/conda/lib/python3.6/site-packages/transformers-4.7.0.dev0-py3.6.egg/transformers/tokenization_utils_base.py:1631: FutureWarning: Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated and won't be possible anymore in v5. Use a model identifier or the path to a directory instead.
FutureWarning,
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/NER/weibo', '--output_dir=data/result/NER/weibo/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=100', '--max_word_num=5', '--label_file=data/dataset/NER/weibo/labels.txt', '--word_embedding=data/embedding/word_embedding.txt', '--saved_embedding_dir=data/dataset/NER/weibo', '--model_type=LEBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=4', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=2', '--warmup_steps=190', '--save_steps=600', '--logging_steps=600']' died with <Signals.SIGKILL: 9>.

ontonote4 的labels.txt

你好，我想跑一下你提供的在ontonote4上训练好的模型，但不知道labels的id，可否提供一下labels.txt 文件？

when fp16 run some variables is not defined.

      if args.fp16 and _use_native_amp:

                scaler.scale(loss).backward()

the scaler maybe you forget to define it , it can be defined as the following:
scaler = torch.cuda.amp.GradScaler()

请问下有没有Bert+Word的实现？

请问下有没有论文中提到的Bert+word的实现？是BertWordLSTMCRFForTokenClassification吗

json语料的最后一行都有一个{"text": [], "label": []}正常吗

{"text": [], "label": []}

Resume NER 和 CTB5 POS 的链接一样，可能是链接错了

Resume NER 和 CTB5 POS 的链接一样，可能是链接错了？

Can the datasets be shared to other cloud disks?

Hi.Rt

Thx

AttributeError: 'BertConfig' object has no attribute 'add_layers'

大佬，最近看到了您的文章，在复现的时候时候遇到如下问题，我是单机运行的，求指教。。。

2021-07-02 02:46:18:INFO: Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
2021-07-02 02:46:18:INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, config_name='data/berts/bert/config.json', data_dir='data/dataset/NER/weibo', default_label='O', device=device(type='cpu'), do_eval=True, do_predict=True, do_shuffle=True, do_train=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_file='data/dataset/NER/weibo/labels.txt', learning_rate=1e-05, local_rank=-1, logging_dir='data/log', logging_steps=100, max_grad_norm=1.0, max_scan_num=1000000, max_seq_length=256, max_steps=-1, max_word_num=5, model_name_or_path='data/berts/bert/pytorch_model.bin', model_type='WCBertCRF_Token', n_gpu=0, no_cuda=False, nodes=1, num_train_epochs=20, output_dir='data/dataset/NER/output', overwrite_cache=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=4, save_steps=600, save_total_limit=50, saved_embedding_dir='data/dataset/NER/weibo', seed=106524, sgd_momentum=0.9, vocab_file='data/berts/bert/vocab.txt', warmup_steps=190, weight_decay=0.0, word_embed_dim=200, word_embedding='data/Tencent_AILab_ChineseEmbedding.txt', word_vocab_file='data/tencent_vocab.txt')
['data/tencent_vocab.txt']
Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "Trainer.py", line 597, in
main()
File "Trainer.py", line 547, in main
num_labels=label_vocab.get_item_size())
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 947, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/datadrive/LEBERT/wcbert_modeling.py", line 466, in init
self.bert = WCBertModel(config)
File "/datadrive/LEBERT/wcbert_modeling.py", line 324, in init
self.encoder = BertEncoder(config)
File "/datadrive/LEBERT/wcbert_modeling.py", line 207, in init
self.add_layers = config.add_layers
AttributeError: 'BertConfig' object has no attribute 'add_layers'

liuwei1206 / lebert Goto Github PK

lebert's People

Contributors

Stargazers

Watchers

Forkers

lebert's Issues

========================================================================= training has no problem，but weibo/labels.txt has 28 tags，two tags '' and ''，add to 30. but 31 in pred value.

Recommend Projects

Recommend Topics

Recommend Org

=========================================================================
training has no problem，but weibo/labels.txt has 28 tags，two tags '' and ''，add to 30. but 31 in pred value.