thunlp / erica Goto Github PK

Source code for ACL 2021 paper "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning"

Python 99.83% Shell 0.17%

erica's Introduction

ERICA

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

The code is based on huggingface's transformers, the trained models and pre-training data can be downloaded from Google Drive.

Quick Start

You can quickly run our code by following steps:

Install dependencies as described in following section.
cd to pretrain or finetune directory then download and pre-process data for pre-training or finetuning.

1. Dependencies

Run the following script to install dependencies.

pip install -r requirement.txt

You need to install transformers and apex manually.

transformers We use huggingface transformers to implement Bert and RoBERTa, and the version is 2.5.0. For convenience, we have downloaded transformers into code/pretrain/ so you can easily import it, and we have also modified some lines in the class BertForMaskedLM in src/transformers/modeling_bert.py while keeping the other codes unchanged.

You just need run

pip install .

to install transformers manually.

apex Install apex under the offical guidance.

process pretraining data

In folder prepare_pretrain_data, we provide the codes for processing pre-training data.

2. Pretraining

To pretrain ERICA_bert:

cd code/pretrain

python -m torch.distributed.launch --nproc_per_node 8  main.py  \
    --model DOC  --lr 3e-5 --batch_size_per_gpu 16 --max_epoch 105  \
    --gradient_accumulation_steps 16    --save_step 500  --temperature 0.05  \
    --train_sample  --save_dir ckpt_doc_dw_f_alpha_1_uncased --n_gpu 8  --debug 1  --add_none 1 \
    --alpha 1 --flow 0 --dataset_name none.json  --wiki_loss 1 --doc_loss 1 \
    --change_dataset 1  --start_end_token 0 --bert_model bert \
    --pretraining_size -1 --ablation 0 --cased 0

some explanations for hyper-parameters: temperature (\tau used in loss function of contrastive learning); debug (whether to debug (we provide an example_debug file for pre-training); add_none (whether to add no_relation pair in RD loss); alpha (the proportion of masking (1 means no masking, in experiments, we find masking is not helpful as is described in the main paper, so for all models, we do not mask in the pre-training phase. However, we leave this function here for further research explorations.)); flow (if masking, whether to use a linear decay); wiki_loss (whether to add ED loss); doc_loss (whether to add RD loss); start_end_token (use another entity encoding method); cased (whether to use cased version of BERT).

3. Fine-tuning

Enter each folder for downstream task (document-level / sentence-level relation extraction, entity typing and question answering) fine-tuning. Before fine-tuning, we assume you have already pre-trained an ERICA model. Excecute the bash in each folder for reimplementation.

erica's People

Contributors

Stargazers

Watchers

Forkers

pvcastro xrosliang allensmile qianrenjian xy-always cheerghz sparkjiao jboru anshiquanshu66 insbread haojiepan1 casually-pylearner

erica's Issues

请问有在中文数据上预训练过的模型吗？

Can you give the hyper-parameters of this dataset

Loss for entities

Thanks for your interesting work!

When I went through the code, I didn't find the data preparation and loss for entities (pretraining) as described in the paper.
Could you help me find the related codes? Thanks.

What is the version of Python?

as the title.

Problem about finetune the DocRED dataset

| epoch 19 | step 9700 | ms/b 35103.27 | train loss 0.00095783 | NA acc: 0.99 | not NA acc: 0.95 | tot acc: 0.98
| epoch 19 | step 9750 | ms/b 1007.05 | train loss 0.00080489 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9800 | ms/b 1034.59 | train loss 0.00080026 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9850 | ms/b 1064.13 | train loss 0.00079512 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9900 | ms/b 1235.63 | train loss 0.00084200 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9950 | ms/b 1050.25 | train loss 0.00081005 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10000 | ms/b 1057.81 | train loss 0.00076923 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10050 | ms/b 1031.88 | train loss 0.00076228 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10100 | ms/b 1010.93 | train loss 0.00079324 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10150 | ms/b 1004.57 | train loss 0.00081850 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98

Finish training
Best epoch = 19

I fine-tunning the DocRED dataset according to the guidelines, I set num_train_epochs as 20. But I got the results : the F1 validation set is 0.5494, but all metrics of test set are 0.

so, why all metrics of test set are 0 ?

sentenceRE ‘s finetune codes can't reproduce successfully

Hi! here is a problem and need your help
After run 'bash run.sh' ，the code suspended in main.py [line 125]

loss, output = model(**inputs)

for a long time

Namespace(adam_epsilon=1e-08, batch_size_per_gpu=32, ckpt_to_load='../pretrain/ckpt/ERICA_bert_uncased_RP', cuda='0', dataset='tacred', encoder='bert', entity_marker=True, gpu=device(type='cuda'), hidden_size=768, lr=3e-05, max_epoch=8, max_grad_norm=1, max_length=100, mode='CM', optim='adamw', seed=42, train_prop=1.0, warmup_steps=500, weight_decay=1e-05)
Use all train data!
pre process train.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process dev.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process test.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
********* load from ckpt/../pretrain/ckpt/ERICA_bert_uncased_RP ***********
successful load ckpt
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Begin train...
We will train model in 17032 steps

i tried to delete the specific [main.py/line 103] 'model = nn.DataParallel(model)' and add 'model.cuda()' .but get error:

Traceback (most recent call last):
  File "main.py", line 311, in <module>
    train(args, model, train_dataloader, dev_dataloader, test_dataloader)
  File "main.py", line 131, in train
    loss, output = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/ERICA/finetune/Sent_level_RE/code/re/model.py", line 36, in forward
    outputs = self.bert(input_ids, mask)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 753, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 178, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Is it possible that the batch data[main.py/line:114] is stored in the CPU while the model is stored in the GPU?