Giter Site home page Giter Site logo

erica's Introduction

ERICA

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

The code is based on huggingface's transformers, the trained models and pre-training data can be downloaded from Google Drive.

Quick Start

You can quickly run our code by following steps:

  • Install dependencies as described in following section.
  • cd to pretrain or finetune directory then download and pre-process data for pre-training or finetuning.

1. Dependencies

Run the following script to install dependencies.

pip install -r requirement.txt

You need to install transformers and apex manually.

transformers We use huggingface transformers to implement Bert and RoBERTa, and the version is 2.5.0. For convenience, we have downloaded transformers into code/pretrain/ so you can easily import it, and we have also modified some lines in the class BertForMaskedLM in src/transformers/modeling_bert.py while keeping the other codes unchanged.

You just need run

pip install .

to install transformers manually.

apex Install apex under the offical guidance.

process pretraining data

In folder prepare_pretrain_data, we provide the codes for processing pre-training data.

2. Pretraining

To pretrain ERICA_bert:

cd code/pretrain

python -m torch.distributed.launch --nproc_per_node 8  main.py  \
    --model DOC  --lr 3e-5 --batch_size_per_gpu 16 --max_epoch 105  \
    --gradient_accumulation_steps 16    --save_step 500  --temperature 0.05  \
    --train_sample  --save_dir ckpt_doc_dw_f_alpha_1_uncased --n_gpu 8  --debug 1  --add_none 1 \
    --alpha 1 --flow 0 --dataset_name none.json  --wiki_loss 1 --doc_loss 1 \
    --change_dataset 1  --start_end_token 0 --bert_model bert \
    --pretraining_size -1 --ablation 0 --cased 0

some explanations for hyper-parameters: temperature (\tau used in loss function of contrastive learning); debug (whether to debug (we provide an example_debug file for pre-training); add_none (whether to add no_relation pair in RD loss); alpha (the proportion of masking (1 means no masking, in experiments, we find masking is not helpful as is described in the main paper, so for all models, we do not mask in the pre-training phase. However, we leave this function here for further research explorations.)); flow (if masking, whether to use a linear decay); wiki_loss (whether to add ED loss); doc_loss (whether to add RD loss); start_end_token (use another entity encoding method); cased (whether to use cased version of BERT).

3. Fine-tuning

Enter each folder for downstream task (document-level / sentence-level relation extraction, entity typing and question answering) fine-tuning. Before fine-tuning, we assume you have already pre-trained an ERICA model. Excecute the bash in each folder for reimplementation.

erica's People

Contributors

thuqinyj16 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

erica's Issues

Loss for entities

Thanks for your interesting work!

When I went through the code, I didn't find the data preparation and loss for entities (pretraining) as described in the paper.
Could you help me find the related codes? Thanks.

Problem about finetune the DocRED dataset

| epoch 19 | step 9700 | ms/b 35103.27 | train loss 0.00095783 | NA acc: 0.99 | not NA acc: 0.95 | tot acc: 0.98
| epoch 19 | step 9750 | ms/b 1007.05 | train loss 0.00080489 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9800 | ms/b 1034.59 | train loss 0.00080026 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9850 | ms/b 1064.13 | train loss 0.00079512 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9900 | ms/b 1235.63 | train loss 0.00084200 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 9950 | ms/b 1050.25 | train loss 0.00081005 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10000 | ms/b 1057.81 | train loss 0.00076923 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10050 | ms/b 1031.88 | train loss 0.00076228 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10100 | ms/b 1010.93 | train loss 0.00079324 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98
| epoch 19 | step 10150 | ms/b 1004.57 | train loss 0.00081850 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98

dev set evaluation
ALL : Theta 0.9079 | F1 0.5720 | AUC 0.5670
Ignore ma_f1 0.5501 | input_theta 0.9079 test_result F1 0.5494 | AUC 0.5376
test set evaluation
ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000
Ignore ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000

Finish training
Best epoch = 19

I fine-tunning the DocRED dataset according to the guidelines, I set num_train_epochs as 20. But I got the results : the F1 validation set is 0.5494, but all metrics of test set are 0.

so, why all metrics of test set are 0 ?

sentenceRE ‘s finetune codes can't reproduce successfully

Hi! here is a problem and need your help
After run 'bash run.sh' ,the code suspended in main.py [line 125]

loss, output = model(**inputs)

for a long time

Namespace(adam_epsilon=1e-08, batch_size_per_gpu=32, ckpt_to_load='../pretrain/ckpt/ERICA_bert_uncased_RP', cuda='0', dataset='tacred', encoder='bert', entity_marker=True, gpu=device(type='cuda'), hidden_size=768, lr=3e-05, max_epoch=8, max_grad_norm=1, max_length=100, mode='CM', optim='adamw', seed=42, train_prop=1.0, warmup_steps=500, weight_decay=1e-05)
Use all train data!
pre process train.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process dev.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process test.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
********* load from ckpt/../pretrain/ckpt/ERICA_bert_uncased_RP ***********
successful load ckpt
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Begin train...
We will train model in 17032 steps

i tried to delete the specific [main.py/line 103] 'model = nn.DataParallel(model)' and add 'model.cuda()' .but get error:

Traceback (most recent call last):
  File "main.py", line 311, in <module>
    train(args, model, train_dataloader, dev_dataloader, test_dataloader)
  File "main.py", line 131, in train
    loss, output = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/ERICA/finetune/Sent_level_RE/code/re/model.py", line 36, in forward
    outputs = self.bert(input_ids, mask)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 753, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 178, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Is it possible that the batch data[main.py/line:114] is stored in the CPU while the model is stored in the GPU?

如果一个文本中有多个正例,是咋处理的?

您好,我想请问下,论文中“Entity Discrimination”和“Relation Discrimination”任务,采用对比学习的方式训练。以“Entity Discrimination”任务为例,假设有一句话:A和B共同创建了公司C。本句话有两个三元组(C,founded by, A)和 (C,founded by, B),那么对于(C,founded by,),A和B都是正例。但我看到对比学习的那个公式,好像每次文本中的正例只有一个。不知道我理解的是否正确,希望得到您的回答,谢谢。

scripts for preparing pre-train data

Hi, Thank you for your great work. I am trying to work on reproducing the pretraining data, and possibly extending it to different languages. May I know how do you extract all_triples and all_qs from the wikipedia dump (latest-all.json)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.