Giter Site home page Giter Site logo

ganjinzero / kebiolm Goto Github PK

View Code? Open in Web Editor NEW
62.0 3.0 5.0 1.49 MB

Improving Biomedical Pretrained Language Models with Knowledge [BioNLP 2021]

Home Page: https://arxiv.org/abs/2104.10344

Python 100.00%
language-model pre-training nlp biomedical

kebiolm's Introduction

KeBioLM

Improving Biomedical Pretrained Language Models with Knowledge. Accepted by BioNLP 2021. Paper

Introduction

KeBioLM: Knowledge enhanced Biomedical pretrained Language Model

KeBioLM applies text-only encoding layer to learn entity representation and applies a text-entity fusion encoding to aggregate entity representation. KeBioLM has three pretraining tasks:

  • Masked Language Model: Extend whole word masking to whole entity masking.
  • Entity Detection: Predict B/I/O tags for NER.
  • Entity Linking: Link predicted entities to UMLS.

Pre-trained Model

You can download our model from Google Drive. Our model contain pre-trained weights pytorch_model.bin, a tokenizer vocab.txt (same as PubMedBERT) and an entity dictionary for entity linking task in pretraining phase entity.jsonl.

Environment

All codes are tested under Python 3.7, PyTorch 1.7.0 and Transformers 3.4.0.

Fine-tune KeBioLM for NER and RE

BLURB Dataset

Download BLURB dataset from here.

NER

For example, to fine tune BC5CDR-disease dataset:

cd ner
CUDA_VISIBLE_DEVICES=0 python \
run_ner.py \
--data_dir $BC5CDR_DATASET \
--model_name_or_path $KEBIOLM_CHECKPOINT_PATH \
--output_dir $OUTPUT_DIR \
--num_train_epochs 60 \
--do_train --do_eval --do_predict --overwrite_output_dir \
--gradient_accumulation_steps 2 \
--learning_rate 3e-5 \
--warmup_steps 1710 \
--evaluation_strategy epoch \
--max_seq_length 512 \
--per_device_train_batch_size 8 \
--eval_accumulation_steps 1 \
--load_best_model_at_end --metric_for_best_model f1

To use your own task for fine-tuning, please prepare train.tsv, test.tsv and dev.tsv in same folder. If your task contain tags more than B,I,O like B-disease, I-disease, please also provide a label file which contains each label in a line:

O
B-disease
I-disease
B-symptom
I-symptom

And you should pass this label file by --labels $label_file.

RE

To fine-tune DDI dataset:

cd re
CUDA_VISIBLE_DEVICES=0 python \
run.py \
--task_name ddi \
--data_dir $DDI_DATASET \
--model_name_or_path $KEBIOLM_CHECKPOINT_PATH \
--output_dir $OUTPUT_DIR \
--num_train_epochs 60 \
--do_train --do_eval --do_predict --overwrite_output_dir \
--gradient_accumulation_steps 1 \ 
--learning_rate 1e-5 \
--warmup_steps 9486 \
--evaluation_strategy epoch \
--max_seq_length 256 \
--per_device_train_batch_size 16 \
--eval_accumulation_steps 1 \ 
--load_best_model_at_end --metric_for_best_model f1

Since DDI, ChemProt and GAD dataset have different formats and labels, we should specific --task_name ddi/chemprot/gad.

Hyperparameters for BLURB dataset

To reproduce the results of our paper, please try training models with following hyperparameters. All models are trained for 60 epochs with a 10% steps linear warmup.

Dataset Learning rate Sequence Length Batch size Gradient accumulation
BC5chem 3e-5 512 8 2
BC5dis 1e-5 512 8 2
NCBI 1e-5 512 8 2
BC2GM 3e-5 512 8 2
JNLPBA 1e-5 512 8 2
ChemProt 1e-5 256 16 1
DDI 1e-5 256 16 1
GAD 1e-5 128 16 1

BC2GM, GAD and DDI datasets have relatively higher variance, you can try different seeds by setting --seed $seed_number.

UMLS Knowledge Probing

For a relation triplet (s, r, o) in UMLS, we generate two queries: [CLS] [MASK] r o [SEP] and [CLS] s r [MASK] [SEP]. We request language models to restore the masked entites. We collect 143771 queries for 922 relation types.

Rebuild the dataset

To rebuild our dataset for UMLS knowledge probing, you should prepare UMLS2020AA version. After installing UMLS2020AA, you should have a folder $UMLS_DIR with MRCONSO.RRF, MRREL.RRF, MRSTY.RRF. Using probe/build.py to rebuild the probing dataset dataset.txt based on UMLS LUI and relation.

cd probe
python build.py $UMLS_DIR

Rebuilding process will take about 10 minutes.

Probing

The default setting of probing is max_length_of_[MASK] = 10, beam_width = 5. To probe dataset.txt for SciBERT (or other Bert-based language models):

cd probe
python beam_batch_decode.py $SCIBERT_PATH dataset.txt

To probe dataset.txt for KeBioLM:

cd probe
python beam_batch_decode.py $KEBIOLM_CHECKPOINT_PATH dataset.txt

Probing with beam_width = 5 will take very long time (over 1 day on V100), you may split dataset.txt for using multi GPUs to decode.

To perform evaluation on probing results, using:

cd probe
python metric.py $predict_file dataset.txt $UMLS_DIR

Citation

@inproceedings{yuan-etal-2021-improving,
    title = "Improving Biomedical Pretrained Language Models with Knowledge",
    author = "Yuan, Zheng  and
      Liu, Yijia  and
      Tan, Chuanqi  and
      Huang, Songfang  and
      Huang, Fei",
    booktitle = "Proceedings of the 20th Workshop on Biomedical Language Processing",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bionlp-1.20",
    doi = "10.18653/v1/2021.bionlp-1.20",
    pages = "180--190",
    abstract = "Pretrained language models have shown success in many natural language processing tasks. Many works explore to incorporate the knowledge into the language models. In the biomedical domain, experts have taken decades of effort on building large-scale knowledge bases. For example, UMLS contains millions of entities with their synonyms and defines hundreds of relations among entities. Leveraging this knowledge can benefit a variety of downstream tasks such as named entity recognition and relation extraction. To this end, we propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases. Specifically, we extract entities from PubMed abstracts and link them to UMLS. We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and then applies a text-entity fusion encoding to aggregate entity representation. In addition, we add two training objectives as entity detection and entity linking. Experiments on the named entity recognition and relation extraction tasks from the BLURB benchmark demonstrate the effectiveness of our approach. Further analysis on a collected probing dataset shows that our model has better ability to model medical knowledge.",
}

kebiolm's People

Contributors

ganjinzero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

kebiolm's Issues

Is there any possible to get the annotated PubMed documents ?

Hi, nice and clean work, quite inspiring. Thanks for releasing the code, I was wondering, is there any possibility to get the annotated pubmed_ds dataset (with entities annotated). Or you just used the data from MEDTYPE: Improving Medical Entity Linking with Semantic Type Prediction.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.