Giter Site home page Giter Site logo

suamin / icd-bert Goto Github PK

View Code? Open in Web Editor NEW
72.0 4.0 17.0 18.02 MB

ICD-BERT: Multi-label Classification of ICD-10 Codes with BERT (CLEF 2019)

Home Page: http://ceur-ws.org/Vol-2380/paper_67.pdf

License: MIT License

Python 100.00%
multi-label-classification bert clef icd-10

icd-bert's Introduction

MLT-DFKI at CLEF eHealth Task 1: Multi-label Classification with BERT

Code for our submission at CLEF eHealth Task 1: Multilingual Information Extraction. For details, check here.

Requirements

If you're using new trasnformers library, then it is recommended to create virtual environment as this code was written with the older version (note there will be no issues even if both versions co-exist):

pip install pytorch-pretrained-bert

For migration to new library, look here. For baseline experiments, install scikit-learn as well.

Data

Raw data can be found under exps-data/data/*.txt (this was provided by task organizers).

Pre-preprocessed data can be found under exps-data/data/{train, dev, test}_data.pkl as pickled files. English translations are also provided for reproducibility (Google Translate API was used to get translations).

ICD-10 Metadata can be found under exps-data/codes_and_titles_{de, en}.txt, where each line is tab delimited as [ICD Code Description] \t [ICD Code].

Pre-trained Models

For static word embeddings, we used English and German vectors provided by fastText. For domain specific vectors, we used PubMed word2vec (only for English).

For contextualized word embeddings, BERT-base-cased and BioBERT for English and Multilingual-BERT-base-cased for German.

Store all the models under a directory MODELS.

Running BERT Models

Set the path export BERT_MODEL=$MODELS/pubmed_pmc_470k (e.g. BioBERT).

Convert TF checkpoint to PyTorch model

This script is provided by transformers library, but there might be some changes with new version so it is recommended to use the one installed with pytorch-pretrained-bert:

python convert_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path $BERT_MODEL/biobert_model.ckpt \
    --bert_config_file $BERT_MODEL/bert_config.json \
    --pytorch_dump_path $BERT_MODEL/pytorch_model.bin
Fine-tune the model

Configure the paths:

export DATA_DIR=exps-data/data
export BERT_EXPS_DIR=tmp/bert-exps-dir
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run the model:

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_MODEL \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --num_train_epochs 20.0 \
    --do_train \
    --do_eval \
    --train_batch_size 64

BERT English models (BioBERT, BERT-base-cased) results can be reproduced by 20 epochs and for multilingual BERT, with 25 epochs.

Inference

Run predictions (change files to test/dev manually in processor):

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_EXPS_DIR/output \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --do_eval 
Evaluate

Use official evaluation.py script to evaluate:

python evaluation.py --ids_file=$DATA_DIR/ids_development.txt \
                     --anns_file=$DATA_DIR/anns_train_dev.txt \
                     --dev_file=$BERT_EXPS_DIR/output/preds_development.txt \
                     --out_file=$BERT_EXPS_DIR/output/eval_output.txt

Running Other Models

Change configurations here (no CLI yet). Main parameters are:

lang: can be one of {en, de}

load_pretrain_ft: whether to use fastText pre-trained embeddings, works for both languages.

load_pretrain_pubmed: whether to use PubMed embeddings, works for English only.

pretrain_file: path to pre-trained vectors, should be one of path/to/cc.{en, de}.300.vec when load_pretrain_ft=True and path/to/pubmed2018_w2v_400D.bin when load_pretrain_pubmed=True.

model_name: name of the model; can be one of {cnn, han, slstm, clstm}.

For other hyperparameters, check here.

After all the models have been tested and results placed under one directory (one has to manually check the folder names), use predict.py to reproduce the numbers found in Results.txt.

Citation

If you find our work useful, please consider citing:

@inproceedings{amin2019mlt,
  title={MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT},
  author={Amin, Saadullah and Neumann, G{\"u}nter and Dunfield, Katherine Ann and Vechkaeva, Anna and Chapman, Kathryn Annette and Wixted, Morgan Kelly},
  booktitle={Proceedings of the 20th Conference and Labs of the Evaluation Forum (Working Notes)},
  url = {http://ceur-ws.org/Vol-2380/paper_67.pdf},
  pages = {1--15},
  year = {2019}
}

icd-bert's People

Contributors

suamin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.