Giter Site home page Giter Site logo

cjagadis / icd-prediction-mimic Goto Github PK

View Code? Open in Web Editor NEW

This project forked from 3778/icd-prediction-mimic

0.0 0.0 0.0 468 KB

Predicting ICD Codes from Clinical Notes

Home Page: https://arxiv.org/abs/2008.01515

License: Apache License 2.0

Python 14.97% Jupyter Notebook 85.03%

icd-prediction-mimic's Introduction

Predicting ICD Codes from Clinical Notes

This repository contains code for training and evaluating several neural network models for predicting ICD codes from discharge summaries on the publicly acessible MIMIC-III dataset (v. 1.4). The models are described in the paper Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes, which uses the results on MIMIC-III as a benchmark. The implemented models are:

  • Logistic Regression
  • Convolutional Neural Network
  • Recurrent Neural Network with Gated Recurrent Units
  • Convolutional Neural Network with Attention (based on CAML)

Dependencies

This project depends on:

  • python==3.6.9
  • numpy==1.19.0
  • scikit-learn==0.23.1
  • pandas==0.25.3
  • nltk==3.4.4
  • scipy==1.4.1
  • gensim==3.8.3
  • tensorflow==2.1.0 (preferably tensorflow-gpu)

General pipeline:

1. In data/, place the files below:

  • NOTEEVENTS.csv.gz (from MIMIC-III)
  • DIAGNOSES_ICD.csv.gz (from MIMIC-III)
  • {train,dev,test}_full_hadm_ids.csv (all 3 from CAML)

2. Run MIMIC_preprocessing.py to select discharge summaries and merge MIMIC-III tables.

MIMIC-III tables NOTEEVENTS, DIAGNOSES_ICD are loaded and merged through admission IDs. From NOTEEVENTS, only a single discharge summary is selected per admission ID.

Outputs data/mimic3_data.pkl, a DataFrame containing 4 columns:

  • HADM_ID: the admission IDs of each patient stay.
  • SUBJECT_ID: the patient IDs. A patient may have multiple admissions, hence a SUBJECT_ID may be linked to multiple HADM_IDs.
  • TEXT: the discharge summaries, one for each HADM_ID.
  • ICD9_CODE: a list of ICD codes assigned to each stay (i.e. to each HADM_ID).

3. Run MIMIC_train_w2v.py to train Word2Vec word embeddings for the neural network models.

This script generates training instances by filtering data/mimic3_data.pkl with data/train_full_hadmids.csv to train gensim.models.Word2Vec word embeddings.

Outputs:

  • MIMIC_emb_train_vecW2V_SIZE.pkl: an embedding matrix with shape (vocab_length, embedding_dimension), in which every row contains the embedding vector of a word.
  • MIMIC_dict_train_vecW2V_SIZE.pkl: a dictionary linking words to the respective row indexes in the embedding matrix.
  • w2v_model.model: the trained Word2Vec instance.

4. Now, any model can be trained and evaluated:

4.1. Run MIMIC_train_baselines.py, for LR and Constant models.

  • For Constant: Computes the k most ocurring ICDs in the training set and predicts them for all test samples. Nothing is stored here.

  • For LR: Computes TF-IDF features in the training set. Then, fits the LR model to the training set. After training, the weights of the epoch with best micro F1 in the validation set are restored and threshold-optimized metrics are computed for all subsets. The fitted model is stored using Tensorflow SavedModel format.

4.2. Run MIMIC_train_nn.py, for CNN, GRU and CNN-Att.

This script loads the data splits and Word2Vec embeddings, then fits the desired model for the training set. After training, the weights of the epoch with best micro F1 in the validation set are restored and threshold-optimized metrics are computed for all subsets.

The fitted model is stored using Tensorflow SavedModel format.

5. In notebooks/, you will find:

  • MIMIC_overview.ipynb, where some data analyses from MIMIC-III discharge summaries are provided.
  • MIMIC_analyze_predictions.ipynb, where some additional analyses from the predictions of a trained model with W2V embeddings can be computed. The shown outputs are from our CNN-Att model. Edit the first cell with the desired model name and run all cells.

icd-prediction-mimic's People

Contributors

arthurreys avatar dsevero avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.