Giter Site home page Giter Site logo

labrador's Introduction

Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data

Get Started | arXiv paper | HuggingFace Models | Data | License

Visual Abstract Visual Abstract

Labrador is a pre-trained continuous Transformer model for masked lab modeling.

Laboratory data are a rich source of information about a patient's health. They are often used to diagnose and monitor disease, and to guide treatment. However, lab values are continuous, often missing and therefore difficult to model with the Transformer architecture.

Labrador solves this problem by jointly embedding lab values with a token for the lab test identifier so that the quantitative and qualitative information from each test is combined into a single representation.

Labrador is pre-trained on a large corpus of 100 million lab tests from over 260,000 patients. We rigorously evaluate Labrador on intrinsic and extrinsic tasks, including four real-world problems: cancer diagnosis, COVID-19 diagnosis, predicting elevated alcohol consumption and ICU mortality due to sepsis. We find that Labrador is superior to BERT across all evaluations but both are outperformed by XGBoost indicating that transfer learning from continuous EHR data is still an open problem.

We discuss the limitations of our approach and suggest future directions for research in the corresponding paper, Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data.

Pre-trained Labrador and BERT models are available for download from the Hugging Face model hub.

Get Started

System packages

This repository requires git, make and Python3.10 or later. You can install these using several methods. We recommend brew (install guide), which can be installed directly for MacOS and Linux users, whereas Windows users should first install and activate WSL-Ubuntu.

We recommend using pyenv to manage Python versions.

brew update
brew install git make pyenv gzip
pyenv init

Follow pyenv's instructions to add contents to your shell configuration files. Restart your shell and install Python 3.10 or later.

pyenv install 3.10.0

Clone this GitHub repository

git clone https://github.com/DavidBellamy/labrador

Virtual environment

The following setup commands and all files in scripts/ should be run from the root directory of this repository.

Ensure that you have Python3.10 or later selected in your active shell (pyenv shell 3.10 if using pyenv), then create a virtual environment and activate it:

python -m venv venv && source venv/bin/activate

Install requirements, download model weights and run tests

make setup

make setup installs this repository as a Python package in editable mode along with all its requirements (requirements.txt), downloads the model weights for Labrador and BERT from Hugging Face to the model_weights/ directory and runs the project's tests found in tests/.

๐Ÿšจ Make sure that all tests pass before proceeding.

The Makefile

The Makefile contains a number of commands for running the files in scripts/. Using these, you can determine the order in which the scripts should be run as well as what command-line arguments each script expects.

Labrador model architecture

Labrador's architecture is based on the BERT model with some modifications. It has a continuous embedding layer and a continuous prediction head, as shown in the figure below.

We define a bag of labs as the collection of lab tests ordered at the same point in time for a patient. Labrador is pre-trained as a masked language model (MLM) on bags of labs with the objective of predicting the masked lab code and value given the other labs in the bag.

Figure 1 Figure 1

Data

The data used to pre-train Labrador are a subset of the MIMIC-IV dataset. This dataset is freely available but requires a data use agreement to access so we cannot share it directly. We use the following tables:

  • labevents.csv
  • admissions.csv
  • patients.csv
  • d_labitems.csv

Directly from PhysioNet

These files could be downloaded by singing the usage agreement on MIMIC-IV page on PhysioNet and executing following command

make mimic_data physionet_username=*insert_your_username*

The download speed from the PhysioNet is usually quite low, so it may take a few hours to download.

From Google Cloud

Alternatively, the data could be downloaded few hundred times faster from Google cloud, but this requires some additional setup.

  1. Connect your PhysioNet to your Google account in Cloud Settings
  2. Request a Google Cloud access to the dataset
  3. Setup Google Cloud Project and billing account
  4. Install google-cloud-sdk
  5. Authenticate with gcloud auth login in the terminal
  6. Run make mimic_data google_project_id=*your_project_id* tool=gsutil

Downstream tasks data

The data used in the sepsis mortality prediction evaluation are also derived from the MIMIC-IV dataset. You can download sepsis cohort data via Google BigQuery:

SELECT `subject_id`, `stay_id`, `sofa_time`, `sofa_score` FROM `physionet-data.mimiciv_derived.sepsis3`;

The resulting table should be saved as data/raw/mimic4_sepsis_cohort.csv

The data we used for the cancer diagnosis, COVID-19 diagnosis and elevated alcohol consumption prediction evaluations are open source so we share it directly in data/.

Making the pre-training data

This is done in 3 steps. First, raw EHR data are converted to JSON lines for each patient. Then, the JSON lines are converted to bags of labs. Finally, the bags of labs are converted to TFRecords for faster training.

Note: these steps require a lot of memory and disk space.

  • Raw -> JSON lines: output file pattern is {model}_{split}_patients.jsonl, e.g. bert_train_patients.jsonl
    • Scripts: pretraining_raw_data_to_{model}.py, where {model} in [bert, labrador].
  • JSON lines -> Bags of labs: output file pattern is {model}_{split}_bags.jsonl, e.g. bert_train_bags.jsonl
    • Scripts: pretraining_jsonl_to_{model}_bags.py, where {model} in [bert, labrador].
  • Bags of labs -> TFRecords: output directory pattern is {model}_tfrecords_{split}/, e.g. bert_tfrecords_train
    • Scripts: pretraining_bags_to_{model}_tfrecords.py, where {model} in [bert, labrador].

Citing Labrador

If you use Labrador in your research, please cite:

License

This work is licensed under the MIT License. You can find the full text of the license in the LICENSE file.

labrador's People

Contributors

mshavliuk avatar davidbellamy avatar

Stargazers

 avatar  avatar Jeff Carpenter avatar  avatar Zuhui Wang avatar  avatar Cindy Wang avatar  avatar Stefan avatar Andrew L. Beam avatar Michael Schnebly avatar Bhawesh Kumar avatar

Watchers

 avatar

Forkers

mshavliuk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.