Giter Site home page Giter Site logo

multicenter-sepsis's Introduction

This is the repository for the paper: Predicting sepsis using deep learning across international sites: a retrospective development and validation study

Reference:

@article{moor2023predicting,
  title={Predicting sepsis using deep learning across international sites: a retrospective development and validation study},
  author={Moor, Michael and Bennett, Nicolas and Ple{\v{c}}ko, Drago and Horn, Max and Rieck, Bastian and Meinshausen, Nicolai and B{\"u}hlmann, Peter and Borgwardt, Karsten},
  journal={eClinicalMedicine},
  volume={62},
  pages={102124},
  year={2023},
  publisher={Elsevier}
}

Disclaimer:

We plan to clean up the following components:

  • R code for data loading / harmonization
  • Python code for pre-prorcessing (feature extraction), normalization etc. (assumes a Dask pipeline that can be run on a large CPU server or cluster)
  • Python code for model development (both deep learning models in PyTorch, and classic models using sklearn), finetuning, calibration

Acknowledgements:

This project was a massive effort stretching over 4 years and over 1.5K commits.

Code contributors:

Michael, Nicolas, Max, Bastian, and Drago

Data setup

In order to set up the datasets, the R package ricu (available via CRAN) is required alongside access credentials for PhysioNet and a download token for AmsterdamUMCdb. This information can then be made available to ricu by setting the environment variables RICU_PHYSIONET_USER, RICU_PHYSIONET_PASS and RICU_AUMC_TOKEN.

install.packages("ricu")
Sys.setenv(
    RICU_PHYSIONET_USER = "my-username",
    RICU_PHYSIONET_PASS = "my-password",
    RICU_AUMC_TOKEN = "my-token"
)

Then, by sourcing the files in r/utils, which will require further R packages to be installed (see r/utils/zzz-demps.R), the function export_data() becomes available. This roughly loads data corresponding to the specification in config/features.json, on an hourly grid, performs some patient filtering and concludes with some missingness imputation/feature augmentation steps. The script under r/scripts/create_dataset.R can be used to carry out these steps.

install.packages(
    c("here", "arrow", "bigmemory", "jsonlite", "data.table", "readr",
      "optparse", "assertthat", "cli", "memuse", "dplyr",
      "biglasso", "ranger", "qs", "lightgbm", "cowplot", "roll")
)

invisible(
  lapply(list.files(here::here("r", "utils"), full.names = TRUE), source)
)

for (x in c("mimic", "eicu", "hirid", "aumc")) {

  if (!is_data_avail(x)) {
    msg("setting up `{x}`\n")
    setup_src_data(x)
  }

  msg("exporting data for `{x}`\n")
  export_data(x)
}

If export_data() is called with a default argument of data_path("export") for dest_dir, this will create one parquet file per data source under data-export. This procedure can also be run using the PhysioNet demo datasets for debugging and to make sure it runs through:

install.packages(
  c("mimic.demo", "eicu.demo"),
  repos = "https://eth-mds.github.io/physionet-demo"
)

for (x in c("mimic_demo", "eicu_demo")) {
  export_data(x)
}

Python pipeline (for the machine learning / modelling side):

For transparency, we include the full list of requirements we used throughout this study in
requirements_full.txt However, some individual packages may not be supported anymore, hence to get started you may want to start with
requirements_minimal.txt

For example, by activating your virtual environment, and running:
pip install -r requirements_minimal.txt

For setting up this project, we ran:
>pipenv install
>pipenv shell Hence, feel free to also check out the Pipfile / Pipfile.lock

Datasets

(internal usage: run source scripts/download_from_euler.sh )
All data will be downloaded to:
datasets/downloads/

Source code

src:

  • torch: pytorch-based pipeline and models (currently an attention model)
    TODO: add docu for training a model
  • sklearn: sklearn-based pipeline for boosted trees baselines

Preprocessing

Running the preprocessing

source scripts/run_preprocessing.sh

Note that the preprocessed data (as parquet files) contain two different label columns: 'sep3', 'utility', whereas sep3 is the sepsis label, and utility is a regression target (that is derived from the sepsis label), as inspired by the Physionet 2019 Challenge for sepsis prediction. The utility score is a bit more complex to use, as it can not be directly used with different datasets (due to prevalence differences). We have a solution for this (lambda parameters) but they are not part of this paper. Feel free to contact us, if interested.

If you are not using our scripts (which automatically take care of this), make sure to not use either of sep3 or utility as feature for training!

Training

Model overview

  • src/torch: pytorch-based pipeline and models (currently GRU and attention model)
  • src/sklearn: sklearn-based pipeline for lightGBM and LogReg models

Running the LightGBM hyperparameter search

>source scripts/run_lgbm.sh <results_folder_name>

After having run the LightGBM hyperparameter search, run repetitions with:

>source scripts/run_lgbm_rep.sh <results_folder_name>

Running the baseline models hyperparameter search + repetitions (in one)

>source scripts/run_baselines.sh <results_folder_name>

Deep models / torch pipeline

These jobs we currently run on bs-slurm-02.

First, compile a sweep on wandb.ai, using the sweep-id, (only the id -- not the entire id-path) run:
>source scripts/wandb/submit_job.sh sweep-id
In this submit_job script you can configure the variable n_runs, i.e. how many evaluations should be run (e.g. 25 during coarse or fine tuning search, or 5 for repetition runs)

Training a single dataset and model

Fitting an attention-model on Physionet: #TODO update this

Evaluation pipeline

Shallow models + Baselines

>source scripts/eval_sklearn.sh <results_folder_name> where the results folder refers to the output folder of the hyperparameter search Make sure that the eval_sklearn script reads all those methods you wish to evaluate. This script already assumes that repetitions are available.

Deep models

First determine the best run of your sweep, giving you a run-id. First apply this model to all datasets:
>source scripts/wandb/submit_evals.sh run-id
Once this is completed, the prediction files can be processed in the patient eval:
>source scripts/eval_torch.sh run-id

For evaluating a repetition sweep, run (on slurm)
>pipenv run python scripts/wandb/get_repetition_runs.py sweep-id1 sweep-id2 .. and once completed, run (again cpu server):
>python scripts/wandb/get_repetition_evals.py sweep-id1 sweep-id2 ...

Results and plots

For gathering all repetition results, run:
>python -m scripts.plots.gather_data --input_path results/evaluation_validation/evaluation_output_subsampled --output_path results/evaluation_validation/plots/

For creating ROC plots, run:
>python scripts/plots/plot_roc.py --input_path results/evaluation/plots/result_data.csv

For creating precision/earliness plots, run: >python -m scripts.plots.plot_scatterplots results/evaluation/plots/result_data.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation/plots/
For the scatter data, in order to return 50 measures (5 repetition splits, 10 subsamplings), set --aggregation micro

Pooled predictions

First, we need to create a mapping from experiments (data_train,data_eval, model etc) to the prediction files:
>python scripts/map_model_to_result_files.py <path_to_predictons> --output_path <output_json_path> Use --overwrite, to overwrite an existing mapping json.

Next we actually pool the predictions:
>source scripts/pool_predictions.sh

Then, we evaluate them:
>source scripts/eval_pooled.sh
To create plots with the pooled predictions, run:
>python -m scripts.plots.gather_data --input_path results/evaluation_test/prediction_pooled_subsampled/max/evaluation_output --output_path results/evaluation_test/prediction_pooled_subsampled/max/plots/
>python scripts/plots/plot_roc.py --input_path results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv
For computing precision/earliness, run:
python -m scripts.plots.plot_scatterplots results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation_test/prediction_pooled_subsampled/max/plots/ And heatmap incl. pooled preds:
>python -m scripts.make_heatmap results/evaluation_test/plots/roc_summary_subsampled.csv --pooled_path results/evaluation_test/prediction_pooled_subsampled/max/plots/roc_summary_subsampled.csv

multicenter-sepsis's People

Contributors

mi92 avatar pseudomanifold avatar nbenn avatar expectationmax avatar dplecko avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.