borgwardtlab / multicenter-sepsis Goto Github PK

License: Apache License 2.0

Python 77.08% R 11.51% Shell 11.41%

multicenter-sepsis's Introduction

This is the repository for the paper: Predicting sepsis using deep learning across international sites: a retrospective development and validation study

Reference:

@article{moor2023predicting,
  title={Predicting sepsis using deep learning across international sites: a retrospective development and validation study},
  author={Moor, Michael and Bennett, Nicolas and Ple{\v{c}}ko, Drago and Horn, Max and Rieck, Bastian and Meinshausen, Nicolai and B{\"u}hlmann, Peter and Borgwardt, Karsten},
  journal={eClinicalMedicine},
  volume={62},
  pages={102124},
  year={2023},
  publisher={Elsevier}
}

Disclaimer:

We plan to clean up the following components:

R code for data loading / harmonization
Python code for pre-prorcessing (feature extraction), normalization etc. (assumes a Dask pipeline that can be run on a large CPU server or cluster)

Acknowledgements:

This project was a massive effort stretching over 4 years and over 1.5K commits.

Code contributors:

Michael, Nicolas, Max, Bastian, and Drago

Data setup

In order to set up the datasets, the R package ricu (available via CRAN) is required alongside access credentials for PhysioNet and a download token for AmsterdamUMCdb. This information can then be made available to ricu by setting the environment variables RICU_PHYSIONET_USER, RICU_PHYSIONET_PASS and RICU_AUMC_TOKEN.

install.packages("ricu")
Sys.setenv(
    RICU_PHYSIONET_USER = "my-username",
    RICU_PHYSIONET_PASS = "my-password",
    RICU_AUMC_TOKEN = "my-token"
)

Then, by sourcing the files in r/utils, which will require further R packages to be installed (see r/utils/zzz-demps.R), the function export_data() becomes available. This roughly loads data corresponding to the specification in config/features.json, on an hourly grid, performs some patient filtering and concludes with some missingness imputation/feature augmentation steps. The script under r/scripts/create_dataset.R can be used to carry out these steps.

install.packages(
    c("here", "arrow", "bigmemory", "jsonlite", "data.table", "readr",
      "optparse", "assertthat", "cli", "memuse", "dplyr",
      "biglasso", "ranger", "qs", "lightgbm", "cowplot", "roll")
)

invisible(
  lapply(list.files(here::here("r", "utils"), full.names = TRUE), source)
)

for (x in c("mimic", "eicu", "hirid", "aumc")) {

  if (!is_data_avail(x)) {
    msg("setting up `{x}`\n")
    setup_src_data(x)
  }

  msg("exporting data for `{x}`\n")
  export_data(x)
}

If export_data() is called with a default argument of data_path("export") for dest_dir, this will create one parquet file per data source under data-export. This procedure can also be run using the PhysioNet demo datasets for debugging and to make sure it runs through:

install.packages(
  c("mimic.demo", "eicu.demo"),
  repos = "https://eth-mds.github.io/physionet-demo"
)

for (x in c("mimic_demo", "eicu_demo")) {
  export_data(x)
}

Python pipeline (for the machine learning / modelling side):

For transparency, we include the full list of requirements we used throughout this study in
requirements_full.txt However, some individual packages may not be supported anymore, hence to get started you may want to start with
requirements_minimal.txt

For example, by activating your virtual environment, and running:
pip install -r requirements_minimal.txt

For setting up this project, we ran:
>pipenv install
>pipenv shell Hence, feel free to also check out the Pipfile / Pipfile.lock

Datasets

Make sure that all exported data is put here:
datasets/downloads/

Source code

src:

torch: pytorch-based pipeline and models (currently an attention model)
TODO: add docu for training a model
sklearn: sklearn-based pipeline for boosted trees baselines

Preprocessing

Running the preprocessing

source scripts/run_preprocessing.sh

Note that the preprocessed data (as parquet files) contain two different label columns: 'sep3', 'utility', whereas sep3 is the sepsis label, and utility is a regression target (that is derived from the sepsis label), as inspired by the Physionet 2019 Challenge for sepsis prediction. The utility score is a bit more complex to use, as it can not be directly used with different datasets (due to prevalence differences). We have a solution for this (lambda parameters) but they are not part of this paper. Feel free to contact us, if interested.

If you are not using our scripts (which automatically take care of this), make sure to not use either of sep3 or utility as feature for training!

Training

Model overview

src/torch: pytorch-based pipeline and models (currently GRU and attention model)
src/sklearn: sklearn-based pipeline for lightGBM and LogReg models

Running the LightGBM hyperparameter search

>source scripts/run_lgbm.sh <results_folder_name>

After having run the LightGBM hyperparameter search, run repetitions with:

>source scripts/run_lgbm_rep.sh <results_folder_name>

Running the baseline models hyperparameter search + repetitions (in one)

>source scripts/run_baselines.sh <results_folder_name>

Deep models / torch pipeline

These jobs we currently run on bs-slurm-02.

First, compile a sweep on wandb.ai, using the sweep-id, (only the id -- not the entire id-path) run:
>source scripts/wandb/submit_job.sh sweep-id
In this submit_job script you can configure the variable n_runs, i.e. how many evaluations should be run (e.g. 25 during coarse or fine tuning search, or 5 for repetition runs)

Example sweep for hyperparameter search of training an attention model on MIMIC:

method: random
metric:
  goal: minimize
  name: online_val/loss
parameters:
  batch_size:
    values:
      - 16
      - 32
      - 64
      - 128
  cost:
    value: 5
  d_model:
    values:
      - 32
      - 64
      - 128
      - 256
  dataset:
    value: MIMIC
  dropout:
    values:
      - 0.3
      - 0.4
      - 0.5
      - 0.6
      - 0.7
  gpus:
    value: -1
  ignore_statics:
    value: "True"
  label_propagation:
    value: 6
  label_propagation_right:
    value: 24
  learning_rate:
    distribution: log_uniform
    max: -7
    min: -9
  max_epochs:
    value: 100
  model:
    value: AttentionModel
  n_layers:
    value: 2
  norm:
    value: rezero
  task:
    value: classification
  weight_decay:
    values:
      - 0.1
      - 0.01
      - 0.001
      - 0.0001
program: src/torch/train_model.py

This can be directly copied into Weights & Biases, for creating a new sweep.

Training a single dataset and model

Example command for training an attention model on MIMIC:

python src/torch/train_model.py --batch_size=16 --d_model=256 --dataset=MIMIC --dropout=0.5 --gpus=-1 --ignore_statics=True --label_propagation=6 --label_propagation_right=24 --learning_rate=0.0002 --max_epochs=100 --model=AttentionModel --n_layers=2 --norm=rezero --task=classification --weight_decay=0.001

Evaluation pipeline

Shallow models + Baselines

>source scripts/eval_sklearn.sh <results_folder_name> where the results folder refers to the output folder of the hyperparameter search Make sure that the eval_sklearn script reads all those methods you wish to evaluate. This script already assumes that repetitions are available.

Deep models

First determine the best run of your sweep, giving you a run-id. First apply this model to all datasets:
>source scripts/wandb/submit_evals.sh run-id
Once this is completed, the prediction files can be processed in the patient eval:
>source scripts/eval_torch.sh run-id

For evaluating a repetition sweep, run (on slurm)
>pipenv run python scripts/wandb/get_repetition_runs.py sweep-id1 sweep-id2 .. and once completed, run (again cpu server):
>python scripts/wandb/get_repetition_evals.py sweep-id1 sweep-id2 ...

Results and plots

For gathering all repetition results, run:
>python -m scripts.plots.gather_data --input_path results/evaluation_validation/evaluation_output_subsampled --output_path results/evaluation_validation/plots/

For creating ROC plots, run:
>python scripts/plots/plot_roc.py --input_path results/evaluation/plots/result_data.csv

For creating precision/earliness plots, run: >python -m scripts.plots.plot_scatterplots results/evaluation/plots/result_data.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation/plots/
For the scatter data, in order to return 50 measures (5 repetition splits, 10 subsamplings), set --aggregation micro

Pooled predictions

First, we need to create a mapping from experiments (data_train,data_eval, model etc) to the prediction files:
>python scripts/map_model_to_result_files.py <path_to_predictons> --output_path <output_json_path> Use --overwrite, to overwrite an existing mapping json.

Next we actually pool the predictions:
>source scripts/pool_predictions.sh

Then, we evaluate them:
>source scripts/eval_pooled.sh
To create plots with the pooled predictions, run:
>python -m scripts.plots.gather_data --input_path results/evaluation_test/prediction_pooled_subsampled/max/evaluation_output --output_path results/evaluation_test/prediction_pooled_subsampled/max/plots/
>python scripts/plots/plot_roc.py --input_path results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv
For computing precision/earliness, run:
python -m scripts.plots.plot_scatterplots results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation_test/prediction_pooled_subsampled/max/plots/ And heatmap incl. pooled preds:
>python -m scripts.make_heatmap results/evaluation_test/plots/roc_summary_subsampled.csv --pooled_path results/evaluation_test/prediction_pooled_subsampled/max/plots/roc_summary_subsampled.csv

multicenter-sepsis's People

Contributors

Stargazers

Watchers

Forkers

hank9cao lzl-alan dmj163

multicenter-sepsis's Issues

Unable to access `bun_cr` feature in data frame

Running into a strange issue: the feature bun_cr shows up in the data frame columns, but I am unable to access the column afterwards. Posting the standard error here that I received:

Traceback (most recent call last):
  File "/local0/software/python/python3_bleeding_edge/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/local0/software/python/python3_bleeding_edge/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rieckb/Projects/multicenter-sepsis/src/sklearn/analyse_distributions.py", line 220, in <module>
    X_train, X_val, X_test
  File "/home/rieckb/Projects/multicenter-sepsis/src/sklearn/analyse_distributions.py", line 119, in analyse_feature_distributions
    df[feature],
  File "/home/rieckb/.local/share/virtualenvs/multicenter-sepsis-SWAM21d5/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/rieckb/.local/share/virtualenvs/multicenter-sepsis-SWAM21d5/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 'bun_cr'

Wavelet transforms

Looking at the code in src/sklearn/data/transformers.py, I could not find anything related to wavelet transforms. Didn't we agree to include something like this? @dplecko and I got some decent results out of this some time ago. Including this shouldn't be hard, using for example kymatio. Or was there some issue with utility score evaluation? As far as I remember @mi92 mentioned that some challenge projects were doing something similar, so they must have figured out evaluation.

Check distribution shift of train--val--test splits

Check (at least) the following things:

measurement frequency
time series length
class imbalance

Goal is to decide whether a new test split is suitable or suffering from shifts.

add a flag for choosing between patient based and instance based loss computation in deep models

add feature subsetter for physionet challenge models (trained or tested on physionet)

Attention Model

Currently we use batch norm, evtl we can also use saved layer norm statistics

Test alternatives to LayerNorm:
- ReZero is All You Need
- Transformers without tears

Retraining after early stopping to include validation data in model performance estimate

Hyperparameter tuning of torch model

hyperopt pytorch lightning?

Simple GRU Model

--> same train/val split
--> same features! (without look back)

Adapt eval script outputs to agree upon format

Format for evaluation output json should be of the following form:

{
    "model": <modelname>,
    "model_path": <path to loaded model>,
    "model_checksum": <md5 of loaded model>,
    "model_params": <nested dict>,
    "dataset_train": <Name of dataset which was used to train classifier>,
    "dataset_eval": <Name of dataset on which this evaluation was run>,
    "split": <one of train, validation and test>,
    "average_precision": 0.0,
    "auroc": 0.0,
    "accuracy": 0.0,
    "balanced_accuracy": 0.0,
    "physionet2019_score": 0.0,
    "labels": [[patient1], [...]],
    "prediction": [[patient1], [...]],
    "scores": [[patient1], [...]],
    "ids": [1, 2, 3, 4]
}

There is a difference in performance between the masked evaluation and non-masked evalaution

This indicates that we have some degree of future information leakage, output from training attention model for 2 epochs on PreprocessedPhysionet2019:

TEST RESULTS
{'val_auroc': 0.9071204701741901,
 'val_average_precision': 0.11100927963326301,
 'val_balanced_accuracy': 0.8131479410920233,
 'val_loss': 1.044224500656128,
 'val_physionet2019_score': tensor(0.4154)}
--------------------------------------------------------------------------------
MASKED TEST RESULTS
{'auroc': 0.6951147134689225,
 'average_precision': 0.04438458271936486,
 'balanced_accuracy': 0.6281635931258406,
 'physionet2019_score': -0.3737012987012949}

Refactor patient-based evaluation wrt label shifting

Move early stopping to loss instead of physionet score

We are currently in the potential situation of a loss - eval-metric mismatch.
A symptom of these problems are that some models stop training after a single epoch.

(medium prio) implement regression target / label

as inspired by morill's approach target = U1 - U0

Sklearn model failing on challenge dataset

Traceback (most recent call last):
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in call
return self.fn(*self.args, **self.kwargs)
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 608, in call
return self.func(*args, **kwargs)
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/joblib/parallel.py", line 256, in call
for func, args, kwargs in self.items]
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/joblib/parallel.py", line 256, in
for func, args, kwargs in self.items]
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 560, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer)
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 607, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in call
*args, **kwargs)
File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 213, in _score
**self._kwargs)
File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/evaluation/sklearn_utils.py", line 169, in call
patient, patient_y, self.shift_onset_label)
File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/evaluation/sklearn_utils.py", line 100, in shift_onset_label
is_case = nanany(label)
File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/evaluation/sklearn_utils.py", line 16, in nanany
return np.any(array[~np.isnan(array)])
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
"""

Update Preprocessing Pipeline: account for proper filtering of baseline scores

How can we harmonize sklearn model and torch model evaluation?

Currently, torch uses a single random split from the training data for early stopping.

Batch-size is not properly varied in hypersearch (always 32)

Error when reloading model for final evaluation after training on EICU

Loading model with best physionet score...
^MTesting: 0it [00:00, ?it/s]/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, test dataloader 0, do
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "/home/mimoor/.pyenv/versions/3.7.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/mimoor/.pyenv/versions/3.7.4/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/train_model.py", line 78, in <module>
    main(hyperparam_draw, model_cls)
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/train_model.py", line 54, in main
    trainer.test(loaded_model)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
    self.fit(model)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
    self.run_evaluation(test_mode=True)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 448, in evaluation_forward
    output = model.test_step(*args)
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/models/base_model.py", line 139, in test_step
    return self._shared_eval(batch, batch_idx, prefix='val')
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/models/base_model.py", line 64, in _shared_eval
    output = self.forward(data, lengths).squeeze(-1)
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/models/attention_model.py", line 237, in forward
    for layer in self.layers:
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/bs-filesvr01/export/group/borgwardt/Projects/sepsis/multicenter-sepsis/src/torch/models/attention_model.py", line 102, in forward
    x = self.w_1(x).permute(2, 0, 1)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mimoor/.local/share/virtualenvs/multicenter-sepsis-6cH0FM1U/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 256 106 1, expected input[32, 104, 125] to have 106 channels, but got 104 channels instead

Make large feature set (of the shallow models) compatible with deep models, to test whether the manual feature engineering is helpful there too

for this ensure that imputation is handled properly (currently shallow models use forward filling, deep models use missingness indicator, so use either or, for this we also need to check if we can simply drop the imputation step in the preprocessing pipeline or if this causes problems with downstream transforms).
preprocessed dataset should be able to load the small feature set (~47 features), as well as the large one (~700 features).