Giter Site home page Giter Site logo

irinastatslab / glucobench Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 0.0 578.93 MB

The official implementation of the paper "GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks."

Python 7.86% Jupyter Notebook 91.99% Shell 0.16%
cgm deep-learning glucose-monitoring glucose-prediction medical-application time-series uncertainty-quantification

glucobench's Introduction

GlucoBench

The official implementation of the paper "GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks." If you found our work interesting and plan to re-use the code, please cite us as:

@article{
  author  = {Renat Sergazinov and Valeriya Rogovchenko and Elizabeth Chun and Nathaniel Fernandes and Irina Gaynanova},
  title   = {GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks},
  journal = {arXiv}
  year    = {2023},
}

Dependencies

We recommend to setup clean Python enviroment with conda by running conda create -n glucobench python=3.10. Then we can install all dependenices by running pip install -r requirments.txt.

To run Latent ODE model, install torchdiffeq.

Code organization

The code is organized as follows:

  • bin/: training commands for all models
  • config/: configuration files for all datasets
  • data_formatter/
    • base.py: performs all pre-processing for all CGM datasets
  • exploratory_analysis/: notebooks with processing steps for pulling the data and converting to .csv files
  • lib/
    • gluformer/: model implementation
    • latent_ode/: model implementation
    • *.py: hyper-paraemter tuning, training, validation, and testing scripts
  • output/: hyper-parameter optimization and testing logs
  • paper_results/: code for producing tables and plots, found in the paper
  • utils/: helper functions for model training and testing
  • raw_data.zip: web-pulled CGM data (processed using exploratory_analysis)
  • environment.yml: conda environment file

Data

The datasets are distributed according to the following licences and can be downloaded from the following links outlined in the table below.

Dataset License Number of patients CGM Frequency
Colas Creative Commons 4.0 208 5 minutes
Dubosson Creative Commons 4.0 9 5 minutes
Hall Creative Commons 4.0 57 5 minutes
Broll GPL-2 5 5 minutes
Weinstock Creative Commons 4.0 200 5 minutes

To process the data, follow the instructions in the exploratory_analysis/ folder. Processed datasets should be saved in the raw_data/ folder. We provide examples in the raw_data.zip file.

How to reproduce results?

Setting up the enviroment

We recommend to setup clean Python enviroment with conda and install all dependenices by running conda env create -f environment.yml. Now we can activate the environment by running conda activate glunet.

Changing the configs

The config/ folder stores the best hyper-parameters (selected by Optuna) for each dataset and model. The config/ also stores the dataset-specific parameters for interpolation, dropping, splitting, and scaling. To train and evaluate the models with these defaults, we can simply run:

python ./lib/model.py --dataset dataset --use_covs False --optuna False

Changing the hyper-parameters

To change the search grid for hyper-parameters, we need to modify the ./lib/model.py file. Specifically, we look at the objective() function and modify the trial.suggest_* parameters to set the desired ranges. Once we are done, we can run the following command to re-run the hyper-parameter optimization:

python ./lib/model.py --dataset dataset --use_covs False --optuna True

How to work with the repository?

We provide a detailed example of the workflow in the example.ipynb notebook. For clarification, we provide some general suggestions below in order of increasing complexity.

Just the data

To start experimenting with the data, we can run the following command:

import yaml
from data_formatter.base import DataFormatter

with open(f'./config/{dataset}.yaml', 'r') as f:
    config = yaml.safe_load(f)
formatter = DataFormatter(config)

The command exposes an object of class DataFormatter which automatically pre-processes the data upon initialization. The pre-processing steps can be controlled via the config/ files. The DataFormatter object exposes the following attributes:

  1. formatter.train_data: training data (as pandas.DataFrame)
  2. formatter.val_data: validation data
  3. formatter.test_data: testing (in-distribution and out-of-distribution) data i. formatter.test_data.loc[~formatter.test_data.index.isin(formatter.test_idx_ood)]: in-distribution testing data ii. formatter.test_data.loc[formatter.test_data.index.isin(formatter.test_idx_ood)]: out-of-distribution testing data
  4. formatter.data: unscaled full data

Integration with PyTorch

Training models with PyTorch typically boils down to (1) defining a Dataset class with __getitem__() method, (2) wrapping it into a DataLoader, (3) defining a torch.nn.Module class with forward() method that implements the model, and (4) optimizing the model with torch.optim in a training loop.

Parts (1) and (2) crucically depend on the definition of the Dataset class. Essentially, having the data in the table format (e.g. formatter.train_data), how do we sample input-output pairs and pass the covariate information? The various Dataset classes conveniently adopted from the Darts library (see here) offer one way to wrap the data into a Dataset class. Different Dataset classes differ in what information is provided to the model:

  1. SamplingDatasetPast: supports only past covariates
  2. SamplingDatasetDual: supports only future-known covariates
  3. SamplingDatasetMixed: supports both past and future-known covariates

Below we give an example of loading the data and wrapping it into a Dataset:

from utils.darts_processing import load_data
from utils.darts_dataset import SamplingDatasetDual

formatter, series, scalers = load_data(seed=0,
                                       dataset=dataset,
                                       use_covs=True, 
                                       cov_type='dual',
                                       use_static_covs=True)
dataset_train = SamplingDatasetDual(series['train']['target'],
                                    series['train']['future'],
                                    output_chunk_length=out_len,
                                    input_chunk_length=in_len,
                                    use_static_covariates=True,
                                    max_samples_per_ts=max_samples_per_ts,)

Parts (3) and (4) are model-specific, so we omit their discussion. For inspiration, we suggest to take a look at the lib/gluformer/model.py and lib/latent_ode/trainer_glunet.py files.

glucobench's People

Contributors

acutinha avatar irinagain avatar lylchun avatar mrsergazinov avatar nathaniel-fernandes avatar nckasman avatar rvaleriya avatar ushrestha57 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

glucobench's Issues

Environment creation problem and error in the example.ipynb

  1. I attempted to create an environment using the recommended command conda env create -f environment.yml as outlined in your repository's README file (tried in both Win and Linux). However, I encountered the following errors:
    image

Note that the code line that is recommended in your readme file it exports the current Conda environment's configuration to a YAML file named environment.yml, and not install the environment dependencies. Also, darts library is missing from the dependencies.

  1. Upon creating my own environment and attempting to execute example.ipynb, I encountered dimensionality issues while running cell 23. Specifically, the error occurred at the following line:
    model.fit_from_dataset(dataset_train, verbose=False)
    See details:
    -> 18 model.fit_from_dataset(dataset_train, verbose=False)

ValueError Traceback (most recent call last)
Cell In[31], line 18
3 model = NHiTSModel(input_chunk_length=formatter.params['nhits']['in_len'],
4 output_chunk_length=formatter.params['length_pred'],
5 num_stacks=3,
(...)
15 batch_size=formatter.params['nhits']['batch_size'],
16 pl_trainer_kwargs={"accelerator": "gpu", "devices": [0]})
17 # fit model
---> 18 model.fit_from_dataset(dataset_train, verbose=False)

File c:\Users\Maria.conda\envs\llmtime\lib\site-packages\darts\utils\torch.py:112, in random_method..decorator(self, *args, **kwargs)
110 with fork_rng():
111 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112 return decorated(self, *args, **kwargs)

File c:\Users\Maria .conda\envs\llmtime\lib\site-packages\darts\models\forecasting\torch_forecasting_model.py:929, in TorchForecastingModel.fit_from_dataset(self, train_dataset, val_dataset, trainer, verbose, epochs, num_loader_workers)
877 @random_method
878 def fit_from_dataset(
879 self,
(...)
885 num_loader_workers: int = 0,
886 ) -> "TorchForecastingModel":
887 """
...
573 else past_target,
574 static_covariates,
575 )

ValueError: not enough values to unpack (expected 3, got 2)

This issue also persists when attempting to utilize transformers and tft.

Questions:
1)Is it possible you could provide the packages with an alternative method e.g., requirements.txt?
2) Do you have any ideas about the problem I encounter and how to resolve it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.