Giter Site home page Giter Site logo

ml4h's Introduction

ML4H

ML4H is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more. The diverse data modalities of biomedicine offer different perspectives on the underlying challenge of understanding human health. For this reason, ML4H is built on a foundation of multimodal multitask modeling, hoping to leverage all available data to help power research and inform clinical care. Our tools help apply clinical research standards to ML models by carefully considering bias and longitudinal outcomes. Our project grew out of efforts at the Broad Institute to make it easy to work with the UK Biobank on the Google Cloud Platform and has since expanded to include proprietary data from academic medical centers. To put cutting-edge AI and ML to use making the world healthier, we're fostering interdisciplinary collaborations across industry and academia. We'd love to work with you too!

ML4H is best described with Five Verbs: Ingest, Tensorize, TensorMap, Model, Evaluate

  • Ingest: collect files onto one system
  • Tensorize: write raw files (XML, DICOM, NIFTI, PNG) into HD5 files
  • TensorMap: tag data (typically from an HD5) with an interpretation and a method for generation
  • ModelFactory: connect TensorMaps with a trainable neural network architecture loss function, and optimization strategy
  • Evaluate: generate plots that enable domain-driven inspection of models and results

Getting Started

Advanced Topics:

  • Tensorizing Data (going from raw data to arrays suitable for modeling, in ml4h/tensorize/README.md, TENSORIZE.md )

Setting up your local environment

Clone the repo to your home directory:

cd ~ \
git clone https://github.com/broadinstitute/ml4h.git

Run the CPU docker (this step does not work on Apple silicon). The first time you do this the docker will need to download which takes awhile:

docker run -v ${HOME}:/home/ -it ghcr.io/broadinstitute/ml4h:tf2.9-latest-cpu

Then once inside the docker try to run the tests (again, not on Apple silicon):

python -m pytest /home/ml4h/tests/test_recipes.py

If the tests pass (ignoring warnings) you're off to the races! Next connect to some tensorized data, or checkout the introductory Jupyter notebooks.

Setting up your cloud environment (optional; currently only GCP is supported)

Make sure you have installed the Google Cloud SDK (gcloud). With Homebrew, you can use

brew cask install google-cloud-sdk

Make sure you have configured your development environment. In particular, you will probably have to complete the steps to prepare the Google Cloud CLI and enable the required Google services.

Setting up a remote VM

To create a VM without a GPU run:

./scripts/vm_launch/launch_instance.sh ${USER}-cpu

With GPU (not recommended unless you need something beefy and expensive)

./scripts/vm_launch/launch_dl_instance.sh ${USER}-gpu

This will take a few moments to run, after which you will have a VM in the cloud. Remember to shut it off from the command line or console when you are not using it!

Now ssh onto your instance (replace with proper machine name and project name, note that you can also use regular old ssh if you have the external IP provided by the script or if you login from the GCP console)

gcloud --project your-gcp-project compute ssh ${USER}-gpu --zone us-central1-a

Next, clone this repo on your instance (you should copy your github key over to the VM, and/or if you have Two-Factor authentication setup you need to generate an SSH key on your VM and add it to your github settings as described here):

git clone [email protected]:broadinstitute/ml4h.git

Because we don't know everyone's username, you need to run one more script to make sure that you are added as a docker user and that you have permission to pull down our docker instances from GCP's gcr.io. Run this while you're logged into your VM:

./ml4h/scripts/vm_launch/run_once.sh

Note that you may see warnings like below, but these are expected:

WARNING: Unable to execute `docker version`: exit status 1
This is expected if `docker` is not installed, or if `dockerd` cannot be reached...
Configuring docker-credential-gcr as a registry-specific credential helper. This is only supported by Docker client versions 1.13+
/home/username/.docker/config.json configured to use this credential helper for GCR registries

You need to log out after that (exit) then ssh back in so everything takes effect.

Finish setting up docker, test out a jupyter notebook

Now let's run a Jupyter notebook. On your VM run:

${HOME}/ml4h/scripts/jupyter.sh

Add a -c if you want a CPU version.

This will start a notebook server on your VM. If you a Docker error like

docker: Error response from daemon: driver failed programming external connectivity on endpoint agitated_joliot (1fa914cb1fe9530f6599092c655b7036c2f9c5b362aa0438711cb2c405f3f354): Bind for 0.0.0.0:8888 failed: port is already allocated.

overwrite the default port (8888) like so

${HOME}/ml4h/scripts/jupyter.sh -p 8889

The command also outputs two command lines in red. Copy the line that looks like this:

gcloud compute ssh ${USER}@${USER}-gpu -- -NnT -L 8889:localhost:8889

Open a terminal on your local machine and paste that command.

If you get a public key error run: gcloud compute config-ssh

Now open a browser on your laptop and go to the URL http://localhost:8888

Set up VScode to connect to the GCP VM (which makes your coding much easier)

step 1: install VSdoe

step 2:config the ssh key gcloud compute config-ssh --project "broad-ml4cvd"

Step 3: install remote-SSH extension in VS Code

Step 4: connect to the VM by pressing F1 and type "Remote-SSH: Connect to Host..." and select the VM you want to connect to (eg dianbo-dl.us-central1-abroad-ml4cvd)

Step 5: open the folder you want to work on in the VM, type in your Broad password, and you are good to go!

Contributing code

Want to contribute code to this project? Please see CONTRIBUTING for developer setup and other details.

Releases

Ideally, each release should be available on our github releases page In addition, the version # in setup.py should be incremented. The pip installable ml4h package on pypi should also be updated.

If the release changed the docker image, the new dockers both (CPU & GPU) should update the “latest” tag and should be pushed to both gcr: gcr.io/broad-ml4cvd/deeplearning, and the ml4h github container repo with appropriate tags (e.g. tf2.9-latest-gpu for the latest GPU docker image or tf2.9-latest-cpu for the CPU) at: ghcr.io/broadinstitute/ml4h

Command line interface

The ml4h package is designed to be accessable through the command line using "recipes". To get started, please see RECIPE_EXAMPLES.

DOI

ml4h's People

Contributors

abaumann avatar amygdala avatar anamika1302 avatar angela4z avatar bicyclic avatar carbocation avatar christopherreeder avatar daniellepace avatar deflaux avatar dependabot[bot] avatar gpbatra avatar gsarma avatar ian-erickson avatar kaiyuanmifen avatar kathwy avatar kyuksel avatar lucidtronix avatar meganshand avatar mklarqvist avatar mmorgantaylor avatar ndiamant avatar paolodi avatar stevensong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml4h's Issues

GitHub Issues template

What
Pre-populate our issues with @christopherreeder's template:

**What**
Summarize the issue in 1-2 sentences.

**Why**
Describe why this issue should be solved, new feature implemented, etc.

**How**
High-level overview of how you propose to address.

**Acceptance Criteria**
Unambiguous milestones; if any are incomplete, the PR cannot be merged.

so meta

Why
To save us time when we make new issues, and improve adherence to the nice template.

How
Instructions: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository

I would make one but do not have sufficient permissions for this repo.

If temporarily elevated (@lucidtronix @gpbatra), glad to set it up.

Acceptance Criteria
This repo has a GitHub Issues template that is a selectable option for pre-populating new issues.

Manipulate and tensorize Partners ECG XML files

What
Code to manipulate and tensorize Partners ECG XML files currently lives in a different repo: https://github.com/mit-ccrg/partners-ecg

Why
These are capabilities that should exist within ML4CVD.

How
Move scripts related to XMLs from original repo to ml/ingest.
Move scripts related to tensorization to ml/ml4cvd/tensorize/.

Acceptance Criteria
Scripts to manipulate and tensorize Partners ECG XML files live inside of ML4CVD, and maintain original functionality.

macOS keeps correcting tensorization to "tenderization" but I am OK with that.

make_multimodal refactor

It's hard to update the interface to make_multimodal_to_multilabel_modelbecause it's used in so many places and has so many arguments.
It should have default arguments for some of its parameters, and maybe be called via

make_multimodal_to_multilabel_model(**args)

If we follow the **args strategy we could add **kwargs to the signature of make_multimodal_to_multilabel_model, which would absorb any extra arguments so we could do minimal processing on the command line args before passing them. The downside of this strategy is it would sometimes fail quietly - e.g. misspelling an argument with a default would not give an error, it would just use the default value.

Phecode_mapping -- include date.

How do we choose which date to include in phecode_mapping output table?

Admidate, operdate, etc?

Should we have a best value, or include all dates?

Should we create an ordered disease progression table, so multiple phecodes don't appear over the years if a patient is assigned a phecode early on?

Should pick one protocol and stick with it, including MPG mappings.

Speed up training

The majority of training time is spent processing the data with CPUs.
In the image below blue is the fitting function, mauve is the GPU fitting, and green is processing the data.

Screen Shot 2019-10-22 at 4 24 13 PM

(this was found with command line

./scripts/tf.sh /home/ndiamant/ml/ml4cvd/recipes.py --mode train --tensors /mnt/disks/ecg-bike-tensors/2019-10-10/ --input_tensors ecg-bike-pretest --output_tensors ecg-bike-new-hrr --batch_size 64 --epochs 5 --training_steps 5 --validation_steps 2 --inspect_models

This video from tensor flow suggests some tensorflow specific ideas.

By adding use_multiprocessing=True and workers=16 I've been able to get a big increase in data processing speed. That suggests a potential way forward of increasing the number of CPUs on our machines, which is relatively cheap.

If you have fairly fixed tensor maps, it might be worth saving preprocessing results to a csv or something like that, but that seems like a waste of developer time to figure out.

Hyperoptimize over tmaps

What
Optimize which augmentations, sizes, and covariates to use to minimize loss.
Specify

  • covariates, any subset of which can be used
  • necessary tmaps, all of which must be used
  • choice tmaps, one of which must be used

Why
We don't have a clean way of optimizing over augmentations / shapes / covariates

How
Helper functions in hyperparameters.py that allow you to specify the above.

Acceptance Criteria
Easy to understand/use optimization over input tmaps

1 Python package

Currently tensorize is a package separate from ml4cvd. This will lead to code duplication. Ideally, all this code will be packaged into one pip installable Python package. tensorize should be integrated into ml4cvd or if there is a reason this is not possible both packages should be setup as pip-installable python packages.

user and group of results directory (specified by --output_folder) are both root

This does not break anything, but does complicate my mediocre workflow where I save results in a folder synced via Dropbox. So it is a matter of convenience.

I do this because viewing results, sharing them with collaborators / advisors, formatting figures into manuscripts, etc. is easier on macOS than a Linux terminal with no GUI.

I presume the team has an efficient way of viewing results and moving them from GCP to emails, Keynote slides, etc. Would be good to hear how it is done now.

Assess deduplication strategy

  • Ask GE to explain fields DBList and Workflow_ID. May relate to import process.
  • Assess if keeping only first example of a (patientID, AcquisitionTime) loses any patients.
  • Manually inspect 25 duplicates and all associated data, to check if we are throwing out data.

Intelligent picking of defaults for model structure

What
Option to guess depth, number of channels etc. based on input and output shapes (like EfficientNet)

Why
Makes first attempt at any task better, makes the code more useful for newcomers.

How
Not sure. Probably first attempt would be an EfficientNet implementation with u_connect for segmentation and autoencoding.

Acceptance Criteria

  • Option to pick smart hyperparameters
  • The smart option performs well on a variety of tasks: Maybe HRR prediction, age regression from brain MRI, C-MRI segmentation, ECG rhythm classification

Expand the capture-net of phecode mapping

Currrent phecode mapping only uses icd10_diag from the main hesin table. There are secondary codes in other hesin tables, icd_oper, icd9, etc., etc. Those should also be mapped, but the interpretation will be sensitive.

Let's keep an eye on this and decide when to triage once we know more about how we want to structure phenotypes.

Validation Plots A B C D framework

What
Evaluation plots should include the ABCD framework proposed in Towards better clinical prediction models: seven steps for development and an ABCD for validation. The ABCDs are: calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance
statistic (C); and clinical usefulness, with decision-curve analysis (D).

Why
Gives several perspectives on model validity and using previously published framework by an external group is easy to defend. We currently get very little insight into model calibration and do no decision curve analysis.

How
This can probably separate into 2 or more PRs hopefully committed by different folks. We basically have C already (ROC curve and C stat).

Acceptance Criteria
train and test modes generate plots like
Screenshot 2020-04-15 15 24 20
and
Screenshot 2020-04-16 07 21 07

Multiple output tensors

If N >1 output tensors have the same name, only the first one is used and an obscure error is thrown ("you are using N loss functions but your model has only a single output").

To fix, should check for duplicates of output tensormap names.

Also, should check if this bug applies to input tensormaps.

Cannot run >1 instance of recipes on prem because second GPU not utilized

Title is speculative and reflects my hypothesis.

I am currently running explore on 2.6M ECGs. Even though no training is performed, almost all of the memory of one of my GPUs is in use:

er498@mithril > nvidia-smi         ml -> er_tensorize_partners_ecgs $ ! RC=1
Mon Mar  2 21:23:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   37C    P8    19W / 250W |  10600MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
|  0%   38C    P8     1W / 250W |    272MiB / 11018MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     26811      C   python                                     10589MiB |
|    1      1567      G   /usr/lib/xorg/Xorg                            14MiB |
|    1      1661      G   /usr/bin/gnome-shell                          10MiB |
|    1     26811      C   python                                       235MiB |
+-----------------------------------------------------------------------------+

When I try to run a second instance of explore before the first run completes, I get an CUDA_ERROR_OUT_OF_MEMORY error which suggests ML4CVD tries to use the first GPU and does not utilize the second GPU although it is recognized:

er498@mithril > sh run_explore.sh       ml -> er_tensorize_partners_ecgs $ !
mkdir: cannot create directory ‘/mnt/ml4cvd’: Permission denied
Attempting to run Docker with
    docker run -it
        --rm
        --ipc=host
        -v /home/er498/jupyter/root/:/root/
        -v /home/er498/:/home/er498/
        -v /mnt/:/mnt/
        gcr.io/broad-ml4cvd/deeplearning:tf2-latest-gpu python /home/er498/repos/ml/ml4cvd/recipes.py --mode explore --tensors /data/partners_ecg/hd5_subset --input_tensors partners_ecg_patientid partners_ecg_date partners_ecg_dob partners_ecg_read_md_raw partners_ecg_read_pc_raw partners_ecg_rate partners_ecg_qrs partners_ecg_pr partners_ecg_qt partners_ecg_qtc --test_modulo 0 --output_folder /home/er498/ml4cvd_results/ --id explore_partners_ecg_subset
Processing /home/er498/repos/ml
Building wheels for collected packages: ml4cvd
  Building wheel for ml4cvd (setup.py) ... done
  Created wheel for ml4cvd: filename=ml4cvd-0.0.1-py3-none-any.whl size=403522 sha256=736f1b2fa148fab99dfb4397c1cf3561ecaa3c260dc6807b304b642d81702cc9
  Stored in directory: /tmp/pip-ephem-wheel-cache-0_dzl_rq/wheels/9c/5b/fa/03f47092853802b5352de00dc549ae7baf4101b7e30db46407
Successfully built ml4cvd
Installing collected packages: ml4cvd
Successfully installed ml4cvd-0.0.1
2020-03-02 21:21:56.021050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-02 21:21:56.022306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-03-02 21:21:57.425267: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-02 21:21:57.440138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.441567: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.442968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.443219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.444728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.445622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.445659: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.450824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.451042: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-02 21:21:57.483917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3099995000 Hz
2020-03-02 21:21:57.486154: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a03ee0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.486188: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-02 21:21:57.841317: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5985690 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.841357: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.841369: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.842855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.844333: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.844432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.844482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.844532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.844581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.844613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.848874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.848949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:58.258708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-02 21:21:58.258732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1
2020-03-02 21:21:58.258738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N N
2020-03-02 21:21:58.258742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   N N
2020-03-02 21:21:58.260170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2020-03-02 21:21:58.261248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9986 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2020-03-02 21:21:58.264000: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 185.62M (194641920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/home/er498/repos/ml/ml4cvd/recipes.py", line 17, in <module>
    from ml4cvd.arguments import parse_args
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/arguments.py", line 24, in <module>
    from ml4cvd.tensor_maps_by_hand import TMAPS
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_maps_by_hand.py", line 1, in <module>
    from ml4cvd.tensor_from_file import normalized_first_date, TMAPS
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_from_file.py", line 478, in <module>
    loss=weighted_crossentropy(np.array(_get_lead_cm(32)[1]), 'ecg_median_categorical'))
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/metrics.py", line 33, in weighted_crossentropy
    exec(string_globe, globals(), locals())
  File "<string>", line 4, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 814, in variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 260, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 254, in _variable_v2_call
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 235, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2645, in default_variable_creator_v2
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
    distribute_strategy=distribute_strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
    graph_mode=self._in_graph_mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
    shape, dtype, shared_name, name, graph_mode, initial_value)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
    math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse

Args attached:
arguments_2020-03-02_17-34.txt

Expected behavior when running ML4CVD on prem is to use other GPUs that have available memory.

@paolodi earlier, to use the other GPU so we could train two models at the same time on Mithril, we had to modify tf.sh.

If current codebase does not support >1 GPU, this could be a good issue to tackle in a PR that extends ML4CVD to on-prem hardware.

Hyperparameter optimization crashes after 24 hours

Long running issue with hyper-parameter optimization: Models are not garbage collected so long runs tend to slow down and stall typically happens after 24 hours. We've tried several things to fix this with no luck so far.

Currently we have:

finally:
            del model
            gc.collect()

but it doesn't seem to help. Previously we tried things like:

def limit_mem():
    try:
        K.clear_session()
        cfg = K.tf.ConfigProto()
        cfg.gpu_options.allow_growth = True
        K.set_session(K.tf.Session(config=cfg))
    except AttributeError as e:
        logging.exception('Could not clear session. Maybe you are using Theano backend?')

Make Encoders and Decoders block agnostic

What
Right now ConvEncoder and ConvDecoder build DenseBlocks. Instead they should get passed a list of blocks and chain them together.

Why
Allows more flexible swapping of the many block types we want to experiment with, e.g. attention convolution, efficient block, etc.

How

  • Make block factory functions
  • Edit the ConvEncoder and ConvDecoder classes

Acceptance Criteria

  • Block types can be easily swapped from command line
  • Old behavior maintained or improved upon with default args

Explore mode for recipes

What
Implement Explore mode in recipes.py that provides summary statistics for specified input tensors of specified HD5 files.

Why
Understanding basic information of one's data is a vital first step before training models.

How
Iterates through three tmap interpretations (categorical, continuous, and language). For each interpretation, all user-specified input tensor maps that match that type are extracted from HD5 files into a Pandas DataFrame, from which summary statistics are calculated and saved as CSV files.

At the end, all input tensors are concatenated into a large dataframe and saved to a CSV file. Each row is a patient. Columns are: 1) tensor maps (or, if the tmap has channels, a tuple of (tmap, cm)), 2) errors (if any are thrown during opening the tensor), and 3) full path to the tensor on disk. This large CSV file will be ingested into a database for future queries.

Acceptance Criteria
Summary statistics and a big CSV file with all tensors are generated.

Memory allocation error when starting test workers during training

What
Remove the reliance of train_multimodal_multitask on big_batch_from_minibatch_generator

Why
In the train recipe, test workers use big_batch_from_minibatch_generator which frequently leads to a memory allocation error.

How
Write an alternative function to big_batch_from_minibatch_generator that does not gather mini batches into larger batches but does produce inputs, outputs, and paths in the format expected by _predict_and_evaluate

Acceptance Criteria
Memory allocation errors caused by big_batch_from_minibatch_generator no longer occur during training runs

Keep track of fields used for tensorization

Currently we store the list of field ids used for tensorizing categorical and continuous fields in BigQuery's shared_data.tensorization_fieldids table. It was created from a csv file in GCS (gs://ml4cvd/data/fieldids.csv). This helps eliminate fetching of phenotype table rows we don't care about. However, when we need to add new fields to be tensorized, directly updating the table would make it difficult to audit, especially if we run into tensorization issues due to it.

One way we could keep field adding easy while having a log of what changed, is by checking in the tsv file into the repo, and have the tensorization pipeline re-create the tensorization_fieldids table from it.

GCS Client usage in Dataflow pipelines

Dataflow pipeline runs write tensors to GCS buckets using GCS' Python client. From Stackdriver logs, it appears a single client ends up being used by all Dataflow workers. This can be problematic. Investigate whether we can have one client per 'task' (unit of work Dataflow sends to a worker at once) and/or assess how risky having a single client for the entire run would be in the future.

Tensorization Pipeline

Our current tensorization process involves 4 main steps:

  1. Run Dataflow jobs to tensorize data from big query database. This means 6 separate data flow runs for each of the fields: ['categorical', 'continuous', 'icd', 'disease', 'death', 'phecode_disease']

  2. Tensorize bulk data with tensor_writer_ukbb.py. This file has code to tensorize abdominal, cardiac, and brain MRI as well as resting and exercise ECG.

  3. Merge tensors with merge_hd5s.py. Doing intersections or in-place merging as appropriate.

  4. Append any CSV or TSV data with the `append[categorical,continuous]_[csv,tsv]' recipe modes.

Each step is currently run separately. We would also like to be able to run them all at once and store intermediate tensors in google cloud buckets.

One jupyter server script

Currently we have scripts/dl_jupyter.sh and scripts/jupyter.sh these should be merged and allow the docker image to be selected between GPU and CPU on the command line the way it works in scripts/tf.sh

"Presentation mode" for plots

What
Implement a flag that has the effect of increasing the size of the font in plots

Why
Plots as they are currently generated are difficult to read when put in slides for presentations.

How
Abstract the font size specification in plots.py so that each plot type is capable of generating plots with a font size appropriate for presentation-viewing or non-presentation-viewing.

Acceptance Criteria
A presentation-mode flag can be used

Long term code updates

  • Update tests so we actually run them.
  • Tests should include linting and performance comparisons.
  • Important code like TensorGenerator should have thoroughly tested properties even for multiprocessing, e.g. output batch values should match the hd5 they were read from.
  • multimodal_multitask_worker should be simplified. Take in a batch post processing function and a path generator function so it doesn't have to handle siamese and mixup logic and whatever other new logic we'll want to add. In progress in #77.
  • Make names like tensor_maps_in consistent across functions.

Mesh TensorMaps: Start with bounding box

What
Implementing a TensorMap that succinctly describes the bounding box of a 3-D object (e.g., 3-D image segmentation).

Why
ML4CVD is designed to naturally handle models that can enhance the extraction of structural and functional information from widespread diagnostic assessments via limited training on rare information-rich modalities. For example, the codebase has been successfully employed to perform complex segmentation tasks on cardiac MRI, and to implement state-of-the-art models that infer derived features of segmentation, such as LV mass, from cheap and widely available ECGs.

In the current pipeline, however, the models have been often asked to treat rich 3-D information either as a collection of 2-D assessments (slice-by-slice), or as fully unconstrained 3-D objects embedded in structured grids (3-D images). Introducing intermediate and succinct representations of the 3-D objects (e.g., via parametric models and meshes) could increase model performance, enhance interpretability, and provide helpful regularizers for complex multi-task mode;s

How
The implementation of general 3-D Mesh TensorMaps is a complex task that might be better tackled in several substeps. In this first step, we will leverage existing TMAPs that extract the main axis of 3-D objects via SVD on centroids of arbitrary cross-sections. Rather than limiting the extraction to a single axis, we will expand SVD to extract the 3 orientation angles and use boundary detection algorithms from VTK to extract a meaningful bounding box.

Acceptance Criteria
A TMAP returning a bounding box of SAX cardiac MRI segmentation is tested as the target of supervised training from 1) a cardiac MRI and 2) an ECG model.

Improve clarity of logfile contents

What

  1. Display training, validation, and test set size at the end of the log file for train mode (and potentially other modes).

  2. Clearly portray how many epochs are actually completed (due to patience).

Why
It is helpful to know the number of tensors used for training, validation, and test, as well as the label count within each set.

Label count makes sense for categorical. Less clear how we best handle this for regression models.

It is also important to know when early stopping occurred.

Currently this information is not consolidated in one place in the log file. It also is spread out over workers.

How
Aggregate over workers.

Acceptance Criteria
After running recipes with train mode, the number of tensors used for training, validation, and test sets, as well label counts in each set, and the number of epochs actually run before early stopping, are summarized at the end of the log file.

Tests should live in a top-level directory

Import the (soon-to-be-created) ml4cvd package.

Different testing files for tensorization, training, and evaluation.

We should have one set of quick running tests that can execute quickly thoughout development and a suite of longer tests that give more scientific validation for PR merges and/or major refactors.

Our training and inference tests are very formulaic. They are given a list of input TensorMaps, output TensorMaps, architecture parameters, and expected performance metrics. In Java this could all be abstracted with a DataProvider. Is there something similar in Python?

Ideally, we will setup continuous or nightly integration testing as well.

bug in shuffle paths in tensor generators

https://github.com/broadinstitute/ml/blob/a71a97f6f8777e243c4d76f63bb2709156001c85/ml4cvd/tensor_generators.py#L48-L53

The intended behavior is to return the current path and then increment.

However, at the last index, by setting self.idx = 0, shuffling the paths, and returning self.paths[self.idx - 1], a random path is returned. This results in a duplicate path.

We think this only generated N duplicates, where N is typically 3 (train, val, test).

PR #137 fixes this by saving the current path in a temp variable that is returned; this is unaffected by the case where we reach the last index and shuffle paths.

This bug is illustrated in a simple test with 11 ECG .hd5 files. See attached screenshot:

Note: MRNs are scrambled as to protect PHI!

Screen Shot 2020-02-20 at 9 37 11 AM

  • Column 0 is index, column 1 is actual MRN, and column 2 is the MRN from tensor generator.
  • Green rows 0-5 are training, red rows 6-9 are validation, and blue row 10 is test.
  • Row 5 is the last index in the training generator. Here, a random prior tensor in the training generator is returned (MRN highlighted in green) instead of the self (6559753).
  • Row 9 is the last index in the validation generator. Here, a random prior tensor in the validation generator is returned (MRN highlighted in red) instead of the self (6571743).
  • This bug does not manifest in test generator because there is only one element. Therefore a randomly returned tensor is the only tensor in the test generator.
  • Bug fix results in actual MRNs.

Harmonize tensorization parallelization

What
Tensorization is parallelized via three difference schemes. This should be unified.

Why
Let's minimize redundancy within the codebase.

How
TBD

Acceptance Criteria
Tensorization is performed in parallel via a single approach.

MRI processing

Overall

  • currently works with ml4cvd conda install so 3.6 env seems consistent (locally tested only!)
  • pip install apache-beam[gcp] --ignore-installed PyYAML
  • when testing, please apply careful scale-out limits

Per file:

  • defines.py
    • should be renamed to 'constants' or something
    • this is redundant with defines.py in ml4cvd, so please prune so that we're not defining twice. (just copied this over quickly, so there's lots of unncessary ECG stuff in here)
    • where should these constants go? If they're only used by tensorize, here. If they're used more broadly, then higher up in the library.
  • process_mri_test.py
    • can just be deleted
  • process_mri.py
    • prune imports to only what's needed
    • up to you whether to try the setup/tearddown of client for tasks, or leave as a separate issue.
    • rest of the function works well, you may have better approaches for deleting/creating folder data.
  • experiment2.py
    • rename to 'tensorize_mri' or something
    • unhardcode pipeline options
    • creating list of blobs is a bit of a hot mess:
      • Is there a native function for passing in a list of blobs, instead of creating a list ourselves? you could see that if the number of blobs grows large, this violates naive parallelization
      • currently, list of blobs is everything in a directory, you would have to upload every MRI file to a directory in gs (appropriate, IMO)
      • we currently pass every zip file in that directory to process_mri, which checks to see if the fieldid is in the ALLOWED_MRI_FIELD_IDS. This is really wasteful. We should be checking when we create the list of blobs if the zip file is one that we want. That way, we can decrease the amount of data we put on the network by a substantial amount.

Create only relevant TensorMaps at run time

What
TensorMap creation is slow, especially if it requires parsing a CSV. This slowness is compounded if the CSV is read over a network, e.g. from MAD3 or ERISOne.

See #171 (comment)

and https://github.com/broadinstitute/ml/blob/6141bef96d04b65e4a3573cbdd3705fb3ebb3a5e/ml4cvd/arguments.py#L175-L183

Why
This happens because we have a ton of TensorMaps, and we load many of them even though we only use a few whenever running ML4CVd.

How
Implementation details TBD. I think this is worth a video chat to discuss. Relates to #143 (organize tensor_from_file).

Acceptance Criteria
When ML4CVD is run, only the user-specified input and output tensor maps should be created, and the rest are not.

Assess handling of non-float values while writing tensor files

Related code snippet from tensor_writer:

float_value = to_float_or_false(value)

if float_value is not False:
    hd5.create_dataset(hd5_dataset_name, data=[float_value])
else:
    logging.warning("Cannot cast to float from '{}' for field id '{}' and sample id '{}'".format(value, field_id, sample_id))

Standardize MRN cleaning

What
MRNs are cleaned at several places throughout the code. This should be performed in a uniform way, preferably by calling the same function.

Why
To ensure internal consistency that applies best practices from what we've learned about Partners MRNs.

How
We should discuss with clinical collaborators who have experience with MRN abnormalities, e.g. Steve Lubitz and @shaankhurshid.

Acceptance Criteria

  • Any time an MRN is cleaned in the code base, e.g. ml4cvd/tensor_writer_partners.py, it should use the same function.
  • This function is developed with input from clinical collaborators.

Handle normalizations and regularizations more flexibly

What
Model normalizers and regularizers placed automatically in correct order

Why
Normalizers and regularizers are currently placed in fixed positions, which prevents the use of some useful techniques. For example, L2 regularization is impossible because right now because regularization layer comes after the activation instead of the layer.

How
Dictionary mapping regularization enum or string to position in block.

Acceptance Criteria

  • Dictionary mapping regularization enum or string to position in block
  • New regularizations + normalizations properly used (especially L2, layer norm)
  • Command line allows multiple regularizations
  • Command line fails early on nonexistent regularizations and normalizations

Organize `tensor_from_file`

What
New folder ml4cvd/tensor_maps/.
Specific TensorMaps go into their own files, e.g. ml4cvd/tensor_maps/ecg_bike_tensors.py.

Eventually we move the folder to its own repo as per Marcus's suggestion!

Why
I have a branch with a 1700 line and growing tensor_from_file. You should not have to import specific libraries, like biosppy or vtk for TensorMaps that don't use them.

Set output_folder to path outside of ml repo

What
When calling recipes.py, if a user sets an --output_folder path that is not within the repo directory, no results are saved on the host machine.

It would be great if a user could specify any path on their machine in which to save results from running ML4CVD!

Why
The repo should contain code. Results should live in a different directory. Results in the repo directory can clutter the output of git status and subsequent adds, commits, and pushes. Having to move results out of the repo directory adds a step to user workflow.

How
I think this limitation is due to a Docker mount setting. The solution is probably to mount the home directory, and accept a limitation that --output_folder must be within ~/ and not upstream of that. Mounting / seems problematic.

Acceptance Criteria
User can set any --output_folder regardless of whether it is in the repo directory, and results appear in a subdirectory specified by id arg.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.