broadinstitute / ml4h Goto Github PK

License: Other

Go 1.38% Shell 0.62% Dockerfile 0.02% Python 34.85% Jupyter Notebook 61.18% R 1.94%

ml4h's Issues

Long term code updates

Update tests so we actually run them.
Tests should include linting and performance comparisons.
Important code like TensorGenerator should have thoroughly tested properties even for multiprocessing, e.g. output batch values should match the hd5 they were read from.
multimodal_multitask_worker should be simplified. Take in a batch post processing function and a path generator function so it doesn't have to handle siamese and mixup logic and whatever other new logic we'll want to add. In progress in #77.
Make names like tensor_maps_in consistent across functions.

Set output_folder to path outside of ml repo

What
When calling recipes.py, if a user sets an --output_folder path that is not within the repo directory, no results are saved on the host machine.

It would be great if a user could specify any path on their machine in which to save results from running ML4CVD!

Why
The repo should contain code. Results should live in a different directory. Results in the repo directory can clutter the output of git status and subsequent adds, commits, and pushes. Having to move results out of the repo directory adds a step to user workflow.

How
I think this limitation is due to a Docker mount setting. The solution is probably to mount the home directory, and accept a limitation that --output_folder must be within ~/ and not upstream of that. Mounting / seems problematic.

Acceptance Criteria
User can set any --output_folder regardless of whether it is in the repo directory, and results appear in a subdirectory specified by id arg.

Saliency Maps for Latent Spaces

fun!

Assess handling of non-float values while writing tensor files

Related code snippet from tensor_writer:

float_value = to_float_or_false(value)

if float_value is not False:
    hd5.create_dataset(hd5_dataset_name, data=[float_value])
else:
    logging.warning("Cannot cast to float from '{}' for field id '{}' and sample id '{}'".format(value, field_id, sample_id))

"Presentation mode" for plots

What
Implement a flag that has the effect of increasing the size of the font in plots

Why
Plots as they are currently generated are difficult to read when put in slides for presentations.

How
Abstract the font size specification in plots.py so that each plot type is capable of generating plots with a font size appropriate for presentation-viewing or non-presentation-viewing.

Acceptance Criteria
A presentation-mode flag can be used

make_multimodal refactor

It's hard to update the interface to make_multimodal_to_multilabel_modelbecause it's used in so many places and has so many arguments.
It should have default arguments for some of its parameters, and maybe be called via

make_multimodal_to_multilabel_model(**args)

If we follow the **args strategy we could add **kwargs to the signature of make_multimodal_to_multilabel_model, which would absorb any extra arguments so we could do minimal processing on the command line args before passing them. The downside of this strategy is it would sometimes fail quietly - e.g. misspelling an argument with a default would not give an error, it would just use the default value.

Histograms for continuous tensor fields in PDF

@lucidtronix commented on Thu Mar 14 2019

Modality by Disease count table. Histograms for continuous and categorical fields. Counts and shapes of every tensor group and type. Check for all 0 tensors.

Make merge more explicit

Count the types and sources of tensors that are merged and output the log.

Cannot run >1 instance of recipes on prem because second GPU not utilized

Title is speculative and reflects my hypothesis.

I am currently running explore on 2.6M ECGs. Even though no training is performed, almost all of the memory of one of my GPUs is in use:

er498@mithril > nvidia-smi         ml -> er_tensorize_partners_ecgs $ ! RC=1
Mon Mar  2 21:23:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   37C    P8    19W / 250W |  10600MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
|  0%   38C    P8     1W / 250W |    272MiB / 11018MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     26811      C   python                                     10589MiB |
|    1      1567      G   /usr/lib/xorg/Xorg                            14MiB |
|    1      1661      G   /usr/bin/gnome-shell                          10MiB |
|    1     26811      C   python                                       235MiB |
+-----------------------------------------------------------------------------+

When I try to run a second instance of explore before the first run completes, I get an CUDA_ERROR_OUT_OF_MEMORY error which suggests ML4CVD tries to use the first GPU and does not utilize the second GPU although it is recognized:

er498@mithril > sh run_explore.sh       ml -> er_tensorize_partners_ecgs $ !
mkdir: cannot create directory ‘/mnt/ml4cvd’: Permission denied
Attempting to run Docker with
    docker run -it
        --rm
        --ipc=host
        -v /home/er498/jupyter/root/:/root/
        -v /home/er498/:/home/er498/
        -v /mnt/:/mnt/
        gcr.io/broad-ml4cvd/deeplearning:tf2-latest-gpu python /home/er498/repos/ml/ml4cvd/recipes.py --mode explore --tensors /data/partners_ecg/hd5_subset --input_tensors partners_ecg_patientid partners_ecg_date partners_ecg_dob partners_ecg_read_md_raw partners_ecg_read_pc_raw partners_ecg_rate partners_ecg_qrs partners_ecg_pr partners_ecg_qt partners_ecg_qtc --test_modulo 0 --output_folder /home/er498/ml4cvd_results/ --id explore_partners_ecg_subset
Processing /home/er498/repos/ml
Building wheels for collected packages: ml4cvd
  Building wheel for ml4cvd (setup.py) ... done
  Created wheel for ml4cvd: filename=ml4cvd-0.0.1-py3-none-any.whl size=403522 sha256=736f1b2fa148fab99dfb4397c1cf3561ecaa3c260dc6807b304b642d81702cc9
  Stored in directory: /tmp/pip-ephem-wheel-cache-0_dzl_rq/wheels/9c/5b/fa/03f47092853802b5352de00dc549ae7baf4101b7e30db46407
Successfully built ml4cvd
Installing collected packages: ml4cvd
Successfully installed ml4cvd-0.0.1
2020-03-02 21:21:56.021050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-02 21:21:56.022306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-03-02 21:21:57.425267: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-02 21:21:57.440138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.441567: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.442968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.443219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.444728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.445622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.445659: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.450824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.451042: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-02 21:21:57.483917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3099995000 Hz
2020-03-02 21:21:57.486154: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a03ee0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.486188: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-02 21:21:57.841317: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5985690 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.841357: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.841369: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.842855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.844333: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.844432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.844482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.844532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.844581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.844613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.848874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.848949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:58.258708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-02 21:21:58.258732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1
2020-03-02 21:21:58.258738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N N
2020-03-02 21:21:58.258742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   N N
2020-03-02 21:21:58.260170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2020-03-02 21:21:58.261248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9986 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2020-03-02 21:21:58.264000: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 185.62M (194641920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/home/er498/repos/ml/ml4cvd/recipes.py", line 17, in <module>
    from ml4cvd.arguments import parse_args
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/arguments.py", line 24, in <module>
    from ml4cvd.tensor_maps_by_hand import TMAPS
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_maps_by_hand.py", line 1, in <module>
    from ml4cvd.tensor_from_file import normalized_first_date, TMAPS
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_from_file.py", line 478, in <module>
    loss=weighted_crossentropy(np.array(_get_lead_cm(32)[1]), 'ecg_median_categorical'))
  File "/usr/local/lib/python3.6/dist-packages/ml4cvd/metrics.py", line 33, in weighted_crossentropy
    exec(string_globe, globals(), locals())
  File "<string>", line 4, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 814, in variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 260, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 254, in _variable_v2_call
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 235, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2645, in default_variable_creator_v2
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
    distribute_strategy=distribute_strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
    graph_mode=self._in_graph_mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
    shape, dtype, shared_name, name, graph_mode, initial_value)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
    math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse

Args attached:
arguments_2020-03-02_17-34.txt

Expected behavior when running ML4CVD on prem is to use other GPUs that have available memory.

@paolodi earlier, to use the other GPU so we could train two models at the same time on Mithril, we had to modify tf.sh.

If current codebase does not support >1 GPU, this could be a good issue to tackle in a PR that extends ML4CVD to on-prem hardware.

Inference ignore errors in `tensor_maps_out`

https://github.com/broadinstitute/ml/tree/nd_inference_update

user and group of results directory (specified by --output_folder) are both root

This does not break anything, but does complicate my mediocre workflow where I save results in a folder synced via Dropbox. So it is a matter of convenience.

I do this because viewing results, sharing them with collaborators / advisors, formatting figures into manuscripts, etc. is easier on macOS than a Linux terminal with no GUI.

I presume the team has an efficient way of viewing results and moving them from GCP to emails, Keynote slides, etc. Would be good to hear how it is done now.

Break up `make_multimodal_multitask_model`

Encoders, decoders, bottleneck should all be separate functions. Likely object oriented as Keras recommends with sub-models and blocks implementing Layer.
Preliminary refactor in https://github.com/broadinstitute/ml/tree/nd_effectless_model_refactor

Handle normalizations and regularizations more flexibly

What
Model normalizers and regularizers placed automatically in correct order

Why
Normalizers and regularizers are currently placed in fixed positions, which prevents the use of some useful techniques. For example, L2 regularization is impossible because right now because regularization layer comes after the activation instead of the layer.

How
Dictionary mapping regularization enum or string to position in block.

Acceptance Criteria

Dictionary mapping regularization enum or string to position in block
New regularizations + normalizations properly used (especially L2, layer norm)
Command line allows multiple regularizations
Command line fails early on nonexistent regularizations and normalizations

Partners ECG - minor fixes to get classifier on supranodal rhythms working

Input tensor maps to account for individual leads rather than a single [12, 2500] np.array.
~~Fix docker: update tensorflow from 1.15 to 2.0~~

bug in shuffle paths in tensor generators

https://github.com/broadinstitute/ml/blob/a71a97f6f8777e243c4d76f63bb2709156001c85/ml4cvd/tensor_generators.py#L48-L53

The intended behavior is to return the current path and then increment.

However, at the last index, by setting self.idx = 0, shuffling the paths, and returning self.paths[self.idx - 1], a random path is returned. This results in a duplicate path.

We think this only generated N duplicates, where N is typically 3 (train, val, test).

PR #137 fixes this by saving the current path in a temp variable that is returned; this is unaffected by the case where we reach the last index and shuffle paths.

This bug is illustrated in a simple test with 11 ECG .hd5 files. See attached screenshot:

Note: MRNs are scrambled as to protect PHI!

Column 0 is index, column 1 is actual MRN, and column 2 is the MRN from tensor generator.
Green rows 0-5 are training, red rows 6-9 are validation, and blue row 10 is test.
Row 5 is the last index in the training generator. Here, a random prior tensor in the training generator is returned (MRN highlighted in green) instead of the self (6559753).
Row 9 is the last index in the validation generator. Here, a random prior tensor in the validation generator is returned (MRN highlighted in red) instead of the self (6571743).
This bug does not manifest in test generator because there is only one element. Therefore a randomly returned tensor is the only tensor in the test generator.
Bug fix results in actual MRNs.

Phecode_mapping -- include date.

How do we choose which date to include in phecode_mapping output table?

Admidate, operdate, etc?

Should we have a best value, or include all dates?

Should we create an ordered disease progression table, so multiple phecodes don't appear over the years if a patient is assigned a phecode early on?

Should pick one protocol and stick with it, including MPG mappings.

Organize `tensor_from_file`

What
New folder ml4cvd/tensor_maps/.
Specific TensorMaps go into their own files, e.g. ml4cvd/tensor_maps/ecg_bike_tensors.py.

Eventually we move the folder to its own repo as per Marcus's suggestion!

Why
I have a branch with a 1700 line and growing tensor_from_file. You should not have to import specific libraries, like biosppy or vtk for TensorMaps that don't use them.

More modeling options

I want to add more options to make_multimodal_to_multilabel_model

Dilated convolution
More normalization schemes
Optimizer options
Learning rate schemes (cosine annealing)
VAE (with supervision)
Triplet loss to improve embeddings
Quantization to train those brains
Infer multiple models at the same time for ensembling

tensorboard

Add support for Tensorboard

`Normalization` could be a class that defines `normalize` and `undo`

#148

Hyperparameter optimization crashes after 24 hours

Long running issue with hyper-parameter optimization: Models are not garbage collected so long runs tend to slow down and stall typically happens after 24 hours. We've tried several things to fix this with no luck so far.

Currently we have:

finally:
            del model
            gc.collect()

but it doesn't seem to help. Previously we tried things like:

def limit_mem():
    try:
        K.clear_session()
        cfg = K.tf.ConfigProto()
        cfg.gpu_options.allow_growth = True
        K.set_session(K.tf.Session(config=cfg))
    except AttributeError as e:
        logging.exception('Could not clear session. Maybe you are using Theano backend?')

Git hooks for style (e.g. nbstrip notebooks)

https://github.com/broadinstitute/ml/tree/nd_commit_hooks

Standardize MRN cleaning

What
MRNs are cleaned at several places throughout the code. This should be performed in a uniform way, preferably by calling the same function.

Why
To ensure internal consistency that applies best practices from what we've learned about Partners MRNs.

How
We should discuss with clinical collaborators who have experience with MRN abnormalities, e.g. Steve Lubitz and @shaankhurshid.

Acceptance Criteria

Any time an MRN is cleaned in the code base, e.g. ml4cvd/tensor_writer_partners.py, it should use the same function.
This function is developed with input from clinical collaborators.

Speed up training

The majority of training time is spent processing the data with CPUs.
In the image below blue is the fitting function, mauve is the GPU fitting, and green is processing the data.

(this was found with command line

./scripts/tf.sh /home/ndiamant/ml/ml4cvd/recipes.py --mode train --tensors /mnt/disks/ecg-bike-tensors/2019-10-10/ --input_tensors ecg-bike-pretest --output_tensors ecg-bike-new-hrr --batch_size 64 --epochs 5 --training_steps 5 --validation_steps 2 --inspect_models

This video from tensor flow suggests some tensorflow specific ideas.

By adding use_multiprocessing=True and workers=16 I've been able to get a big increase in data processing speed. That suggests a potential way forward of increasing the number of CPUs on our machines, which is relatively cheap.

If you have fairly fixed tensor maps, it might be worth saving preprocessing results to a csv or something like that, but that seems like a waste of developer time to figure out.

Data disk is not auto-mounted

The disk should be attached by the ml4cvd-image.sh file: https://console.cloud.google.com/storage/browser/ml4cvd/projects/jamesp/server-config/?project=broad-ml4cvd
but it's not working currently.
Also the ml4cvd-image.sh and other files in the server-config directory should be added to the new repo.

Intelligent picking of defaults for model structure

What
Option to guess depth, number of channels etc. based on input and output shapes (like EfficientNet)

Why
Makes first attempt at any task better, makes the code more useful for newcomers.

How
Not sure. Probably first attempt would be an EfficientNet implementation with u_connect for segmentation and autoencoding.

Acceptance Criteria

Option to pick smart hyperparameters
The smart option performs well on a variety of tasks: Maybe HRR prediction, age regression from brain MRI, C-MRI segmentation, ECG rhythm classification

Tensorization Pipeline

Our current tensorization process involves 4 main steps:

Run Dataflow jobs to tensorize data from big query database. This means 6 separate data flow runs for each of the fields: ['categorical', 'continuous', 'icd', 'disease', 'death', 'phecode_disease']
Tensorize bulk data with tensor_writer_ukbb.py. This file has code to tensorize abdominal, cardiac, and brain MRI as well as resting and exercise ECG.
Merge tensors with merge_hd5s.py. Doing intersections or in-place merging as appropriate.
Append any CSV or TSV data with the `append[categorical,continuous]_[csv,tsv]' recipe modes.

Each step is currently run separately. We would also like to be able to run them all at once and store intermediate tensors in google cloud buckets.

GCS Client usage in Dataflow pipelines

Dataflow pipeline runs write tensors to GCS buckets using GCS' Python client. From Stackdriver logs, it appears a single client ends up being used by all Dataflow workers. This can be problematic. Investigate whether we can have one client per 'task' (unit of work Dataflow sends to a worker at once) and/or assess how risky having a single client for the entire run would be in the future.

Harmonize tensorization parallelization

What
Tensorization is parallelized via three difference schemes. This should be unified.

Why
Let's minimize redundancy within the codebase.

How
TBD

Acceptance Criteria
Tensorization is performed in parallel via a single approach.

MRI processing

Overall

currently works with ml4cvd conda install so 3.6 env seems consistent (locally tested only!)
pip install apache-beam[gcp] --ignore-installed PyYAML
when testing, please apply careful scale-out limits

Per file:

defines.py
- should be renamed to 'constants' or something
- this is redundant with defines.py in ml4cvd, so please prune so that we're not defining twice. (just copied this over quickly, so there's lots of unncessary ECG stuff in here)
- where should these constants go? If they're only used by tensorize, here. If they're used more broadly, then higher up in the library.
process_mri_test.py
- can just be deleted
process_mri.py
- prune imports to only what's needed
- up to you whether to try the setup/tearddown of client for tasks, or leave as a separate issue.
- rest of the function works well, you may have better approaches for deleting/creating folder data.
experiment2.py
- rename to 'tensorize_mri' or something
- unhardcode pipeline options
- creating list of blobs is a bit of a hot mess:
  - Is there a native function for passing in a list of blobs, instead of creating a list ourselves? you could see that if the number of blobs grows large, this violates naive parallelization
  - currently, list of blobs is everything in a directory, you would have to upload every MRI file to a directory in gs (appropriate, IMO)
  - we currently pass every zip file in that directory to process_mri, which checks to see if the fieldid is in the ALLOWED_MRI_FIELD_IDS. This is really wasteful. We should be checking when we create the list of blobs if the zip file is one that we want. That way, we can decrease the amount of data we put on the network by a substantial amount.

GitHub Issues template

What
Pre-populate our issues with @christopherreeder's template:

**What**
Summarize the issue in 1-2 sentences.

**Why**
Describe why this issue should be solved, new feature implemented, etc.

**How**
High-level overview of how you propose to address.

**Acceptance Criteria**
Unambiguous milestones; if any are incomplete, the PR cannot be merged.

so meta

Why
To save us time when we make new issues, and improve adherence to the nice template.

How
Instructions: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository

I would make one but do not have sufficient permissions for this repo.

If temporarily elevated (@lucidtronix @gpbatra), glad to set it up.

Acceptance Criteria
This repo has a GitHub Issues template that is a selectable option for pre-populating new issues.

Expand the capture-net of phecode mapping

Currrent phecode mapping only uses icd10_diag from the main hesin table. There are secondary codes in other hesin tables, icd_oper, icd9, etc., etc. Those should also be mapped, but the interpretation will be sensitive.

Let's keep an eye on this and decide when to triage once we know more about how we want to structure phenotypes.

Field naming, biobank ETL

hesin tables refer to sample_id as eid, these should be changed here and downstream.

Multiple output tensors

If N >1 output tensors have the same name, only the first one is used and an obscure error is thrown ("you are using N loss functions but your model has only a single output").

To fix, should check for duplicates of output tensormap names.

Also, should check if this bug applies to input tensormaps.

Keep track of fields used for tensorization

Currently we store the list of field ids used for tensorizing categorical and continuous fields in BigQuery's shared_data.tensorization_fieldids table. It was created from a csv file in GCS (gs://ml4cvd/data/fieldids.csv). This helps eliminate fetching of phenotype table rows we don't care about. However, when we need to add new fields to be tensorized, directly updating the table would make it difficult to audit, especially if we run into tensorization issues due to it.

One way we could keep field adding easy while having a log of what changed, is by checking in the tsv file into the repo, and have the tensorization pipeline re-create the tensorization_fieldids table from it.

One jupyter server script

Currently we have scripts/dl_jupyter.sh and scripts/jupyter.sh these should be merged and allow the docker image to be selected between GPU and CPU on the command line the way it works in scripts/tf.sh

1 Python package

Currently tensorize is a package separate from ml4cvd. This will lead to code duplication. Ideally, all this code will be packaged into one pip installable Python package. tensorize should be integrated into ml4cvd or if there is a reason this is not possible both packages should be setup as pip-installable python packages.

Tests should live in a top-level directory

Import the (soon-to-be-created) ml4cvd package.

Different testing files for tensorization, training, and evaluation.

We should have one set of quick running tests that can execute quickly thoughout development and a suite of longer tests that give more scientific validation for PR merges and/or major refactors.

Our training and inference tests are very formulaic. They are given a list of input TensorMaps, output TensorMaps, architecture parameters, and expected performance metrics. In Java this could all be abstracted with a DataProvider. Is there something similar in Python?

Ideally, we will setup continuous or nightly integration testing as well.

Make Encoders and Decoders block agnostic

What
Right now ConvEncoder and ConvDecoder build DenseBlocks. Instead they should get passed a list of blocks and chain them together.

Why
Allows more flexible swapping of the many block types we want to experiment with, e.g. attention convolution, efficient block, etc.

How

Make block factory functions
Edit the ConvEncoder and ConvDecoder classes

Acceptance Criteria

Block types can be easily swapped from command line
Old behavior maintained or improved upon with default args

Assess deduplication strategy

Ask GE to explain fields DBList and Workflow_ID. May relate to import process.
Assess if keeping only first example of a (patientID, AcquisitionTime) loses any patients.
Manually inspect 25 duplicates and all associated data, to check if we are throwing out data.

Memory allocation error when starting test workers during training

What
Remove the reliance of train_multimodal_multitask on big_batch_from_minibatch_generator

Why
In the train recipe, test workers use big_batch_from_minibatch_generator which frequently leads to a memory allocation error.

How
Write an alternative function to big_batch_from_minibatch_generator that does not gather mini batches into larger batches but does produce inputs, outputs, and paths in the format expected by _predict_and_evaluate

Acceptance Criteria
Memory allocation errors caused by big_batch_from_minibatch_generator no longer occur during training runs

Mesh TensorMaps: Start with bounding box

What
Implementing a TensorMap that succinctly describes the bounding box of a 3-D object (e.g., 3-D image segmentation).

Why
ML4CVD is designed to naturally handle models that can enhance the extraction of structural and functional information from widespread diagnostic assessments via limited training on rare information-rich modalities. For example, the codebase has been successfully employed to perform complex segmentation tasks on cardiac MRI, and to implement state-of-the-art models that infer derived features of segmentation, such as LV mass, from cheap and widely available ECGs.

In the current pipeline, however, the models have been often asked to treat rich 3-D information either as a collection of 2-D assessments (slice-by-slice), or as fully unconstrained 3-D objects embedded in structured grids (3-D images). Introducing intermediate and succinct representations of the 3-D objects (e.g., via parametric models and meshes) could increase model performance, enhance interpretability, and provide helpful regularizers for complex multi-task mode;s

How
The implementation of general 3-D Mesh TensorMaps is a complex task that might be better tackled in several substeps. In this first step, we will leverage existing TMAPs that extract the main axis of 3-D objects via SVD on centroids of arbitrary cross-sections. Rather than limiting the extraction to a single axis, we will expand SVD to extract the 3 orientation angles and use boundary detection algorithms from VTK to extract a meaningful bounding box.

Acceptance Criteria
A TMAP returning a bounding box of SAX cardiac MRI segmentation is tested as the target of supervised training from 1) a cardiac MRI and 2) an ECG model.

Explore mode for recipes

What
Implement Explore mode in recipes.py that provides summary statistics for specified input tensors of specified HD5 files.

Why
Understanding basic information of one's data is a vital first step before training models.

How
Iterates through three tmap interpretations (categorical, continuous, and language). For each interpretation, all user-specified input tensor maps that match that type are extracted from HD5 files into a Pandas DataFrame, from which summary statistics are calculated and saved as CSV files.

At the end, all input tensors are concatenated into a large dataframe and saved to a CSV file. Each row is a patient. Columns are: 1) tensor maps (or, if the tmap has channels, a tuple of (tmap, cm)), 2) errors (if any are thrown during opening the tensor), and 3) full path to the tensor on disk. This large CSV file will be ingested into a database for future queries.

Acceptance Criteria
Summary statistics and a big CSV file with all tensors are generated.

Validation Plots A B C D framework

What
Evaluation plots should include the ABCD framework proposed in Towards better clinical prediction models: seven steps for development and an ABCD for validation. The ABCDs are: calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance
statistic (C); and clinical usefulness, with decision-curve analysis (D).

Why
Gives several perspectives on model validity and using previously published framework by an external group is easy to defend. We currently get very little insight into model calibration and do no decision curve analysis.

How
This can probably separate into 2 or more PRs hopefully committed by different folks. We basically have C already (ROC curve and C stat).

Acceptance Criteria
train and test modes generate plots like

and

Create only relevant TensorMaps at run time

What
TensorMap creation is slow, especially if it requires parsing a CSV. This slowness is compounded if the CSV is read over a network, e.g. from MAD3 or ERISOne.

See #171 (comment)

and https://github.com/broadinstitute/ml/blob/6141bef96d04b65e4a3573cbdd3705fb3ebb3a5e/ml4cvd/arguments.py#L175-L183

Why
This happens because we have a ton of TensorMaps, and we load many of them even though we only use a few whenever running ML4CVd.

How
Implementation details TBD. I think this is worth a video chat to discuss. Relates to #143 (organize tensor_from_file).

Acceptance Criteria
When ML4CVD is run, only the user-specified input and output tensor maps should be created, and the rest are not.

Tensorize STS tabular data

Manipulate and tensorize Partners ECG XML files

What
Code to manipulate and tensorize Partners ECG XML files currently lives in a different repo: https://github.com/mit-ccrg/partners-ecg

Why
These are capabilities that should exist within ML4CVD.

How
Move scripts related to XMLs from original repo to ml/ingest.
Move scripts related to tensorization to ml/ml4cvd/tensorize/.

Acceptance Criteria
Scripts to manipulate and tensorize Partners ECG XML files live inside of ML4CVD, and maintain original functionality.

macOS keeps correcting tensorization to "tenderization" but I am OK with that.

Improve clarity of logfile contents

What

Display training, validation, and test set size at the end of the log file for train mode (and potentially other modes).
Clearly portray how many epochs are actually completed (due to patience).

Why
It is helpful to know the number of tensors used for training, validation, and test, as well as the label count within each set.

Label count makes sense for categorical. Less clear how we best handle this for regression models.

It is also important to know when early stopping occurred.

Currently this information is not consolidated in one place in the log file. It also is spread out over workers.

How
Aggregate over workers.

Acceptance Criteria
After running recipes with train mode, the number of tensors used for training, validation, and test sets, as well label counts in each set, and the number of epochs actually run before early stopping, are summarized at the end of the log file.

Use tensorflow addons

https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers has Radam, cyclic learning rate, and model weight averaging, all of which we want. We should use it.
In progress in #170

Hyperoptimize over tmaps

What
Optimize which augmentations, sizes, and covariates to use to minimize loss.
Specify

covariates, any subset of which can be used
necessary tmaps, all of which must be used
choice tmaps, one of which must be used

Why
We don't have a clean way of optimizing over augmentations / shapes / covariates

How
Helper functions in hyperparameters.py that allow you to specify the above.

Acceptance Criteria
Easy to understand/use optimization over input tmaps

broadinstitute / ml4h Goto Github PK

ml4h's Issues

Recommend Projects

Recommend Topics

Recommend Org