broadinstitute / ml4h Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
TensorGenerator
should have thoroughly tested properties even for multiprocessing, e.g. output batch values should match the hd5 they were read from.multimodal_multitask_worker
should be simplified. Take in a batch post processing function and a path generator function so it doesn't have to handle siamese and mixup logic and whatever other new logic we'll want to add. In progress in #77.tensor_maps_in
consistent across functions.What
When calling recipes.py
, if a user sets an --output_folder
path that is not within the repo directory, no results are saved on the host machine.
It would be great if a user could specify any path on their machine in which to save results from running ML4CVD!
Why
The repo should contain code. Results should live in a different directory. Results in the repo directory can clutter the output of git status
and subsequent adds, commits, and pushes. Having to move results out of the repo directory adds a step to user workflow.
How
I think this limitation is due to a Docker mount setting. The solution is probably to mount the home directory, and accept a limitation that --output_folder
must be within ~/
and not upstream of that. Mounting /
seems problematic.
Acceptance Criteria
User can set any --output_folder
regardless of whether it is in the repo directory, and results appear in a subdirectory specified by id
arg.
fun!
Related code snippet from tensor_writer
:
float_value = to_float_or_false(value)
if float_value is not False:
hd5.create_dataset(hd5_dataset_name, data=[float_value])
else:
logging.warning("Cannot cast to float from '{}' for field id '{}' and sample id '{}'".format(value, field_id, sample_id))
What
Implement a flag that has the effect of increasing the size of the font in plots
Why
Plots as they are currently generated are difficult to read when put in slides for presentations.
How
Abstract the font size specification in plots.py so that each plot type is capable of generating plots with a font size appropriate for presentation-viewing or non-presentation-viewing.
Acceptance Criteria
A presentation-mode flag can be used
It's hard to update the interface to make_multimodal_to_multilabel_model
because it's used in so many places and has so many arguments.
It should have default arguments for some of its parameters, and maybe be called via
make_multimodal_to_multilabel_model(**args)
If we follow the **args
strategy we could add **kwargs
to the signature of make_multimodal_to_multilabel_model
, which would absorb any extra arguments so we could do minimal processing on the command line args before passing them. The downside of this strategy is it would sometimes fail quietly - e.g. misspelling an argument with a default would not give an error, it would just use the default value.
@lucidtronix commented on Thu Mar 14 2019
Modality by Disease count table. Histograms for continuous and categorical fields. Counts and shapes of every tensor group and type. Check for all 0 tensors.
Count the types and sources of tensors that are merged and output the log.
Title is speculative and reflects my hypothesis.
I am currently running explore
on 2.6M ECGs. Even though no training is performed, almost all of the memory of one of my GPUs is in use:
er498@mithril > nvidia-smi ml -> er_tensorize_partners_ecgs $ ! RC=1
Mon Mar 2 21:23:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:17:00.0 Off | N/A |
| 0% 37C P8 19W / 250W | 10600MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 0% 38C P8 1W / 250W | 272MiB / 11018MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 26811 C python 10589MiB |
| 1 1567 G /usr/lib/xorg/Xorg 14MiB |
| 1 1661 G /usr/bin/gnome-shell 10MiB |
| 1 26811 C python 235MiB |
+-----------------------------------------------------------------------------+
When I try to run a second instance of explore
before the first run completes, I get an CUDA_ERROR_OUT_OF_MEMORY
error which suggests ML4CVD tries to use the first GPU and does not utilize the second GPU although it is recognized:
er498@mithril > sh run_explore.sh ml -> er_tensorize_partners_ecgs $ !
mkdir: cannot create directory ‘/mnt/ml4cvd’: Permission denied
Attempting to run Docker with
docker run -it
--rm
--ipc=host
-v /home/er498/jupyter/root/:/root/
-v /home/er498/:/home/er498/
-v /mnt/:/mnt/
gcr.io/broad-ml4cvd/deeplearning:tf2-latest-gpu python /home/er498/repos/ml/ml4cvd/recipes.py --mode explore --tensors /data/partners_ecg/hd5_subset --input_tensors partners_ecg_patientid partners_ecg_date partners_ecg_dob partners_ecg_read_md_raw partners_ecg_read_pc_raw partners_ecg_rate partners_ecg_qrs partners_ecg_pr partners_ecg_qt partners_ecg_qtc --test_modulo 0 --output_folder /home/er498/ml4cvd_results/ --id explore_partners_ecg_subset
Processing /home/er498/repos/ml
Building wheels for collected packages: ml4cvd
Building wheel for ml4cvd (setup.py) ... done
Created wheel for ml4cvd: filename=ml4cvd-0.0.1-py3-none-any.whl size=403522 sha256=736f1b2fa148fab99dfb4397c1cf3561ecaa3c260dc6807b304b642d81702cc9
Stored in directory: /tmp/pip-ephem-wheel-cache-0_dzl_rq/wheels/9c/5b/fa/03f47092853802b5352de00dc549ae7baf4101b7e30db46407
Successfully built ml4cvd
Installing collected packages: ml4cvd
Successfully installed ml4cvd-0.0.1
2020-03-02 21:21:56.021050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-02 21:21:56.022306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-03-02 21:21:57.425267: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-02 21:21:57.440138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.441538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.441567: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.442968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.443219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.444728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.445622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.445659: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.450824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.451042: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-02 21:21:57.483917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3099995000 Hz
2020-03-02 21:21:57.486154: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a03ee0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.486188: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-03-02 21:21:57.841317: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5985690 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-02 21:21:57.841357: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.841369: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-03-02 21:21:57.842855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-03-02 21:21:57.844309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:57.844333: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-02 21:21:57.844432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-02 21:21:57.844482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-02 21:21:57.844532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-02 21:21:57.844581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-02 21:21:57.844613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-02 21:21:57.848874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-03-02 21:21:57.848949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-02 21:21:58.258708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-02 21:21:58.258732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-03-02 21:21:58.258738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N N
2020-03-02 21:21:58.258742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: N N
2020-03-02 21:21:58.260170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2020-03-02 21:21:58.261248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9986 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2020-03-02 21:21:58.264000: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 185.62M (194641920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
File "/home/er498/repos/ml/ml4cvd/recipes.py", line 17, in <module>
from ml4cvd.arguments import parse_args
File "/usr/local/lib/python3.6/dist-packages/ml4cvd/arguments.py", line 24, in <module>
from ml4cvd.tensor_maps_by_hand import TMAPS
File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_maps_by_hand.py", line 1, in <module>
from ml4cvd.tensor_from_file import normalized_first_date, TMAPS
File "/usr/local/lib/python3.6/dist-packages/ml4cvd/tensor_from_file.py", line 478, in <module>
loss=weighted_crossentropy(np.array(_get_lead_cm(32)[1]), 'ecg_median_categorical'))
File "/usr/local/lib/python3.6/dist-packages/ml4cvd/metrics.py", line 33, in weighted_crossentropy
exec(string_globe, globals(), locals())
File "<string>", line 4, in <module>
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 814, in variable
constraint=constraint)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 260, in __call__
return cls._variable_v2_call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 254, in _variable_v2_call
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 235, in <lambda>
previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2645, in default_variable_creator_v2
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
distribute_strategy=distribute_strategy)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
graph_mode=self._in_graph_mode)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
shape, dtype, shared_name, name, graph_mode, initial_value)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse
Args attached:
arguments_2020-03-02_17-34.txt
Expected behavior when running ML4CVD on prem is to use other GPUs that have available memory.
@paolodi earlier, to use the other GPU so we could train two models at the same time on Mithril, we had to modify tf.sh
.
If current codebase does not support >1 GPU, this could be a good issue to tackle in a PR that extends ML4CVD to on-prem hardware.
This does not break anything, but does complicate my mediocre workflow where I save results in a folder synced via Dropbox. So it is a matter of convenience.
I do this because viewing results, sharing them with collaborators / advisors, formatting figures into manuscripts, etc. is easier on macOS than a Linux terminal with no GUI.
I presume the team has an efficient way of viewing results and moving them from GCP to emails, Keynote slides, etc. Would be good to hear how it is done now.
Encoders, decoders, bottleneck should all be separate functions. Likely object oriented as Keras recommends with sub-models and blocks implementing Layer
.
Preliminary refactor in https://github.com/broadinstitute/ml/tree/nd_effectless_model_refactor
What
Model normalizers and regularizers placed automatically in correct order
Why
Normalizers and regularizers are currently placed in fixed positions, which prevents the use of some useful techniques. For example, L2 regularization is impossible because right now because regularization layer comes after the activation instead of the layer.
How
Dictionary mapping regularization enum or string to position in block.
Acceptance Criteria
The intended behavior is to return the current path and then increment.
However, at the last index, by setting self.idx = 0
, shuffling the paths, and returning self.paths[self.idx - 1]
, a random path is returned. This results in a duplicate path.
We think this only generated N
duplicates, where N
is typically 3 (train, val, test).
PR #137 fixes this by saving the current path in a temp variable that is returned; this is unaffected by the case where we reach the last index and shuffle paths.
This bug is illustrated in a simple test with 11 ECG .hd5
files. See attached screenshot:
Note: MRNs are scrambled as to protect PHI!
How do we choose which date to include in phecode_mapping output table?
Admidate, operdate, etc?
Should we have a best value, or include all dates?
Should we create an ordered disease progression table, so multiple phecodes don't appear over the years if a patient is assigned a phecode early on?
Should pick one protocol and stick with it, including MPG mappings.
What
New folder ml4cvd/tensor_maps/
.
Specific TensorMaps
go into their own files, e.g. ml4cvd/tensor_maps/ecg_bike_tensors.py
.
Eventually we move the folder to its own repo as per Marcus's suggestion!
Why
I have a branch with a 1700 line and growing tensor_from_file
. You should not have to import specific libraries, like biosppy
or vtk
for TensorMaps
that don't use them.
I want to add more options to make_multimodal_to_multilabel_model
Add support for Tensorboard
Long running issue with hyper-parameter optimization: Models are not garbage collected so long runs tend to slow down and stall typically happens after 24 hours. We've tried several things to fix this with no luck so far.
Currently we have:
finally:
del model
gc.collect()
but it doesn't seem to help. Previously we tried things like:
def limit_mem():
try:
K.clear_session()
cfg = K.tf.ConfigProto()
cfg.gpu_options.allow_growth = True
K.set_session(K.tf.Session(config=cfg))
except AttributeError as e:
logging.exception('Could not clear session. Maybe you are using Theano backend?')
What
MRNs are cleaned at several places throughout the code. This should be performed in a uniform way, preferably by calling the same function.
Why
To ensure internal consistency that applies best practices from what we've learned about Partners MRNs.
How
We should discuss with clinical collaborators who have experience with MRN abnormalities, e.g. Steve Lubitz and @shaankhurshid.
Acceptance Criteria
ml4cvd/tensor_writer_partners.py
, it should use the same function.The majority of training time is spent processing the data with CPUs.
In the image below blue is the fitting function, mauve is the GPU fitting, and green is processing the data.
(this was found with command line
./scripts/tf.sh /home/ndiamant/ml/ml4cvd/recipes.py --mode train --tensors /mnt/disks/ecg-bike-tensors/2019-10-10/ --input_tensors ecg-bike-pretest --output_tensors ecg-bike-new-hrr --batch_size 64 --epochs 5 --training_steps 5 --validation_steps 2 --inspect_models
This video from tensor flow suggests some tensorflow specific ideas.
By adding use_multiprocessing=True
and workers=16
I've been able to get a big increase in data processing speed. That suggests a potential way forward of increasing the number of CPUs on our machines, which is relatively cheap.
If you have fairly fixed tensor maps, it might be worth saving preprocessing results to a csv or something like that, but that seems like a waste of developer time to figure out.
The disk should be attached by the ml4cvd-image.sh file: https://console.cloud.google.com/storage/browser/ml4cvd/projects/jamesp/server-config/?project=broad-ml4cvd
but it's not working currently.
Also the ml4cvd-image.sh and other files in the server-config directory should be added to the new repo.
What
Option to guess depth, number of channels etc. based on input and output shapes (like EfficientNet)
Why
Makes first attempt at any task better, makes the code more useful for newcomers.
How
Not sure. Probably first attempt would be an EfficientNet implementation with u_connect for segmentation and autoencoding.
Acceptance Criteria
Our current tensorization process involves 4 main steps:
Run Dataflow jobs to tensorize data from big query database. This means 6 separate data flow runs for each of the fields: ['categorical', 'continuous', 'icd', 'disease', 'death', 'phecode_disease']
Tensorize bulk data with tensor_writer_ukbb.py
. This file has code to tensorize abdominal, cardiac, and brain MRI as well as resting and exercise ECG.
Merge tensors with merge_hd5s.py
. Doing intersections or in-place merging as appropriate.
Append any CSV or TSV data with the `append[categorical,continuous]_[csv,tsv]' recipe modes.
Each step is currently run separately. We would also like to be able to run them all at once and store intermediate tensors in google cloud buckets.
Dataflow pipeline runs write tensors to GCS buckets using GCS' Python client. From Stackdriver logs, it appears a single client ends up being used by all Dataflow workers. This can be problematic. Investigate whether we can have one client per 'task' (unit of work Dataflow sends to a worker at once) and/or assess how risky having a single client for the entire run would be in the future.
What
Tensorization is parallelized via three difference schemes. This should be unified.
Why
Let's minimize redundancy within the codebase.
How
TBD
Acceptance Criteria
Tensorization is performed in parallel via a single approach.
Overall
pip install apache-beam[gcp] --ignore-installed PyYAML
Per file:
defines.py
defines.py
in ml4cvd, so please prune so that we're not defining twice. (just copied this over quickly, so there's lots of unncessary ECG stuff in here)process_mri_test.py
process_mri.py
experiment2.py
What
Pre-populate our issues with @christopherreeder's template:
**What**
Summarize the issue in 1-2 sentences.
**Why**
Describe why this issue should be solved, new feature implemented, etc.
**How**
High-level overview of how you propose to address.
**Acceptance Criteria**
Unambiguous milestones; if any are incomplete, the PR cannot be merged.
so meta
Why
To save us time when we make new issues, and improve adherence to the nice template.
How
Instructions: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository
I would make one but do not have sufficient permissions for this repo.
If temporarily elevated (@lucidtronix @gpbatra), glad to set it up.
Acceptance Criteria
This repo has a GitHub Issues template that is a selectable option for pre-populating new issues.
Currrent phecode mapping only uses icd10_diag from the main hesin table. There are secondary codes in other hesin tables, icd_oper, icd9, etc., etc. Those should also be mapped, but the interpretation will be sensitive.
Let's keep an eye on this and decide when to triage once we know more about how we want to structure phenotypes.
hesin tables refer to sample_id as eid, these should be changed here and downstream.
If N >1 output tensors have the same name, only the first one is used and an obscure error is thrown ("you are using N loss functions but your model has only a single output").
To fix, should check for duplicates of output tensormap names.
Also, should check if this bug applies to input tensormaps.
Currently we store the list of field ids used for tensorizing categorical and continuous fields in BigQuery's shared_data.tensorization_fieldids
table. It was created from a csv file in GCS (gs://ml4cvd/data/fieldids.csv
). This helps eliminate fetching of phenotype
table rows we don't care about. However, when we need to add new fields to be tensorized, directly updating the table would make it difficult to audit, especially if we run into tensorization issues due to it.
One way we could keep field adding easy while having a log of what changed, is by checking in the tsv file into the repo, and have the tensorization pipeline re-create the tensorization_fieldids
table from it.
Currently we have scripts/dl_jupyter.sh
and scripts/jupyter.sh
these should be merged and allow the docker image to be selected between GPU and CPU on the command line the way it works in scripts/tf.sh
Currently tensorize
is a package separate from ml4cvd
. This will lead to code duplication. Ideally, all this code will be packaged into one pip installable Python package. tensorize should be integrated into ml4cvd or if there is a reason this is not possible both packages should be setup as pip-installable python packages.
Import the (soon-to-be-created) ml4cvd package.
Different testing files for tensorization, training, and evaluation.
We should have one set of quick running tests that can execute quickly thoughout development and a suite of longer tests that give more scientific validation for PR merges and/or major refactors.
Our training and inference tests are very formulaic. They are given a list of input TensorMaps, output TensorMaps, architecture parameters, and expected performance metrics. In Java this could all be abstracted with a DataProvider. Is there something similar in Python?
Ideally, we will setup continuous or nightly integration testing as well.
What
Right now ConvEncoder
and ConvDecoder
build DenseBlocks
. Instead they should get passed a list of blocks and chain them together.
Why
Allows more flexible swapping of the many block types we want to experiment with, e.g. attention convolution, efficient block, etc.
How
ConvEncoder
and ConvDecoder
classesAcceptance Criteria
What
Remove the reliance of train_multimodal_multitask on big_batch_from_minibatch_generator
Why
In the train recipe, test workers use big_batch_from_minibatch_generator which frequently leads to a memory allocation error.
How
Write an alternative function to big_batch_from_minibatch_generator that does not gather mini batches into larger batches but does produce inputs, outputs, and paths in the format expected by _predict_and_evaluate
Acceptance Criteria
Memory allocation errors caused by big_batch_from_minibatch_generator no longer occur during training runs
What
Implementing a TensorMap that succinctly describes the bounding box of a 3-D object (e.g., 3-D image segmentation).
Why
ML4CVD is designed to naturally handle models that can enhance the extraction of structural and functional information from widespread diagnostic assessments via limited training on rare information-rich modalities. For example, the codebase has been successfully employed to perform complex segmentation tasks on cardiac MRI, and to implement state-of-the-art models that infer derived features of segmentation, such as LV mass, from cheap and widely available ECGs.
In the current pipeline, however, the models have been often asked to treat rich 3-D information either as a collection of 2-D assessments (slice-by-slice), or as fully unconstrained 3-D objects embedded in structured grids (3-D images). Introducing intermediate and succinct representations of the 3-D objects (e.g., via parametric models and meshes) could increase model performance, enhance interpretability, and provide helpful regularizers for complex multi-task mode;s
How
The implementation of general 3-D Mesh TensorMaps is a complex task that might be better tackled in several substeps. In this first step, we will leverage existing TMAPs that extract the main axis of 3-D objects via SVD on centroids of arbitrary cross-sections. Rather than limiting the extraction to a single axis, we will expand SVD to extract the 3 orientation angles and use boundary detection algorithms from VTK to extract a meaningful bounding box.
Acceptance Criteria
A TMAP returning a bounding box of SAX cardiac MRI segmentation is tested as the target of supervised training from 1) a cardiac MRI and 2) an ECG model.
What
Implement Explore mode in recipes.py
that provides summary statistics for specified input tensors of specified HD5 files.
Why
Understanding basic information of one's data is a vital first step before training models.
How
Iterates through three tmap interpretations (categorical, continuous, and language). For each interpretation, all user-specified input tensor maps that match that type are extracted from HD5 files into a Pandas DataFrame, from which summary statistics are calculated and saved as CSV files.
At the end, all input tensors are concatenated into a large dataframe and saved to a CSV file. Each row is a patient. Columns are: 1) tensor maps (or, if the tmap has channels, a tuple of (tmap, cm)
), 2) errors (if any are thrown during opening the tensor), and 3) full path to the tensor on disk. This large CSV file will be ingested into a database for future queries.
Acceptance Criteria
Summary statistics and a big CSV file with all tensors are generated.
What
Evaluation plots should include the ABCD framework proposed in Towards better clinical prediction models: seven steps for development and an ABCD for validation. The ABCDs are: calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance
statistic (C); and clinical usefulness, with decision-curve analysis (D).
Why
Gives several perspectives on model validity and using previously published framework by an external group is easy to defend. We currently get very little insight into model calibration and do no decision curve analysis.
How
This can probably separate into 2 or more PRs hopefully committed by different folks. We basically have C already (ROC curve and C stat).
Acceptance Criteria
train and test modes generate plots like
and
What
TensorMap creation is slow, especially if it requires parsing a CSV. This slowness is compounded if the CSV is read over a network, e.g. from MAD3 or ERISOne.
See #171 (comment)
Why
This happens because we have a ton of TensorMaps, and we load many of them even though we only use a few whenever running ML4CVd.
How
Implementation details TBD. I think this is worth a video chat to discuss. Relates to #143 (organize tensor_from_file).
Acceptance Criteria
When ML4CVD is run, only the user-specified input and output tensor maps should be created, and the rest are not.
What
Code to manipulate and tensorize Partners ECG XML files currently lives in a different repo: https://github.com/mit-ccrg/partners-ecg
Why
These are capabilities that should exist within ML4CVD.
How
Move scripts related to XMLs from original repo to ml/ingest
.
Move scripts related to tensorization to ml/ml4cvd/tensorize/
.
Acceptance Criteria
Scripts to manipulate and tensorize Partners ECG XML files live inside of ML4CVD, and maintain original functionality.
macOS keeps correcting tensorization to "tenderization" but I am OK with that.
What
Display training, validation, and test set size at the end of the log file for train
mode (and potentially other modes).
Clearly portray how many epochs are actually completed (due to patience
).
Why
It is helpful to know the number of tensors used for training, validation, and test, as well as the label count within each set.
Label count makes sense for categorical. Less clear how we best handle this for regression models.
It is also important to know when early stopping occurred.
Currently this information is not consolidated in one place in the log file. It also is spread out over workers.
How
Aggregate over workers.
Acceptance Criteria
After running recipes with train
mode, the number of tensors used for training, validation, and test sets, as well label counts in each set, and the number of epochs actually run before early stopping, are summarized at the end of the log file.
https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers has Radam
, cyclic learning rate, and model weight averaging, all of which we want. We should use it.
In progress in #170
What
Optimize which augmentations, sizes, and covariates to use to minimize loss.
Specify
Why
We don't have a clean way of optimizing over augmentations / shapes / covariates
How
Helper functions in hyperparameters.py
that allow you to specify the above.
Acceptance Criteria
Easy to understand/use optimization over input tmaps
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.