mahmoodlab / clam Goto Github PK

Data-efficient and weakly supervised computational pathology on whole slide images - Nature Biomedical Engineering

License: GNU General Public License v3.0

Python 100.00%

histopathology pathology weakly-supervised-learning whole-slide-imaging data-efficient computational-pathology mahmoodlab bioimage-informatics deep-learning tcga-data

clam's Introduction

CLAM

Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images. Nature Biomedical Engineering

ArXiv | Journal Link | Interactive Demo | Cite

TL;DR: CLAM is a high-throughput and interpretable method for data efficient whole slide image (WSI) classification using slide-level labels without any ROI extraction or patch-level annotations, and is capable of handling multi-class subtyping problems. Tested on three different WSI datasets, trained models adapt to independent test cohorts of WSI resections and biopsies as well as smartphone microscopy images (photomicrographs).

CLAM: A Deep-Learning-based Pipeline for Data Efficient and Weakly Supervised Whole-Slide-level Analysis

Pre-requisites • Installation • Segmentation and Patching • Feature Extraction • Weakly Supervised Training • Testing • Trained Models • Heatmap Visualization • Examples • Pre-print • Demo • Cite

How does CLAM work? Clustering-constrained Attention Multiple Instance Learning (CLAM) is a deep-learning-based weakly-supervised method that uses attention-based learning to automatically identify sub-regions of high diagnostic value in order to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space.

Updates:

04/06/2024: UNI and CONCH are now available to select as pretrained encoders. See Using CONCH / UNI as Pretrained Encoders for more details. Please make sure all dependencies are installed correctly by installing the latest env.yml file (see Installation guide for details), and using the corresponding clam_latest conda environment.
03/19/2024: We are releasing UNI and CONCH, a pair of SOTA pretrained encoders that produce strong representations for histopathology images and enhance performance on various computational pathology workflows, including the MIL-based CLAM workflow.
05/24/2021: Script for heatmap visualization now available via create_heatmaps.py, with the configuration template located in heatmaps/configs. See Heatmap visualization for demo and instructions.
03/01/2021: New, fast patching/feature extraction pipeline is now available. TL;DR: since CLAM only requires image features for training, it is not necessary to save the actual image patches, the new pipeline rids of this overhead and instead only saves the coordinates of image patches during "patching" and loads these regions on the fly from WSIs during feature extraction. This is significantly faster than the old pipeline and usually only takes 1-2s for "patching" and a couple minutes to featurize a WSI. To use the new pipeline, make sure you are calling create_patches_fp.py and extract_features_fp.py instead of the old create_patches.py and extract_features.py scripts.

Note: while we hope that the newest update will require minimal changes to the user's workflow, if needed, you may reference the old version of the code base here. Please report any issues in the public forum.

Warning: the latest update will by default resize image patches to 224 x 224 before extracting features using the pretrained encoder. This change serves to make it more consistent with the evaluation protocol used in UNI, CONCH and other studies. If you wish to preserve the original size of the image patches generated during patching or use a different image size for feature extraction, you can do so by specifying --target_patch_size in extract_features_fp.py.

RE update 03/01/21: note that the README has been updated to use the new, faster pipeline by default. If you still wish to use the old pipeline, refer to: Guide for Old Pipeline. It saves tissue patches, which is signficantly slower and takes up a lot of storage space but can still be useful if you need to work with original image patches instead of feature embeddings.

Installation:

Please refer to our Installation guide for detailed instructions on how to get started.

WSI Segmentation and Patching

The first step focuses on segmenting the tissue and excluding any holes. The segmentation of specific slides can be adjusted by tuning the individual parameters (e.g. dilated vessels appearing as holes may be important for certain sarcomas.) The following example assumes that digitized whole slide image data in well known standard formats (.svs, .ndpi, .tiff etc.) are stored under a folder named DATA_DIRECTORY

DATA_DIRECTORY/
	├── slide_1.svs
	├── slide_2.svs
	└── ...

Basic, Fully Automated Run

python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --seg --patch --stitch

The above command will segment every slide in DATA_DIRECTORY using default parameters, extract all patches within the segemnted tissue regions, create a stitched reconstruction for each slide using its extracted patches (optional) and generate the following folder structure at the specified RESULTS_DIRECTORY:

RESULTS_DIRECTORY/
	├── masks
    		├── slide_1.png
    		├── slide_2.png
    		└── ...
	├── patches
    		├── slide_1.h5
    		├── slide_2.h5
    		└── ...
	├── stitches
    		├── slide_1.png
    		├── slide_2.png
    		└── ...
	└── process_list_autogen.csv

The masks folder contains the segmentation results (one image per slide). The patches folder contains arrays of extracted tissue patches from each slide (one .h5 file per slide, where each entry corresponds to the coordinates of the top-left corner of a patch) The stitches folder contains downsampled visualizations of stitched tissue patches (one image per slide) (Optional, not used for downstream tasks) The auto-generated csv file process_list_autogen.csv contains a list of all slides processed, along with their segmentation/patching parameters used.

Additional flags that can be passed include:

--custom_downsample: factor for custom downscale (not recommended, ideally should first check if native downsamples exist)
--patch_level: which downsample pyramid level to extract patches from (default is 0, the highest available resolution)
--no_auto_skip: by default, the script will skip over files for which patched .h5 files already exist in the desination folder, this toggle can be used to override this behavior

Some parameter templates are also availble and can be readily deployed as good choices for default parameters:

bwh_biopsy.csv: used for segmenting biopsy slides scanned at BWH (Scanned using Hamamatsu S210 and Aperio GT450)
bwh_resection.csv: used for segmenting resection slides scanned at BWH
tcga.csv: used for segmenting TCGA slides

Simply pass the name of the template file to the --preset argument, for example, to use the biopsy template:

python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --preset bwh_biopsy.csv --seg --patch --stitch

Custom Default Segmentation Parameters

For advanced usage, in addition to using the default, single set of parameters defined in the script create_patches_fp.py, the user can define custom templates of parameters depending on the dataset. These templates are expected to be stored under presets, and contain values for each of the parameters used during segmentation and patching.

The list of segmentation parameters is as follows:

seg_level: downsample level on which to segment the WSI (default: -1, which uses the downsample in the WSI closest to 64x downsample)
sthresh: segmentation threshold (positive integer, default: 8, using a higher threshold leads to less foreground and more background detection)
mthresh: median filter size (positive, odd integer, default: 7)
use_otsu: use otsu's method instead of simple binary thresholding (default: False)
close: additional morphological closing to apply following initial thresholding (positive integer or -1, default: 4)

The list of contour filtering parameters is as follows:

a_t: area filter threshold for tissue (positive integer, the minimum size of detected foreground contours to consider, relative to a reference patch size of 512 x 512 at level 0, e.g. a value 10 means only detected foreground contours of size greater than 10 512 x 512 sized patches at level 0 will be processed, default: 100)
a_h: area filter threshold for holes (positive integer, the minimum size of detected holes/cavities in foreground contours to avoid, once again relative to 512 x 512 sized patches at level 0, default: 16)
max_n_holes: maximum of holes to consider per detected foreground contours (positive integer, default: 10, higher maximum leads to more accurate patching but increases computational cost)

The list of segmentation visualization parameters is as follows:

vis_level: downsample level to visualize the segmentation results (default: -1, which uses the downsample in the WSI closest to 64x downsample)
line_thickness: line thickness to draw visualize the segmentation results (positive integer, in terms of number of pixels occupied by drawn line at level 0, default: 250)

The list of patching parameters is as follows:

use_padding: whether to pad the border of the slide (default: True)
contour_fn: contour checking function to decide whether a patch should be considered foreground or background (choices between 'four_pt' - checks if all four points in a small, grid around the center of the patch are inside the contour, 'center' - checks if the center of the patch is inside the contour, 'basic' - checks if the top-left corner of the patch is inside the contour, default: 'four_pt')

Two-Step Run (Mannually Adjust Parameters For Specific Slides)

To ensure that high quality segmentation and extraction of relevant tissue patches, user has the option of first performing segmentation (typically around 1s per slide), inspecting the segmentation results and tweaking the parameters for select slides if necessary and then extracting patches using the tweaked parameters. i.e., first run:

python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --seg

The above command will segment every slide in DATA_DIRECTORY using default parameters and generate the csv file, but will NOT patch just yet (patches and stitches folders will be empty)

The csv file can be tweaked for specific slides, and be passed to the script via the --process_list CSV_FILE_NAME such that the script will use the user-updated specifications. Before tweaking the segmentation parameters, the user should make a copy of the csv file and give it a new name (e.g. process_list_edited.csv) because otherwise this file with the default name is overwritten the next time the command is run. Then the user has the option to tweak the parameters for specific slides by changing their corresponding fields in the csv file. The process column stores a binary variable (0 or 1) for whether the script should process a specific slide. This allows the user to toggle on just the select few slides to quickly confirm whether the tweaked parameters produce satisfactory results. For example, to re-segment just slide_1.svs again using user-updated parameters, make the appropriate changes to its fields, update its process cell to 1, save the csv file, and pass its name to the same command as above:

python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --seg --process_list process_list_edited.csv

When satisfied with the segmentation results, the user should make the process cell for all slides that need to be processed to 1, save the csv file, and run patching with the saved csv file (just like in the fully-automated run use case, with the additional csv file argument):

python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --seg --process_list CSV_FILE_NAME --patch --stitch

Weakly-Supervised Learning using Slide-Level Labels with CLAM

Feature Extraction (GPU Example)

CUDA_VISIBLE_DEVICES=0 python extract_features_fp.py --data_h5_dir DIR_TO_COORDS --data_slide_dir DATA_DIRECTORY --csv_path CSV_FILE_NAME --feat_dir FEATURES_DIRECTORY --batch_size 512 --slide_ext .svs

The above command expects the coordinates .h5 files to be stored under DIR_TO_COORDS and a batch size of 512 to extract 1024-dim features from each tissue patch for each slide and produce the following folder structure:

FEATURES_DIRECTORY/
    ├── h5_files
            ├── slide_1.h5
            ├── slide_2.h5
            └── ...
    └── pt_files
            ├── slide_1.pt
            ├── slide_2.pt
            └── ...

where each .h5 file contains an array of extracted features along with their patch coordinates (note for faster training, a .pt file for each slide is also created for each slide, containing just the patch features). The csv file is expected to contain a list of slide filenames (without the filename extensions) to process (the easiest option is to take the csv file auto generated by the previous segmentation/patching step, and delete the filename extensions)

Using CONCH / UNI as Pretrained Encoders

If using UNI or CONCH, first refer to their respective HF page below to request and download the model weights (pytorch_model.bin).

UNI: https://huggingface.co/MahmoodLab/UNI

CONCH: https://huggingface.co/MahmoodLab/CONCH

After successfully downloading the model checkpoints, you need to set the CONCH_CKPT_PATH and UNI_CKPT_PATH environment variable to the path of the pretrained encoder checkpoints, before running the feature extraction script. For example, if you have downloaded the pretrained UNI and CONCH checkpoints and placed them in the checkpoints/conch and checkpoints/uni folders respectively, you can set the environment variables as follows:

export CONCH_CKPT_PATH=checkpoints/conch/pytorch_model.bin
export UNI_CKPT_PATH=checkpoints/uni/pytorch_model.bin

When running the extract_features_fp.py also set --model_name to either 'uni_v1' or 'conch_v1' to use the respective encoder.

Note that these encoder models (especially UNI, which uses ViT-L) are more computationally expensive and require more GPU memory than the default ResNet50 encoder, so expect longer runtimes and reduced batch sizes accordingly if you run out of GPU memory. UNI will produce 1024-dim features, while CONCH will produce 512-dim features.

Datasets

The data used for training and testing are expected to be organized as follows:

DATA_ROOT_DIR/
    ├──DATASET_1_DATA_DIR/
        ├── h5_files
                ├── slide_1.h5
                ├── slide_2.h5
                └── ...
        └── pt_files
                ├── slide_1.pt
                ├── slide_2.pt
                └── ...
    ├──DATASET_2_DATA_DIR/
        ├── h5_files
                ├── slide_a.h5
                ├── slide_b.h5
                └── ...
        └── pt_files
                ├── slide_a.pt
                ├── slide_b.pt
                └── ...
    └──DATASET_3_DATA_DIR/
        ├── h5_files
                ├── slide_i.h5
                ├── slide_ii.h5
                └── ...
        └── pt_files
                ├── slide_i.pt
                ├── slide_ii.pt
                └── ...
    └── ...

Namely, each dataset is expected to be a subfolder (e.g. DATASET_1_DATA_DIR) under DATA_ROOT_DIR, and the features extracted for each slide in the dataset is stored as a .pt file sitting under the pt_files folder of this subfolder. Datasets are also expected to be prepared in a csv format containing at least 3 columns: case_id, slide_id, and 1 or more labels columns for the slide-level labels. Each case_id is a unique identifier for a patient, while the slide_id is a unique identifier for a slide that correspond to the name of an extracted feature .pt file. This is necessary because often one patient has multiple slides, which might also have different labels. When train/val/test splits are created, we also make sure that slides from the same patient do not go to different splits. The slide ids should be consistent with what was used during the feature extraction step. We provide 2 dummy examples of such dataset csv files in the dataset_csv folder: one for binary tumor vs. normal classification (task 1) and one for multi-class tumor_subtyping (task 2).

Dataset objects used for actual training/validation/testing can be constructed using the Generic_MIL_Dataset Class (defined in datasets/dataset_generic.py). Examples of such dataset objects passed to the models can be found in both main.py and eval.py.

For training, look under main.py:

if args.task == 'task_1_tumor_vs_normal':
    args.n_classes=2
    dataset = Generic_MIL_Dataset(csv_path = 'dataset_csv/tumor_vs_normal_dummy_clean.csv',
                            data_dir= os.path.join(args.data_root_dir, 'tumor_vs_normal_feat_resnet'),
                            shuffle = False, 
                            seed = args.seed, 
                            print_info = True,
                            label_dict = {'normal_tissue':0, 'tumor_tissue':1},
                            label_col = 'label',
                            ignore=[])

The user would need to pass:

csv_path: the path to the dataset csv file
data_dir: the path to saved .pt features
label_dict: a dictionary that maps labels in the label column to numerical values
label_col: name of the label column (optional, by default it's 'label')
ignore: labels to ignore (optional, by default it's an empty list)

Finally, the user should add this specific 'task' specified by this dataset object in the --task arguments as shown below:

parser.add_argument('--task', type=str, choices=['task_1_tumor_vs_normal',  'task_2_tumor_subtyping'])

Training Splits

For evaluating the algorithm's performance, multiple folds (e.g. 10-fold) of train/val/test splits can be used. Example 10-fold 80/10/10 splits for the two dummy datasets can be found under the splits folder. These splits can be automatically generated using the create_splits_seq.py script with minimal modification just like with main.py. For example, tumor_vs_normal splits can be created by calling:

python create_splits_seq.py --task task_1_tumor_vs_normal --seed 1 --k 10

The script uses the Generic_WSI_Classification_Dataset Class for which the constructor expects the same arguments as Generic_MIL_Dataset (without the data_dir argument). For details, please refer to the dataset definition in datasets/dataset_generic.py

GPU Training Example for Binary Positive vs. Negative Classification (e.g. Lymph Node Status)

Note: --embed_dim should be set to 512 for CONCH, and 1024 for UNI and resnet50_trunc.

CUDA_VISIBLE_DEVICES=0 python main.py --drop_out 0.25 --early_stopping --lr 2e-4 --k 10 --exp_code task_1_tumor_vs_normal_CLAM_50 --weighted_sample --bag_loss ce --inst_loss svm --task task_1_tumor_vs_normal --model_type clam_sb --log_data --data_root_dir DATA_ROOT_DIR --embed_dim 1024

GPU Training Example for Subtyping Problems (e.g. 3-class RCC Subtyping)

CUDA_VISIBLE_DEVICES=0 python main.py --drop_out 0.25 --early_stopping --lr 2e-4 --k 10 --exp_code task_2_tumor_subtyping_CLAM_50 --weighted_sample --bag_loss ce --inst_loss svm --task task_2_tumor_subtyping --model_type clam_sb --log_data --subtyping --data_root_dir DATA_ROOT_DIR --embed_dim 1024

Note: We have included the option to use a single-attention-branch CLAM model, which performs favoribly in most experiments and can be set via --model_type clam_sb (single branch) or clam_mb (multi branch). clam_sb is the default choice. Additionally, the user can adjust the number of patches used for clustering via --B.

By default results will be saved to results/exp_code corresponding to the exp_code input argument from the user. If tensorboard logging is enabled (with the arugment toggle --log_data), the user can go into the results folder for the particular experiment, run:

tensorboard --logdir=.

This should open a browser window and show the logged training/validation statistics in real time. For information on each argument, see:

python main.py -h

Testing and Evaluation Script

User also has the option of using the evluation script to test the performances of trained models. Examples corresponding to the models trained above are provided below:

CUDA_VISIBLE_DEVICES=0 python eval.py --k 10 --models_exp_code task_1_tumor_vs_normal_CLAM_50_s1 --save_exp_code task_1_tumor_vs_normal_CLAM_50_s1_cv --task task_1_tumor_vs_normal --model_type clam_sb --results_dir results --data_root_dir DATA_ROOT_DIR --embed_dim 1024

CUDA_VISIBLE_DEVICES=0 python eval.py --k 10 --models_exp_code task_2_tumor_subtyping_CLAM_50_s1 --save_exp_code task_2_tumor_subtyping_CLAM_50_s1_cv --task task_2_tumor_subtyping --model_type clam_sb --results_dir results --data_root_dir DATA_ROOT_DIR --embed_dim 1024

Once again, for information on each commandline argument, see:

python eval.py -h

By adding your own custom datasets into eval.py the same way as you do for main.py, you can also easily test trained models on independent test sets.

Heatmap Visualization

Heatmap visualization can be computed in bulk via create_heatmaps.py by filling out the config file and storing it in /heatmaps/configs and then running create_heatmaps.py with the --config NAME_OF_CONFIG_FILE flag. A demo template is included (config_template.yaml) for lung subtyping on two WSIs from the CPTAC. To run the demo (raw results are saved in heatmaps/heatmap_raw_results and final results are saved in heatmaps/heatmap_production_results):

CUDA_VISIBLE_DEVICES=0 python create_heatmaps.py --config config_template.yaml

See /heatmaps/configs/config_template.yaml for explanations for each configurable option.

Similar to feature extraction, if using UNI / CONCH, set the environment variables before running the script. See Using CONCH / UNI as Pretrained Encoders for more details.

Trained Model Checkpoints

For reproducability, all trained models used can be accessed here. The 3 main folders (tcga_kidney_cv, tcga_cptac_lung_cv and camelyon_40x_cv) correspond to models for RCC subtyping trained on the TCGA, for NSCLC subtyping trained on TCGA and CPTAC and for Lymph Node Metastasis (Breast) detection trained on Camelyon16+17 respectively. In each main folder, each subfolder corresponds to one set of 10-fold cross-validation experiments. For example, the subfolder tcga_kidney_cv_CLAM_50_s1 contains the 10 checkpoints corresponding to the 10 cross-validation folds for TCGA RCC subtyping, trained using CLAM with multi-attention branches using 50% of cases in the full training set.

For reproducability, these models can be evaluated on data prepared by following the same pipeline described in the sections above by calling eval.py with the appropriate arguments that specify the model options (either --model_type clam_mb or --model_type mil should be set, for evaluation only, --subtyping flag does not make a difference) as well as where the model checkpoints (--results_dir and --models_exp_code) and data (--data_root_dir and --task) are stored.

Examples

Please refer to our pre-print and interactive demo for detailed results on three different problems and adaptability across data sources, imaging devices and tissue content.

Visulize additional examples here: http://clam.mahmoodlab.org

Issues

Please report all issues on the public forum.

License

Funding

This work was funded by NIH NIGMS R35GM138216.

Reference

If you find our work useful in your research or if you use parts of this code please consider citing our paper:

Lu, M.Y., Williamson, D.F.K., Chen, T.Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng 5, 555–570 (2021). https://doi.org/10.1038/s41551-020-00682-w

@article{lu2021data,
  title={Data-efficient and weakly supervised computational pathology on whole-slide images},
  author={Lu, Ming Y and Williamson, Drew FK and Chen, Tiffany Y and Chen, Richard J and Barbieri, Matteo and Mahmood, Faisal},
  journal={Nature Biomedical Engineering},
  volume={5},
  number={6},
  pages={555--570},
  year={2021},
  publisher={Nature Publishing Group}
}

clam's People

Contributors

Stargazers

Watchers

Forkers

foxtrotmike zahranoorcs jlevy44 hettiepath agisga nikhilkurian jjhbw cpufxb jesperkers quanxiang-liu mainmanmane gaody9527 quincy-125 maryamhaghighat berooo ehazrati anand-ulle yangsenwxy gabemarx cl2227619761 trendingtechnology qinghezeng drsangjoonkwak cheneyyu mansoor2k4 johnyquest7 hakan77 edgarrios111 aaljuhani sumansudhir habibmrad yashsharma geng-lee ha55anali mtalhaabdullah sirgarfieldc hesiqi1998 kevinmtian hegu2692 schetana12 zhang64-llnl pathai-dev dsouzavijeth superzhen625 utayao m081429 zpeng1989 wanghuogen mayunlong89 zhufenglong2 luan-zb mammadov7 leitong02 archiael lucyocg jiandai wxphb linyaaaa hokiee garvm7 samkaranth yuansh3354 nstatho jingweizhang-xyz chenchengkuan juanigp aishwaryaallada icecube2020 keithcallenberg arfa-jabin zhangjian1966 benlansdell fmhoward kunchi232 deisygysi aryalmilan pacific89 pak287 learneruestc longyueying jacksonjacobs1 sjq5263 szupzp leaf315 hatbry yangyang117 boqchen areebapatel haochenz96 jacksonlr2 vimuth97 pegahkhosravi cshnai allenjwzhu hoarjour wyjzll biophotonics-comi tinnyflames garyyjn yilmaz0734

clam's Issues

Division by zero

Hi,
I am trying to run your code. But when I run create_patch.py I get error "division by zero" on line 201 that is -> seg_times /= total
I checked the value for total from line 92-98
mask = df['process'] == 1
process_stack = df[mask]
total = len(process_stack)
seg_times = 0.
patch_times = 0.
stitch_times = 0.
In the csv file value for each entry in column "process" is 0. Kindly, help me with that. And also, Masks which are being saved in Masks folder are same as original images not binary segmented masks. Kindly, help me with that too.
Thank you

Best Regards

How do I use CLAM to create ROI heatmaps?

Graceful failure

Hi,

When creating patches for one of my svs files, I think the h5 file creation failed for one of my svs files (Not sure why) and as a result, the entire program halted. Would it be possible to catch errors gracefully?

Need help in understanding the clustering step

Thanks for such an amazing work!!!

I'm working on a 3-class molecular subtyping task. I've used single-attention-branch CLAM model.

I'm not very clear on the clustering step. For a 3-class subtype, my understanding is that positive/negative class clusters are computed for each of the 3 classes. Am I understanding it correctly?
When training the model, I see print statements only for class 0 & class 1 clustering accuracy. Another point to note is that class 0 clustering accuracy is ~ 100% and class 1 clustering accuracy is ~ 0%

Proper visualization of the heatmap

Hi there, I am trying to test a single kidney wsi from TCGA. I could be able to test my data, however; I m facing some issues with the visualization of the heatmap. My attention score and coords. shapes are respectively : [3, 80560],(80560, 2).
If I pass the parameters as below I obtain an image but it doesn't make sense as you have shown in your demos.
Any help would be appreciated, thanks.

    abc = WholeSlideImage("./DATA_DIRECTORY/slide_0.svs")
    heatMap = abc.visHeatmap(A_SCORE, coords)

I called visHeatmap just like this,

What coordinates should be handed to visHeatmap?

I was wondering what coordinates have to be used as input to the visHeatmap function defined in the WholeSlideImage class.
Are these the same coordinates that are present in the h5 files? If so, does it matter at which level of the pyramid you samples from (e.g. level 1 instead of level 0).

Finally, is it also possible to create fine attention maps from this function?

ModuleNotFoundError: No module named 'topk'

So i followed all the steps starting with initializing the environment with the provided "clam.yaml" file but when running the training part i am facing this error. I searched in the dependencies of the yaml file but couldn't find any for the "topk"

error:
"""
Init Model... Traceback (most recent call last):
File "main.py", line 218, in
results = main(args)
File "main.py", line 49, in main
results, test_auc, val_auc, test_acc, val_acc = train(datasets, i, args)
File "/home/abhiraj/DDP/CLAM/CLAM/utils/core_utils.py", line 140, in train
from topk import SmoothTop1SVM
ModuleNotFoundError: No module named 'topk'
"""

please let me know how to resolve this

(also in create_splits_seq.py there seems to be a syntax error in line
if (args.task == tcga_kidney)
it should have been
if (args.task == tcga_kidney_cv)
according to me)

Can this project be extended to more than 3 class classification?

Thanks for making this repo public. I went through the paper, it was an impressive work. Can you please guide me if this set up can be extended to multi class classification with more than 3 labels.

Segmentation Parameters for Small Tissue Content

I tested your code with TCGA-DA-A95X-01Z-00-DX1.F66442E6-F0C9-4528-B000-E47FCAA964FD
(breast histology BRAF mutated svs WSI downloaded from TCGA database)
you code was not able to be reproducible for this sample even on the data preprocessing part.

I reported the issue here, many thanks!

Idea behind using top-k loss function

Hi,

Thanks for providing such a wonderful work with Whole Slide Images. I am curious to know about few challenges and the way to tackle it.
My first question is to know your intuition behind using top-k loss function instead of Cross Entropy loss function and my second question is - I myself is solving the similar problem using feature based approach but since WSIs are too large and lesions are small, the feature space seems sparse and therefore, model highlights lots of false alarms even for negative slides. Can top-k help in better localizing the lesion in WSI.

I will really appreciate if you can clear my queries.
Thank you.

Format of h5 files

Hi Sam, sorry I am not totally clear about your question, are you saying you already have the tiles (presumably stored in some standard image format like .jpg or .png) and are wondering how you might generate h5/pt files used as inputs to training a model?
Max

Originally posted by @fedshyvana in #22 (comment)

My apologies for the delay in answering this, I had not seen your response. Yes your question is correct, I have already generated the tiles for my cohort in JPG format, and I am wondering how to use these as input. This is necessary as the slide images and the annotations are in pretty strange formats at present.

create_splits_seq val_num and test_num

First, thanks for making such a nice work public!
More than an issue I have just the doubt about what exactly should be the val_num and test_num in the create_splits_seq file,
are the indexes for fixing some of the WSI as val and test? or are their ID's?
Since they are required, I've just set them as val_num, test_num = (1,2),(3,4), and it seems that the partitions are created OK.
Thanks!

For two classes case, what is the purpose of second classifier in the CLAM_SB with same data as input.

Hi Again,

I have one more question. I got confused after going deeper in the script.

The problem statement I have is given slide labels [0 (negative) and 1 (positive)], I want to generate pseudo labels based on attention scores.
Now the number of classes as per the script : n_classes = 2 for camelyon data set
based on this information the Clam_SB creates two classifier branches [line 92 in model_clam.py]. Following on this, at line 165 in the same script, the instance classifier is looped based on the number of classes. The instance label when equals 1 then instance loss is computed. However, the doubt I have is why two classifier is required when instance loss is calculated only when instance label is 1.
For example, say I have positive slide [which contains lesion and normal]
now from the slide level label I will mark each instance as positive, which is fed in first classifier out of two and then picking most attended and least attended tiles to classify.
I don't get why do we feed the same data and info into second classifier.

Am I understanding it wrong?
thanks in advance

how to set the value of val_num and test_num in create_splits_seq.py using my data

How to resolve runtime error in `inst_eval` caused by hardcoding k_sample

 def inst_eval(self, A, h, classifier): 
        device=h.device
        if len(A.shape) == 1:
            A = A.view(1, -1)
        top_p_ids = torch.topk(A, self.k_sample)[1][-1]
        top_p = torch.index_select(h, dim=0, index=top_p_ids)
        top_n_ids = torch.topk(-A, self.k_sample, dim=1)[1][-1]
        top_n = torch.index_select(h, dim=0, index=top_n_ids)
        p_targets = self.create_positive_targets(self.k_sample, device)
        n_targets = self.create_negative_targets(self.k_sample, device)

        all_targets = torch.cat([p_targets, n_targets], dim=0)
        all_instances = torch.cat([top_p, top_n], dim=0)
        logits = classifier(all_instances)
        all_preds = torch.topk(logits, 1, dim = 1)[1].squeeze(1)
        instance_loss = self.instance_loss_fn(logits, all_targets)
        return instance_loss, all_preds, all_targets

In my data, there is large variance for # of instances per slide. Hard coding k_sample for this function does not work.

i) Runtime error when # of instances < k_sample
ii) when 2 * k_sample > # instances on current slide, then we will label some instance both positive and negative ??

I plan to use dynamic k_sample. For each slide that contains N instances, I will draw N // 2 positive samples and N // 2 negative samples.

Do you see any problem with this naive approach ?

This code relies on sorting order to determine the classification "target".

If I only have 2 instances and have very similar score, I am not sure the naive approach described above would make sense. ( In this case, we just create target arbitrarily )

How to extract patches at different magnification

Thanks for the Repository. I am looking to extract patches at different resolution (20x, 10x, 5x). Can you point out how to perform the multi-resolution patch extraction ?

Is this repo compatible with Windows operating system?

Hi, many thanks for your outstanding work. I would like to know if this repo compatible with windows?

Best,
Jiajun

cv2: Too many values to unpack

When I execute the "Basic, Fully Automated Run", it throws this error. How can I resolve this issue?
TIA

Format of h5 files

Many thanks for making this incredible repository. I am hoping to use tiles that I have pre-generated, as we have pathologist-defined tumor annotations. I am slightly unclear how to convert the tiles including the h5/pt file formats you are using as input. Are you able to provide any advice on this? Kind regards, Sam Kleeman, PhD Student Cold Spring Harbor Laboratory.

Visualizing attention scores on whole slide image

Hi,
Thanks for your great work. I trained the model on a new dataset.
Now, I want to visualize attention scores on the whole slide image. Is there any functionality available in the code itself for this? Or could you recommend me steps to do this?

I want this for interpretability. Any help would be appreciated.

What are the use of val_num, test_num at dataset.create_splits function?

Hi,
Thank you for publishing your codebase for CLAM.
I'm trying to use your codebase to evaluate the result on a TCGA dataset, I could do the creating patches and extracting features steps, and now I want to train the model. To split the dataset I want to use create_splits_seq but I don't know which values I must use for val_num, test_num in dataset.create_splits function. I couldn't find any documentation on it.
Thanks.

Does the package work for non-exclusive multi-class classification ?

looking at tumor_subtyping_dummy_clean.csv The subtyping problem seems to have label class being exclusive.

I am trying to apply this package to a problem where a slide can simultaneously belong to multiple classes.

Question1:

Is there any particular factors from the architecture side that would prevent my non-exclusive multi-class classification setup from achieving the good results in the exclusive multi-class classification setup (ie. like the multi-class subtyping stuff investigated in the paper) ?

Question2:

In non-exclusive multi-class classification setup , my label is a vector of (1, num_class) containing individual class probabiliyt.
I will need to replace the CrossEntropyLoss with BCELoss

Do you see any reason that this change would break the entire architecture ?

How to extract tissue patches as .jpg files

Thanks for sharing this great repository, it works really well! I was wondering if you could provide some guidelines how to extract save tissue patches as pictures instead of h5 files?

What script to use to rebuild the output image

Hi,

thanks for the great work. can you tell how was the visualization slide created.

Segmentation parameter recommendations for immunohistochemical stained WSI

Hello,
I am working with WSI of human hippocampal tissue stained with AT8 IHC to detect neurofibrillary tangles. The ICH creates a dark brown stain for positive tissue while the negative tissue is quite light. When running CLAM on cases with a great deal of positive stain, I am unable to get segmentation of the negatively stained tissue. I have played with combinations of parameters but have not been successful at getting adequate tissue coverage.
Would love any recommendations for better segmentation performance.

Best,
Gabe

Issue creating env from yaml file

Hi,

When I tried creating the env using

conda env create -n clam -f clam.yaml

I get the following error

json.decoder.JSONDecodeError: Unterminated string starting at: line 830302 column 14 (char 25006056)

Here is my full error report

Collecting package metadata (repodata.json): failed

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 263, in _load
        repodata_fn=self.repodata_fn)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 613, in fetch_repodata_remote_request
        raise Response304ContentUnchanged()
    conda.core.subdir_data.Response304ContentUnchanged

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/conda_env/cli/main.py", line 80, in do_call
        exit_code = getattr(module, func_name)(args, parser)
      File "/opt/conda/lib/python3.7/site-packages/conda_env/cli/main_create.py", line 118, in execute
        result[installer_type] = installer.install(prefix, pkg_specs, args, env)
      File "/opt/conda/lib/python3.7/site-packages/conda_env/installers/conda.py", line 32, in install
        prune=getattr(args, 'prune', False), update_modifier=UpdateModifier.FREEZE_INSTALLED)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/solve.py", line 117, in solve_for_transaction
        should_retry_solve)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/solve.py", line 158, in solve_for_diff
        force_remove, should_retry_solve)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/solve.py", line 262, in solve_final_state
        ssc = self._collect_all_metadata(ssc)
      File "/opt/conda/lib/python3.7/site-packages/conda/common/io.py", line 88, in decorated
        return f(*args, **kwds)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/solve.py", line 425, in _collect_all_metadata
        index, r = self._prepare(prepared_specs)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/solve.py", line 1021, in _prepare
        self.subdirs, prepared_specs, self._repodata_fn)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/index.py", line 277, in get_reduced_index
        repodata_fn=repodata_fn)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 120, in query_all
        result = tuple(concat(executor.map(subdir_query, channel_urls)))
      File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
        yield fs.pop().result()
      File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 428, in result
        return self.__get_result()
      File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
        raise self._exception
      File "/opt/conda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 113, in <lambda>
        package_ref_or_match_spec))
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 125, in query
        self.load()
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 189, in load
        _internal_state = self._load()
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 278, in _load
        mod_etag_headers.get('_mod'))
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 326, in _read_local_repdata
        _internal_state = self._process_raw_repodata_str(raw_repodata_str)
      File "/opt/conda/lib/python3.7/site-packages/conda/core/subdir_data.py", line 364, in _process_raw_repodata_str
        json_obj = json.loads(raw_repodata_str or '{}')
      File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
        return _default_decoder.decode(s)
      File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
        obj, end = self.scan_once(s, idx)
    json.decoder.JSONDecodeError: Unterminated string starting at: line 830302 column 14 (char 25006056)

`$ /opt/conda/bin/conda-env create -n clam -f clam.yaml`

  environment variables:
            BINARIES_PATH=/opt/deeplearning/binaries
                 CIO_TEST=<not set>
  CONDA_AUTO_UPDATE_CONDA=false
        CONDA_DEFAULT_ENV=base
                CONDA_EXE=/opt/conda/bin/conda
             CONDA_PREFIX=/opt/conda
    CONDA_PROMPT_MODIFIER=(base)
         CONDA_PYTHON_EXE=/opt/conda/bin/python
               CONDA_ROOT=/opt/conda
              CONDA_SHLVL=1
           CURL_CA_BUNDLE=<not set>
              DL_BIN_PATH=/opt/deeplearning/bin
         DL_METADATA_PATH=/opt/deeplearning/metadata
                  DL_PATH=/opt/deeplearning
        ENV_URI_FILE_PATH=/opt/deeplearning/metadata/env_uri
    ENV_VERSION_FILE_PATH=/opt/deeplearning/metadata/env_version
      FRAMEWORK_FILE_PATH=/opt/deeplearning/metadata/framework
        JUPYTER_DEPS_PATH=/opt/deeplearning/jupyter
          LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPT
                          I/lib64
                     PATH=/opt/conda/bin:/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:
                          /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
       REQUESTS_CA_BUNDLE=<not set>
RESTRICTION_TYPE_FILE_PATH=/opt/deeplearning/restriction
                 SRC_PATH=/opt/deeplearning/src
            SSL_CERT_FILE=<not set>
          TITLE_FILE_PATH=/opt/deeplearning/metadata/title
           TUTORIALS_PATH=/opt/deeplearning/workspace/tutorials
        VERSION_FILE_PATH=/opt/deeplearning/metadata/version
           WORKSPACE_PATH=/opt/deeplearning/workspace

     active environment : base
    active env location : /opt/conda
            shell level : 1
       user config file : /home/Surya/.condarc
 populated config files : /opt/conda/.condarc
          conda version : 4.9.2
    conda-build version : not installed
         python version : 3.7.8.final.0
       virtual packages : __glibc=2.28=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /opt/conda  (writable)
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/conda/pkgs
                          /home/Surya/.conda/pkgs
       envs directories : /opt/conda/envs
                          /home/Surya/.conda/envs
               platform : linux-64
             user-agent : conda/4.9.2 requests/2.25.1 CPython/3.7.8 Linux/4.19.0-11-cloud-amd64 debian/10 glibc/2.28
                UID:GID : 1006:1007
             netrc file : None
           offline mode : False


An unexpected error has occurred. Conda has prepared the above report.

If submitted, this report will be used by core maintainers to improve
future releases of conda.
Would you like conda to send this report to the core maintainers?

I was unable to reproduce this however, since on a different machine, it worked. So it might be a distribution specific thing. I upgraded my conda using conda update conda but still face this issue on my system. Can you provide a few configuration details such as which conda distribution, or a different way to unpack the yaml file?

saving coords without images?

Hello!

Thank you for the great repo!

I was wondering if you recommend a way to save the patches file with just the coords. I am only using the coordinates for my project and the images take up too much memory.

Thanks so much!
Gabe

getting mask contours

Hello!

Thanks again for a fantastic repo, it has proven essential to my work!

I was wondering if there is an easy way (or if you could recommend which lines to alter) for me to save the contour objects created during the masking process. This would be very handy for getting whole tissue area and using the mask for other parts of my project.

Best,
Gabe

Patch file not creating

I trird to run the create_patches.py file using the command from the readme.

python create_patches.py --source dataset_lung --save_dir new --patch_size 256 --seg --patch --stitch

but always getting an error

OSError: Unable to open file (unable to open file: name = 'new/patches/TC_W107_4.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Conda installation issues

Thanks for providing us great open-source code for us to use. I can't wait to try it out!

There are two installation issues I have noted:

Conda spends 5-10 minutes "Solving environment". There may be some issue here (see this message but I am not too familiar with the conda install process and the channels used to be able to propose a solution unfortunately.
There is some clash in the setuptools version and openslide-python. If I try to install as is, I get this error. If I set the setuptools in clam.yaml to v45.3.0, then this error is resolved and it seems like the requirements are successfully installed...

Training in a multi-label classification setting

Hi,

Thanks for open sourcing such a great project. I was wondering if the code could be used or easily modified for handling a multi-label classification tasks, eg in most of the prostate grading datasets, the objective is to predict both the primary and the secondary Gleason score leading to 2 different labels for each WSI.

Thanks for the info!

Using Feature Based training to detect lesion vs Normal on PANDA Dataset

Hi Mahmood,

I really liked your work on MIL for fitting in the WSI image in memory was pretty motivating. I am working on the same technique as well. Had one or two questions about the same, if you can give suggestions from your experience it will help me a lot.
-Objective-> Using slide level metadata to train the classifier [#MIL , weakly supervised learning]

-> Setting Data - PANDA (https://www.kaggle.com/c/prostate-cancer-grade-assessment) has 10k slides of different Gleason grading. I am considering negative (normal) grade as 0 and [4+4, 5+5] combined as positive class 1.

-> Now from the previous step, we now have a binary problem.
-> Feature Extraction Stage-> I used resnet50 with preprocess input function available in keras, the features were extracted from conv4_block6_out [?, 32,32,1024] with input tiles of shape (512,512,3). Tiles are obtained from the WSI image and I used annotations available with data to filter background tiles and only tiles which had more than 70% tissue were used.

-> Training Network: I am using simple Attention Network to classify the two classes from features but the network either predicts 0 for one epoch and 1 for other with accuracy 56% and 44% respectively. The training data distribution is also the same.
Below, is the network

data_input = Input(shape=input_dim, dtype='float32', name='input')
conv = Conv2D(filters=512,kernel_size=(1,1),strides=(2,2),padding='valid',activation='relu')(data_input)
x = Flatten()(conv)
fc2 = Dense(512, activation='relu', kernel_regularizer=l2(weight_decay), name='fc2')(x)                                                                                      
fc2 = Dropout(0.5)(fc2)
alpha = Mil_Attention(L_dim=128, output_dim=1, kernel_regularizer=l2(weight_decay), name='alpha', use_gated=useGated)(fc2)
x_mul = multiply([alpha, fc2])
out = Last_Sigmoid(output_dim=1, name='FC1_sigmoid')(x_mul)
model = Model(inputs=[data_input], outputs=[out])

Loss Metric:

class bag_loss(tf.keras.losses.Loss):
def init(self, name='bag_loss'):
super().init(name='bag_loss')

def call(self, y_true, y_pred):
    y_pred = K.mean(y_pred, axis=0, keepdims=False)
    y_pred = tf.squeeze(y_pred,axis=0)
    loss = K.binary_crossentropy(y_true, y_pred)
    loss = tf.keras.backend.cast(loss, dtype=tf.float32)
    return loss

Can you give any suggestions on this?

How can I get my own csv file like the camelyon_clean.csv you put in the folder?

Thanks for the Repository.
I followed the data pipeline to prepare my own dataset , and got some problem on this step

CUDA_VISIBLE_DEVICES=0,1 python extract_features.py --data_dir DIR_TO_PATCHES --csv_path CSV_FILE_NAME --feat_dir FEATURES_DIRECTORY --batch_size 512

I am not sure if the CSV_FILE_NAME means the process_list_autogen.csv generated by the previous step?

Code adaptations for tuning model

Hi, first thank you again for sharing these fantastic codes! I am tuning the model on my dataset to get better performance and have a few questions:

If we train clam with a different B value (e.g., 64) via --B, should we pass B to initial the model while validation?
If we have multiple classes, should we modify the hard-coded class number of instance_loss_fn = SmoothTop1SVM(n_classes = 2) in core_utils.py?
If we enable --no_inst_cluster, what adaptation should we make in core_utils.py and eval.py?

Multi-class slide level prediction

Hi there,

For multi-class MIL, can you point me to the code piece where a slide level prediction is made? In your architecture illustration, a softmax is finally applied, however, I failed to pick it up in your implement.

just wanna confirm it is the self.classifiers in the model.

Thank you in advance.

Executing the whole pipeline on a new dataset

I was wondering if I can train CLAM on a new dataset. Currently, I have this dataset and I have placed it in the directory. Can you guide me on how to do this? create_patches.py is accepting the directory but it is creating empty folders for patches. Can you guide me where i am going wrong. The image size is of 1360x1024

I would be grateful for your response.

self.wsi out of scope from patchgen?

Hello, I wanted to appreciate how well the problem and the solution has been formulated in the paper. Specifically generalization to multi-class MIL, its something most of us find it hard to crack. This paper will surely have lot of impact.
On that note, i m trying to run the patch extraction on one of mrxs slide. However, when i run below command

python create_patches.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --seg --patch --stitch

It starts processing one slide, its able to extract patches but then i after get below error saying, the wsi file not found

patches extracted: 706
Bounding Box: 166152 126537 2497 26947
Contour Area: 36043328.0
Traceback (most recent call last):
File "create_patches.py", line 296, in
process_list = process_list, auto_skip=args.no_auto_skip)
File "create_patches.py", line 185, in seg_and_patch
file_path, patch_time_elapsed = patching(WSI_object = WSI_object, **current_patch_params)
File "create_patches.py", line 37, in patching
file_path = WSI_object.createPatches_bag_hdf5(**kwargs, save_coord=True)
File "/CLAM/wsi_core/WholeSlideImage.py", line 233, in createPatches_bag_hdf5
for patch in patch_gen:
File "/CLAM/wsi_core/WholeSlideImage.py", line 291, in _getPatchGenerator
patch_PIL = self.wsi.read_region((x,y), patch_level, (patch_size, patch_size)).convert('RGB')
File "/usr/bin/python3.6/site-packages/openslide/init.py", line 229, in read_region
level, size[0], size[1])
File "/usr/bin/python3.6/site-packages/openslide/lowlevel.py", line 214, in read_region
_read_region(slide, buf, x, y, level, w, h)
File "/usr/bin/python3.6/site-packages/openslide/lowlevel.py", line 151, in _check_error
raise OpenSlideError(err)
openslide.lowlevel.OpenSlideError: Empty input file

Due to some reason, the self.wsi goes out of scope, any idea, what could be the problem?

Why isn't batch size an argument for main.py?

Going through the code, I see the batch size isn't an argument for the code. I later found out that the batch size is hard-coded to be 1. This seems like an odd choice. Why can't the batch size be higher? Is there something wrong with using higher batch sizes? Is there instead a challenge of collating the slide bags with variable number of tiles?

Similarly, why is num_workers hard-coded to 4? The optimal value will depend on the workstation.

Skipping files?

I tried to run the following command:

python create_patches.py --source $SOURCE_DIR --save_dir $SAVE_DIR --patch_size 256 --seg --patch --stitch

It processes the first image fine, but then I get the following error when processing the second image:

progress: 0.00, 0/10616
processing 0005f7aaab2800f6170c399693a96917.tiff
Creating patches for:  0005f7aaab2800f6170c399693a96917 ...
Bounding Box: 3424 3232 5409 21457
Contour Area: 27171456.0
patches extracted: 493
original size: 27648 x 29440
downscaled size for stiching: 432 x 460
number of patches: 493
patch shape: (256, 256, 3)
start stitching 0005f7aaab2800f6170c399693a96917
progress: 0/493 stitched
progress: 50/493 stitched
progress: 100/493 stitched
progress: 150/493 stitched
progress: 200/493 stitched
progress: 250/493 stitched
progress: 300/493 stitched
progress: 350/493 stitched
progress: 400/493 stitched
progress: 450/493 stitched
segmentation took 0.3680613040924072 seconds
patching took 4.518922328948975 seconds
stitching took 0.18634939193725586 seconds


progress: 0.00, 1/10616
processing 000920ad0b612851f8e01bcc880d9b3d.tiff
Creating patches for:  000920ad0b612851f8e01bcc880d9b3d ...

Traceback (most recent call last):
  File "create_patches.py", line 294, in <module>
    process_list = process_list, auto_skip=args.no_auto_skip)
  File "create_patches.py", line 188, in seg_and_patch
    heatmap, stitch_time_elapsed = stitching(file_path, downscale=64)
  File "create_patches.py", line 14, in stitching
    heatmap = StitchPatches(file_path, downscale=downscale, bg_color=(0,0,0), alpha=-1, draw_grid=False)
  File "/home/tmabraham/CLAM/wsi_core/WholeSlideImage.py", line 46, in StitchPatches
    file = h5py.File(hdf5_file_path, 'r')
  File "/home/tmabraham/anaconda3/envs/clam/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/home/tmabraham/anaconda3/envs/clam/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'PANDA/patches/000920ad0b612851f8e01bcc880d9b3d.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

This seems to indicate that it is skipping a file and not creating patches for a file. Indeed, when I remove the stitching option, I can see the program is skipping through many images. I get terminal output like this:

progress: 0.00, 13/10616
processing 004f6b3a66189b4e88b6a01ba19d7d31.tiff
Creating patches for:  004f6b3a66189b4e88b6a01ba19d7d31 ...
segmentation took 0.22474193572998047 seconds
patching took 3.695487976074219e-05 seconds
stitching took -1 seconds

Whereas for a properly processed image (without stitching option) I get:

progress: 0.00, 6/10616                                                                                                                                                   
processing 003046e27c8ead3e3db155780dc5498e.tiff                                                                                                                          Creating patches for:  003046e27c8ead3e3db155780dc5498e ...                                                                                                               Bounding Box: 16176 3488 5441 28721                                                                                                                                       Contour Area: 35853696.0                                                                                                                                                 
patches extracted: 638                                                                                                                                                    
segmentation took 0.3057291507720947 seconds                                                                                                                              patching took 5.923604488372803 seconds                                                                                                                                   stitching took -1 seconds

Here is an example image (plt.imshow) file in question:

Why is the program skipping images? My understanding is that right now, you use simple binary thresholding, correct? Is it possible that the threshold is not correct for my dataset?

Configuration Files

Hi,
Thank you for making public this implementation of your very nice paper.

I just wanted to inform you that it seems that the configuration file clam.yaml is not located in the directory \docs.

Thanks again for your work.
Best,

Attention scores

Thanks for sharing this great repository, it works really well! I was wondering if you could provide some guidelines how to extract the attention scores from the bag features for a single image?

Jeroen

run eval.py get TypeError

Hi! Thank you for publishing such grate work.
I have encountered an error during run python eval.py in my own dataset
Here is the error message:

Traceback (most recent call last):
  File "eval.py", line 134, in <module>
    model, patient_results, test_error, auc, df  = eval(split_dataset, args, ckpt_paths[ckpt_idx])
  File "/home/pch/Documents/hx/CLAM/utils/eval_utils.py", line 53, in eval
    patient_results, test_error, auc, df, _ = summary(model, loader, args)
  File "/home/pch/Documents/hx/CLAM/utils/eval_utils.py", line 70, in summary
    for batch_idx, (data, label) in enumerate(loader):
  File "/home/pch/anaconda3/envs/clam/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/pch/anaconda3/envs/clam/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/pch/anaconda3/envs/clam/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/pch/anaconda3/envs/clam/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/pch/anaconda3/envs/clam/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/pch/Documents/hx/CLAM/utils/utils.py", line 36, in collate_MIL
    img = torch.cat([item[0] for item in batch], dim = 0)
TypeError: expected Tensor as element 0 in argument 0, but got str

Here is the command I used:
CUDA_VISIBLE_DEVICES=1 python eval.py --drop_out --k 10 --models_exp_code tcga_hx_cv_sb_s1 --save_exp_code eval_tcga_hx_cv_sb_s1 --task tcga_hx_cv --model_type clam_sb --results_dir results --data_root_dir dataset/hx_resnet_feature

the Dir structure is as follows:

I am little bit confused why I get str error, can you suggest where may be wrong?
Thank you!

Visualizing attention scores?

Is the code to visualize the attention scores on the patches available?

Trained CLAM Models

Hi,
Thank you for making public this implementation of your very nice paper.
I want to fine tune models on my custom dataset. would you please send me your trained model?

Requesting requirements.txt file for virtualenv users

Hi,

Could you please share a requirements.txt file for the venv users?

You can run the following command from within the 'clam' conda env-
pip freeze > requirements.txt

Ref- https://stackoverflow.com/questions/48787250/set-up-virtualenv-using-a-requirements-txt-generated-by-conda

Thank you.

Incomplete masks

Thank you for making public such fantastic codes!

When I ran the following command for my dataset,
python create_patches.py --source $SOURCE_DIR --save_dir $SAVE_DIR --patch_size 256 --seg

I got incomplete mask for some slides. Here is an example:

Could you please point out what I did wrong and how to fix it? Thank you!

Smooth vs Hard Top K Loss

Hello Team,

Thanks for sharing such a great work. I have one query, I have binary problem [positive, negative] of slide classification. I used cross-entropy as instance loss function and I am getting good results. But in your work, I see you are using TopKLoss function.
I want to ask what is the reason behind this switch of loss function [smooth or hard]. Also, in your code I guess you are using softmax activation in the final layer but I am using sigmoid [binary problem]. How can I achieve this switching in my work?
Do I need to replace sigmoid to softmax?
And lastly, in the loss function script, svm.py line 94 to 98, if I understand correctly
the output from the model is of shape [2,2] where first axis gives probability of tile belonging 0 class and second axis probability of tile belonging to 1 class where 0 means normal and 1 mean abnormal. When smooth we compute loss for abnormal tiles else we compute for normal?
Please let me know if my understanding is correct. Also, do let me know why we switch on axis?

Kind Regards

How to disable mutual exclusivity?

Hi, I am trying to apply CLAM to a multi-class problem. For my data, the label classes are more like predicting cancer grades. Is it possible to remove the mutual exclusivity? Thank you!

What are opportunities for tuning the model?

I have a model that is trained on a large dataset, and obtained a high AUC but an okay quadratically-weighted kappa (~0.82). I am looking for ways to potentially improve this model. Are there any hyperparameters that are worth tuning? Have you tested this out with other datasets? I guess the obvious culprits are learning rate, dropout, and weight decay. But are there other opportunities for improving model performance? Like has any form of data augmentation or learning rate schedule been tried out with any success? Any parts of the loss function been tested with? For example, for my problem, I am trying to predict cancer grades that are determined based on various heterogenous patterns, so I am not sure mutual exclusivity necessarily holds. How could this be removed? I have also seen that B patches for the instance-level clustering is set to 8. Is this a value that is quite robust to different datasets, or does it need to be tuned?

PAIP 2019 datasets

Hi,
Is it possible to use your work on PAIP 2019 Liver cancer Segmentation Challenge datasets. This challenge is a regional segmentation https://paip2019.grand-challenge.org/Dataset/ we need first to generate patches and classify wether the patches (viable or whole) based on the ROI for viable for each slide, which has been provided in the csv file, and then we do the segmentation task. If you think your work can be used on my datasets, can you kindly explain how and provide more details, I’m new to this task and your reply and suggestion would be highly appreciated.

best regards