mahmoodlab / survpath Goto Github PK

Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction - CVPR 2024

Python 92.64% Shell 7.36%

histology-transcriptomics mahmoodlab pathology pathology-genomics survpath interpretability pathology-representation pathways

survpath's Introduction

Code for Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction

Welcome to the official GitHub repository for our CVPR 2024 paper, "Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction". This project was developed by the Mahmood Lab at Harvard Medical School and Brigham and Women's Hospital. Preprint can be accessed here.

Highlights

In our study, we explore the integration of whole-slide images (WSIs) and bulk transcriptomics to enhance patient survival prediction and interpretability. We focus on addressing two key challenges: (1) devising a method for meaningful tokenization of transcriptomics data and (2) capturing dense multimodal interactions between WSIs and transcriptomics. Our proposed model, SurvPath, leverages biological pathway tokens from transcriptomics and histology patch tokens from WSIs, facilitating memory-effective fusion through a multimodal Transformer. SurvPath surpasses unimodal and multimodal baselines across five datasets from The Cancer Genome Atlas, showcasing state-of-the-art performance. Furthermore, our interpretability framework identifies critical multimodal prognostic factors, offering deeper insights into genotype-phenotype interactions and underlying biological mechanisms.

Installation Guide for Linux (using anaconda)

Pre-requisities:

Linux (Tested on Ubuntu 18.04)
NVIDIA GPU (Tested on Nvidia GeForce RTX 2080 Ti x 16) with CUDA 11.0 and cuDNN 7.5
Python (3.8.13), h5py (2.10.0), matplotlib (3.6.3), numpy (1.21.6), opencv-python (4.5.1.48), openslide-python (1.2.0), openslide (3.4.1), pandas (1.4.2), pillow (9.0.1), PyTorch (1.6.5), scikit-learn (1.2.1), scipy (1.9.0), torchvision (0.13.1), captum (0.6.0), shap (0.41.0)

Downloading TCGA Data and Pathways Compositions

To download diagnostic WSIs (formatted as .svs files), molecular feature data and other clinical metadata, please refer to the NIH Genomic Data Commons Data Portaland the cBioPortal. WSIs for each cancer type can be downloaded using the GDC Data Transfer Tool. To get the pathway compositions for 50 Hallmarks, refer to MsigDB. To get the Reactome pathway compositions, refer to PARADIGM

Processing Whole Slide Images

To process Whole Slide Images (WSIs), first, the tissue regions in each biopsy slide are segmented using Otsu's Segmentation on a downsampled WSI using OpenSlide. The 256 x 256 patches without spatial overlapping are extracted from the segmented tissue regions at the desired magnification. Consequently, an SSL pretrained Swin Transformer CTransPath is used to encode raw image patches into 768-dim feature vectors, which we then save as .pt files for each WSI. The extracted features then serve as input (in a .pt file) to the network. All pre-processing of WSIs is done using the CLAM toolbox.

Transcriptomics and Pathway Compositions

We downloaded raw RNA-seq abundance data for the TCGA cohorts from the Xena database and performed normalization in the dataset class. The raw data is included as CSV files datasets_csv. Xena database was also used to access disease specific survival and associated censorhsip. Using the Reactome and MSigDB Hallmarks pathway compositions, we selected pathways that had more than 90% of transcriptomics data available. The compositions can be found at metadata.

Training-Validation Splits

For evaluating the algorithm's performance, we partitioned each dataset using 5-fold cross-validation (stratified by the site of histology slide collection). Splits for each cancer type are found in the splits folder, which each contain splits_{k}.csv for k = 1 to 5. In each splits_{k}.csv, the first column corresponds to the TCGA Case IDs used for training, and the second column corresponds to the TCGA Case IDs used for validation. Slides from one case are not distributed across training and validation sets. Alternatively, one could define their own splits, however, the files would need to be defined in this format. The dataset loader for using these train-val splits are defined in the return_splits function in the SurvivalDatasetFactory.

Running Experiments

Refer to scripts folder for source files to train SurvPath and the baselines presented in the paper. Refer to the paper to find the hyperparameters required for training.

Issues

The preferred mode of communication is via GitHub issues.
If GitHub issues are inappropriate, email [email protected] (and cc [email protected]).
Immediate response to minor issues may not be available.

License and Usage

If you find our work useful in your research, please consider citing our paper at:

@article{jaume2023modeling,
  title={Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction},
  author={Jaume, Guillaume and Vaidya, Anurag and Chen, Richard and Williamson, Drew and Liang, Paul and Mahmood, Faisal},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Mahmood Lab - This code is made available under the GPLv3 License and is available for non-commercial academic purposes.

survpath's People

Contributors

Stargazers

Watchers

Forkers

yangsenwxy ge-yl

survpath's Issues

Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair).

          Sorry for the late reply. Each fold is trained for a fixed number of epochs. There is no early stopping based on validation c-index (results wouldn't be fair).

Originally posted by @guillaumejaume in #4 (comment)

Some bugs occurred during data loading.

I am grateful to the authors for providing such exceptional work. This work have greatly inspired me. While I was replicating the code, I encountered a bug related to mismatched modal data during the data loading process. Specifically, the issue is as follows:

During the debugging process, I noticed that when utilizing multimodal data, the case_id and slide_id of the current case do not align with the temp index in the omics data.

I found that the issue was caused by the following code in datasets/dataset_survival.py:

        elif self.modality == "coattn":
            patch_features, mask = self._load_wsi_embs_from_path(self.data_dir, slide_ids)
            omic1 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[0]].iloc[idx])
            omic2 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[1]].iloc[idx])
            omic3 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[2]].iloc[idx])
            omic4 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[3]].iloc[idx])
            omic5 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[4]].iloc[idx])
            omic6 = torch.tensor(self.omics_data_dict["rna"][self.omic_names[5]].iloc[idx])

            return (patch_features, omic1, omic2, omic3, omic4, omic5, omic6, label, event_time, c, clinical_data, mask)
        
        elif self.modality == "survpath":
            patch_features, mask = self._load_wsi_embs_from_path(self.data_dir, slide_ids)
            omic_list = []
            for i in range(self.num_pathways):
                omic_list.append(torch.tensor(self.omics_data_dict["rna"][self.omic_names[i]].iloc[idx]))
            
            return (patch_features, omic_list, label, event_time, c, clinical_data, mask)`

I modified the code as follows, and it is now running correctly.

        elif self.modality in ["coattn", "coattn_motcat"]:
            patch_features, mask = self._load_wsi_embs_from_path(self.data_dir, slide_ids)
            omic1 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[0]].values[0])
            omic2 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[1]].values[0])
            omic3 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[2]].values[0])
            omic4 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[3]].values[0])
            omic5 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[4]].values[0])
            omic6 = torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[5]].values[0])
            
            return (patch_features, omic1, omic2, omic3, omic4, omic5, omic6, label, event_time, c, clinical_data, mask)
        
        elif self.modality == "survpath":
            patch_features, mask = self._load_wsi_embs_from_path(self.data_dir, slide_ids)
            omic_list = []
            for i in range(self.num_pathways):
                omic_list.append(torch.tensor(self.omics_data_dict["rna"][self.omics_data_dict["rna"]["temp_index"] == case_id][self.omic_names[i]].values[0]))
            
            return (patch_features, omic_list, label, event_time, c, clinical_data, mask)

I'm not sure if this is an isolated case. I feel documenting the situation could help with reproducing this work, so I raised this issue.

Finally, thank you once again for your great work.

Discussion on the issue of the basis for the selection of 5-fold cross-validation results and the preservation of trained models.

Thanks for sharing the code, I have a question regarding the experimental setup.1. Is the C-index value under 5-fold cross-validation in the article obtained as the highest C-index value for each fold or the corresponding C-index value when the loss per fold drops to the minimum?2. What criterion are you basing on to save the model for each fold training? Maximum C-index or minimum loss?

How is the interpretability visualization of the WSI in your article implemented?

Your thesis visualisation work is great, could you please share the visualisation code?

the folder structure of dataset

Dear authors,

I found your work on transcriptomics, histology, and multimodal fusion for classification tasks to be quite interesting. I would like to know more about the folder structure you used in your experiments. Specifically, I'm interested in understanding how you organized the different data modalities and their corresponding files.

Could you kindly provide some information or details regarding the folder structure you employed in your study? It would greatly help me in better understanding and replicating your experiments.

Number of Parameters

Hi,

Thanks for the interesting work. Can you please share the total number of parameters or at least an estimation?
I may have misunderstood it, but I assume that there is a separate SNN block for each pathway. So, If we have 330 pathways here, the pathway tokenizer will have at least 330256256 parameters which is about 21 million. I appreciate your clarification.

Thanks

Data

The storage location and required type of WSI do not seem to be shown in the code. If it's convenient for you, can you provide the complete data structure? Thank you very much