Giter Site home page Giter Site logo

mahmoodlab / hipt Goto Github PK

View Code? Open in Web Editor NEW
455.0 11.0 83.0 758.07 MB

Hierarchical Image Pyramid Transformer - CVPR 2022 (Oral)

License: Other

Jupyter Notebook 98.97% Python 1.03%
computational-pathology cvpr cvpr2022 deep-learning hierarchical-attention-networks high-resolution histopathology pretrained-weights pytorch self-supervised-learning

hipt's Introduction

Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning, CVPR 2022. [HTML] [arXiv] [Oral]
Richard. J. Chen, Chengkuan Chen, Yicong Li, Tiffany Y. Chen, Andrew D. Trister, Rahul G. Krishnan*, Faisal Mahmood*
@inproceedings{chen2022scaling,
    author    = {Chen, Richard J. and Chen, Chengkuan and Li, Yicong and Chen, Tiffany Y. and Trister, Andrew D. and Krishnan, Rahul G. and Mahmood, Faisal},
    title     = {Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {16144-16155}
}
HIPT Illustration
Key Ideas & Main Findings
  1. Hierarchical Image Pyramid Transformer (HIPT) Architecture: Three-stage hierarchical ViT that formulates gigapixel whole-slide images (WSIs) as a disjoint set of nested sequences. HIPT unroll the WSI into non-overlapping [4096 × 4096] image regions, followed by unrolling each region into non-overlapping [256 × 256] image patches, and lastly each patch as non-overlapping [16 × 16] cell tokens. Our method is analgous to that of hierarchical attention networks in long document modeling, in which word embeddings within sentences are aggregated to form sentence-level embeddings and subsequently aggregated into document-level embeddings. Inference in HIPT is performed via bottom-up aggregation of [16 × 16] visual tokens in their respective [256 × 256] and [4096 × 4096] windows via Transformer attention to compute a slide-level representation.
  2. Learning Context-Aware Token Dependencies in WSIs: Note that Transformer attention is computed only in local windows (instead of across the entire WSI), which makes learning long-range dependencies tractable. Though representation learning for [4096 × 4096] image regions may seem expensive, also note that the patch size at this level is [256 × 256], and thus has similar complexity of applying ViTs to [256 × 256] image patches with [16 × 16] tokens.
  3. Hierarchical Pretraining: Since encoding [4096 x 4096] images is the same subproblem as encoding [256 x 256] images, we hypothesize that ViT pretraining techniques can generalize to higher resolutions with little modification. DINO is used to not only pretrain ViT-16 in HIPT, but also ViT-256 via [6 x 6] local and [14 x 14] global crops on a 2D grid-of-features (obtained by using VIT-16 as a patch tokenizer for ViT-256).
  4. Self-Supervised Slide-Level Representation Learning: HIPT is evaluated via pretraining + freezing the ViT-16 / ViT-256 stages, with the ViT-4K stage finetuned with slide-level labels, assessed on cancer subtyping and survival prediction tasks in TCGA. We also perform self-supervised KNN evaluation of HIPT embeddings via computing the mean [CLS]-4K tokens extracted from ViT-256, as a proxy for the slide-level embedding. On Renal Cell Carcinoma subtyping, we report that averaged, pretrained HIPT-4K embeddings without any labels perform as well as CLAM-SB.

Updates / TODOs

Please follow this GitHub for more updates.

  • Removing dead code in HIPT_4K library.
  • Better documentation on interpretability code example.
  • Add pretrained models + instructions for hierarchical visualization.
  • Add pre-extracted slide-level embeddings, and code for K-NN evaluation.
  • Add weakly-supervised results for Tensorboard.

Pre-Reqs + Installation

This repository includes not only the code base for HIPT, but also saved HIPT checkpoints and pre-extracted HIPT slide embeddings with ~4.08 GiB of storage, which we version control via Git LFS.

To clone this repository without large files initially:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/mahmoodlab/HIPT.git 	# Pulls just the codebase
git lfs pull --include "*.pth"						# Pulls the pretrained checkpoints
git lfs pull --include "*.pt"						# Pulls pre-extracted slide embeddings
git lfs pull --include "*.pkl"						# Pulls pre-extracted patch embeddings
git lfs pull --include "*.png"						# Pulls demo images (required for 4K x 4K visualization)

To clone all files:

git clone https://github.com/mahmoodlab/HIPT.git

To install Python dependencies:

pip install -r requirements.txt

HIPT Walkthrough

How HIPT Works

Below is a snippet of a standalone two-stage HIPT model architecture that can load fully self-supervised weights for nested [16 x 16] and [256 x 256] token aggregation, defined in ./HIPT_4K/hipt_4k.py. Via a few einsum operations, you can put together multiple ViT encoders and have it scale to large resolutions. HIPT_4K was used for feature extraction of non-overlapping [4096 x 4096] image regions across the TCGA.

import torch
from einops import rearrange, repeat
from HIPT_4K.hipt_model_utils import get_vit256, get_vit4k

class HIPT_4K(torch.nn.Module):
    """
    HIPT Model (ViT_4K-256) for encoding non-square images (with [256 x 256] patch tokens), with 
    [256 x 256] patch tokens encoded via ViT_256-16 using [16 x 16] patch tokens.
    """
    def __init__(self, 
        model256_path: str = 'path/to/Checkpoints/vit256_small_dino.pth',
        model4k_path: str = 'path/to/Checkpoints/vit4k_xs_dino.pth', 
        device256=torch.device('cuda:0'), 
        device4k=torch.device('cuda:1')):

        super().__init__()
        self.model256 = get_vit256(pretrained_weights=model256_path).to(device256)
        self.model4k = get_vit4k(pretrained_weights=model4k_path).to(device4k)
        self.device256 = device256
        self.device4k = device4k
        self.patch_filter_params = patch_filter_params
	
    def forward(self, x):
        """
        Forward pass of HIPT (given an image tensor x), outputting the [CLS] token from ViT_4K.
        1. x is center-cropped such that the W / H is divisible by the patch token size in ViT_4K (e.g. - 256 x 256).
        2. x then gets unfolded into a "batch" of [256 x 256] images.
        3. A pretrained ViT_256-16 model extracts the CLS token from each [256 x 256] image in the batch.
        4. These batch-of-features are then reshaped into a 2D feature grid (of width "w_256" and height "h_256".)
        5. This feature grid is then used as the input to ViT_4K-256, outputting [CLS]_4K.

        Args:
          - x (torch.Tensor): [1 x C x W' x H'] image tensor.

        Return:
          - features_cls4k (torch.Tensor): [1 x 192] cls token (d_4k = 192 by default).
        """
        batch_256, w_256, h_256 = self.prepare_img_tensor(x)                    # 1. [1 x 3 x W x H].
        batch_256 = batch_256.unfold(2, 256, 256).unfold(3, 256, 256)           # 2. [1 x 3 x w_256 x h_256 x 256 x 256] 
        batch_256 = rearrange(batch_256, 'b c p1 p2 w h -> (b p1 p2) c w h')    # 2. [B x 3 x 256 x 256], where B = (1*w_256*h_256)


        features_cls256 = []
        for mini_bs in range(0, batch_256.shape[0], 256):                       # 3. B may be too large for ViT_256. We further take minibatches of 256.
            minibatch_256 = batch_256[mini_bs:mini_bs+256].to(self.device256, non_blocking=True)
            features_cls256.append(self.model256(minibatch_256).detach().cpu()) # 3. Extracting ViT_256 features from [256 x 3 x 256 x 256] image batches.

        features_cls256 = torch.vstack(features_cls256)                         # 3. [B x 384], where 384 == dim of ViT-256 [ClS] token.
        features_cls256 = features_cls256.reshape(w_256, h_256, 384).transpose(0,1).transpose(0,2).unsqueeze(dim=0) 
        features_cls256 = features_cls256.to(self.device4k, non_blocking=True)  # 4. [1 x 384 x w_256 x h_256]
        features_cls4k = self.model4k.forward(features_cls256)                  # 5. [1 x 192], where 192 == dim of ViT_4K [ClS] token.
        return features_cls4k

Using the HIPT_4K API

You can use the HIPT_4K model out-of-the-box, and use it to plug-and-play into any of your downstream tasks (example below).

from HIPT_4K.hipt_4k import HIPT_4K
from HIPT_4K.hipt_model_utils import eval_transforms

model = HIPT_4K()
model.eval()

region = Image.open('HIPT_4K/image_demo/image_4k.png')
x = eval_transforms()(region).unsqueeze(dim=0)
out = model.forward(x)

Hierarchical Interpretability

DINO illustration

For hierarchical interpretability, please see the following notebook, which uses the following functions in ./HIPT_4K/hipt_heatmap_utils.py.

Downloading + Preprocessing + Organizing TCGA Data

Using the NIH Genomic Data Commons Data Portal and the cBioPortal, we downloaded diagnostic whole-slide images (WSIs) for 28 cancer types using the GDC Data Transfer Tool, followed by using the publicly-available CLAM library for tissue segmentation, tissue patching and feature extraction, which we modified for extracting both ResNet-50 features (pretrained on ImageNet) and ViT-16 features (pretrained on the TCGA). For patching at [256 × 256] resolution, we used default tissue segmentation parameters. For patching at [4096 × 4096] resolution, we additionally saved each [4096 × 4096] image region, which we used for ViT_256-16 and ViT_4096-256 pretraining (-16 suffix == using [16 × 16]-sized tokens in a ViT model, -256 suffix == using [256 × 256]-sized tokens in a ViT model). Extracted TCGA features are organized in the following directories:

Example Directory
TCGA_ROOT_DIR/
    └──tcga_acc/
        ├── ...
    └──tcga_blca/
        ├── ...
    └──tcga_brca/
        └── WSIs/
            ├── slide_1.svs
            ├── slide_2.svs
            └── ...
        └── extracted_mag20x_patch256_fp/
            └── masks/
                ├── slide_1.png
                ├── slide_2.png
                └── ...
            └── patches/
                ├── slide_1.h5
                ├── slide_2.h5
                └── ...
            └── stitches/
                ├── slide_1.png
                ├── slide_2.png
                └── ...
            └── resnet50_trunc_pt_patch_features/
                ├── slide_1.pt
                ├── slide_2.pt
                └── ...
            └── vits_tcga_pancancer_dino_pt_patch_features/
                ├── slide_1.pt
                ├── slide_2.pt
                └── ...
            └── process_list_autogen.csv
        └── extracted_mag20x_patch4096_fp/
            └── masks/
                ├── slide_1.png
                ├── slide_2.png
                └── ...
            └── patches/
                ├── slide_1.h5
                ├── slide_2.h5
                └── ...
            └── stitches/
                ├── slide_1.png
                ├── slide_2.png
                └── ...
            └── tar_patch_4096/
                ├── slide_1.tar
                ├── slide_2.tar
                └── ...
            └── vits_tcga_pancancer_dino_pt_patch_features/
                ├── slide_1.pt
                ├── slide_2.pt
                └── ...
            └── process_list_autogen.csv
    └──tcga_coadread/
        ├── ...
    ...
    └──tcga_ucec/
        ├── ...

Each cancer type is organized as its own folder in TCGA_ROOT_DIR, which additionally contains the following subfolders: In extracting patches at 20X magnification with non-overlapping patch sizes of 256, we create a results directory called extracted_mag20x_patch256_fp that will contain the following files / folders:

Folder Structure
  1. WSIs/: Raw *.svs WSIs for that cancer type
  2. extracted_mag20x_patch256_fp: Extracted features at 20× magnification for [256 × 256] patches (performed only for BRCA, COADREAD, LUAD, LUSC, CCRCC, CHRCC, PRCC, and STAD studies in TCGA). The _fp suffix represents the use of 'fast patching" as performed in CLAM, in which coordinates instead of raw patches are saved. This folder contains the following subfolders:
    • masks/: Directory of segmented tissue-containing regions (one image per WSI).
    • patches/: Directory of extracted image patches (one .h5 file per WSI, where each entry corresponds to the coordinates of the top-left corner of a patch)
    • stitches/: Directory of downsampled visualizations of stitched tissue patches, used a sanity check to inspect whether we patched correctly (one image per WSI).
    • resnet50_trunc_pt_patch_features/: Directory of pre-extracted ResNet-50 features (pretrained on ImageNet) for each patch within each WSI (with patches read via OpenSlide using coordinates in patches/, saved in a *.pt format. Each *.pt file is a [M × 1024]-sized Tensor containing extracted 1024-dim embeddings for M patches in the WSI.
    • vits_tcga_pancancer_dino_pt_patch_features/: Directory of pre-extracted ViT-16 features (pretrained on TCGA) for each patch within each WSI (with patches read via OpenSlide using coordinates in patches/, saved in a *.pt format. Each *.pt file is a [M × 384]-sized Tensor containing extracted 384-dim embeddings for M patches in the WSI.
    • process_list_autogen.csv: An auto-generated csv file that contains a list of all slides processed, along with their segmentation/patching parameters used.
  3. extracted_mag20x_patch4096_fp: Extracted features at 20× magnification for [4096 × 4096] image regions, containing the following subfolders:
    • masks/: Same as [256 × 256] setting.
    • patches/: Same as [256 × 256] setting.
    • stitches/: Same as [256 × 256] setting.
    • tar_patch_4096/: Directory of saved [4096 × 4096] image regions for each WSI, stored in a *.tar format using WebDataset API.
    • vits_tcga_pancancer_dino_pt_patch_features/: Directory of pre-extracted ViT-16 features (pretrained on TCGA) for each [4096 × 4096] region within each WSI (with regions read via OpenSlide using coordinates in patches/, saved in a *.pt format. Each *.pt file is a [M × 256 × 384]-sized Tensor containing extracted 384-dim embeddings for M regions in the WSI, which each region represented as as a 256-length sequence of [256 × 256] patch embeddings.
    • process_list_autogen.csv: An auto-generated csv file that contains a list of all slides processed, along with their segmentation/patching parameters used. Note that in using a large image resolution for patching, not all WSIs are used in [4096 × 4096] evaluation.

Organizing the folders and subfolders for all of these different cancer types (with different features types too) allowed ease of running classification experiments.

Hierarchical Pretraining for ViT-16/256 Models + Pretrained Models

Example Directory
TCGA_PRETRAINING_DIR/
  └──patch_256_pretraining/
      ├── patch_1.png
      ├── patch_2.png
      └── ...
  └──region_4096_pretraining/
      ├── slide_1_1.pt
      ├── slide_1_2.pt
      └── ...
  └──ckpts/
      └── pretrain/
          └── vit256_s_dino.pth
          └── vit4k_xs_dino.pth

We set up the following directories for ViT_256-16 and ViT_4K-256 pretraining respectively:

  • .../path/to/patch_256_pretraining/: Directory of raw [256 × 256] patches (as *.png format) extracted from the tar_patch_4096/ subdirectories of each cancer type, used to pretrain ViT_256-16.
  • .../path/to/region_4096_pretraining/: Directory of pre-extracted ViT_4K-256 features for each [4096 × 4096] region across all WSIs (in total: 433779 regions). Each *.pt file is a [256 × 384]-sized Tensor, which is a 256-length sequence of pre-extracted ViT_256-16 features for each [256 × 256] patch. This folder is used to pretain ViT_4K-256.
  • ./HIPT_4K/Checkpoints/: Directory for holding the pretrained weights, which we use for feature extraction. Our pretraining method largely follows the original DINO framework for conventional [256 × 256] image pretraining using ViT_256-16, which we extend to the [4096 × 4096] setting. Again, note that the -16 suffix refers to using [16 × 16]-sized tokens in a ViT model, and the -256 suffix using [256 × 256]-sized tokens in a ViT model. The following commands below are used for pretraining.
python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/TCGA_PRETRAINING_DIR/patch_256_pretraining/ --output_dir /path/to/TCGA_PRETRAINING_DIR/ckpts/pretrain/ --epochs 100
python -m torch.distributed.launch --nproc_per_node=8 main_dino4k.py --arch vit_xs --data_path /path/to/TCGA_PRETRAINING_DIR/region_4k_pretraining/ --output_dir /path/to/TCGA_PRETRAINING_DIR/ckpts/pretrain/ --epochs 100
SSL Strategy ViT SSL Dataset Iteration Batch Size Arch Image Size Token Size Dim Download
Hierarchical Pretraining DINO TCGA 400,000 256 ViT-S/16 256 16 384 Backbone
DINO TCGA 200,000 256 ViT-XS/256 4096 256 192 Backbone

Weakly-Supervised Training + Evaluation

Following ViT-16/256 pretraining and pre-extracting instance-level [256 × 256] features using ViT-16, we extend the publicly-available CLAM scaffold code for running 10-fold cross-validation experiments as well as implement several of the current weakly-supervised baselines. Our main method is hipt_lgp (abbreviated for HIPT with Local-Global Pretraining). We make available our saved results directory, evaluation code, and a Jupyter Notebook containing a walkthrough of our method.

Full List of Training Classification Commands
GPU=0
DATAROOT=/path/to/TCGA_ROOT_DIR/
TASK=tcga_brca_subtype
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k
TASK=tcga_kidney_subtype
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k
TASK=tcga_lung_subtype
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k

Analagously, we also use the MCAT scaffold code for survival prediction, and make available our saved results directory / tensorboard logs and evaluation code.

Full List of Training Survival Commands
DATAROOT=/path/to/TCGA_ROOT_DIR/
GPU=0
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_brca --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_coadread --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_kirc --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_kirp --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_luad --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k
CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_stad --mode pyramid --model_type hipt_n --pretrain_4k vit4k_xs_dino --freeze_4k

Understanding Baselines, Clarifications, and Future Work

In making the pretrained weights for HIPT fully-available, we hope that HIPT can be plugged-and-played in your experiments, and you would find the same level of improvement :). In building off of this work, we clarify a few details:

  • As slide-level tasks in the TCGA do not have official benchmarks, reported AUC performance may vary with different train-test splits. The results in this work use the following 10-fold CV and 5-fold CV train-test splits, which have been used consistently in prior works. Though the comparisons of MIL architecture performance are equivalent (all methods using same pretrained patch-level embeddings), general comparisons with MIL performance of prior works cannot be made, as: 1) different patch-level embeddings are used for training MIL methods (ImageNet ResNet-50 vs. SSL ViT-16), 2) a number of WSIs were excluded in each cohort, due to the lack of tissue content in patching at [4096 x 4096] resolution. To reproduce the results of this paper, you must use the exact train-test splits with the same pretrained embedding type.
  • Despite average ViT_4K-256 performing well in KNN evaluation, average ViT_256-16 embeddings did not perform as well as mean ResNet-50 (transferred from ImageNet) embeddings on some of the downstream tasks. Since Hierarchical Pretraining of ViT_4K-256 depends on pre-extracted ViT_256-16 embeddings, there is (of course) considerable room for improvement in boosting unsupervised and weakly-supervised slide-level performance in refining the ViT_256-16 encoder.

Issues

Acknowledgements, License & Usage

  • We thank Felix Yu, Ming Y. Lu, Chunyuan Li, and the BioML group at Microsoft Research New England for their insightful feedback.
  • Code for Weakly-Supervised Subtyping + Survival Classification was largely adapted from CLAM and MCAT
  • Code for Hierarchical Pretraining was largely adapted via making modifications to DINO
  • Code for self-supervised evaluation was built on our previous NeurIPS workshop paper
  • If you found our work useful in your research, please consider citing our works(s) at:
@article{chen2022self,
    author    = {Chen, Richard J and Krishnan, Rahul G},
    title     = {Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology},
    journal   = {Learning Meaningful Representations of Life, NeurIPS 2021},
    year      = {2021},
}

@inproceedings{chen2022scaling,
    author    = {Chen, Richard J. and Chen, Chengkuan and Li, Yicong and Chen, Tiffany Y. and Trister, Andrew D. and Krishnan, Rahul G. and Mahmood, Faisal},
    title     = {Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {16144-16155}
}

Any work that cites HIPT should also cite the original Vision Transformer and DINO.

© This code is made available under the Commons Clasuse License and is available for non-commercial academic purposes.

hipt's People

Contributors

arkkienkeli avatar faisalml avatar richarizardd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hipt's Issues

Some Questions

Hi @Richarizardd,

I would be glad if you can answer some of my questions below:

  1. Before running main_dino4k.py, I notice you saved all [256-Length x 384-Dim] Tensors for the input which correspond to extracted ViT-16 features for 4K x 4K patch. May I know in which part of the code you did that? Do I need to extract patches of 4k x 4k in jpg or png format to get the input tensors?

  2. Should I change the equation below if batch size=64 couldn't be used? How about the learning rate?

args.lr * (args.batch_size_per_gpu * utils.get_world_size()) / 256.

  1. Is HIPT_LGP_FC your main model?

Thanks for your time and kindness!

Recommended GPUs

What are recommended/required GPUs for running the experiments?
Especially, curious about the GPU memory requirements.

Thank you for this awesome work! @Richarizardd

VIT-4096-WSI

What would y'all recommend to use for VIT4096-WSI? I ended up just using an adjusted wrapper of the VisionTransformer4K class, but I feel as if this isn't the best solution.

Training procedure for custom dataset?

Hi great work and repo :)

I wanted to try this on a private dataset of WSIs at my lab. Do you have instructions for doing this? Cant find a clear path. Seems like the 3 stages must be trained independently?

Stochastic behavior when extracting the features for one image

Hi,
Really great work!

I have been extracting features at the two first levels (256 and 4k) for one whole slide images. Running it four times (to see if it was reproducible), I noticed that I got four different features. Setting torch.backends.cudnn.deterministic=True solved the issue, however I still do not get why we are having differences as the weights should be fixed.

The line that is creating the stochastic behavior is the following:
from line 43 in HIPT_4K/hipt_4k.py: self.model256 = get_vit256(pretrained_weights=model256_path).to(device256)
then line 54 in HIPT_4K/hipt_model_utils.py: model256 = vits.__dict__[arch](patch_size=16, num_classes=0) (which is calling vit_small)
then line 165 in: HIPT_4K/vision_transformer.py: self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

This last line is where I found the stochastic behavior. Why is that?
Thank

requirment

Hi, can you open source the requirements document? Requirements are what the packages need to install.

Which mean "n" and "h" ?

ViT4096-256(n = 4, h = 3, d = 192)
ViTWSI-4096(n = 2, h = 3, d = 192)

Hi, which mean "n" and "h" ?

Some questions

Hi, thank you for a great paper.
I would like to ask some questions.

1/ About the training code for Slide-Level Classification.
Are you guys still updating the code? Because I can't find the file 2-Weakly-Supervised-Train-Val/Model%20Walkthrough.ipynb. Or main.py in the training commands.
I only find the training in this eval_linear.py. Is it the training code for Slide-Level Classification?

2/ You set num_classes=0 in this code, so I guess you train the $ViT_{WSI}-4096$ + an additional classification layer together at the Fine-tuning state. But I can't find the details of the classification layer for MIL in the paper or the supplementary.
You only mentioned in the Fine-tuning part of the paper, that the $ViT_{WSI}-4096$ is finetuned but no information about the classification layer.
In addition, in this code, you only train one Linear layer for the classification layer from the feature.
I'm confused. I thought you train the $ViT_{WSI}-4096$ + classification layer.
Can you give me more details about the Fine-tuning state?

3/ And also, does main_dino256.py in the command is the 1-Hierarchical-Pretraining/main_dino.py?

4/ In the Supplementary, Table 4:
image
In the 4th case, Is it a typo?

  • If you used only the $ViT-16_{PF}$, $ViT-256_{PF}$ then an AP-4096 should be at the red line.

  • If it is not the typo, then how you can get the result with only 2 $ViT-16_{PF}$, $ViT-256_{PF}$?
    Furthermore, It means using only 2 pretrained $ViT-16_{PF}$, $ViT-256_{PF}$ from DINO is the best result ( the same as the best result in Table 1)? Then it means $ViT_{WSI}-4096$ is not important, right?

Thank you for supporting me.

ModuleNotFoundError: No module named 'nn_encoder_arch'

Thanks for your excellent work!
when I ran the code in ./3-Self-Supervised-Eval/patch_extraction.py , It occured the error ModuleNotFoundError: No module named 'nn_encoder_arch'. I cannot find the nn_encoder_arch, what is nn_encoder_arch? Thank you.

Model Architectures

from nn_encoder_arch.vision_transformer import vit_small
from nn_encoder_arch.resnet_trunc import resnet50_trunc_baseline

git lfs pull

Thanks for your excellent work!
When I run the git lfs pull, the error is reported as follows:
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/mahmoodlab/HIPT.git/info/lfs'

Can you upload the large files(images, checkpoint) to google drive or other cloud disks?

WSI preprocessing

Hi, Thanks for sharing this inspiring work :)

Could you please share the steps and criteria in the preprocessing in this project? For example, the preset parameters used for CLAM preprocessing, and patch selection criteria (e.g. to exclude artefacts and patches without enough tissue, etc.). Or other preprocessing steps such as stain normalisation.

Many thanks!

average number of [4096,4096] regions per slide

Hi,

I'm trying to reproduce your results for the TCGA BRCA dataset. I've thus narrowed down the dataset to the 875 breast slides you use to train & evaluate your models (based on the .csv files you provided here).

As a first step, I ran CLAM segmentation & patching pipeline on this dataset. I used the preset parameters that the group provided for TCGA BRCA (can be found here). When computing the average number of regions per slide, I get avg_M ~ 212.

The paper states that the average number of [4096,4096] regions per slide (avg_M) is around 38 when computed over the 10,678 FFPE slides from 33 cancer types in TCGA. I thought "maybe TCGA BRCA slides are much bigger than the other cancer types', hence why such a big difference when computing avg_M".

To assess whether or not that assumption was true, I downloaded the pre-extracted "region-level" feature embeddings you kindly provide under 3-Self-Supervised-Eval/embeddings_slide_lib/embeddings_slide_lib/vit256mean_tcga_slide_embeddings. From these I can easily tell how many [4096,4096] regions you found for each of the 875 TCGA BRCA slides. After computing the average over these slides, I get avg_M ~ 30.

I was thus wondering if you had an idea of where such a big difference (212 vs. 30) may come from. Do you remember using tcga preset parameters when running CLAM or did you use custom parameters?

Thank you!

name 'get_patch_attention_scores' is not defined

Hi,

Where is the 'get_patch_attention_scores' function?

Here is the error when perform 256 x 256 Demo (Saving Attention Maps Individually)


NameError Traceback (most recent call last)
/tmp/ipykernel_3452983/1904309063.py in
4 create_patch_heatmaps_indiv(patch=patch, model256=model256,
5 output_dir=output_dir, fname='patch',
----> 6 cmap=light_jet, device256=device256)

/data1/partitionA/CUHKSZ/histopath_2022/codes/histopathology_pretraining/HIPT/HIPT_4K/hipt_heatmap_utils.py in create_patch_heatmaps_indiv(patch, model256, output_dir, fname, threshold, offset, alpha, cmap, device256)
174 patch1 = patch.copy()
175 patch2 = add_margin(patch.crop((16,16,256,256)), top=0, left=0, bottom=16, right=16, color=(255,255,255))
--> 176 b256_1, a256_1 = get_patch_attention_scores(patch1, model256, device256=device256)
177 b256_1, a256_2 = get_patch_attention_scores(patch2, model256, device256=device256)
178 save_region = np.array(patch.copy())

NameError: name 'get_patch_attention_scores' is not defined

Batch-wise extract features

Hello,

I'm trying to extract and save features using HIPT/hipt_4k.py codes. However, It seems to works for only a 1-batch size.
Do you have any suggestions or tips for modifying the code to allow for batch-wise processing?

Thank you!

img_size argument

Thank you for your work and for sharing the code.
I've dived into the repo as I'm trying to run HIPT on a custom WSI dataset to see how it compares to other methods.

Following what's described in the paper, ViT-4096 is supposed to encode a [4096,4096] region into an embedding.
This happens by leveraging ViT-256 to encode each [256,256] patches in that region, then work on a sequence of [CLS] tokens.

When looking at the HIPT_4K class, you use get_vit4k, which returns:

model4k = vits4k.__dict__[arch](num_classes=0)

with arch='vit4k_xs', this is basically a call to:

def vit4k_xs(patch_size=16, **kwargs):
    model = VisionTransformer4K(
        patch_size=patch_size, input_embed_dim=384, output_embed_dim=192,
        depth=6, num_heads=6, mlp_ratio=4, 
        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    return model

I was wondering if you could shed light on the following:

  1. VisionTransformer & PatchEmbed img_size argument defaults to [224]: not sure why it doesn't default to [256]
  2. VisionTransformer4K img_size argument defaults to [224] : not sure why it doesn't default to [4096]?

point 1. also impacts the shape of some self-supervised pre-trained weights (output of Conv2d w/ kernel_size=16 & stride=16 on a 224 × 224 img gives a 14 × 14 map, while on a 256 × 256 img it gives a 16 × 16 map)

Thanks in advance.

All categories were classified into the same category

I was using a tcga dataset for grading learning. However, the results showed that all categories were classified into the same category in stage 3, i.e., final fine-tune WSI stage. The training losses had converged in stages 256 and 4096.

Have you ever met such this problem, and how did you solve it?
If not, could you tell me what can I do next step?

Thanks!

'Weakly-Supervised Training' task - Issue with fast_cluster_ids.pkl file

Hello,
Thank you for the great work and the released code. I am trying to perform the training of the ViT-WSI to get a slide-level classification on a customized dataset. Even though I have structured the TCGA_ROOT_DIR folder as shown in the README and adapted the code in the main.py file to my use-case by adding the following lines

          elif args.task == 'tcga_h_subtype':
                      args.n_classes = 2

                       dataset = Generic_MIL_Dataset(csv_path = './dataset_csv/tcga_h_subset.csv',
                                                                          data_dir= os.path.join(args.data_root_dir, study_dir),
                                                                          mode=args.mode, 
                                                                          shuffle = False, 
                                                                          seed = args.seed, 
                                                                          print_info = True,
                                                                          label_col='oncotree_code',
                                                                          label_dict = {0:0, 1:1},
                                                                          patient_strat=False,
                                                                          prop=args.prop) ` 

when running the command

CUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k

I get the following error File "HIPT/2-Weakly-Supervised-Subtyping/datasets/dataset_generic.py", line 411, in __init__ with open(os.path.join(cluster_dir, 'fast_cluster_ids.pkl'), 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: data_dir/extracted_mag20x_patch4096_fp/fast_cluster_ids.pkl'.

Which script should I run to produce this file? What does it represent? I cannot find any indication in the README.

Moreover, I noticed that in line 123 and 125 on main.py file, the args.pretrain_WSI argument is called but it is not specified in the arg.parser.

Thanks!!

Datasets

Thank you for the great work and for sharing the code.

I have noticed that the datasets used in this article (NSCLSC and RCC) have been used in other articles of the author. Here I have a question to ask whether there is any overlap between the pretraining data and the datasets. Of course, if possible, can the author provide the manifest files of the respective TCGA datasets(I say this because I download TCGA datasets through the manifest file, so I record the ID of each WSI slide,and I want to use these two data sets as a task, so I want to match the data you use)? If possible, I can email you, and then you can tell me by email. Please forgive me for my poor language ability.

Some issue about create patch.

Thanks for your intuitive work for Wsi classification. Could you please inform me some details about the feature extraction. Here are some questions of me to process tagc_lung.

  1. To extract features in 256*256 patches for MIL baseline, we set patch_size=256 in create_patches.py of CLAM to generate patch (python create_patches_fp.py --source DATA_DIRECTORY --save_dir RESULTS_DIRECTORY --patch_size 256 --preset bwh_biopsy.csv --seg --patch --stitch). Subsequently, model is changed to pre-trained ViT16 in extract_features_fp.py of CLAM to extract features. Is it correct?
  2. To extract features in 4096*4096 patches for HIPT baseline, we set patch_size=4096 in create_patches.pyof CLAM. Then model is changed to hipt_4k in extract_features_fp.py of CLAM to extract features (pre-trained ViT16 and ViT4k are both utilized in hipt_4k model). Is it correct?
  3. If instruction 2 is correct. I wonder how to fast extract features with hipt_4k because its default batch_size is set to 1. Did you change the hipt_4k for a bigger batch size or take several servers to extract features. (If I process almost 900 slides on two Nvidia 3090 with batch_size = 1, almost 80 days will be consumed.)

Split indices of precomputed embeddings for reproducibility & comparison

Dear Team,

Excellent work - thanks for providing precomputed embeddings!

For reproducibility, benchmarking and comparison, could you please share the indices of the train-val-test split in the patch embeddings under "embeddings_patch_lib/" - e.g. in "bcss_val_vits_tcga_brca_dino.pkl"?

The file contains the embeddings and labels, but there seems to be no way to map them back to which case id / slide the embeddings came from, so it's difficult to use them to benchmark... Looking at patch_extraction_utils.py, the indices in the pickles seem to have something to do with "BCSS/40x/patches/summary.csv"; but that file is not provided either; so we can't reproduce the train-val-test split...

Thanks in advance!

Number of epochs required for finetuning

Hi, My goal is to fine-tune a DINO 1st stage model from a checkpoint folder (vit256_small_dino.pth) using the Camelyon16 dataset. For an optimal solution, how many epochs are needed?

Extracting features from 4096 x 4096 patches (M x L x D)

Hello @Richarizardd,

Could you please guide me on how to extract vits_tcga_pancancer_dino_pt_patch_features for every 4096 x 4096 patch and save them in the extracted_mag20x_patch4096_fp folder?

Also, I was wondering if there is a feature extraction pipeline available for this task, or do we need to extract features for each 4096 x 4096 patch using HIPT_4k api and then vstack all of them? If you have a code for this, it would be great if you could share it.

Thank you!

Process of training subtyping task on custom WSIs dataset using provided pretrained model

Hi, thank you for your wonderful work and sharing the codes.

We hope to use provided feature extractor on our own dataset. (i.e., Skip the self-supervised learning process.)

However, we are not clear about how to organize the folders for weakly-supervised training on our own dataset.

We'd like to ask you how to construct folder like this, by modifying CLAM's create_patches_fp.py and extract_features_fp.py.

Custom_Dataset_Features_Folder/
    ├── h5_files
            ├── slide_1.h5
            ├── slide_2.h5
            └── ...
    └── pt_files
            ├── slide_1.pt
            ├── slide_2.pt
            └── ...

or if we are not correct, can you provide the codes or details for weakly-supervised training setup?

Thank you!

Inquiry about Dataset

Hi Richard,

Could you please let me know how to download the same datasets as yours? Additionally, I tried git cloning your repository, but I couldn't unzip all zip folders in HIPT/2-Weakly-Supervised-Subtyping/dataset_csv. What is the procedure for unzipping it? Thank you

Issue with Attention Visualization

Hello, thank you for the work that you published, very exciting!
I am trying to play with the code and took the code from the Attention Visualization notebook.
I get the following error when I am trying to run create_hierarchical_heatmaps_indiv

Traceback (most recent call last):
  File "hipt4kinference_attentionvisualization.py", line 28, in <module>
    create_hierarchical_heatmaps_indiv(region, model256, model4k,
  File "/storage01/nikitam/HIPT/HIPT_4K/hipt_heatmap_utils.py", line 388, in create_hierarchical_heatmaps_indiv
    score256_1 = concat_scores256(a256_1[:,i,:,:], size=(s//16,)*2)
TypeError: concat_scores256() missing 2 required positional arguments: 'w_256' and 'h_256'

Thank you!

Unable to find *.pt files for region_4096_pretraining

Am I right in expecting the patch level feature *.pt files (433779 files each containing 256 x 384 tensor) used for pretraining the second stage of HIPT to be present in the HIPT/3-Self-Supervised-Eval/embeddings_patch_lib/ directory?

Currently, I only see the following pickle files in that directory.

25M     bcss_train_resnet50_trunc.pkl
9.3M    bcss_train_vits_tcga_brca_dino.pkl
4.5M    bcss_val_resnet50_tcga_brca_simclr.pkl
2.3M    bcss_val_resnet50_trunc.pkl
868K    bcss_val_vits_tcga_brca_dino.pkl
19M     breastpathq_train_resnet50_tcga_brca_simclr.pkl
9.4M    breastpathq_train_resnet50_trunc.pkl
3.6M    breastpathq_train_vits_tcga_brca_dino.pkl
1.5M    breastpathq_val_resnet50_tcga_brca_simclr.pkl
744K    breastpathq_val_resnet50_trunc.pkl
280K    breastpathq_val_vits_tcga_brca_dino.pkl
783M    crc100knonorm_train_resnet50_tcga_brca_simclr.pkl
393M    crc100knonorm_train_resnet50_trunc.pkl
149M    crc100knonorm_train_vits_tcga_brca_dino.pkl
57M     crc100knonorm_val_resnet50_tcga_brca_simclr.pkl
29M     crc100knonorm_val_resnet50_trunc.pkl
11M     crc100knonorm_val_vits_tcga_brca_dino.pkl
783M    crc100k_train_resnet50_tcga_brca_simclr.pkl
393M    crc100k_train_resnet50_trunc.pkl
149M    crc100k_train_vits_tcga_brca_dino.pkl
57M     crc100k_val_resnet50_tcga_brca_simclr.pkl
29M     crc100k_val_resnet50_trunc.pkl
11M     crc100k_val_vits_tcga_brca_dino.pkl

Thanks in advance.

Worse Performance in CAMELYON16 only

Hi @Richarizardd , I have a question concerning the worst performance I had in the CAMELYON16 dataset only. Here are the results of my experiment after following all of the settings and pretraining models provided:

  • CAMELYON16
    1 fold: Train: 242 WSIs, Val: 28 WSIs, Test: 129 WSIs
    Mean Test AUC across 10 folds: 0.709 += 0.024
    Mean Test ACC across 10 folds: 0.764 += 0.021

  • UCEC
    1 fold: Train: 668 WSIs, Val: 58 WSIs, Test: 238 WSIs
    Mean Test AUC across 10 folds : 0.991 += 0.004
    Mean Test ACC across 10 folds: 0.958 += 0.012

  • our own dataset (leica colon)
    1 fold: Train: 469 WSIs, Val: 49 WSIs, Test: 223 WSIs
    Mean Test AUC across 10 folds : 0.977 += 0.04
    Mean Test ACC across 10 folds: 0.941 += 0.009

Are you perhaps able to explain why the mean test AUC and ACC in CAMELYON16 aren't that good? Could it be that the pretraining dataset is very different from the training dataset? Is it because there aren't many training slides? In fact, I trained CLAM with the same dataset, distribution for CAMELYON16, and it achieved AUC and ACC of around 85. It is greatly appreciated that you shared your insight. Thank you~

Labels for cancer subtyping

Hi, first of all thank you for the amazing work and codebase. I see that the data folds for the subtype classification task are provided in HIPT/2-Weakly-Supervised-Subtyping/splits/10foldcv_subtype.

However in the .csv files, the labels are not provided. I tried retrieving the diagnoses for all the .svs in the brca cohort using the gdc api and these are the different diagnoses:

'Adenoid cystic carcinoma', 'Apocrine adenocarcinoma', 'Basal cell carcinoma, NOS', 'Carcinoma, NOS', 'Cribriform carcinoma, NOS', 'Infiltrating duct and lobular carcinoma', 'Infiltrating duct carcinoma, NOS', 'Infiltrating duct mixed with other types of carcinoma', 'Infiltrating lobular mixed with other types of carcinoma', 'Intraductal micropapillary carcinoma', 'Intraductal papillary adenocarcinoma with invasion', 'Large cell neuroendocrine carcinoma', 'Lobular carcinoma, NOS', 'Medullary carcinoma, NOS', 'Metaplastic carcinoma, NOS', 'Mucinous adenocarcinoma', 'Paget disease and infiltrating duct carcinoma of breast', 'Papillary carcinoma, NOS', 'Phyllodes tumor, malignant', 'Pleomorphic carcinoma', 'Secretory carcinoma of breast', 'Tubular adenocarcinoma'

I would like to ask how did you derive the ILC and IDC labels that you use, or if you could provide the labels for the .svs files.

I am asking particularly about the brca cohort since it is the one I started inspecting, but probably a similar question would arise from the other cohorts, and their labels would be useful as well (:

Creating patches and extracting features for [4096 x 4096]

@Richarizardd @faisalml - I appreciate your intuitive work. I have been using CLAM for quite some time, but I have encountered an obstacle as follows:

[Preface] - I use an in-house dataset, and CLAM works fine. I recently read your paper and was curious to generate the hierarchical attention maps for the custom dataset. I have the splits and features for [256 x 256] patches, but how do I connect the existing [256 x 256] to the newly extracted [4096 x 4096] features? I have read the open and closed issues. However, I am not finding a lucid explanation.

Consider a WSI with ~20000 [256 x 256] patches, and I have Resnet50 features already extracted and stored on my disk using CLAM's scripts. @Richarizardd has mentioned that I have to change [256 x 256] to [4096 x 4096] while creating patches and extracting the features. In doing this, is the hierarchy still preserved? For example, if I extract a [4096 x 4096] patch hp1, how do I correlate it with the existing [256 x 256] patches in my directory? Is it using the [x,y] coordinates? Is the trajectory of my understanding of the pre-processing reasonable? Am I missing something?

In addition to this, where do I find ViT-16 features pretrained on TCGA (ref)? Is it from

"from vision_transformer import vit_small\n",

Do I use this instead of resnet_custom in the feature extraction

Or is it from

features_cls256 = []

Please correct me if I am wrong @Richarizardd @faisalml. Thank you.

knn_classifier

def knn_classifier(train_features, train_labels, test_features, test_labels, k, T, num_classes=2),The function is given a parameter of num_classes of 1000, when this value is changed to 2 and the binary classification problem is performed, an error is reported: RuntimeError: start (0) + length (5) exceeds dimension size (2). How should I use knn for binary classification problem?

Some questions

thanks you for your open source but i have some questions
image
where is the function of get_patch_attention_scores.I can't find that

.csv metadata file for colorectal dataset

hi, thanks for the interesting paper! I am trying to replicate the results for survival prediction but cannot find the .csv file (not splits, the dataset csv itself) for colon and rectal cancer, in either the HIPT codebase or the MCAT codebase. Could you please let me know where I can find this?

tar_patch_4096 with webdataset API

Hello,
Thank you for your great work.

I am trying to recreate the folder structure for weakly supervised learning.

I am trying to create the tar_patch_4096 folder.

The description is:

Directory of saved [4096 × 4096] image regions for each WSI, stored in a *.tar format using WebDataset API.

I am not sure how to proceed here, Do I take the .h5 patches and turn them into .tar files?

Thank you,

Juan

the reasoning behind attention heatmaps code

Hi, I thoroughly went through the attention heatmap generation code and there is one thing I have trouble understanding.
I'd love to hear your take on this as it would allow me to get the part of the picture I'm missing.

To keep it simple, let's focus on the create_patch_heatmaps_indiv function.

patch2 = add_margin(patch.crop((16,16,256,256)), top=0, left=0, bottom=16, right=16, color=(255,255,255))

In the line above, you're taking the (240, 240) bottom crop of the input patch, then paste it in the top-left corner of a white (256, 256) image. Then, you retrieve the attention scores for the original input patch, as well as for patch2.
Eventually, you combine both attention scores in the following lines:

new_score256_2 = np.zeros_like(score256_2)
new_score256_2[offset_2:s, offset_2:s] = score256_2[:(s-offset_2), :(s-offset_2)]
overlay256 = np.ones_like(score256_2)*100
overlay256[offset_2:s, offset_2:s] += 100
score256 = (score256_1+new_score256_2)/overlay256

Here all you do is restricting the attention scores from scores256_2 the those corresponding to the tissue crop in patch2.
Then, you sum scores256_1 and new_scores256_2, making sure to divide by a twice bigger weight (200) the portion where scores256_1 and scores256_2 overlap (because they represent the same tissue crop).

I draw a summary of what is happening:

attention_patch_summary

My question then boils down to: what is the reasoning behind blending a crop and not simply computing scores256 via:

_, a256 = get_patch_attention_scores(patch, model256, device256=device256)
score256 = get_scores256(a256[:,i,:,:], size=(s,)*2)
score256 = score256 / 100

Thanks!

Question regarding Evaluation

Hello Richard,
I have a few questions regarding your paper. I hope you can answer them.

  1. Is it possible to use your implementation for Slides with less than 4096 to 4096 pixels? I am very curious about this because if you use biopsy slides with tissue gathered by fine needles, you usually have slides with high background ratio and tissue area is smaller than 4096 x 4096 pixels.
  2. Why are you not incorporating CLAM-SB into your survival prediction comparison since it seems to be the second-best classification model in your Slide-Level classification task? Also, are you using multiple WSI per patient for survival prediction or just one WSI?

I hope you can understand my questions, and I am looking forward for a reply!

Thanks, Fabian

cannot load pretrained model weights

Hi! I am trying to load pretrained dino weights found in HIPT/HIPT_4K/Checkpoints using the following code:

pretrained_weights4k = 'vit256_small_dino.pth'
state_dict = torch.load(pretrained_weights4k)['teacher']

but i am getting an error: UnpicklingError: invalid load key, 'v'.

Also running the notebook HIPT_4K Inference + Attention Visualization I have checked that when get_vit256 or get_vit4k are called, os.path.isfile(pretrained_weights) returns False inside these functions, so the code doesn't load model weights.

Can you please help with this issue

Thank you!

ModuleNotFoundError: No module named 'vision_transformer'

Hello, even installing the dependencies (pip install -r requirements.txt) I got the message "ModuleNotFoundError: No module named 'vision_transformer'" while executing "from HIPT_4K.hipt_4k import HIPT_4K". If you can tell me where I went wrong or how to import these "vision_transformer" library I would appreciate it. Thanks.

subtyping: training loss not really decreasing

Hi, I'm trying to replicate the subtyping results you report on TCGA BRCA as a sanity check before applying HIPT to a different dataset. Hence, I'm using the same slides, same splits & same labels as given in the repo. For now, I've sticked to training & evaluating on fold_0.

I'm having troubles when training a model: my training loss barely goes down (see picture below: loss plateaus after epoch 6, with training AUC being around 0.50).

W B Chart 09_11_2022, 20_44_57

After having deeply dived into the code, there are a few things I'd love to have your help to understand:

In HIPT_LGP_FC you set self.local_vit = vit4k_xs() ; based on the following lines, it means self.local_vit is an instance of VisionTransformer4K with patch_size = 16

def vit4k_xs(patch_size=16, **kwargs):
model = VisionTransformer4K(
patch_size=patch_size, input_embed_dim=384, output_embed_dim=192,
depth=6, num_heads=6, mlp_ratio=4,
qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
return model

Then, looking at the VisionTransformer4K class, the default img_size argument is [224].
Combined with patch_size = 16, this means that num_patches = 196 (line 170), which is used line 174 to instantiate self.pos_embed
class VisionTransformer4K(nn.Module):
""" Vision Transformer 4K """
def __init__(self, num_classes=0, img_size=[224], input_embed_dim=384, output_embed_dim = 192,
depth=12, num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None,
drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm, num_prototypes=64, **kwargs):
super().__init__()
embed_dim = output_embed_dim
self.num_features = self.embed_dim = embed_dim
self.phi = nn.Sequential(*[nn.Linear(input_embed_dim, output_embed_dim), nn.GELU(), nn.Dropout(p=drop_rate)])
num_patches = int(img_size[0] // 16)**2
print("# of Patches:", num_patches)
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))

Hence, if we feed HIPT_LGP_FC a tensor of shape [M, 256, 384] as done in the model walkthrough notebook, at some point during the forward pass, the interpolate_pos_encoding method gets called. Given x.shape = [M, 257, 384] and pos_embed.shape = [1, 197, 192], npatch = 256 and N = 196: the condition npatch == N on line 204 is False, so we need interpolate the positional embedding
def interpolate_pos_encoding(self, x, w, h):
npatch = x.shape[1] - 1
N = self.pos_embed.shape[1] - 1
if npatch == N and w == h:
return self.pos_embed

  1. why the patch_size argument passed when instantiating VisionTransformer4K is actually not used in VisionTransformer4K.__init__() -- instead, a hard coded value of 16 is used (line 170, see below)

    num_patches = int(img_size[0] // 16)**2

  2. why the img_size argument passed when instantiating VisionTransformer4K is left as default (i.e. img_size = [224]) and not set to [256]? I get that during self-supervised pre-training, you use crops of size [224, 224], but during subtyping, we're using the full [256, 256] patch, so I guess we should use img_size = [256], shouldn't we? Doing so, the previously discussed condition npatch == N would become True (hence we would not need to interpolate the positional embedding anymore).

  3. Given we pass a tensor of shape [M, 256, 384] to HIPT_LGP_FC, which get reshaped to [M, 384, 16, 16] before being passed to HIPT_LGP_FC.local_vit, the following line gives B = M.

    B, embed_dim, w, h = x.shape

    Then, in the following line we define cls_token as a tensor of shape [M, 1, 192]. Isn't there a confusion between B (supposed to account for the batch size) and M (number of [4096, 4096] regions per slide)? Shouldn't the cls_token tensor be of shape [batch_size, 1, 192] ?
    # add the [CLS] token to the embed patch tokens
    cls_tokens = self.cls_token.expand(B, -1, -1)

  4. I've also tried training only the global aggregation layers by directly feeding the region-level pre-extracted features (of shape [M, 192]), without success (training loss not really decreasing either). Could you confirm that this should work just as well as training the intermediate transformer + the global aggregation layers on the [M, 256, 384] features?

Thanks!

ModuleNotFoundError: No module named 'utils.gpu_utils'

Hello; I am trying to reproduce results in 'Weakly-Supervised Training + Evaluation' section however when I run code, I am facing with this error. There is no file named gpu_utils under HIPT/ 2-Weakly-Supervised-Subtyping/ utils. I also checked CLAM and MCAT GitHub repos but could not find the file under utils files.

Confusion regarding the number of epochs the HIPT model was pre-trained for

Thank you for the great work and for sharing the code.

The paper mentions that the model was trained for 400K iterations with batch size of 256 which amounts to 102,400,000 patches, which seems to be about the same as the size of the dataset used for pretraining. So it seems like the model was trained for just 1 epoch, but the training script in the README

python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/TCGA_PRETRAINING_DIR/patch_256_pretraining/ --output_dir /path/to/TCGA_PRETRAINING_DIR/ckpts/pretrain/ --epochs 100

seems to suggest that it was pretrained for 100 epochs. Could you please clarify this detail? Thanks in advance.

Pretrained ViTWSI-4096 model

Thanks for sharing this excellect work! The method is both amazing and elegant.

I wonder if there is a pretrained ViTWSI-4096(n = 2, h = 3, d = 192) which aggregate the [CLS]4096 tokens and generate a slide-level representaion.

Reshaping Vit256-16 output to be consistent with ViT4096-256 and following ViT-WSI

Hello,

I have two questions regarding your implementation for using it with MIL.

1.) My first question is about connecting Vit256-16, Vit4096-256 and ViT-WSI to build a complete network

As you state in the model walkthrough notebook for weakly supervised subtyping, input shape with pre-extracted ViT256-16 tokens must be

Input: $[M \times L \times D]$ Tensor, where:

  • M: Number of (non-overlapping) $[4096 \times 4096]$ Image regions in a WSI (On Average: 38)
  • L: Number of (non-overlapping) $[256 \times 256]$ Image Patches in a $[4096 \times 4096]$ Image Region (Default: 256)
  • D: Embedding Dimension (Default: 384)

Now I am a little bit confused about the output shape of ViT256-16. If I take the given implementation, the output shape of ViT256-16 is $[(M*L) \times D]$. My first thought to match the new input dimension for ViT4096-256 is just to use x = x.reshape(M, L, D) , respectively x = einops.rearrange(x, '(M L) D -> M L D', M=4, L=16), for a given tensor with shape [M*L, D], but I am not sure if this correctly maintains the spatial dimensions and axis?

2.) How do you handle Multiple WSI per patient in your Framework?

Do you simply use the patient label as the label for every WSI? I have not found information about merging multiple WSI for one patient in your code (e.g. the train loop in core utils simply iterates over WSI with given labels).

Using the Features in CLAM

Hi @Richarizardd ,
Thanks for the great work.
I am trying the subtyping part from the repository. I want to ask you if the features for slide having dimension NX192, where N is number of patches provided here https://github.com/mahmoodlab/HIPT/tree/master/3-Self-Supervised-Eval/embeddings_slide_lib/embeddings_slide_lib/vit256mean_tcga_slide_embeddings

Q1 : can be used directly in the CLAM training pipeline provided in the subtyping part?

Q2 : if we want touse the hipt_lgp or hipt_n models from the repo, we would need to feed them with NX384 features also? I am assuming this are not provided in the repository ?

Q3 : When we extract patches with 4096 will give us features 192, and when we extract patches at 256 it will give us features at 384?
When we extract features from 4096 from VIT4K, and for 256 size patches from VIT256, when I tried to extract features from VIT256, features size still was still as N X 192. Is there anything need to be changed while extracting VIT 256 features?

Thanks!

Survival Code

Hi,
Thanks for sharing the code.
It seems that hipt_lgp is implemented in the MCAT repo yet, but it says in this repo that we should use the below command which I assume is related to MCAT repo(Please correct me if I'm wrong):

python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_brca --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k

I wonder if you can provide more info on how to use MCAT code with hipt_lgp backbone for survival prediction especially when there is no omics data available in the dataset? or the main file with other required files for this task.

Thanks.

Pretrained-WSI

Thank you for your great work on the HIPT model!

In HIPT_LGP_FC, it seems that the code for loading pretrained-WSI is just "pass". Is there a way to load pretrained-WSI? Also, is the pretrained model for this part included in the repository?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.