wl-zhao / vpd Goto Github PK

[ICCV 2023] VPD is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.

Home Page: https://vpd.ivg-research.xyz

License: MIT License

Python 36.26% Shell 0.36% Makefile 0.02% Jupyter Notebook 63.36%

vpd's Introduction

VPD

Created by Wenliang Zhao*, Yongming Rao*, Zuyan Liu*, Benlin Liu, Jie Zhou, Jiwen Lu†

This repository contains PyTorch implementation for paper "Unleashing Text-to-Image Diffusion Models for Visual Perception" (ICCV 2023).

VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.

[Project Page] [arXiv]

Installation

Clone this repo, and run

git submodule init
git submodule update

Download the checkpoint of stable-diffusion (we use v1-5 by default) and put it in the checkpoints folder. Please also follow the instructions in stable-diffusion to install the required packages.

Semantic Segmentation with VPD

Equipped with a lightweight Semantic FPN and trained for 80K iterations on $512\times512$ crops, our VPD can achieve 54.6 mIoU on ADE20K.

Please check segmentation.md for detailed instructions.

Referring Image Segmentation with VPD

VPD achieves 73.46, 63.93, and 63.12 oIoU on the validation sets of RefCOCO, RefCOCO+, and G-Ref, repectively.

Dataset	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	OIoU	Mean IoU
RefCOCO	85.52	83.02	78.45	68.53	36.31	73.46	75.67
RefCOCO+	76.69	73.93	69.68	60.98	32.52	63.93	67.98
RefCOCOg	75.16	71.16	65.60	55.04	29.41	63.12	66.42

Please check refer.md for detailed instructions on training and inference.

Depth Estimation with VPD

VPD obtains 0.254 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.

	RMSE	d1	d2	d3	REL	log_10
VPD	0.254	0.964	0.995	0.999	0.069	0.030

Please check depth.md for detailed instructions on training and inference.

License

MIT License

Acknowledgements

This code is based on stable-diffusion, mmsegmentation, LAVT, and MIM-Depth-Estimation.

Citation

If you find our work useful in your research, please consider citing:

@article{zhao2023unleashing,
  title={Unleashing Text-to-Image Diffusion Models for Visual Perception},
  author={Zhao, Wenliang and Rao, Yongming and Liu, Zuyan and Liu, Benlin and Zhou, Jie and Lu, Jiwen},
  journal={ICCV},
  year={2023}
}

vpd's People

Contributors

Stargazers

Watchers

vpd's Issues

FPN head

Hello!

How does the FPN head not have feature strides? The init of FPNHead in mmseg requires feature_strides

VPD/segmentation/configs/fpn_vpd_sd1-5_512x512_gpu8x2.py

Line 14 in 90c9d85

decode_head=dict(

error when using xformers

I have run your code correctly.
To improve the training efficiency, I install the xformer in the environments.
However, the whole network fails suddenly, with the att16 att32 att63 all become blank [].
Can you help me with this?

t set to 0 in unet

shouldn't this line be zeros:

VPD/segmentation/models/vpd_seg.py

Line 96 in 940fc5f

t = torch.ones((img.shape[0],), device=img.device).long()

in the paper you mention that you put t to 0 in order not to have any noise added to the latent embedding.

Depth training code

Just wondering if you were planning on sharing the depth training scripts -- thanks!

Class embedding

Hi,
thanks for the interesting work! Could you please also share the code for creating the class embedding, so that I could try on my own dataset?
More specifically, the text embedding for one prompt has a shape of [77, 768]. However, in the class_embedding.pt, each class only has an embedding with shape of [1,768]. Did you average over 77 tokens or?
Thanks a lot!

The class_id seems to be incorrect in depth estimation

In this part of the code,

VPD/depth/dataset/nyudepthv2.py

Lines 53 to 57 in 90c9d85

    
           class_id = -1 
        
           for i, name in enumerate(self.class_list): 
        
               if name in filename: 
        
                   class_id = i 
        
                   break

The sample with the filename 'study_room_xxx' will be assigned to the class 'study' since 'study' is before 'study_room' in self.class_list.

Can you release the training log for segmentation?

Hello, I have been reading your paper recently and find your ideas very interesting. However, I ultimately couldn't fully reproduce the results of the segmentation.

I am using this configuration. https://github.com/wl-zhao/VPD/blob/main/segmentation/configs/fpn_vpd_sd1-5_512x512_gpu8x2.py
The ckpt is also v1-5-pruned-emaonly.ckpt. However, the final mIoU is 53.06 which is a little bit lower than the released result(53.7). Therefore can you release the training log for segmentation? If you could release the training log, I believe it would greatly help me in reproducing the results.

Thank you.

class embedding

hi !
thanks for the great work,
in the original stable diffusion, for the text embedding, the pooling you have does not exist, hence the output is 20,77,768 instead of 20,768 so I am wondering what is the purpose of using pretrained weights if you are changing the prompt embeddings to a pooled version? As I understand it this is not how the original backbone had been trained.

About class_embeddings

Hi, thanks for opening source VPD, and I want to ask about how to obtain the "class_embeddings.pth" of a specified new dataset? May I have to infer CLIP on that dataset every time I want to train VPD on a new dataset?

An error in dump_nyu_text_embeddings file

There are errors in this file VPD/depth/dump_nyu_text_embeddings line 14,
from clip import FrozenCLIPEmbedderContext
There is no corresponding function FrozenCLIPEmbedderContext in the https://github.com/openai/CLIP.git
In stablediffusion github used FrozenCLIPTextEmbedder to realize encoder for text
So I want to know if there's an error in this code Thank you very much

How to use VPD for classification?

Hi, thanks for open source the code for VPD!
I'm very interested in your work, and I wonder if I can use VPD for zero-shot image classification, here is my plan:

Get embeddings(class_embeddings.pth) for different class labels.
Get the cross-attention map for a given image with c_crossattn(which did not go through a text adapter).
Find the most responsive embedding on the cross-attention map, which is the classification result.

My question is: how to get the logits for each embedding, so I can use softmax to get the prediction score? A code example would be great.

I managed to obtain outs (x_c1, x_c2, x_c3, x_c4) and class_embeddings. What should I do next? Any help would be appreciated.

from mmcv.cnn.utils import revert_sync_batchnormfrom error

why do i get this error?
ImportError: cannot import name
which version do you really used, cant you write version config in a new file, or in the README?

typo in the paper

Hi,
In the paper there is a sentence
"NYUv2 contains 24K images for training and 645 images"
But I think the correct number of test images should be 654, not 645.

error running depth inference on NYUv2

Hello! Thanks for your great work on VPD!
I am trying to reproduce the depth estimation results on NYUv2 using test.sh and keep getting the following error. Any help here is greatly appreciated. Thanks a lot!
Kalyani

Blurred depth maps

Hi,

I tried visualizing the depth map on one of the test images. But I am getting some blurring in the depth map:

On the other hand models like ZoeDepth produce a sharper depth map:

Why might this be happening?

Looking forward to your reply.
Thanks

Inference on NYU Depth doesnt work

Hi, I am trying to do inference on the NYU Depth dataset. The model seems to predict a single value (5) for all the pixels. I loaded the "v1-5-pruned-emaonly.ckpt" model from HuggingFace. Also, to load the model I have to remove the "['model']" part in line 44 of test.py, as it throws an error.

I would really appreciate any help. Thanks.

Loss calculation error during training

Hi @wl-zhao , I am following all the instructions mentioned here for training - https://github.com/wl-zhao/VPD/blob/main/depth/README.md
When I try to run the training script, I get an error, specifically in the loss function, https://github.com/wl-zhao/VPD/blob/main/depth/utils_depth/criterion.py#L17

IndexError: The shape of the mask [3, 480, 480] at index 1 does not match the shape of the indexed tensor [3, 1, 480, 480] at index 1
The error makes sense since pred.shape is [3, 1, 480, 480] & gt.shape is [3, 480, 480].

I can fix it by squeezing dim=1 instead, at https://github.com/wl-zhao/VPD/blob/main/depth/train.py#L228.

However, I want to confirm if this is the right change or am I missing something? The only config change I have made is updating the num_workers for the dataloader, but that should not impact anything.

Question around number of denoising steps

Hi @wl-zhao , if I understood correctly, for each RGB image, you perform de-noising conditioned on the text for 1 timestamp, as shown here - https://github.com/wl-zhao/VPD/blob/main/depth/models_depth/model.py#L100 .
Is that correct? If yes, why did you choose timestamp = 1, did you also explore more rounds of denoising and did that help?

mm versions for segmentation

Hi thanks for sharing the code for your project.

I see in a previous issue you said you were using mmcv==1.7.1 and mmsegmentation==1.0.0, but the mmsegmentation==1.0.0 is the new version of the mm library and doesn't have the modules imported in segmentation/train.py.
For example, the modules Config and DictAction have moved into mmengine.config module

Is it possible that you are using a different version of the libraries, or perhaps the train.py script is outdated?

Having encountered fatal error in running inference on Segmentation task: ADE20K

Here my GPU is GTX1080
GPU Driver: NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4
PyTorch: 1.11.0
Python: 3.8.5
It seems that something went wrong with the parallel training, but I'm not sure.
Having followed the instructions step by step.

=============================================================
Traceback (most recent call last):
File "./test.py", line 11, in
from mmcv.cnn.utils import revert_sync_batchnorm
ImportError: cannot import name 'revert_sync_batchnorm' from 'mmcv.cnn.utils' (/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/mmcv/cnn/utils/in$t.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6417) of binary: /home/ycy/anaconda3/envs/ldm/bin/python3
Traceback (most recent call last):
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

NYUv2 datasets

Hello, thank you for such a great job. I encountered a problem. The current NYUv2 data set download link is broken, and I don’t know how to obtain split.mat. Can you help me? Thank you so much.

Training Time

Hi, thank you for your great work! May I ask the GPU types you used and the training time for each of the tasks (semantic segmentation, referral expression, and depth estimation)? Thank you very much!

how can i get the depth metrics in this paper?

Hi. Thanks for sharing this code and good paper.

recently. I would like to get depth estimation metrics using a pretrained model(vpd_depth_480x480.pth).
but when i evaluated the pretrained model these error messages came out.

Unexpected Keys: ['model_ema.decay', 'model_ema.num_updates']
Traceback (most recent call last):
File "test.py", line 173, in
main()
File "test.py", line 47, in main
model.load_state_dict(model_weight, strict=False)
File "/home/jslee/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VPDDepth:
size mismatch for decoder.deconv_layers.3.weight: copying a param with shape torch.Size([32, 32, 2, 2]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3]).
size mismatch for decoder.deconv_layers.6.weight: copying a param with shape torch.Size([32, 32, 2, 2]) from checkpoint, the shape in current model is torch.Size([32, 32, 4, 4]).

what's my problem? Why are the pretrained weights different from the NYU input images?

What does A32/A64 mean?

When using VPD for semantic segmentation, does A32/A64 mean that only use size 32/64 attention maps? In the source code, attn16, attn32, attn64 are all used, how can i reproduce the results of VPD_A32 and VPD_A64?

How do you handle the info loss from VAE encoding?

Using internal features from a LATENT diffusion model means the width / height of what you get is only 1/8 of the original image. I'm trying to implement my own feature extraction pipeline and have encountered this info loss issue, which makes my segmentation results a total mess. I have also read the codes in this repo but not figured out how you manage to handle it. That's why I'm here asking. Thanks in advance!

Can you describe the inferance process?

Can you describe the inference process? or Can you provide a script for inference?

datasets can't be download

What is the specific prompt for generating class_embeddings for depth?

NYU test set predictions

Hi,
Thanks for sharing your work! It's an exciting result.
Could you please share the predictions on NYU-Depth v2 in raw 16-bit PNG format for easy comparison? Thanks

Is it possible to use VPD for detection task?

Hi, thanks for your awesome work!

Have you used VPD on the detection task?
What do you think about using VPD for detection tasks?
Looking forward to your reply~

Inconsistent Training images number for NYUv2 in the preprint

Hi, thanks for your work.

Please correct me if I am wrong, but the number of training images for the NYUv2 dataset is mentioned as 24k in your preprint. However, the original NYUv2 dataset, as well as the NYUv2 dataset prepared using your instructions, have only 795 images. Do you use any extra training data, or is it just a typo in the paper? Thanks

Struggling to setup VPD

Hi, I'm having trouble to run train.py - I want to run it without the train.sh file - , i have the following error :
ModuleNotFoundError: No module named 'vpd.models'
Do you have any idea of what could be vausing this error ?
Also, just to be sure sure, is this supposed to be the correct structure ?
-/VPD
-----/checkpoints
---------v1-5-pruned.ckpt
-----/depth
-----/segmentation
-----etc...

segmentation training code

Hi authors, nice and inspiring work! I want to reproduce your segmentation results but only find the test scripts. Could you provide the training code? Thanks in advance.

Can you release the training logs on Depth estimation on NYUv2?

Hi,Thank you for providing this exploratory work on diffusion on visual perception tasks.

When I train on the NYUv2 dataset, I find that it converges slowly, and I wonder if there is something wrong with the training process.

Therefore can you release the training log for depth estimation? If you could release the training log, I believe it would greatly help me in reproducing the results.
Thank you.

resolution

Hello!
thanks for the great work!
The attention map does not work when the resolution of the image is not 512 x 512 and is for example 256x256.
There is a problem in the process_attn part because the attention map is not of size 256 so the square root is 11 which does not exist in up_attn. Is there a way to make this work for other resolutions especially none square?

how to inference single image using VPD？？

thanks for sharing this amazing work，I just want to inference single image using VPD。

VPDSeg doesn't run - questions on backbone parameters

Hi! Thank you for releasing this wonderful work. I have a couple questions when playing with the backbone from this repo:

The output channels from self.unet(latents, t, c_crossattn=[c_crossattn]) are [320, 650, 1290, 1280], why do you have [320, 790, 1430, 1280] for the FPN in the VPDSeg config? Am I missing anything? Using the config I got the following error:

RuntimeError: Given groups=1, weight of size [256, 790, 1, 1], expected input[1, 650, 32, 32] to have 790 channels, but got 650 channels instead

When using distributed training, I got errors from parameters not receiving gradients, for the following parameters:

img_backbone.unet.unet.diffusion_model.out.0.weight
img_backbone.unet.unet.diffusion_model.out.0.bias
img_backbone.unet.unet.diffusion_model.out.2.weight
img_backbone.unet.unet.diffusion_model.out.2.bias

which seems due to self.out from UNetModel is never used in the forward in the wrapper:

def register_hier_output(model):
    self = model.diffusion_model
    from ldm.modules.diffusionmodules.util import checkpoint, timestep_embedding

    def forward(x, timesteps=None, context=None, y=None, **kwargs):
        """
        Apply the model to an input batch.
        :param x: an [N x C x ...] Tensor of inputs.
        :param timesteps: a 1-D batch of timesteps.
        :param context: conditioning plugged in via crossattn
        :param y: an [N] Tensor of labels, if class-conditional.
        :return: an [N x C x ...] Tensor of outputs.
        """
        assert (y is not None) == (self.num_classes
                                   is not None), "must specify y if and only if the model is class-conditional"
        hs = []
        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
        emb = self.time_embed(t_emb)

        if self.num_classes is not None:
            assert y.shape == (x.shape[0], )
            emb = emb + self.label_emb(y)

        h = x.type(self.dtype)
        for module in self.input_blocks:
            # import pdb; pdb.set_trace()
            h = module(h, emb, context)
            hs.append(h)
        h = self.middle_block(h, emb, context)
        out_list = []

        for i_out, module in enumerate(self.output_blocks):
            h = torch.cat([h, hs.pop()], dim=1)
            h = module(h, emb, context)
            if i_out in [1, 4, 7]:
                out_list.append(h)
        h = h.type(x.dtype)

        out_list.append(h)
        return out_list

    self.forward = forward

I'm using the SD commit 21f890f as suggested. To double-check if I'm using the right thing without manually deleting self.out, how did you avoid this issue in the training?

Can you please help me these questions? Thank you very much in advance!

Inconsistant configuration between the repo and the paper

Hi thanks for your wonderful jobs! However, I notice that the config for the semantic segmentation task is not consistent with the paper, thus I met with a failure in reproducing the numbers.

Could you please provide the configs that are aligned with the configs in the main paper to foster the further research.

Thanks~

Output Dim of Depth Estimation

Hi, I was wondering what would be the output dimension of the model here :
(https://github.com/wl-zhao/VPD/blob/main/depth/models_depth/model.py#L157)

Specifically, NYUv2 has 27 classes, does the count of the number of classes matter? My question arises from this part of the paper : Since the cross-attention maps are computed by using the conditioning inputs C as the key and value, the averaged attention map has the shape of |S|×H_{i}×W_{i}

Question about "S"

Hi,

Thanks for sharing this great work.

I have some questions regarding "S". In section 3.2, you mentioned that "S" contains all category names associated with the task, and in section 3.4, you indicated that "S" varies across different tasks.

As per my understanding, for tasks such as referring image segmentation and depth estimation, |S|= 1, representing the given text or a specific category. However, in terms of semantic segmentation, I am uncertain about "S" which contains all category names relevant to this task. Does it refer to all category names present in the current image? If so, how is the loss function designed to establish a linkage between textual information and image content?

Reproduce the performance of ADE20K

I am struggling to reproduce the ADE20K reported performance on 8K iters (just get the 45.64mIoU). Is there any suggestion about this? This is the config setting~

lr_config = dict(policy='poly', power=0.9, min_lr=1e-6, by_epoch=False,
warmup='linear',
warmup_iters=150,
warmup_ratio=1e-6)

optimizer = dict(type='AdamW', lr=0.001, weight_decay=0.0001,
paramwise_cfg=dict(custom_keys={'unet': dict(lr_mult=0.1),
'encoder_vq': dict(lr_mult=0.0),
'text_encoder': dict(lr_mult=0.0),
'norm': dict(decay_mult=0.)}))

will you release the code of diffswap?

Depth Estimation with VPD

Has anyone succeeded in replicating it? For convenient communication, you can leave a message

Reproducing 4k/8k segmentation results

Hi,
Can you share the config you used to train 4k and 8k iters? Most importantly, the learning rate schedule you were using?

Trying to reproduce the results.

Thank you.

Image encoder clarification

Hi,
thank you for sharing this interesting work.
could you please share which encoder is used to encode the images to latent space?
In the paper, you mention using a frozen VQGAN encoder. However, in the vi_inference.yaml configuration file used to set up stable diffusion, a kl-autoencoder is used as the first stage model.
thanks!

Failed to reproduce the accuracy on ADE20k

Hello, thanks for your excellent work! At the moment, I am in the process of training a segmentation model on ADE20k utilizing the configuration you have provided. However, the final mIoU I have achieved is 53.1, which is lower than the reported 53.7. I have included my log. Is there any advice you could provide on how to improve my training process? Thank you.

About cross attention in depth estimation

Hello,
Thank you for sharing interesting work.

Did you use cross attention map when training depth estimation?

the code below, cross attention is disabled in depth estimation.
https://github.com/wl-zhao/VPD/blob/main/depth/models_depth/model.py#L57

When set 'use_attn' to True, runtime error is occurred cause of not matched channel size.
Could you confirm my understanding? Please correct if needed.

Thank you.

Cross-Attention maps

How to visualize Cross-Attention Maps for text prompts in fig1 (b)?

Hi, I would like to ask how to visualize prompts for a given picture using VPD, could you please help me with it?

Is using test file name for inference a fair practice?

Isn't it wrong to use the name of the test image file for the inference process? Like suppose I named them img1.png, img2.png, ..., then the code would not work. Also you can't do inference with images whose class_id you don't know or even which doesn't fall into one of the class_ids.

Mismatch between reported and observed performance of pretrained models made available

I am getting the following metrics on the 654 test set of images with the vpd_depth_480x480.pth file (obtained from the hyperlink "Tsinghua Cloud" in depth.md) as the checkpoint:

The reported metrics are:

Why am I observing this degradation in performance for the same model? Am I doing something wrong?

	class_id = -1
	for i, name in enumerate(self.class_list):
	if name in filename:
	class_id = i
	break

wl-zhao / vpd Goto Github PK

vpd's Introduction

VPD

Installation

Semantic Segmentation with VPD

Referring Image Segmentation with VPD

Depth Estimation with VPD

License

Acknowledgements

Citation

vpd's People

Contributors

Stargazers

Watchers

Forkers

vpd's Issues

Recommend Projects

Recommend Topics

Recommend Org