Giter Site home page Giter Site logo

wl-zhao / vpd Goto Github PK

View Code? Open in Web Editor NEW
470.0 5.0 30.0 8.4 MB

[ICCV 2023] VPD is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.

Home Page: https://vpd.ivg-research.xyz

License: MIT License

Python 36.26% Shell 0.36% Makefile 0.02% Jupyter Notebook 63.36%

vpd's Introduction

VPD

PWC

Created by Wenliang Zhao*, Yongming Rao*, Zuyan Liu*, Benlin Liu, Jie Zhou, Jiwen Lu

This repository contains PyTorch implementation for paper "Unleashing Text-to-Image Diffusion Models for Visual Perception" (ICCV 2023).

VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.

intro

[Project Page] [arXiv]

Installation

Clone this repo, and run

git submodule init
git submodule update

Download the checkpoint of stable-diffusion (we use v1-5 by default) and put it in the checkpoints folder. Please also follow the instructions in stable-diffusion to install the required packages.

Semantic Segmentation with VPD

Equipped with a lightweight Semantic FPN and trained for 80K iterations on $512\times512$ crops, our VPD can achieve 54.6 mIoU on ADE20K.

Please check segmentation.md for detailed instructions.

Referring Image Segmentation with VPD

VPD achieves 73.46, 63.93, and 63.12 oIoU on the validation sets of RefCOCO, RefCOCO+, and G-Ref, repectively.

Dataset [email protected] [email protected] [email protected] [email protected] [email protected] OIoU Mean IoU
RefCOCO 85.52 83.02 78.45 68.53 36.31 73.46 75.67
RefCOCO+ 76.69 73.93 69.68 60.98 32.52 63.93 67.98
RefCOCOg 75.16 71.16 65.60 55.04 29.41 63.12 66.42

Please check refer.md for detailed instructions on training and inference.

Depth Estimation with VPD

VPD obtains 0.254 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.

RMSE d1 d2 d3 REL log_10
VPD 0.254 0.964 0.995 0.999 0.069 0.030

Please check depth.md for detailed instructions on training and inference.

License

MIT License

Acknowledgements

This code is based on stable-diffusion, mmsegmentation, LAVT, and MIM-Depth-Estimation.

Citation

If you find our work useful in your research, please consider citing:

@article{zhao2023unleashing,
  title={Unleashing Text-to-Image Diffusion Models for Visual Perception},
  author={Zhao, Wenliang and Rao, Yongming and Liu, Zuyan and Liu, Benlin and Zhou, Jie and Lu, Jiwen},
  journal={ICCV},
  year={2023}
}

vpd's People

Contributors

junyi42 avatar liuzuyan avatar raoyongming avatar wl-zhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

vpd's Issues

Is using test file name for inference a fair practice?

Isn't it wrong to use the name of the test image file for the inference process? Like suppose I named them img1.png, img2.png, ..., then the code would not work. Also you can't do inference with images whose class_id you don't know or even which doesn't fall into one of the class_ids.

Reproduce the performance of ADE20K

I am struggling to reproduce the ADE20K reported performance on 8K iters (just get the 45.64mIoU). Is there any suggestion about this? This is the config setting~

lr_config = dict(policy='poly', power=0.9, min_lr=1e-6, by_epoch=False,
warmup='linear',
warmup_iters=150,
warmup_ratio=1e-6)

optimizer = dict(type='AdamW', lr=0.001, weight_decay=0.0001,
paramwise_cfg=dict(custom_keys={'unet': dict(lr_mult=0.1),
'encoder_vq': dict(lr_mult=0.0),
'text_encoder': dict(lr_mult=0.0),
'norm': dict(decay_mult=0.)}))

Class embedding

Hi,
thanks for the interesting work! Could you please also share the code for creating the class embedding, so that I could try on my own dataset?
More specifically, the text embedding for one prompt has a shape of [77, 768]. However, in the class_embedding.pt, each class only has an embedding with shape of [1,768]. Did you average over 77 tokens or?
Thanks a lot!

Inconsistent Training images number for NYUv2 in the preprint

Hi, thanks for your work.

Please correct me if I am wrong, but the number of training images for the NYUv2 dataset is mentioned as 24k in your preprint. However, the original NYUv2 dataset, as well as the NYUv2 dataset prepared using your instructions, have only 795 images. Do you use any extra training data, or is it just a typo in the paper? Thanks

Screenshot 2023-03-26 at 10 03 10 PM

An error in dump_nyu_text_embeddings file

There are errors in this file VPD/depth/dump_nyu_text_embeddings line 14,
from clip import FrozenCLIPEmbedderContext
There is no corresponding function FrozenCLIPEmbedderContext in the https://github.com/openai/CLIP.git
In stablediffusion github used FrozenCLIPTextEmbedder to realize encoder for text
So I want to know if there's an error in this code Thank you very much

About class_embeddings

Hi, thanks for opening source VPD, and I want to ask about how to obtain the "class_embeddings.pth" of a specified new dataset? May I have to infer CLIP on that dataset every time I want to train VPD on a new dataset?

How do you handle the info loss from VAE encoding?

Using internal features from a LATENT diffusion model means the width / height of what you get is only 1/8 of the original image. I'm trying to implement my own feature extraction pipeline and have encountered this info loss issue, which makes my segmentation results a total mess. I have also read the codes in this repo but not figured out how you manage to handle it. That's why I'm here asking. Thanks in advance!

class embedding

hi !
thanks for the great work,
in the original stable diffusion, for the text embedding, the pooling you have does not exist, hence the output is 20,77,768 instead of 20,768 so I am wondering what is the purpose of using pretrained weights if you are changing the prompt embeddings to a pooled version? As I understand it this is not how the original backbone had been trained.

how can i get the depth metrics in this paper?

Hi. Thanks for sharing this code and good paper.

recently. I would like to get depth estimation metrics using a pretrained model(vpd_depth_480x480.pth).
but when i evaluated the pretrained model these error messages came out.

Unexpected Keys: ['model_ema.decay', 'model_ema.num_updates']
Traceback (most recent call last):
File "test.py", line 173, in
main()
File "test.py", line 47, in main
model.load_state_dict(model_weight, strict=False)
File "/home/jslee/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VPDDepth:
size mismatch for decoder.deconv_layers.3.weight: copying a param with shape torch.Size([32, 32, 2, 2]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3]).
size mismatch for decoder.deconv_layers.6.weight: copying a param with shape torch.Size([32, 32, 2, 2]) from checkpoint, the shape in current model is torch.Size([32, 32, 4, 4]).

what's my problem? Why are the pretrained weights different from the NYU input images?

What does A32/A64 mean?

When using VPD for semantic segmentation, does A32/A64 mean that only use size 32/64 attention maps? In the source code, attn16, attn32, attn64 are all used, how can i reproduce the results of VPD_A32 and VPD_A64?

mm versions for segmentation

Hi thanks for sharing the code for your project.

I see in a previous issue you said you were using mmcv==1.7.1 and mmsegmentation==1.0.0, but the mmsegmentation==1.0.0 is the new version of the mm library and doesn't have the modules imported in segmentation/train.py.
For example, the modules Config and DictAction have moved into mmengine.config module

Is it possible that you are using a different version of the libraries, or perhaps the train.py script is outdated?

Can you release the training log for segmentation?

Hello, I have been reading your paper recently and find your ideas very interesting. However, I ultimately couldn't fully reproduce the results of the segmentation.

I am using this configuration. https://github.com/wl-zhao/VPD/blob/main/segmentation/configs/fpn_vpd_sd1-5_512x512_gpu8x2.py
The ckpt is also v1-5-pruned-emaonly.ckpt. However, the final mIoU is 53.06 which is a little bit lower than the released result(53.7). Therefore can you release the training log for segmentation? If you could release the training log, I believe it would greatly help me in reproducing the results.

Thank you.

Depth training code

Just wondering if you were planning on sharing the depth training scripts -- thanks!

NYU test set predictions

Hi,
Thanks for sharing your work! It's an exciting result.
Could you please share the predictions on NYU-Depth v2 in raw 16-bit PNG format for easy comparison? Thanks

Struggling to setup VPD

Hi, I'm having trouble to run train.py - I want to run it without the train.sh file - , i have the following error :
ModuleNotFoundError: No module named 'vpd.models'
Do you have any idea of what could be vausing this error ?
Also, just to be sure sure, is this supposed to be the correct structure ?
-/VPD
-----/checkpoints
---------v1-5-pruned.ckpt
-----/depth
-----/segmentation
-----etc...

Loss calculation error during training

Hi @wl-zhao , I am following all the instructions mentioned here for training - https://github.com/wl-zhao/VPD/blob/main/depth/README.md
When I try to run the training script, I get an error, specifically in the loss function, https://github.com/wl-zhao/VPD/blob/main/depth/utils_depth/criterion.py#L17

IndexError: The shape of the mask [3, 480, 480] at index 1 does not match the shape of the indexed tensor [3, 1, 480, 480] at index 1
The error makes sense since pred.shape is [3, 1, 480, 480] & gt.shape is [3, 480, 480].

I can fix it by squeezing dim=1 instead, at https://github.com/wl-zhao/VPD/blob/main/depth/train.py#L228.

However, I want to confirm if this is the right change or am I missing something? The only config change I have made is updating the num_workers for the dataloader, but that should not impact anything.

Image encoder clarification

Hi,
thank you for sharing this interesting work.
could you please share which encoder is used to encode the images to latent space?
In the paper, you mention using a frozen VQGAN encoder. However, in the vi_inference.yaml configuration file used to set up stable diffusion, a kl-autoencoder is used as the first stage model.
thanks!

Failed to reproduce the accuracy on ADE20k

Hello, thanks for your excellent work! At the moment, I am in the process of training a segmentation model on ADE20k utilizing the configuration you have provided. However, the final mIoU I have achieved is 53.1, which is lower than the reported 53.7. I have included my log. Is there any advice you could provide on how to improve my training process? Thank you.

Blurred depth maps

Hi,

I tried visualizing the depth map on one of the test images. But I am getting some blurring in the depth map:

Screenshot 2023-09-23 at 10 39 38 AM

On the other hand models like ZoeDepth produce a sharper depth map:

download

Why might this be happening?

Looking forward to your reply.
Thanks

Cross-Attention maps

How to visualize Cross-Attention Maps for text prompts in fig1 (b)?

Hi, I would like to ask how to visualize prompts for a given picture using VPD, could you please help me with it?

Training Time

Hi, thank you for your great work! May I ask the GPU types you used and the training time for each of the tasks (semantic segmentation, referral expression, and depth estimation)? Thank you very much!

typo in the paper

Hi,
In the paper there is a sentence
"NYUv2 contains 24K images for training and 645 images"
But I think the correct number of test images should be 654, not 645.

error when using xformers

I have run your code correctly.
To improve the training efficiency, I install the xformer in the environments.
However, the whole network fails suddenly, with the att16 att32 att63 all become blank [].
Can you help me with this?

Reproducing 4k/8k segmentation results

Hi,
Can you share the config you used to train 4k and 8k iters? Most importantly, the learning rate schedule you were using?

Trying to reproduce the results.

Thank you.

resolution

Hello!
thanks for the great work!
The attention map does not work when the resolution of the image is not 512 x 512 and is for example 256x256.
There is a problem in the process_attn part because the attention map is not of size 256 so the square root is 11 which does not exist in up_attn. Is there a way to make this work for other resolutions especially none square?

segmentation training code

Hi authors, nice and inspiring work! I want to reproduce your segmentation results but only find the test scripts. Could you provide the training code? Thanks in advance.

Having encountered fatal error in running inference on Segmentation task: ADE20K

Here my GPU is GTX1080
GPU Driver: NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4
PyTorch: 1.11.0
Python: 3.8.5
It seems that something went wrong with the parallel training, but I'm not sure.
Having followed the instructions step by step.

=============================================================
Traceback (most recent call last):
File "./test.py", line 11, in
from mmcv.cnn.utils import revert_sync_batchnorm
ImportError: cannot import name 'revert_sync_batchnorm' from 'mmcv.cnn.utils' (/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/mmcv/cnn/utils/in$t.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6417) of binary: /home/ycy/anaconda3/envs/ldm/bin/python3
Traceback (most recent call last):
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ycy/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

error running depth inference on NYUv2

Hello! Thanks for your great work on VPD!
I am trying to reproduce the depth estimation results on NYUv2 using test.sh and keep getting the following error. Any help here is greatly appreciated. Thanks a lot!
Kalyani

image

VPDSeg doesn't run - questions on backbone parameters

Hi! Thank you for releasing this wonderful work. I have a couple questions when playing with the backbone from this repo:

  1. The output channels from self.unet(latents, t, c_crossattn=[c_crossattn]) are [320, 650, 1290, 1280], why do you have [320, 790, 1430, 1280] for the FPN in the VPDSeg config? Am I missing anything? Using the config I got the following error:
RuntimeError: Given groups=1, weight of size [256, 790, 1, 1], expected input[1, 650, 32, 32] to have 790 channels, but got 650 channels instead
  1. When using distributed training, I got errors from parameters not receiving gradients, for the following parameters:
img_backbone.unet.unet.diffusion_model.out.0.weight
img_backbone.unet.unet.diffusion_model.out.0.bias
img_backbone.unet.unet.diffusion_model.out.2.weight
img_backbone.unet.unet.diffusion_model.out.2.bias

which seems due to self.out from UNetModel is never used in the forward in the wrapper:

def register_hier_output(model):
    self = model.diffusion_model
    from ldm.modules.diffusionmodules.util import checkpoint, timestep_embedding

    def forward(x, timesteps=None, context=None, y=None, **kwargs):
        """
        Apply the model to an input batch.
        :param x: an [N x C x ...] Tensor of inputs.
        :param timesteps: a 1-D batch of timesteps.
        :param context: conditioning plugged in via crossattn
        :param y: an [N] Tensor of labels, if class-conditional.
        :return: an [N x C x ...] Tensor of outputs.
        """
        assert (y is not None) == (self.num_classes
                                   is not None), "must specify y if and only if the model is class-conditional"
        hs = []
        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
        emb = self.time_embed(t_emb)

        if self.num_classes is not None:
            assert y.shape == (x.shape[0], )
            emb = emb + self.label_emb(y)

        h = x.type(self.dtype)
        for module in self.input_blocks:
            # import pdb; pdb.set_trace()
            h = module(h, emb, context)
            hs.append(h)
        h = self.middle_block(h, emb, context)
        out_list = []

        for i_out, module in enumerate(self.output_blocks):
            h = torch.cat([h, hs.pop()], dim=1)
            h = module(h, emb, context)
            if i_out in [1, 4, 7]:
                out_list.append(h)
        h = h.type(x.dtype)

        out_list.append(h)
        return out_list

    self.forward = forward

I'm using the SD commit 21f890f as suggested. To double-check if I'm using the right thing without manually deleting self.out, how did you avoid this issue in the training?

Can you please help me these questions? Thank you very much in advance!

Inference on NYU Depth doesnt work

Hi, I am trying to do inference on the NYU Depth dataset. The model seems to predict a single value (5) for all the pixels. I loaded the "v1-5-pruned-emaonly.ckpt" model from HuggingFace. Also, to load the model I have to remove the "['model']" part in line 44 of test.py, as it throws an error.

I would really appreciate any help. Thanks.

Question about "S"

Hi,

Thanks for sharing this great work.

I have some questions regarding "S". In section 3.2, you mentioned that "S" contains all category names associated with the task, and in section 3.4, you indicated that "S" varies across different tasks.

As per my understanding, for tasks such as referring image segmentation and depth estimation, |S|= 1, representing the given text or a specific category. However, in terms of semantic segmentation, I am uncertain about "S" which contains all category names relevant to this task. Does it refer to all category names present in the current image? If so, how is the loss function designed to establish a linkage between textual information and image content?

Inconsistant configuration between the repo and the paper

Hi thanks for your wonderful jobs! However, I notice that the config for the semantic segmentation task is not consistent with the paper, thus I met with a failure in reproducing the numbers.

Could you please provide the configs that are aligned with the configs in the main paper to foster the further research.

Thanks~

Can you release the training logs on Depth estimation on NYUv2?

Hi,Thank you for providing this exploratory work on diffusion on visual perception tasks.

When I train on the NYUv2 dataset, I find that it converges slowly, and I wonder if there is something wrong with the training process.

Therefore can you release the training log for depth estimation? If you could release the training log, I believe it would greatly help me in reproducing the results.
Thank you.

How to use VPD for classification?

Hi, thanks for open source the code for VPD!
I'm very interested in your work, and I wonder if I can use VPD for zero-shot image classification, here is my plan:

  1. Get embeddings(class_embeddings.pth) for different class labels.
  2. Get the cross-attention map for a given image with c_crossattn(which did not go through a text adapter).
  3. Find the most responsive embedding on the cross-attention map, which is the classification result.

My question is: how to get the logits for each embedding, so I can use softmax to get the prediction score? A code example would be great.

I managed to obtain outs (x_c1, x_c2, x_c3, x_c4) and class_embeddings. What should I do next? Any help would be appreciated.

NYUv2 datasets

Hello, thank you for such a great job. I encountered a problem. The current NYUv2 data set download link is broken, and I don’t know how to obtain split.mat. Can you help me? Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.