Created by Wenliang Zhao*, Yongming Rao*, Zuyan Liu*, Benlin Liu, Jie Zhou, Jiwen Luโ
This repository contains PyTorch implementation for paper "Unleashing Text-to-Image Diffusion Models for Visual Perception".
VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
Clone this repo, and run
git submodule init
git submodule update
Download the checkpoint of stable-diffusion (we use v1-5
by default) and put it in the checkpoints
folder
Equipped with a lightweight Semantic FPN and trained for 80K iterations on
Please check segmentation.md for detailed instructions.
VPD achieves 73.25, 63.51, and 62.80 oIoU on the validation sets of RefCOCO, RefCOCO+, and G-Ref, repectively.
Please check refer.md for detailed instructions.
VPD obtains 0.254 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.
RMSE | d1 | d2 | d3 | REL | log_10 | |
---|---|---|---|---|---|---|
VPD | 0.254 | 0.964 | 0.995 | 0.999 | 0.069 | 0.030 |
Please check depth.md for detailed instructions.
MIT License
This code is based on mmsegmentation, LAVT, and MIM-Depth-Estimation.
If you find our work useful in your research, please consider citing:
@article{zhao2023unleashing,
title={Unleashing Text-to-Image Diffusion Models for Visual Perception},
author={Zhao, Wenliang and Rao, Yongming and Liu, Zuyan and Liu, Benlin and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2303.02153},
year={2023}
}