Giter Site home page Giter Site logo

gagaolala / instructcv Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alaalab/instructcv

0.0 0.0 0.0 77.88 MB

Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

License: Other

Shell 0.73% Python 99.27%

instructcv's Introduction

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis and Ahmed Alaa

Paper | HuggingFace 🤗 Demo

🌟 Official PyTorch implementation of InstructCV. The master branch works with PyTorch 1.5+.

InstructCVDemo.mp4

Overview

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this project, we develop a unified language interface for computer vision tasks that abstracts away task specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner.

Set up the environments

Install dependencies by running:

#Step0. Set up the env.
conda env create -f environment.yaml
conda activate lvi
#Step1 (optional) . You could ignore this step if you do not run the baselines.
## install tensorflow : https://www.tensorflow.org/install/pip
pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI 
pip install -U openmim
mim install mmcv-full
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
pip install -v -e .

Get Started

See Preparing Datasets for InstructCV.

See Getting Started with InstructCV for detailed instructions on training and inference with InstructCV.

InstructCV-RP checkpoint

Depth
Estimation
RMSE⬇
Sementic Segmentation mIoU⬆ Classification
Acc⬆
Object Detection mAP⬆ Download
NYUv2 SUNRGB-D ADE-20K VOC Oxford-Pets ImageNet-sub COCO VOC
InstructCV-RP 0.297 0.279 47.235 52.125 82.135 74.665 48.500 61.700 checkpoint

Demo

The pre-trained model for Stable Diffusion is subject to its original license terms from Stable Diffusion.

Acknowledgement

This codebase is largely based on CompVis/stable_diffusion and Instruct Pix2Pix.

Citation

If you find our work useful in your research, please cite:

@article{gan2023instructcv,
  title={InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists},
  author={Gan, Yulu and Park, Sungwoo and Schubert, Alexander and Philippakis, Anthony and Alaa, Ahmed},
  journal={arXiv preprint arXiv:2310.00390},
  year={2023}
}

instructcv's People

Contributors

sunrainyg avatar ahmedmalaa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.