Giter Site home page Giter Site logo

clippy's Introduction

CLIPpy

PWC PWC PWC PWC PWC

Unofficial implementation of Perceptual Grouping in Contrastive Vision-Language Models paper.

Paper Link | Gradio Demo | Colab Notebook

Abstract: Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

intro_image

Setup

Refer to requirements.txt.

Pre-trained Model

We replicate the work in CLIPpy paper under scaled down settings (due to resource and compute limitations). We use the CC-12M dataset (some files missing due to download issues) and use a single node with 8 V100 GPUs for training. Training for 5000 iterations, we are able to reach close to the reported segmentation performance on PASCAL VOC dataset. Our pre-trained checkpoint is provided here.

Inference

We present examples of using CLIPpy in the following notebooks:

Results

ImageNet (acc) VOC (mIoU) VOC (JI)
CLIPpy 45.3 50.8 47.5

Analysis

We provide some analysis on the feature space of CLIPpy.

  • bird to different backgrounds vs average background (TBA)
  • feature space visualization (TBA)

Training

We build over the OpenCLIP repository, use pre-trained checkpoint weights from DINO and Sentence-T5, and follow the training schedule from the CLIPpy paper. We also directly utilize the patch dropout implementation in OpenCLIP.

We run the train job as follows:

torchrun --nproc_per_node 8 -m training.main \
    --train-data "${ROOT}/data/cc12m/{00000..01242}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 1024 \
    --precision fp32 \
    --grad-checkpointing \
    --workers 8 \
    --imagenet-val "${ROOT}/data/imagenet/val/" \
    --model "clippy-B-16" \
    --report-to "wandb" \
    --wandb-project-name "clip" \
    --zeroshot-frequency 1 \
    --name "clippy001"

Citation

@article{clippy2022,
  title={Perceptual Grouping in Contrastive Vision-Language Models},
  author={Kanchana Ranasinghe and Brandon McKinzie and Sachin Ravi and Yinfei Yang and Alexander Toshev and Jonathon Shlens},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.09996}
}

clippy's People

Contributors

kahnchana avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.