Giter Site home page Giter Site logo

absvit's Introduction

Top-Down Visual Attention from Analysis by Synthesis

This is the official codebase of AbSViT, from the following paper:

Top-Down Visual Attention from Analysis by Synthesis, CVPR 2023
Baifeng Shi, Trevor Darrell, and Xin Wang
UC Berkeley, Microsoft Research

Website | Paper

drawing

To-Dos

  • Finetuning on Vision-Language datasets

Environment

Install PyTorch 1.7.0+ and torchvision 0.8.1+ from the official website.

requirements.txt lists all the dependencies:

pip install -r requirements.txt

In addition, please also install the magickwand library:

apt-get install libmagickwand-dev

Demo

ImageNet demo: demo/demo.ipynb gives an example of visualizing AbSViT's attention map on single-object and multi-object images in ImageNet. Since the model is only trained on single-object recognition, the top-down attention is quite weak.

VQA demo: vision_language/demo/visualize_attention.ipynb gives an example of how AbSViT's top-down attention is adaptive to different questions on the same image.

Model Zoo

Name ImageNet ImageNet-C (โ†“) PASCAL VOC Cityscapes ADE20K Weights
ViT-Ti 72.5 71.1 - - - model
AbSViT-Ti 74.1 66.7 - - - model
ViT-S 80.1 54.6 - - - model
AbSViT-S 80.7 51.6 - - - model
ViT-B 80.8 49.3 80.1 75.3 45.2 model
AbSViT-B 81.0 48.3 81.3 76.8 47.2 model

Evaluation on Image Classification

For example, to evaluate AbSViT_small on ImageNet, run

python main.py --model absvit_small_patch16_224 --data-path path/to/imagenet --eval --resume path/to/checkpoint

To evaluate on robustness benchmarks, please add one of --inc_path /path/to/imagenet-c, --ina_path /path/to/imagenet-a, --inr_path /path/to/imagenet-r or --insk_path /path/to/imagenet-sketch to test ImageNet-C, ImageNet-A, ImageNet-R or ImageNet-Sketch.

If you want to test the accuracy under adversarial attackers, please add --fgsm_test or --pgd_test.

Evaluation on Semantic Segmentation

Please see segmentation for instructions.

Training

Take AbSViT_small for an example. We use single node with 8 gpus for training:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345  main.py --model absvit_small_patch16_224 --data-path path/to/imagenet  --output_dir output/here  --num_workers 8 --batch-size 128 --warmup-epochs 10

To train different model architectures, please change the arguments --model. We provide choices of ViT_{tiny, small, base}' and AbSViT_{tiny, small, base}.

Finetuning on Vision-Language Dataset

Please see vision_language for instructions.

Links

This codebase is built upon the official code of "Visual Attention Emerges from Recurrent Sparse Reconstruction" and "Towards Robust Vision Transformer".

Citation

If you found this code helpful, please consider citing our work:


@inproceedings{shi2023top,
  title={Top-Down Visual Attention from Analysis by Synthesis},
  author={Shi, Baifeng and Darrell, Trevor and Wang, Xin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2102--2112},
  year={2023}
}

absvit's People

Contributors

bfshi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.