Giter Site home page Giter Site logo

bfshi / absvit Goto Github PK

View Code? Open in Web Editor NEW
159.0 2.0 13.0 8.83 MB

Official code for "Top-Down Visual Attention from Analysis by Synthesis" (CVPR 2023 highlight)

Python 36.15% Jupyter Notebook 63.85%
attention classification pytorch segmentation vision-transformer cvpr

absvit's Introduction

Top-Down Visual Attention from Analysis by Synthesis

This is the official codebase of AbSViT, from the following paper:

Top-Down Visual Attention from Analysis by Synthesis, CVPR 2023
Baifeng Shi, Trevor Darrell, and Xin Wang
UC Berkeley, Microsoft Research

Website | Paper

drawing

To-Dos

  • Finetuning on Vision-Language datasets

Environment

Install PyTorch 1.7.0+ and torchvision 0.8.1+ from the official website.

requirements.txt lists all the dependencies:

pip install -r requirements.txt

In addition, please also install the magickwand library:

apt-get install libmagickwand-dev

Demo

ImageNet demo: demo/demo.ipynb gives an example of visualizing AbSViT's attention map on single-object and multi-object images in ImageNet. Since the model is only trained on single-object recognition, the top-down attention is quite weak.

VQA demo: vision_language/demo/visualize_attention.ipynb gives an example of how AbSViT's top-down attention is adaptive to different questions on the same image.

Model Zoo

Name ImageNet ImageNet-C (↓) PASCAL VOC Cityscapes ADE20K Weights
ViT-Ti 72.5 71.1 - - - model
AbSViT-Ti 74.1 66.7 - - - model
ViT-S 80.1 54.6 - - - model
AbSViT-S 80.7 51.6 - - - model
ViT-B 80.8 49.3 80.1 75.3 45.2 model
AbSViT-B 81.0 48.3 81.3 76.8 47.2 model

Evaluation on Image Classification

For example, to evaluate AbSViT_small on ImageNet, run

python main.py --model absvit_small_patch16_224 --data-path path/to/imagenet --eval --resume path/to/checkpoint

To evaluate on robustness benchmarks, please add one of --inc_path /path/to/imagenet-c, --ina_path /path/to/imagenet-a, --inr_path /path/to/imagenet-r or --insk_path /path/to/imagenet-sketch to test ImageNet-C, ImageNet-A, ImageNet-R or ImageNet-Sketch.

If you want to test the accuracy under adversarial attackers, please add --fgsm_test or --pgd_test.

Evaluation on Semantic Segmentation

Please see segmentation for instructions.

Training

Take AbSViT_small for an example. We use single node with 8 gpus for training:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345  main.py --model absvit_small_patch16_224 --data-path path/to/imagenet  --output_dir output/here  --num_workers 8 --batch-size 128 --warmup-epochs 10

To train different model architectures, please change the arguments --model. We provide choices of ViT_{tiny, small, base}' and AbSViT_{tiny, small, base}.

Finetuning on Vision-Language Dataset

Please see vision_language for instructions.

Links

This codebase is built upon the official code of "Visual Attention Emerges from Recurrent Sparse Reconstruction" and "Towards Robust Vision Transformer".

Citation

If you found this code helpful, please consider citing our work:


@inproceedings{shi2023top,
  title={Top-Down Visual Attention from Analysis by Synthesis},
  author={Shi, Baifeng and Darrell, Trevor and Wang, Xin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2102--2112},
  year={2023}
}

absvit's People

Contributors

bfshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

absvit's Issues

questions about top_down_transform

Hi, thanks a lot for your great work, and about top_down_transform I have some questions.

Here, why we use top_down_transform to multiply with masked_x again, because we have already got the selected feature.

top_down_transform = prompt[..., None] @ prompt[..., None].transpose(-1, -2)
x = x @ top_down_transform * 5

Unsatisfactory demo result

Thanks first for your impressive work. It's really a fascinating work that inspires me a lot. My problem is when I trial the demo.ipynb, I obtained an unsatisfactory result. The result of prior mask and feedback attention with prompt of the phone as well as the latter classification example all demonstrated meaningless picture. I would like to know where may be possibly wrong and how can I fix it. Thanks again!

connection error

(Pdb) dm = VQAv2DataModule(config)
*** ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

how to solve it?or is there any method instead it?

about the tokeen(vector) td

In paper Figure 3, past vector td to Att, What i wang to know is that ,the td vector is Sentence embedded By something(e.g BERT) ,or just one word or two words 'embeded vector?

how das the td vector and xbu combined to get V

in Chinese:
td,token,是表示一个句子还是说一两个单词,如例子中bridge,dog?,然后**xtd向量是如何与xud结合得到V**的?

多谢解答
thanks a lot

How to set the feedback_aug value?

Hi, thanks for sharing your work, but could you tell me how to set the self.feedback_aug value in your absvit.py when deal with different scence ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.