microsoft / focal-transformer Goto Github PK

[NeurIPS 2021 Spotlight] Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

License: MIT License

Python 100.00%

focal-transformer's Introduction

Focal Transformer [NeurIPS 2021 Spotlight]

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

🎞️ Video by The AI Epiphany

Next Generation Architecture

We had developed FocalNet, a next generation of architecture built based on the focal mechanism. It is much faster and more effective. Check it out at: https://github.com/microsoft/FocalNet!

Faster Focal Transformer

As you may notice, though the theoritical GFLOPs of our Focal Transformer is comparable to prior works, its wall-clock efficiency lags behind. Therefore, we are releasing a faster version of Focal Transformer, which discard all the rolling and unfolding operations used in our first version.

Model	Pretrain	Use Conv	Resolution	acc@1	acc@5	#params	FLOPs	Throughput (imgs/s)	Checkpoint	Config
Focal-T	IN-1K	No	224	82.2	95.9	28.9M	4.9G	319	download	yaml
Focal-fast-T	IN-1K	Yes	224	82.4	96.0	30.2M	5.0G	483	download	yaml
Focal-S	IN-1K	No	224	83.6	96.2	51.1M	9.4G	192	download	yaml
Focal-fast-S	IN-1K	Yes	224	83.6	96.4	51.5M	9.4G	293	download	yaml
Focal-B	IN-1K	No	224	84.0	96.5	89.8M	16.4G	138	download	yaml
Focal-fast-B	IN-1K	Yes	224	84.0	96.6	91.2M	16.4G	203	download	yaml

Benchmarking

Image Classification Throughput with Image Resolution

Model	Top-1 Acc.	GLOPs (224x224)	224x224	448x448	896 x 896
DeiT-Small/16	79.8	4.6	939	101	20
PVT-Small	79.8	3.8	794	172	31
CvT-13	81.6	4.5	746	125	14
ViL-Small	82.0	5.1	397	87	17
Swin-Tiny	81.2	4.5	760	189	48
Focal-Tiny	82.2	4.9	319	105	27
PVT-Medium	81.2	6.7	517	111	20
CvT-21	82.5	7.1	480	85	10
ViL-Medium	83.3	9.1	251	53	8
Swin-Small	83.1	8.7	435	111	28
Focal-Small	83.6	9.4	192	63	17
ViT-Base/16	77.9	17.6	291	57	8
Deit-Base/16	81.8	17.6	291	57	8
PVT-Large	81.7	9.8	352	77	14
ViL-Base	83.2	13.4	145	35	5
Swin-Base	83.4	15.4	291	70	17
Focal-Base	84.0	16.4	138	44	11

Image Classification on ImageNet-1K

Model	Pretrain	Use Conv	Resolution	acc@1	acc@5	#params	FLOPs	Checkpoint	Config
Focal-T	IN-1K	No	224	82.2	95.9	28.9M	4.9G	download	yaml
Focal-T	IN-1K	Yes	224	82.7	96.1	30.8M	5.2G	download	yaml
Focal-S	IN-1K	No	224	83.6	96.2	51.1M	9.4G	download	yaml
Focal-S	IN-1K	Yes	224	83.8	96.5	53.1M	9.7G	download	yaml
Focal-B	IN-1K	No	224	84.0	96.5	89.8M	16.4G	download	yaml
Focal-B	IN-1K	Yes	224	84.2	97.1	93.3M	16.8G	download	yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone	Pretrain	Lr Schd	#params	FLOPs	box mAP	mask mAP
Focal-T	ImageNet-1K	1x	49M	291G	44.8	41.0
Focal-T	ImageNet-1K	3x	49M	291G	47.2	42.7
Focal-S	ImageNet-1K	1x	71M	401G	47.4	42.8
Focal-S	ImageNet-1K	3x	71M	401G	48.8	43.8
Focal-B	ImageNet-1K	1x	110M	533G	47.8	43.2
Focal-B	ImageNet-1K	3x	110M	533G	49.0	43.7

RetinaNet

Backbone	Pretrain	Lr Schd	#params	FLOPs	box mAP
Focal-T	ImageNet-1K	1x	39M	265G	43.7
Focal-T	ImageNet-1K	3x	39M	265G	45.5
Focal-S	ImageNet-1K	1x	62M	367G	45.6
Focal-S	ImageNet-1K	3x	62M	367G	47.3
Focal-B	ImageNet-1K	1x	101M	514G	46.3
Focal-B	ImageNet-1K	3x	101M	514G	46.9

Other detection methods

Backbone	Pretrain	Method	Lr Schd	#params	FLOPs	box mAP
Focal-T	ImageNet-1K	Cascade Mask R-CNN	3x	87M	770G	51.5
Focal-T	ImageNet-1K	ATSS	3x	37M	239G	49.5
Focal-T	ImageNet-1K	RepPointsV2	3x	45M	491G	51.2
Focal-T	ImageNet-1K	Sparse R-CNN	3x	111M	196G	49.0

Semantic Segmentation on ADE20K

Backbone	Pretrain	Method	Resolution	Iters	#params	FLOPs	mIoU	mIoU (MS)
Focal-T	ImageNet-1K	UPerNet	512x512	160k	62M	998G	45.8	47.0
Focal-S	ImageNet-1K	UPerNet	512x512	160k	85M	1130G	48.0	50.0
Focal-B	ImageNet-1K	UPerNet	512x512	160k	126M	1354G	49.0	50.5
Focal-L	ImageNet-22K	UPerNet	640x640	160k	240M	3376G	54.0	55.4

Getting Started

Please follow get_started_for_image_classification.md to get started for image classification.
Please follow get_started_for_object_detection.md to get started for object detection.
Please follow get_started_for_semantic_segmentation.md to get started for semantic segmentation.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

focal-transformer's People

Contributors

Stargazers

Watchers

Forkers

jwyang neudeep qiliangfan objectdetection jireh-father classicvalues yzhong22 eriksennema trendingtechnology sibtainrazajamali snapbuy mrcodechef ai-machine-vision-lab stjordanis peternara ricklentz ajinkyapuar ruifenggong standardgalactic brunotech bweng001 gyq716 johnnyccw hengxyz zyg11 liu4lin kimjisoo12 hiyyg zhangxiruiseu duanweiwe dumpmemory sehun0124 shellysheynin xingshulicc zhangruibo1 kx-zk houxiaonan crashmoon gstoica27 mldl hwijune jlqzzz happyboy1234 python-repository-hub junhsss tinyloop aoteman233 bueeciel9 ts0923 yutao1008 xiaoqiangzhou uestcwcw alvacv cwwangsaratr alsiatang lucainiaoge yanasong9

focal-transformer's Issues

star

good job

Can you give me the segmentation model，thanks

Inference of Instance Segmentation Model

Hey, great work!
How can I run inference with your instance segmentation model?

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

	OpenMMLab 1.0 branch	OpenMMLab 2.0 branch
MMEngine		0.x
MMCV	1.x	2.x
MMDetection	0.x 、1.x、2.x	3.x
MMAction2	0.x	1.x
MMClassification	0.x	1.x
MMSegmentation	0.x	1.x
MMDetection3D	0.x	1.x
MMEditing	0.x	1.x
MMPose	0.x	1.x
MMDeploy	0.x	1.x
MMTracking	0.x	1.x
MMOCR	0.x	1.x
MMRazor	0.x	1.x
MMSelfSup	0.x	1.x
MMRotate	1.x	1.x
MMYOLO		0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

Can't download ImageNet-1k pretrained weights

Hi, the links to download the models seems to be inactive

This XML file does not appear to have any style information associated with it. The document tree is shown below. <Error> <Code>PublicAccessNotPermitted</Code> <Message> Public access is not permitted on this storage account. RequestId:c571d73d-401e-0050-1e5c-166149000000 Time:2023-11-13T18:11:05.4831543Z </Message> </Error>

FocalTransformerV2 not using expand_size

The paper (and focalTv1) implements a window size of 7 with expand size of 3 to match a 13x13 zone, in your implementation in v2 expand_size is never used (even if it is declared), I believe that it is because you replaced it by topK closest position (which represent a zone of sqrt(128)xsqrt(128)) in your config. Am I right ? And you are then projecting the topK coordinates using a Linear layer, right ? Have you tested the model with this configuration ?

about detection and segmentation

Hi Im a big fan of focal-transformer..
I wonder when are detection and segmentation updated?

Please let me know the approximate schedule.
Thanks.

how can i extract features ?

Link error

https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-tiny-is224-ws7.pth

ResourceNotFound
The specified resource does not exist. RequestId:933ca8ee-601e-0028-4af4-f455bc000000 Time:2022-11-10T11:05:52.4511547Z

How the load the pretrained model on classification when trained on segmentation?

The positional embedding is related to the input size, is it possible to load the pretrained models on classification? By the way, it seems that torch.unfold is faster than torch.roll on GPUs.

Relationship between focal window size and focal region size

I was confused by the relationship between focal window size and focal region size. Can you explained it more clearly? Take stage 1 level 0 as an example, sw=1, sr = 13, sw*sr could not be divided by the Ouput Size 56. I could not understand why sr is 13. Thanks a lot if you could help me.

Link expires？

Link expires？https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-base-useconv-is224-ws7.pth
Can you reupload this model？

Your computational complexity is much higher than swin transformer

I take swin transformer to test a segmentation and compare you. Found you twice as long as swin?

Focal Transformer on 1-D Data

Great Work !

My query is that : can this method be run on 1-D features ? Because the pooling operations done before flattening is done for 2-D representation !

Some problems with the reproduction process about focal block

Hello, first of all, thank you for providing the Focal Transformer module,

I have some questions:

The image you are processing is 224224 and window size = 7. I wonder if it is reasonable for me to change the window size to 8 in order to make it divisible when I input 512512 at the beginning.
Since you didn't give the segmentation code, IT seems to me that you set focal level as 2 in your demo. Should I set level values as 1, 2 and 3 respectively when processing? To find the best fit.
As for num_heads, I see that you are patch_embeding, and the number of channels becomes 96. Then in focal attention,num_heads = 2 in order to divide evenly.
Suppose that the size of the image I input is 3232, since patch_size = 4, the length of the vector is 64, then the size I input should increase successively, such as 6464, 128*128. Should I also increase ptach_size so that the final vector length is still 64?

If you can help me, I will be very grateful to you

Hugging Face Hub Integration

Hi there!

Focal Transfomers look very interesting! I see you currently save your model checkpoints through links to a hosted server. Would you be interested in sharing the pretrained models in the Hugging Face Hub through the Microsoft organization?

The Hub offers free hosting of over 20K models, and it would make your work more accessible and visible to the rest of the community. Some of the benefits of sharing your models would be:

versioning
commit history and diffs
branches
repos provide useful metadata about their tasks, languages, metrics, etc

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested. Please let us know if you would be interested and if you have any questions.

Happy to hear your thoughts,
Omar and the Hugging Face team

cc @LysandreJik

Inference on CPU

Can Focal-Transformer inference on CPU? Or it is a GPU only model?

Pretrain Model IN 22k

Hi there!
I am very interested in your work! Could you provide the pretrain model for Image Classification on ImageNet-22Kt for l so that we could make better use of your model for our own datasets and get better results?

about pool_method

Hi,I want to know doing sub-windows pooling，If I want to follow the sub-window-pooling method introduced in the paper, which pooling method should I select.

thank you very much

num_heads value

hi,I'm sorry to bother you . I would like to know whether the num_heads values in each of the four stages in your code are the same, or whether the num_heads values in each stage are fixed or related to something else

How to get q,k,v?

In your code, I do not understand how to get q, k, v from x and x_pooled. I have been confused by the roll operation on k_windows and the unfold operation on k_pooled_k for several days. Take stage 1 as an example, for level 0, since the sw is 1, I think the sr should be window_size//sw, that is 7. And for level 1, since the sw is 7, I think the sr should be output_size//sw, that is 8. Therefore the number of k is 7*7+8*8 = 113. But in your paper, you set sr as 13 at lavel 0 , sr as 7 at level 1. Why 13 and 7? And in your code, the number of k is 7*7+4*7*7-4*(7-3)*(7-3)+7*7=230, which is different from 7*7+13*13=218. As a suggestion, the window attention should be writen more clearly, and more comments are need in your code. Thanks a lot. If there is something wrong with what I said, please forgive me.

Confusion about window size at different focal level

Thanks for your great work!
But I don't understand the meaning of the following code (class FocalTransformerBlock in ./classification/focal_transformer.py) at Link

            for k in range(self.focal_level-1):     
                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
                pooled_h = math.ceil(H / self.window_size) * (2 ** k)
                pooled_w = math.ceil(W / self.window_size) * (2 ** k)
                H_pool = pooled_h * window_size_glo
                W_pool = pooled_w * window_size_glo

I guss the purpose of this is to make the H and W of x_level_k a multiple of window size and to facilitate the pool operation. But how does this actually work?

Why calculate window_size_glo?
What is meaning of pooled_h ? (math.ceil(H / self.window_size) does not change over iteration)
Why not change window_size over iteration directly?

Could you please explain to me in details?
Thanks in advance.