kevinz-code / csra Goto Github PK
View Code? Open in Web Editor NEWOfficial code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"
License: GNU Affero General Public License v3.0
Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"
License: GNU Affero General Public License v3.0
Thank you for your excellent work. It has benefited me a lot。Is there any code about prediction??
@Kevinz-code thanks for providing the source code and great work. I have few queries which is mentioned below
Please do share your thoughts Thansk in adavance
Thanks for you good work. When I combine MHA with ASL(https://github.com/Alibaba-MIIL/ASL), I found the result will decrease. Do you try to combine your work with ASL?
Thanks for your code. I just wonder the reason why we need to normalize the 'classifier' before the 'flatten' op. Does it perform bettern than that without normalizing?
Thank you for your explanation.
Hello, this is a great job! Have you tried using the CSRA module in the ResNet50 model? If so, did it improve performance?
I'm trying to implement CSRA using MobileNet as the backbone, but I'm running into some troubles. This is kind of related to #5.
First of all, from the paper it was not clear to me whether CSRA is to be applied before, after or instead of the classifier.
Now, I have a question: Which version of MobileNet was CSRA implemented into? In my case, I'm trying to use MobileNetV3Large It's stated in the paper it's MobileNetV2
In my use case, I would like to use MobileNetV3 classifcation head, except with a different number of target classes. Where is CSRA supposed to be placed?
This is the structure of the MobileNetV3 classifier:
Is the CSRA supposed to replace the Avg Pool on the (7,7,960) tensor? to replace the 1x1 Conv after the (1,1,1280) tensor? To take place after the last 1x1 Conv?
I think most of the confusion comes from Fig 1 and Fig 2 in the CSRA paper.
in Fig 1, the output of the backbone is run through the classifier, then through CSRA. It is stated that Fig 1 is a special case of CSRA, but it still remains confusing.
In Fig 2, f seems to act directly as the classes scores, while the text previous to Eq 6 states "Finally, all these class-specific feature vectors are sent to the classifier to obtain the final logits". It is not clear in Fig 2 that the result of the CSRA module is sent to the classifier, AND it brings more confusion to the matter of where is the CSRA module supposed to be placed
Thanks for your excellent work!
I noticed you use VIT-L16-224 to train on coco,with an input size of 448.
May I ask how to use VIT-224 pretrain weight with 448 input size?
Hello, by combining the code and your paper, I have the following questions(about vit_ csra):
In the code, the class token is not used in the input of the last CSRA module, so why set the class token in the code in "VIT_CSRA".
Has the last MLP head used for classification in the vision transformer been deleted directly?
Hi, thanks for your excellent work! But I'm confused of the detail about baseline-model settings in your paper.
Take training resnet-101 without cutmix on coco2014 as an example:
With the following training configurations as baseline setting, I get 81.3 mAP after 7 epochs (30 in total, still in training process...), which is much higher than that in your paper (79.4 mAP).
python main.py --num_heads 4 --lam 0 --dataset coco --num_cls 80 --checkpoint coco14/resnet101
So, what is the correct settings to reproduce the baseline result as in your paper? Thanks again.
really appreciate your help!
Hi Kevinz, thanks for your awesome work. I'd like to do a visual analysis to get a better understanding of the CSRA. Could you please give me some advice on how to visualize the attention score (or heatmap, attention image)? Thank you very much!
Residual Attention: A Simple but Effective Method for Multi-Label Recognition
According to the paper, the base_logit (denoted as g in the paper) should be computing the global feature vector by averaging the features over all spatial locations. This is stated in the equation:
Here, xk represents the feature at location k, and we sum over all locations (49 in this case) and then take the average. This operation is class-agnostic, meaning it’s not specific to any class and is the same for all classes. The global feature vector g represents the overall content of the image, irrespective of specific classes. It serves as a baseline representation of the image content.
In my Implementation i have this:
def forward(self, x):
B, _, H, W = x.size() # batch size, _, height, width
# Compute class-specific attention scores
logits = self.classifier(x) # size: (B, C, H, W)
logits = logits.view(B, self.C, -1) # size: (B, C, H*W)
# Compute class-specific feature vectors
x_flatten = x.view(B, self.d, -1) # size: (B, d, H*W)
# Compute global feature vector
g = torch.mean(x_flatten, dim=2) # size: (B, d)
I am computing base_logit (or g) as per the paper’s method. The original implementation seems to be computing something different for base_logit, which doesn’t align with the paper’s description. It’s computing the average class-specific score for each class across all spatial locations, which is not what g represents according to the paper.
This is the original implementation:
def forward(self, x):
# x (B d H W)
# normalize classifier
# score (B C HxW)
score = self.head(x) / torch.norm(self.head.weight, dim=1, keepdim=True).transpose(0,1)
score = score.flatten(2)
base_logit = torch.mean(score, dim=2) # size: (B, C)
Is there a reason for this?
Hi,
thank you so much for your great work. I'm doing a project with multi-label classification so I wonder how I can apply your pretrained model for image feature extraction? what I need is to extract feature of an image. Could you please give me some hints?
Best regards,
Hui
Can I using this repo for dataset with Partial-label?
Thanks for sharing your code,I have som questions about your project.
In val.py, the definitionation of follows was empty.
parser.add_argument("--load_from", default="models_local/resnet101_voc07_head1_lam0.1_94.7.pth", type=str)
And I cannot find the saved model coding about this.
How to use the val.py in your project? and explansion the model saved path clearly?
Hi,
I have two questions:
First : Why didn't you use K Fold Cross Validation?
Second : What is the reason use different learning rate for classifier? Is it for faster convergence?
I am trying to adapt CSRA to EfficientNetB3 on my multi-label dataset. Although I try various head and lambda numbers, I am getting worse results according to baseline model. What is your opinion? Is there also something different to try?
Also there is class imbalance in my dataset. Is there need to make data augmentation to prevent class imbalance? Is CSRA a method affected by data augmentation?
Thanks
i wanna run file visualize.py but cam_all parameter is not defined. how can i run this file to visualize?
Hello, how to use CSRA in vit? There is no specific information in the paper?
When I try to load ''vit_L16_224_coco_head8_86.5.pth'' model in val.py, I get following error.
Error(s) in loading state_dict for VIT_CSRA:
Missing key(s) in state_dict: "classifier.multi_head.0.head.weight".
Unexpected key(s) in state_dict: "head1.weight", "head1.bias", "head2.weight", "head2.bias", "head3.weight", "head3.bias", "head4.weight", "head4.bias", "head5.weight", "head5.bias", "head6.weight", "head6.bias", "head7.weight", "head7.bias", "head8.weight", "head8.bias", "head.weight", "head.bias".
size mismatch for pos_embed: copying a param with shape torch.Size([1, 785, 1024]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
File "C:\Users\osivaz61\Desktop\projects\python\retina\diseaseDetection\CSRA-master\val.py", line 79, in main
model.load_state_dict(torch.load(args.load_from))
File "C:\Users\osivaz61\Desktop\projects\python\retina\diseaseDetection\CSRA-master\val.py", line 97, in
main()
As far as I understand the vit_csra.py file is not updated. Can you share the updated code?
Thanks
Using your provided val model to val.py ,why?
main.py including test.file,the test result means the real results or the test results was produced by val.py?
I am waiting for your reply
According to formula 5 and formula 6 in the paper, the class-specific residual attention (CSRA) feature f should be sent to
the classifier to obtain the final logits, but in your code, you use the f as the final logits, what's the difference?
Hello, I'm training VIT_ B has the above error. It is probably known that the required input in vit is 224 and the data set used is wider_ Attribute, how to modify it?
Hi Kevinz, thanks for your awesome work. I want to know whether you plan to release the weights of VIT that finetuned on the MSCOCO dataset?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.