Giter Site home page Giter Site logo

haochen-wang409 / hpm Goto Github PK

View Code? Open in Web Editor NEW
82.0 3.0 6.0 1.7 MB

[CVPR'23] Hard Patches Mining for Masked Image Modeling

Home Page: https://arxiv.org/pdf/2304.05919.pdf

License: Apache License 2.0

Python 98.47% Shell 1.53%
ade20k coco computer-vision cvpr2023 imagenet masked-autoencoder masked-image-modeling pattern-recognition self-supervised-learning unsupervised-learning

hpm's Introduction

Hard Patches Mining for Masked Image Modeling

Official Implementation of our paper "Hard Patches Mining for Masked Image Modeling", in CVPR 2023.

by Haochen Wang, Kaiyou Song, Junsong Fan, Yuxi Wang, Jin Xie, and Zhaoxiang Zhang

[arXiv] [Paper]

🔔 🔔 🔔 An extension of this paper has been available at [arXiv], where we successfully adapted HPM to masked video modeling benchmarks with almost no modifications! The code will be released soon.

Notes

ImageNet-1K Pretrain: See PRETRAIN.md.
ImageNet-1L Finetune: See FINETUNE.md.

Motivation

Abstract. Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.

Method

Results

Method Model PT Epochs Top-1 Acc. Checkpoint mIoU
MAE ViT-B/16 200 82.2 40.5
HPM ViT-B/16 200 83.0 (+0.8) 42.1 (+1.6)
MAE ViT-B/16 1600 83.6 48.1
HPM ViT-B/16 800 84.2 (+0.6) [Google Drive] 48.5 (+0.4)
MAE ViT-L/16 1600 85.1 53.6
HPM ViT-L/16 800 85.8 (+0.7) [Google Drive] 54.6 (+1.0)

Acknowledgement

The pretraining and finetuning of our project are based on DeiT, MAE and UM-MAE. The linear probing is based on MAE. The kNN classification is based on DINO. Thanks for their wonderful work.

For object detection and semantic segmentation, please refer to Detectron2 and MMSegmentation, respectively. The configurations can be found in here and here for detection and segmentation, respectively.

License

This project is under the Apache License 2.0 license. See LICENSE for details.

Citation

@inproceedings{wang2023hard,
  author    = {Wang, Haochen and Song, Kaiyou and Fan, Junsong and Wang, Yuxi and Xie, Jin and Zhang, Zhaoxiang},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  title     = {Hard Patches Mining for Masked Image Modeling},
  year      = {2023},
}

hpm's People

Contributors

haochen-wang409 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hpm's Issues

question about COCO detection

Hi, nice to read such an interesting paper. The pretraining code is well-detailed, but this repository lacks information regarding COCO detection downstream finetuning. I am curious about how to conduct COCO detection finetuning after MAE pretraining. Could you provide more details about the detection architecture implementation and experiment config?

Question about heatmap

Hello, it is very nice to read this paper. I am conducting research on 'learn where to mask' just like you. Could you please explain how the heatmap in your paper is generated and how to convert weights to RGB for the heatmap? I have tried several approaches, but none of them look as visually appealing as yours. Thank you very much.

Questions Regarding Pretraining Experiment Configuration

I have some inquiries regarding the pretraining experiment configuration with imagenet1k for the HPM project and would appreciate obtaining more detailed information about this experiment to better understand the project's performance and resource requirements.

1.How many GPUs were used during pretraining?
2.Please provide detailed information about the GPU models or specifications used for pretraining.
3.What was the total training time for pretraining?
4.What was the GPU memory consumption per card during pretraining?

input mismatch of loss predictor

Hi,

Very interesting paper. I have a question regarding the implementation.

In your paper, the teacher model predicts loss for each patch based on fully visible image, while the student model learns to predict masked patch loss based on the unmasked patches. Is there a mismatch during training and inference of loss predictor?

Request for certain experimental matters

Hi, thanks for your great work, I have some experimental matters here.

  1. For the ViT-Base HPM pre-training config, the paper only provides a config for 200 epochs, and I'm not sure whether the pretrain_base.sh in this repository is for 800 epochs or not. Could you provide config for 800 and 1600 epochs?
  2. Why is the optimizer different from the original LARS optimizer in MAE for the linear_probing experiments?
  3. Have you tested different batch sizes during pre-training, Does batch_size impact the model performance significantly?
    Looking forward to hearing back from you soon. Thank you!

Availability of Pre-trained Models

I am interested in your project and I am keen on experimenting with it for my research. I was wondering if you have any plans to provide pre-trained models that users can directly use without going through the training process themselves. Having access to pre-trained models can help in significantly reducing the time and resources required to get started with the project.

Thank you for considering this request. Looking forward to your response.

Question about pretrain models on 800 epochs

Hello, nice to read this outstanding paper, I have some questions about the settings when pretrain model on 800 epochs, did your settings keep the same as pretrain model on 200 epoch in Table S1 and S2?(such as: pretrain weight decay 0.05, pretrain layer-wise lr decay 0.8, finetune training epochs 100)

Confused about "model_teacher" in pre-training code

Hello,

Thank You for your great work. I was going through main_pretrain.py and figured out you have three models, model_teacher, model, and model_ema. From your paper I understand, you have 2 models that takes part in the pre-training process, one model (the model?) and the other an ema-based model (the model_ema?). Thus, my question is, what role is the model_teacher playing?

Thank You!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.