facebookresearch / mae Goto Github PK

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

License: Other

Python 100.00%

mae's Introduction

Masked Autoencoders: A PyTorch Implementation

This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.
This repo is a modification on the DeiT repo. Installation and preparation follow that repo.
This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.

Catalog

Visualization demo
Pre-trained checkpoints + fine-tuning code
Pre-training code

Visualization demo

Run our interactive visualization demo using Colab notebook (no GPU needed):

Fine-tuning with pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:

	ViT-Base	ViT-Large	ViT-Huge
pre-trained checkpoint	download	download	download
md5	`8cad7c`	`b8b06e`	`9bdbb0`

The fine-tuning instruction is in FINETUNE.md.

By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):

	ViT-B	ViT-L	ViT-H	ViT-H₄₄₈	prev best
ImageNet-1K (no external data)	83.6	85.9	86.9	87.8	87.1
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K):
ImageNet-Corruption (error rate)	51.7	41.8	33.8	36.8	42.5
ImageNet-Adversarial	35.9	57.1	68.2	76.7	35.8
ImageNet-Rendition	48.3	59.9	64.4	66.5	48.7
ImageNet-Sketch	34.5	45.3	49.6	50.9	36.0
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset:
iNaturalists 2017	70.5	75.7	79.3	83.4	75.4
iNaturalists 2018	75.4	80.1	83.0	86.8	81.2
iNaturalists 2019	80.5	83.4	85.7	88.3	84.1
Places205	63.9	65.8	65.9	66.8	66.0
Places365	57.9	59.4	59.8	60.3	58.0

Pre-training

The pre-training instruction is in PRETRAIN.md.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

mae's People

Contributors

Stargazers

Watchers

Forkers

sunsmarterjie sgzqc namisan qfizik youngfly11 ronghanghu bhheo yangfukui techthiyanes linyq17 fuxuliu kastnerkyle hwfan hello-jinwoo lxtgh mengqidyangge wadesunyang lyp2333 lv-tuan ucasligang ghgggg xupp1989 stjordanis huang-xx hotown peterzhousz mpofukelvintafadzwa cauchyfood forks-learning miguelbandera ak391 olegjakushkin bkj bluseking hiyyg peternara linwk20 laplacekorea supersupercong xixizy carolynxzy yangsenwxy hulinliu hust-lidelong xiaolongguo cpfelix alandene shank2358 zopek iuriimattos2 gcg0210 peterouzh yannnnnnnnnnnn guoyang-xie larryhyq destinyls nxdong sailfish009 nielsrogge tangyongliang2018 cyy53589 liujikun belonger albertipot quanhao129611 rayshark smn2010 sadmemories elischwartz trundleyrg michuanhaohao haifangong zs-zhong piaohe111 co6ez chenyang918 wenhsing hammer-888 requeam wuqifhb abcmouse minxin implus starrysky1986 honyking carlgao-git2 winstonzbb easy-shu cv-junchengli zhangjiwei-japan jmlin163 sanceng funny1624 frankhoang fb-reps wangyangneu jlqzzz sunrise6513 robotpin y-c-li

mae's Issues

Could you please give the code to train ViT from scratch?

I would like to know the higher baseline stated in the paper.

Different masking strategies

Thank you so much for releasing the code.

In the paper, there are experiments with different masking strategies such as random. block and grid. The code implementation includes only the random strategy. Is there any plan to also release the code for other strategies?

Does training user learning_rate decay?

Code set warm up epochs as 40, but lr scheduler does not show up in code

mae_visualize models vs mae_pretrain_full models

Hello,

thank you for the great work and the great repo. I was playing with different pre-trained models for visualization. When I use a mae_visualize_vit_base.pth I get the reconstruction results as in the demo and the paper such as below:

However when I use the mae_pretrain_vit_base_full.pth checkpoint the results are as below:

mask_ratio=0.75 for both results.
So here are my questions:

Can you please clarify what is the difference between visualize and full checkpoints and why the results look worse with full checkpoints?
If I want to finetune an MAE model (both encoder and decoder parts) for reconstruction on a custom dataset, which checkpoint is recommended?

I would appreciate it if you could help me with these questions.

Multi-node Multi-gpu distributed training

hello， I wan't to ask how to train mae pretrain in Multi-node Multi-gpu distributed using network ?
Can you provide a script？

fine-tuned checkpoint

Hello, do you by any chance have a checkpoint for the model fine-tuned on ImageNet instead of pretrained?

when to release detection and segmentation models?

Sample log files / Tensorboard summaries

It would be great to publish the log files generated by pre-training/fine-tuning/linear probe runs or their tensorboard summary files (or screenshots) for reference.

More concretely, I am trying to replicate the base config pre-training and linear probe as closely as possible in Jax but I am getting 60% accuracy which is much lower than the expected 68%. Inspecting these logs/TB summaries will help me find the possible discrepancies.

This is the loss curve on pixel prediction (not normalized)

And this is the linear probe run results

Could you please to release the orignal TensorFlow implementation?

Does this implement support timm > 0.3.2?

Hi,

Thanks for releasing this awesome work!

I notice the code will check the version of timm.

assert timm.__version__ == "0.3.2" # version check

Does this implement support updated timm, e.g. 0.4.12?

Thanks!

Best,
Vera

Could you also provide the code and hyper-parameters for fine-tuning on other datasets?

Could you also provide the code for finetuning on other datasets such as iNaturalists and Places205 datasets?

Why not release fine-tuned checkpoints?

Title says it all: why not make fine-tuned checkpoints (for places and other benchmarks) available?

Problems about reproducing ViT-Base

Thanks for your great work.
My reproduction result for ViT-B is only 83.3, which is 0.3 lower than the paper result. And I have no idea what may cause this.
My exp totally follows this repo, except the following changes:
a. I use 32 V100 with batch size 128 (32x128 = 4096). The recommended setting of this repo is 64 V100 with batch size 64(64x64=4096).
b. I didn't use submitit_pretrain.py but directly use main_pretrain.py for multi-node training.

Are these two changes the potential causes for performance gap?
Any suggestions would be deeply appreciated~

The base lr for finetune is different from that in the paper

Hello, I noticed that the base lr of the vit-base in https://github.com/facebookresearch/mae/blob/main/FINETUNE.md is 5e-4 with layer-wise decay of 0.65, but in the paper the base lr is 1e-3 with layer-wise decay of 0.75, why are they different?

Are there any suggestions for training MAE on small dataset, small vit backbone, and small batch size?

Thanks for your great job!
I am now working hard to reproduce good results on small datasets such as CIFAR-10, CIAFR-20, and a subset of ImageNet, and also with some small backbone e.g. vit-small.
As MAE has only reported the results on ImageNet with > ViT-base, could you provide us with some suggestions to use MAE on small datasets and small backbone? Looking forward to your reply!

Interpolate mode for donwstream tasks

Hello, thank you for creating this awesome algorithm!
I have a problem with applying MAE pre-trained weights in downstream tasks. Is the interpolation mode of position encoding bicubic?
If I train mask-rcnn, will the performance go up by simply replacing the deit weight with mae weight?

Mentioning a community tutorial

Hi @endernewton!

We were in touch via email regarding MAE as well. Thanks for your help there!

@ariG23498 and I implemented a minimal version of MAE in TensorFlow/Keras with the CIFAR-10 dataset in a hope that it would benefit the community. We have open-sourced our experiments here: https://github.com/ariG23498/mae-scalable-vision-learners. We have also written an introductory blog post on it: https://keras.io/examples/vision/masked_image_modeling/.

We would really appreciate it if you could give it a mention from the README of this repository.

Questions on object detection model

Hi, thank you for releasing such a wonderful work. I have a question regarding modifications done on ViT model to adapt it with FPN: as mentioned in the "Benchmarking Detection" paper, in addition to up/down-sampling operations (which are straightforward), the ViT attention is also modified to have local window attention, interleaved with 4 global attention blocks. In this case, how to adapt the ViT pre-trained by MAE (which only has global attention all the way) with the new attention mechanism?

Thanks! I look forward to your reply.

the provided pretrained weight only includes the encoder weight, not includes the decoder weight

It seems that the provided pretrained weight only includes the encoder weight, not includes the decoder weight, could you provide the entire weight for pretraining model. Thank you very much!

Fine-Tuning in unsupervised way

Hello. After reading the paper I wanted to train VITMae in order to extract features in unsupervised (to do image clustering). I have the impression that the fine-tuning script is done in supervised way for classification after looking at the code.

So I would like to know how to load pre-trained Vit-base model and continue unsupervised training on my dataset please. Do I need to use submitit_pretrain or main_pretrain file ?

Sorry to ask such a basic question, I am new to pytorch.

Thanks you very much
Respectfully

提供的预训练模型和代码中的模型不一致？

Why the structure in the pre-trained checkpoint Vit-Base is different in the models_mae.py. The cls_token layer exist in the checkpoint Vit-Base you provide, but not in the models_mae.py.

Could you please also release the finetuned checkpoints?

Thank you very much for your amazing work!

As the title says, could you please share the finetuned checkpoints as I don't have that powerful gpus to reproduce.

Thanks a lot!

Include CrossEnropy loss in pretraining?

According to the paper, the reconstruction loss is the only loss used in pretraining. However, I noticed that the cls token is appended in the input sequence of the decoder, but is dropped after the decoder.

mae/models_mae.py

Lines 193 to 194 in 6a2ba40

    
           # remove cls token 
        
           x = x[:, 1:, :]

I am wondering if you have tried to include the cross-entropy loss between the prediction of the cls token and the ground truth label during pretraining. Just like the NSP (next sentence prediction) task in the BERT training, intuitively, including this cross-entropy loss helps the model learns more high-level information.

Replication of ImageNet Scratch Vit-Base and Vit-Large

Hi @KaimingHe,
thank you for your wonderful work and initiative to open-sourcing it.

I have been banging my head against this for a month now and any help would be deeply appreciated!
I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.

I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.

pytorch version: 1.10.0
cuda: 11.3
timm version 0.5.0
setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016
with recipe mentioned in paper: vit-base 82.3, I get 81.9

            --grad_accum=4 \
            --batch-size=128 \
            --lr=0.0016 \
            --weight-decay=0.30 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \

Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe
here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004
Also paper does not mention layer-decay or different --aa strategy

            --grad_accum=4 \
            --batch-size=32 \
            --lr=0.004 \
            --weight-decay=0.05 \
            --layer_decay=0.65 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5-inc1" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \
            --reprob=0.25 \

What is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?

decoder checkpoint of the pre-trained MAE

Hi,

Thank you for the awesome work and glad to see that the codes are released!

Are there any plans to release the decoder checkpoints of the ViT -Base or -Huge in the near future?

I was trying to run the visualization demo with the ViT-Base model and found that the released checkpoint only consists of the encoder part. (I only found the ViT-Large model’s decoder checkpoint in the visualization demo, i.e., mae_visualize_vit_large.pth)

Thank you very much for your time!

Best regards, 
Jihoon

Release of the linear probing code

Hey,

Thanks for this awesome code! I am wondering whether you plan to release the linear probing code or not. According to the third party MAE repo, the linear probing results seem hard to be reproduced. It would be great if you plan to do so. Thanks!

list of available models

Hello and thanks for the great work and the great repo :)

I wonder if you could add a single list with link and description of all available models. I think it would help clarify and fasten adoption.

Something like

model name	model link	description
`mae_visualize_vit_large.pth`	link	pretrained encoder + decoder
`mae_visualize_vit_base.pth`	link	pretrained encoder + decoder
`mae_pretrain_vit_base.pth`	link	pretrained encoder
-	-	segmentation model
etc..

Mask ratio in forward method

I was wondering why you put the mask_ratio parameter in the forward method.

Did you experiment with changing the mask ratio during training?
Would that be possible?

Thanks for his great repo!

Does someone reproduce the accuracy of imagenet1k?

Could someone share the log of reproducing the accuracy of imagenet1k?

ADE20k Learning Rate

Section A.4 of the paper has "We search for the optimal lr for each entry in Table 5 " when referring to segmentation on ADE20k. Could you share these learning rates? My reproduced baselines are about 2 points too low.

Thanks!

There is a spelling error in the Finetune.md

The patch_size of the ViT_Huge model is 14, where the parameters of the model are written vit_huge_patch16, and the correct should be vit_huge_patch14. I switched here to vit_huge_patch14, and the result of finetune is *Acc@1 86.926 Acc@5 98.094 loss 0.584

MAE is used as a pure auto-encoder

I want to observe the performance when MAE is used as a pure auto-encoder, please what should I do

How MAE is sensitive to batch size?

Thanks for releasing the code!

I am trying to play with MAE but only with limited # of GPUs. I saw in the doc that we can use accum_iter to make the effective batch size the same as the paper (4096). But I wonder what if we use smaller batch sizes (typical case for a university lab)? Do you have some data points on how MAE would perform with batch size smaller than 4096, e.g., ranging from 256 to 1024 etc?

Thank you!

Use VIT encoder with pretrained weight to do Image Recalls, it didn't work out very well

MAE with extra GAN loss

Hello,

thank you for releasing the code! I have seen that the Colab notebook includes the visualization of an MAE trained with an extra GAN loss, which was not mentioned in the paper.
I was wondering, does it improve the results (linear probe/fine-tuning) in any way ? if so, are you planning to release its code?

Thanks

[Question] about CNN backbone

Is it possible to replace a ViT backbone with a regular CNN backbone like Resnet?

What is the final value of loss when pretraining on ImageNet1K with norm_pix_loss objective?

Thanks for your excellent work!
What is the final value of loss when pretraining on ImageNet1K with norm_pix_loss objective?

How to pretrain mae in 1 node with 8 gpus?

I m trying to pretrain a vit_small model in 1 node with 8 gpus, but the submitit_pretrain.py goes wrong with:
Traceback (most recent call last):
File "submitit_pretrain.py", line 131, in
main()
File "submitit_pretrain.py", line 89, in main
args.job_dir = get_shared_folder() / "%j"
File "submitit_pretrain.py", line 39, in get_shared_folder
raise RuntimeError("No shared folder available")
RuntimeError: No shared folder available

it seems like that I have no shared folder, so how can I pretrain mae in 1 node with 8 gpus?
Thanks for your exciting job!!

How to use only VIT Encoder with pretrained weight, any example?

Finetune in IN1k with 1 node 8 gpus

I use this following command to finetune vit-base.

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
    --accum_iter 4 \
    --batch_size 32 \
    --model vit_base_patch16 \
    --finetune mae_pretrain_vit_base.pth \
    --epochs 100 \
    --blr 5e-4 --layer_decay 0.65 \
    --weight_decay 0.05 --drop_path 0.1 --mixup 0.8 --cutmix 1.0 --reprob 0.25 \
    --dist_eval --data_path ${IMAGENET1K_DIR}

However, the loss is always 6.9068 and the acc is 0.10%. I find len(data_loader) == 3 and data_iter_step is 0,1,2. The params do not update because

loss_scaler(loss, optimizer, clip_grad=max_norm,
                    parameters=model.parameters(), create_graph=False,
                    update_grad=(data_iter_step + 1) % accum_iter == 0)

(data_iter_step + 1) % accum_iter == 0 can't be satisfied.

Importing finetuned MAE into another project

I am trying to load a MAE-model into another project. I use the Vision Transformer class from https://github.com/facebookresearch/mae/blob/main/models_vit.py#L56 (I also tried the mae class from https://github.com/facebookresearch/mae/blob/main/models_mae.py#L223).

I tried loading the model using the class and then loading the weights from https://github.com/facebookresearch/mae/blob/main/FINETUNE.md

However with my code it does not seem to load the weights from the checkpoint. When printing the model before and after loading the checkpoint the weights haven't changed. Does someone have an idea why this happens?

Here is my implementation:

#Create model
model = VisionTransformer(patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
                      norm_layer=partial(nn.LayerNorm, eps=1e-6))
#Load checkpoint
checkpoint = torch.load('PATH/TO/mae_finetuned_vit_base.pth', map_location='cpu')
checkpoint_model = checkpoint['model']
    
#load state dict

model.load_state_dict(checkpoint_model, strict=False)`

Edit: simplified code and clarified my error

MAE finetune train loss nan

info : Loss is nan, stopping training.

loss fluctuates during training

Thanks for the great work!

I wanted to pre-train a mae_vit_base_patch16 model on imagenet-1k, and I found the loss fluctuates during training. I was wondering if this is normal for pre-training mae.

distribute training with horovod cost more time every epoch

bandwidth: 10Gbps
machine: 8*v100
nodes: 8
dateset: imagenet

If training with resnet50, the training time is same as paper(about 3min every epoch), but when training with vit_base_patch16, the training time is 10min(the peak receive network is 7Gbps).
After use method in link, the training time is 6min(the peak receive network is 4.4Gbps)
I wonder if the bandwidth is bottleneck for the mae training. What bandwidth do you use in the paper?

Doubts about the Position Encoding

Thank you so much for releasing the code.

I had a small doubt about the position encoding generated for the 2d sequence. In the file pos_embed.py we have divided the input sequence into two groups of size D/2 and I am not sure the reason for it. If we have to move to higher dimensions, say 3, how should the same be done? Should we divide the sequence into three blocks (and make sure that individual blocks are of even size)?

Pretrained model with Decoder weights

Hi~ It seems current checkpoint has no decoder weights. Would you mind also releasing them?
Correct me if I'm wrong
Thanks so much!

Question about positional embedding?

Hi, thanks for the great work!

I have a small question regarding the positional embedding:
In this line

mae/util/pos_embed.py

Line 28 in 6a2ba40

grid = np.meshgrid(grid_w, grid_h) # here w goes first

it seems to return [grid_w, grid_h]

However, in line

mae/util/pos_embed.py

Line 42 in 6a2ba40

    
           emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)

it seems that grid_w is used for making emd_h? or did I miss something?

expected loss value on custom dataset

Thank you so much for releasing the code.

I wanted to train the model from scratch on a custom dataset. I was wondering if you could share what was the loss value computed for the pre-training step. I was getting a very high loss value and was wondering what could be the possible reason.

Also, is there some correlation between the amount of masking and the reconstruction loss? Can a high loss value indicate that the masking ratio is low?

Can you provide more results about the ViT-B Mask R-CNN+MAE 25ep/50ep

Thanks for your awesome work!
I am doing some experiments with MAE. Can you provide more results (e.g. mask AP) about these two models (ViT-B Mask R-CNN + MAE 25ep/50ep). I want to cite these data in my paper.
Thanks! Looking forward to your reply.

What is the reason for remove the class token during the unshuffling and then append it back in the decoder?

mae/models_mae.py

Lines 178 to 180 in 6a2ba40

    
           x_ = torch.cat([x[:, 1:, :], mask_tokens], dim=1)  # no cls token 
        
           x_ = torch.gather(x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2]))  # unshuffle 
        
           x = torch.cat([x[:, :1, :], x_], dim=1)  # append cls token

Right here in lines above.

	x_ = torch.cat([x[:, 1:, :], mask_tokens], dim=1) # no cls token
	x_ = torch.gather(x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2])) # unshuffle
	x = torch.cat([x[:, :1, :], x_], dim=1) # append cls token

facebookresearch / mae Goto Github PK

mae's Introduction

Masked Autoencoders: A PyTorch Implementation

Catalog

Visualization demo

Fine-tuning with pre-trained checkpoints

Pre-training

License

mae's People

Contributors

Stargazers

Watchers

Forkers

mae's Issues

Recommend Projects

Recommend Topics

Recommend Org