In the MoCo paper section 3.3, they say: "We resolve this problem by

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Does the MoCo implementation do shuffled Batch Norm when run on a single GPU? about vissl HOT 3 CLOSED

nemtiax commented on August 30, 2024

Does the MoCo implementation do shuffled Batch Norm when run on a single GPU?

from vissl.

Comments (3)

QuentinDuval commented on August 30, 2024

Hi @nemtiax,

Thanks a lot for your interest in VISSL and your question :)

To the best of my understanding, the issue described in the MoCo paper is based on the ability to "cheat" based on using local batch statistics rather than global batch statistics. We means that the issue only exist if you have several GPUs each using different statistics because their batch are different. The moment we have a single GPU for training, this issue is not an issue anymore as the global statistics are the same as the local statistics.

This is why in VISSL, we only do the shuffling and un-shuffling when distributed training is enabled:
https://github.com/facebookresearch/vissl/blob/main/vissl/hooks/moco_hooks.py#L164
https://github.com/facebookresearch/vissl/blob/main/vissl/hooks/moco_hooks.py#L171

So in short, there should be no issues with MoCo on 1 GPU, at least no problem due to this (there could be issues with the fact that the batch size is too small to get good results, as it is the case for SimCLR).

As for replacing the BatchNorm with LayerNorm, I have never tried it personally for MoCo, but LayerNorm has indeed risen in popularity in particular in vision transformers. It could very well be working here.

Thank you,
Quentin

from vissl.

nemtiax commented on August 30, 2024

Great, that makes sense to me. As you say, the MoCo appendix suggests that the method of "cheating" through BN is to identify which sub-batch contains the target, which would not be an issue for a single GPU.

Thanks for your help!

from vissl.

nemtiax commented on August 30, 2024

Documenting what I found in case someone lands on this issue via Google search in the future. I trained MoCo on Cifar10 using one GPU with both BN and LN. LN seems get to stuck in degenerate solution for many epochs before eventually breaking out and beginning to learn. As might be expected, even after recovering and beginning to learn, the final performance of the LN model is far worse. It loses like 25% accuracy compared to the BN model (~86% -> ~62%) after fine-tuning a linear classifier on the descriptors.

Training loss, orange is BN, blue is LN:

Gradients on a representative sample of model weights. Note that the LN model has very small gradients for the first 60 or so epochs:

I don't yet have an explanation for exactly what is going wrong in the LN case, but my hypothesis is that BN helps to prevent the model from pushing all the images in a batch to the same descriptor, and so makes it easier to started learning.

from vissl.

Does the MoCo implementation do shuffled Batch Norm when run on a single GPU? about vissl HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent