Giter Site home page Giter Site logo

Comments (3)

QuentinDuval avatar QuentinDuval commented on August 30, 2024

Hi @nemtiax,

Thanks a lot for your interest in VISSL and your question :)

To the best of my understanding, the issue described in the MoCo paper is based on the ability to "cheat" based on using local batch statistics rather than global batch statistics. We means that the issue only exist if you have several GPUs each using different statistics because their batch are different. The moment we have a single GPU for training, this issue is not an issue anymore as the global statistics are the same as the local statistics.

This is why in VISSL, we only do the shuffling and un-shuffling when distributed training is enabled:
https://github.com/facebookresearch/vissl/blob/main/vissl/hooks/moco_hooks.py#L164
https://github.com/facebookresearch/vissl/blob/main/vissl/hooks/moco_hooks.py#L171

So in short, there should be no issues with MoCo on 1 GPU, at least no problem due to this (there could be issues with the fact that the batch size is too small to get good results, as it is the case for SimCLR).

As for replacing the BatchNorm with LayerNorm, I have never tried it personally for MoCo, but LayerNorm has indeed risen in popularity in particular in vision transformers. It could very well be working here.

Thank you,
Quentin

from vissl.

nemtiax avatar nemtiax commented on August 30, 2024

Great, that makes sense to me. As you say, the MoCo appendix suggests that the method of "cheating" through BN is to identify which sub-batch contains the target, which would not be an issue for a single GPU.

Thanks for your help!

from vissl.

nemtiax avatar nemtiax commented on August 30, 2024

Documenting what I found in case someone lands on this issue via Google search in the future. I trained MoCo on Cifar10 using one GPU with both BN and LN. LN seems get to stuck in degenerate solution for many epochs before eventually breaking out and beginning to learn. As might be expected, even after recovering and beginning to learn, the final performance of the LN model is far worse. It loses like 25% accuracy compared to the BN model (~86% -> ~62%) after fine-tuning a linear classifier on the descriptors.

Training loss, orange is BN, blue is LN:
image

Gradients on a representative sample of model weights. Note that the LN model has very small gradients for the first 60 or so epochs:

image

I don't yet have an explanation for exactly what is going wrong in the LN case, but my hypothesis is that BN helps to prevent the model from pushing all the images in a batch to the same descriptor, and so makes it easier to started learning.

from vissl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.