Giter Site home page Giter Site logo

Comments (7)

aurickq avatar aurickq commented on May 19, 2024

@gaow0007 you are right. On a single GPU, the approximation for the gradient noise scale doesn't work very well, so we avoid scaling the batch size and learning rate to mitigate convergence issues.

from adaptdl.

gaow0007 avatar gaow0007 commented on May 19, 2024

Thanks a lot!

from adaptdl.

jaywonchung avatar jaywonchung commented on May 19, 2024

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

from adaptdl.

aurickq avatar aurickq commented on May 19, 2024

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

Sure. You are right that it is because we are using the formula from Appendix A.1 of the GNS paper, which depends on having two or more gradients evaluated using the current model parameters. When there are two or more GPUs, then the per-GPU gradients can be used. When there is only a single GPU, then the formula results in a division by zero.

With that said, AdaptDL can still compute the GNS on a single GPU using one of the following ways:

  1. When gradient accumulation is used, AdaptDL can use the per-step gradients rather than the per-GPU gradients to achieve the same thing. However, gradient accumulation on a single GPU is unlikely to speed up training, so this is rarely done.
  2. Otherwise, AdaptDL tries to use the gradient from the current step together with the gradient from the previous step. However, since these two gradients are evaluated using different model parameters, it is only a biased estimation. We do not use this estimation for scaling the learning rate since it can noticeably degrade validation accuracy.

from adaptdl.

jaywonchung avatar jaywonchung commented on May 19, 2024

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

from adaptdl.

aurickq avatar aurickq commented on May 19, 2024

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

Assuming accumulation == False (otherwise that line of code would not run in the first place), it should be taking the second approach where it approximates the GNS based on gradients from the previous step.

from adaptdl.

jaywonchung avatar jaywonchung commented on May 19, 2024

I see. Thanks for the reply! 👍

from adaptdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.