It seems that single-GPU training does not support adaptive batch size and utilizes th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Adaptive Batch Size for Single-GPU training about adaptdl HOT 7 CLOSED

petuum commented on May 19, 2024

Adaptive Batch Size for Single-GPU training

from adaptdl.

Comments (7)

aurickq commented on May 19, 2024

@gaow0007 you are right. On a single GPU, the approximation for the gradient noise scale doesn't work very well, so we avoid scaling the batch size and learning rate to mitigate convergence issues.

from adaptdl.

gaow0007 commented on May 19, 2024

Thanks a lot!

from adaptdl.

jaywonchung commented on May 19, 2024

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

from adaptdl.

aurickq commented on May 19, 2024

@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?

Sure. You are right that it is because we are using the formula from Appendix A.1 of the GNS paper, which depends on having two or more gradients evaluated using the current model parameters. When there are two or more GPUs, then the per-GPU gradients can be used. When there is only a single GPU, then the formula results in a division by zero.

With that said, AdaptDL can still compute the GNS on a single GPU using one of the following ways:

When gradient accumulation is used, AdaptDL can use the per-step gradients rather than the per-GPU gradients to achieve the same thing. However, gradient accumulation on a single GPU is unlikely to speed up training, so this is rarely done.
Otherwise, AdaptDL tries to use the gradient from the current step together with the gradient from the previous step. However, since these two gradients are evaluated using different model parameters, it is only a biased estimation. We do not use this estimation for scaling the learning rate since it can noticeably degrade validation accuracy.

from adaptdl.

jaywonchung commented on May 19, 2024

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

from adaptdl.

aurickq commented on May 19, 2024

Thanks a lot for the detailed explanation!

I blindly tried running adaptdl with one replica after removing the num_replicas == 1 condition (taking the False branch in np.where), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr and _grad_params.var are used - are these the first approach you described, or am I just computing some random value?

Assuming accumulation == False (otherwise that line of code would not run in the first place), it should be taking the second approach where it approximates the GNS based on gradients from the previous step.

from adaptdl.

jaywonchung commented on May 19, 2024

I see. Thanks for the reply! 👍

from adaptdl.

Adaptive Batch Size for Single-GPU training about adaptdl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent