Comments (7)
@gaow0007 you are right. On a single GPU, the approximation for the gradient noise scale doesn't work very well, so we avoid scaling the batch size and learning rate to mitigate convergence issues.
from adaptdl.
Thanks a lot!
from adaptdl.
@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because adaptdl
is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?
from adaptdl.
@aurickq Hi, could you elaborate on why the approximation for GNS is not very good for single GPU training? Is it because
adaptdl
is using the approximation for multi-GPU training from Appendix A.1 in the original GNS paper?
Sure. You are right that it is because we are using the formula from Appendix A.1 of the GNS paper, which depends on having two or more gradients evaluated using the current model parameters. When there are two or more GPUs, then the per-GPU gradients can be used. When there is only a single GPU, then the formula results in a division by zero.
With that said, AdaptDL can still compute the GNS on a single GPU using one of the following ways:
- When gradient accumulation is used, AdaptDL can use the per-step gradients rather than the per-GPU gradients to achieve the same thing. However, gradient accumulation on a single GPU is unlikely to speed up training, so this is rarely done.
- Otherwise, AdaptDL tries to use the gradient from the current step together with the gradient from the previous step. However, since these two gradients are evaluated using different model parameters, it is only a biased estimation. We do not use this estimation for scaling the learning rate since it can noticeably degrade validation accuracy.
from adaptdl.
Thanks a lot for the detailed explanation!
I blindly tried running adaptdl
with one replica after removing the num_replicas == 1
condition (taking the False
branch in np.where
), but actually some non-NaN number for efficiency is computed. Is seems that _grad_params.spr
and _grad_params.var
are used - are these the first approach you described, or am I just computing some random value?
from adaptdl.
Thanks a lot for the detailed explanation!
I blindly tried running
adaptdl
with one replica after removing thenum_replicas == 1
condition (taking theFalse
branch innp.where
), but actually some non-NaN number for efficiency is computed. Is seems that_grad_params.spr
and_grad_params.var
are used - are these the first approach you described, or am I just computing some random value?
Assuming accumulation == False
(otherwise that line of code would not run in the first place), it should be taking the second approach where it approximates the GNS based on gradients from the previous step.
from adaptdl.
I see. Thanks for the reply! 👍
from adaptdl.
Related Issues (20)
- "CUDA error: invalid resource handle" on Standalone Training HOT 1
- The meaning of progress HOT 5
- Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.
- Problems encountered during the installation of AdaptDL Helm Chart HOT 2
- [Pollux, Reproducibility, Inquiry] Are dataset-fetching mechanisms broken? HOT 3
- hello_world can not run HOT 7
- Integrating with PyTorch Lightning HOT 4
- Large system overheads of AdaptDL HOT 6
- A few problems when reproducing the benchmark HOT 4
- Progress in validation
- Problem when installing adaptdl scheduler HOT 9
- Strange outputs when running dcgan example
- what does _get_cluster_sizes function mean HOT 1
- Problem when provision EKS cluster
- Keyerror occurs when auto-scaling happens in AdpatDL scheduler
- Version of Pytorch and Cuda
- Cann't access to tensorboard when mnist_tensorboard.py is running
- submit hello_world occurs "ImagePullBackOff " HOT 3
- hello_world submitted by adaptdl cannot go into `Running` state in Docker Desktop for Mac HOT 1
- Add supports to iterable-style datasets in adaptdl.torch.AdaptiveDataLoader
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adaptdl.