Giter Site home page Giter Site logo

Comments (8)

mcarilli avatar mcarilli commented on August 27, 2024 1

Yes, we have tested full convergence with fp16.

I noticed you aren't using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:

python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet

Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn't have tensor cores (it only supports FP16 through vectorized instructions) so it is forced to do accumulates in FP16, which is less stable. Edit: I'm told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don't have tensor cores, so naturally this will not give the same performance as V100.

from apex.

wangdongxuking61 avatar wangdongxuking61 commented on August 27, 2024

Yes, we have tested full convergence with fp16.

I noticed you aren't using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:

python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet

Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn't have tensor cores (it only supports FP16 through vectorized instructions) so it is forced to do accumulates in FP16, which is less stable. Edit: I'm told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don't have tensor cores, so naturally this will not give the same performance as V100.

Thank you for your answer. Now i am running you suggested command python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet on my 8 P100 server. I will see the result tomorrow.

Could you please explain why you use --static-loss-scale 128 instead of --static-loss-scale 1(the default value when not be set)
As the paper MIXED PRECISION TRAINING says (just below Tabel 1)

Loss-scaling technique was not required for successful mixed precision training of these networks.
While all tensors in the forward and backward passes were in FP16, a master copy of weights was
updated in FP32.

After reading the paper, I think there is no need to set the --static-loss-scale when running main.py, since --static-loss-scale is default set 1. Then I met the problem described in this issue. Would you mind running python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet and seeing if it has the problem the same with mine only after about 16 epochs? That would help a lot! Thank you very much.

from apex.

wangdongxuking61 avatar wangdongxuking61 commented on August 27, 2024

@mcarilli Hi, I have tested full convergence successfully with python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

But i don't know why i got non-convergence and NaN in grad by setting --static-loss-scale 1
Look forwarding for your reply. Thank you.

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Glad to hear it!

Honestly, i don’t know why a static loss scale of 1 didn’t work. The run in the paper may have used different versions of cudnn or cublas with different stability properties. In general, loss scaling is usually helpful, so there’s no reason not to use it, especially when Apex tools can handle it for you automatically.

from apex.

wangdongxuking61 avatar wangdongxuking61 commented on August 27, 2024

Glad to hear it!

Honestly, i don’t know why a static loss scale of 1 didn’t work. The run in the paper may have used different versions of cudnn or cublas with different stability properties. In general, loss scaling is usually helpful, so there’s no reason not to use it, especially when Apex tools can handle it for you automatically.

em...I think apex/example/imagenet/main.py doesn't handle NaN or Inf which causes the non-convergence. I will try to edit main.py handling NaN or Inf.

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Editing main.py to check for infs/nans is a useful exercise for learning purposes, but it's not necessary to use Apex.

FP16_Optimizer and Amp both provide automated dynamic loss scaling, which checks for infs and nans and adjusts the loss scale on the fly. For FP16_Optimizer dynamic loss scaling is an option you can pass to the constructor. For Amp dynamic loss scaling is the default mode. All of this is done transparently to the user. You can see how FP16_Optimizer is used by comparing main.py with main_fp16_optimizer.py, and run the example with dynamic loss scaling:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --dynamic-loss-scale --workers 4 ./

The real intention of main.py is educational, to illustrate the sort of operations FP16_Optimizer is doing under the hood. For simplicity, it does not show inf/nan checking or dynamic loss scaling.

from apex.

wangdongxuking61 avatar wangdongxuking61 commented on August 27, 2024

Thanks, i have already got convergence using main_fp16_optimizer.py whit --dynamic-loss-scale.
I am just curious why --static-loss-scale 1 didn't work. I will figure it out.

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

I wouldn't worry too much about that. It's probably not your fault, or the script's fault. The paper says they used Pytorch, but it doesn't say which version of cudnn/cublas they were using, which can definitely make a difference. In my opinion it's better to use loss scaling for safety, and understand the theoretical reasons why loss scaling is helpful.

This talk starting at slide 12
http://on-demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf
and this talk starting at slide 31
http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
are good resources.

from apex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.