i want use example/imagenet/main.py to train resnet50

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16 about apex HOT 8 CLOSED

nvidia commented on August 27, 2024

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16

from apex.

Comments (8)

mcarilli commented on August 27, 2024 1

Yes, we have tested full convergence with fp16.

I noticed you aren't using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:

python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet

Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn't have tensor cores (it only supports FP16 through vectorized instructions) ~~so it is forced to do accumulates in FP16, which is less stable.~~ Edit: I'm told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don't have tensor cores, so naturally this will not give the same performance as V100.

from apex.

wangdongxuking61 commented on August 27, 2024

Yes, we have tested full convergence with fp16.

I noticed you aren't using loss scaling. For mixed precision training, loss scaling is an important step (although it is not necessary for FP32 training). main.py supports static loss scaling, which uses a constant loss scale throughout training. I believe for our convergence runs we used a static loss scale of 128:
python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet
If you wish to try dynamic loss scaling instead, which automatically adjusts the loss scale whenever it encounters a NaN/inf, you can try running the main_fp16_optimizer example instead:
python main_fp16_optimizer.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --dynamic-loss-scale /imagenet
Also, in general, V100 is better than P100 for mixed precision training. Volta tensor cores take in FP16 data and do the accumulate for gemms and convolutions in FP32. Pascal doesn't have tensor cores (it only supports FP16 through vectorized instructions) so it is forced to do accumulates in FP16, which is less stable. Edit: I'm told Pytorch on P100 is smart enough to avoid calling cublas/cudnn functions that perform accumulates in FP16. Instead, when operating on FP16 data, it calls cublas/cudnn functions that internally upconvert to FP32 arithmetic. So the numerical stability of these ops on P100 should be roughly equivalent to V100, but they still don't have tensor cores, so naturally this will not give the same performance as V100.

Thank you for your answer. Now i am running you suggested command python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet on my 8 P100 server. I will see the result tomorrow.

Could you please explain why you use --static-loss-scale 128 instead of --static-loss-scale 1(the default value when not be set)
As the paper MIXED PRECISION TRAINING says (just below Tabel 1)

Loss-scaling technique was not required for successful mixed precision training of these networks.
While all tensors in the forward and backward passes were in FP16, a master copy of weights was
updated in FP32.

After reading the paper, I think there is no need to set the --static-loss-scale when running main.py, since --static-loss-scale is default set 1. Then I met the problem described in this issue. Would you mind running python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet and seeing if it has the problem the same with mine only after about 16 epochs? That would help a lot! Thank you very much.

from apex.

wangdongxuking61 commented on August 27, 2024

@mcarilli Hi, I have tested full convergence successfully with python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 --static-loss-scale 128 /imagenet

But i don't know why i got non-convergence and NaN in grad by setting --static-loss-scale 1
Look forwarding for your reply. Thank you.

from apex.

mcarilli commented on August 27, 2024

Glad to hear it!

Honestly, i don’t know why a static loss scale of 1 didn’t work. The run in the paper may have used different versions of cudnn or cublas with different stability properties. In general, loss scaling is usually helpful, so there’s no reason not to use it, especially when Apex tools can handle it for you automatically.

from apex.

wangdongxuking61 commented on August 27, 2024

Glad to hear it!

Honestly, i don’t know why a static loss scale of 1 didn’t work. The run in the paper may have used different versions of cudnn or cublas with different stability properties. In general, loss scaling is usually helpful, so there’s no reason not to use it, especially when Apex tools can handle it for you automatically.

em...I think apex/example/imagenet/main.py doesn't handle NaN or Inf which causes the non-convergence. I will try to edit main.py handling NaN or Inf.

from apex.

mcarilli commented on August 27, 2024

Editing main.py to check for infs/nans is a useful exercise for learning purposes, but it's not necessary to use Apex.

FP16_Optimizer and Amp both provide automated dynamic loss scaling, which checks for infs and nans and adjusts the loss scale on the fly. For FP16_Optimizer dynamic loss scaling is an option you can pass to the constructor. For Amp dynamic loss scaling is the default mode. All of this is done transparently to the user. You can see how FP16_Optimizer is used by comparing main.py with main_fp16_optimizer.py, and run the example with dynamic loss scaling:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --dynamic-loss-scale --workers 4 ./

The real intention of main.py is educational, to illustrate the sort of operations FP16_Optimizer is doing under the hood. For simplicity, it does not show inf/nan checking or dynamic loss scaling.

from apex.

wangdongxuking61 commented on August 27, 2024

Thanks, i have already got convergence using main_fp16_optimizer.py whit --dynamic-loss-scale.
I am just curious why --static-loss-scale 1 didn't work. I will figure it out.

from apex.

mcarilli commented on August 27, 2024

I wouldn't worry too much about that. It's probably not your fault, or the script's fault. The paper says they used Pytorch, but it doesn't say which version of cudnn/cublas they were using, which can definitely make a difference. In my opinion it's better to use loss scaling for safety, and understand the theoretical reasons why loss scaling is helpful.

This talk starting at slide 12
http://on-demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-with-mixed-precision-theory-and-practice.pdf
and this talk starting at slide 31
http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
are good resources.

from apex.

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16 about apex HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent