Hi, I am having an error when I implement the amp procedure on a working CNN like this

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

ZeroDivisionError in backward about apex HOT 6 CLOSED

nvidia commented on August 27, 2024

ZeroDivisionError in backward

from apex.

Comments (6)

mcarilli commented on August 27, 2024

@carlc-nv is the primary developer of Amp, I've let him know.

from apex.

cbcase commented on August 27, 2024

Hi @furybubu, thanks for reporting this issue.

A couple questions:

When you say "working CNN," you mean that the model trains acceptably in fp32?
Could you share a little more about the details of the model, optimizer, and dataset?
If you are OK sharing more about the model, could you do the following:
- add verbose=True to the amp init call (ie: amp_handle = amp.init(verbose=True, ...))
- Run one iteration of the model
- Share the output

That last step will log exactly what casts amp is inserting into the model.

The specific issue you are observing is that the "fp16 loss scale" is becoming increasingly small until it becomes zero. This suggests to me there is a different fp16-related issue, since the loss scale decreases only when there is an inf or a NaN in the gradient -- and that should not happen for many iterations in a row (which it would have to for the loss scale to get all the way to zero).

from apex.

furybubu commented on August 27, 2024

HI @cbcase ,
Yes, by "working CNN" I meant a CNN that uses fp32 and not mixed precision data. I cannot share much info about my model but I will try to give you as much as I can:
My model has 5 conv layers interspersed with maxpool layers and a couple of fully connected layers at the end. I use the Adam optimizer, nothing too fancy.
The dataset is pretty large, about a million occurrences of volumetric data, I have a batch size of 20 that I split among 2 gpus (10 each) with DataParrallel. My model trains beautifully when I do not enable the mixed precision training.

I will try to rerun it with the verbose flag to see if I get more clues in the output.
Thanks!

from apex.

furybubu commented on August 27, 2024

So I basically get things like this:
Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (linear) Float->Half (conv3d) Float->Half (linear) Float->Half (conv3d) Float->Half (conv3d) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Half->Float (mse_loss)
And then same error.

from apex.

cbcase commented on August 27, 2024

Hi @furybubu,

I'm looking at adding better debugging support for when there are mixed precision issues. If you're interested in being a guinea pig, I've pushed work-in-progress changes to this branch: https://github.com/NVIDIA/apex/tree/amp_debug. You can check it out and install it in the usual way.

Right now, there's just one function handle.run_debug(model, loss_fn) that will print out a "debug report" of sorts. The input arguments are:

model: your PyTorch model Module
loss_fn: a function that, when invoked, will return the loss on a fixed input / output pair

Here's what that looks like in practice:

Sample original code:

data, target = load_data() # However you load data
output = model(data)
loss = criterion(output, model)
...

To run debug:

data, target = load_data()
def loss_fn():
    output = model(data)
    return criterion(output, model)
handle.run_debug(model, loss_fn)

The debug script will do three things:

Run forward / backward in mixed precision (without any loss scale) and print out any observed inf / nan values and in which module they occur. I believe this can help us diagnose the issue you are seeing.
Print the gradient norm and absolute max value for each model parameter. This is probably not so useful in your case, though it may make it easier to interpret where the overflow values occur.
Find the largest possible loss scale without overflow and compare the gradients computed in fp32 and with mixed precision. This can help identify bugs in mixed precision code (ie, apex).

Let us know if you're able to try this out and what you learn! In particular, I would be interested to hear:

What modules / parameter names do you see overflows on the first iteration?
Same thing, but take a model that has been trained in fp32 for a bit (so the parameters are no longer at their initial values)

from apex.

yuribd commented on August 27, 2024

HI all!

Wonder if some other ppl reported similar issue and what was the solution?
I observe the same issue in my case (see below) . At the same time using FP16_Optimizer with dynamic_loss_scale=True works just fine

     36             if p.grad is not None:
     37                 self._has_overflow = scale_check_overflow(p.grad.data,
---> 38                                                           1. / scale)
     39             if self._has_overflow:
     40                 break

ZeroDivisionError: float division by zero

that's using approach suggested here

it reduces scale gradually from 2^15 to 8 and then breaks

Overflowed with loss scale 16384.0.  Reducing loss scale and replaying
Overflowed with loss scale 8192.0.  Reducing loss scale and replaying
Overflowed with loss scale 4096.0.  Reducing loss scale and replaying
Overflowed with loss scale 2048.0.  Reducing loss scale and replaying
Overflowed with loss scale 1024.0.  Reducing loss scale and replaying
Overflowed with loss scale 512.0.  Reducing loss scale and replaying
Overflowed with loss scale 256.0.  Reducing loss scale and replaying
Overflowed with loss scale 128.0.  Reducing loss scale and replaying
Overflowed with loss scale 64.0.  Reducing loss scale and replaying
Overflowed with loss scale 32.0.  Reducing loss scale and replaying
Overflowed with loss scale 16.0.  Reducing loss scale and replaying
Overflowed with loss scale 8.0.  Reducing loss scale and replaying```

from apex.

ZeroDivisionError in backward about apex HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent