Giter Site home page Giter Site logo

ZeroDivisionError in backward about apex HOT 6 CLOSED

nvidia avatar nvidia commented on August 27, 2024
ZeroDivisionError in backward

from apex.

Comments (6)

mcarilli avatar mcarilli commented on August 27, 2024

@carlc-nv is the primary developer of Amp, I've let him know.

from apex.

cbcase avatar cbcase commented on August 27, 2024

Hi @furybubu, thanks for reporting this issue.

A couple questions:

  • When you say "working CNN," you mean that the model trains acceptably in fp32?
  • Could you share a little more about the details of the model, optimizer, and dataset?
  • If you are OK sharing more about the model, could you do the following:
    • add verbose=True to the amp init call (ie: amp_handle = amp.init(verbose=True, ...))
    • Run one iteration of the model
    • Share the output

That last step will log exactly what casts amp is inserting into the model.

The specific issue you are observing is that the "fp16 loss scale" is becoming increasingly small until it becomes zero. This suggests to me there is a different fp16-related issue, since the loss scale decreases only when there is an inf or a NaN in the gradient -- and that should not happen for many iterations in a row (which it would have to for the loss scale to get all the way to zero).

from apex.

furybubu avatar furybubu commented on August 27, 2024

HI @cbcase ,
Yes, by "working CNN" I meant a CNN that uses fp32 and not mixed precision data. I cannot share much info about my model but I will try to give you as much as I can:
My model has 5 conv layers interspersed with maxpool layers and a couple of fully connected layers at the end. I use the Adam optimizer, nothing too fancy.
The dataset is pretty large, about a million occurrences of volumetric data, I have a batch size of 20 that I split among 2 gpus (10 each) with DataParrallel. My model trains beautifully when I do not enable the mixed precision training.

I will try to rerun it with the verbose flag to see if I get more clues in the output.
Thanks!

from apex.

furybubu avatar furybubu commented on August 27, 2024

So I basically get things like this:
Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (conv3d) Float->Half (linear) Float->Half (conv3d) Float->Half (linear) Float->Half (conv3d) Float->Half (conv3d) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Float->Half (linear) Half->Float (mse_loss)
And then same error.

from apex.

cbcase avatar cbcase commented on August 27, 2024

Hi @furybubu,

I'm looking at adding better debugging support for when there are mixed precision issues. If you're interested in being a guinea pig, I've pushed work-in-progress changes to this branch: https://github.com/NVIDIA/apex/tree/amp_debug. You can check it out and install it in the usual way.

Right now, there's just one function handle.run_debug(model, loss_fn) that will print out a "debug report" of sorts. The input arguments are:

  • model: your PyTorch model Module
  • loss_fn: a function that, when invoked, will return the loss on a fixed input / output pair

Here's what that looks like in practice:

Sample original code:

data, target = load_data() # However you load data
output = model(data)
loss = criterion(output, model)
...

To run debug:

data, target = load_data()
def loss_fn():
    output = model(data)
    return criterion(output, model)
handle.run_debug(model, loss_fn)

The debug script will do three things:

  1. Run forward / backward in mixed precision (without any loss scale) and print out any observed inf / nan values and in which module they occur. I believe this can help us diagnose the issue you are seeing.
  2. Print the gradient norm and absolute max value for each model parameter. This is probably not so useful in your case, though it may make it easier to interpret where the overflow values occur.
  3. Find the largest possible loss scale without overflow and compare the gradients computed in fp32 and with mixed precision. This can help identify bugs in mixed precision code (ie, apex).

Let us know if you're able to try this out and what you learn! In particular, I would be interested to hear:

  • What modules / parameter names do you see overflows on the first iteration?
  • Same thing, but take a model that has been trained in fp32 for a bit (so the parameters are no longer at their initial values)

from apex.

yuribd avatar yuribd commented on August 27, 2024

HI all!

Wonder if some other ppl reported similar issue and what was the solution?
I observe the same issue in my case (see below) . At the same time using FP16_Optimizer with dynamic_loss_scale=True works just fine

     36             if p.grad is not None:
     37                 self._has_overflow = scale_check_overflow(p.grad.data,
---> 38                                                           1. / scale)
     39             if self._has_overflow:
     40                 break

ZeroDivisionError: float division by zero

that's using approach suggested here

it reduces scale gradually from 2^15 to 8 and then breaks

Overflowed with loss scale 16384.0.  Reducing loss scale and replaying
Overflowed with loss scale 8192.0.  Reducing loss scale and replaying
Overflowed with loss scale 4096.0.  Reducing loss scale and replaying
Overflowed with loss scale 2048.0.  Reducing loss scale and replaying
Overflowed with loss scale 1024.0.  Reducing loss scale and replaying
Overflowed with loss scale 512.0.  Reducing loss scale and replaying
Overflowed with loss scale 256.0.  Reducing loss scale and replaying
Overflowed with loss scale 128.0.  Reducing loss scale and replaying
Overflowed with loss scale 64.0.  Reducing loss scale and replaying
Overflowed with loss scale 32.0.  Reducing loss scale and replaying
Overflowed with loss scale 16.0.  Reducing loss scale and replaying
Overflowed with loss scale 8.0.  Reducing loss scale and replaying```

from apex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.