Giter Site home page Giter Site logo

Error with latest pytorch head about apex HOT 13 CLOSED

nvidia avatar nvidia commented on August 27, 2024
Error with latest pytorch head

from apex.

Comments (13)

ngimel avatar ngimel commented on August 27, 2024

What cudnn version are you using? Can you please collect cudnn API logs http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Apex utilities and scripts are meant to work and are regularly tested with top of tree Pytorch (1.0). The python-only utilities should support all versions from 0.4 to present. The cpp and cuda extensions are only officially supported for 1.0.

So your use case (with the latest Pytorch head) should succeed, although it's possible a change may have entered very recently that broke something. I'll try to reproduce on my end. In the meantime, to sandbox the issue, can you try running main.py instead of main_fp16_optimizer.py? main.py implements mixed precision training manually, so if this also fails, it may indicate a deeper problem (e.g. with cudnn as Natalia says).

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

Thanks for the quick replies, sorry I'm using cuda 9.2 and cudnn 7.3.0. I tried using main.py, but it leads to the same error. This time on line 339 where the backward pass is called.

I've enabled cudnn debug logging & attached the log output.

apex error cudnn log word.docx

Interestingly whilst I was trying to run the cudnn logger on just the backward pass line, I get a different error:

THCudaCheck FAIL file=/home/user/pytorch/aten/src/THCUNN/generic/Threshold.cu line=67 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1664, in
main()
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/user/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 505, in
main()
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 230, in main
train(train_loader, model, criterion, optimizer, epoch)
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 339, in train
loss.backward()
File "/home/user/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/user/pytorch/aten/src/THCUNN/generic/Threshold.cu:67

I! CuDNN (v7300) function cudnnDestroy() called:
i! Time: 2018-11-03T11:20:21.576597 (0d+0h+0m+18s since start)
i! Process=3997; Thread=3997; GPU=NULL; Handle=NULL; StreamId=NULL.

This is with halting execution for a few seconds before executing.

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Thank you for the thorough investigation. This definitely smells like some sort of cudnn error, although not one I've observed before on this particular script. I'll try it with cuda 9.2 and cudnn 7.3.0. Are you running in one of the official containers (e.g. nightly-devel-cuda9.2-cudnn7 from https://hub.docker.com/r/pytorch/pytorch/tags/)?

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

Thanks for the prompt support! No I'm running a google cloud image based on Ubuntu 16.04.5.

from apex.

ngimel avatar ngimel commented on August 27, 2024

Can you please try with cudnn 7.3.1? There were a few bugs present in 7.3.0 that were fixed in 7.3.1.

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

I rebuilt the latest pytorch head (6/11/18) together with cudnn 7.3.1 as you suggested, however it's still failing on the backward pass

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Ok, I've reproduced the error. It seems to be caused by excessive memory use. Try running with batch size 224 instead of 256. 256 used to fit (just barely) on a 16GB card, so maybe there's been some creep in either Pytorch's memory usage or cudnn's workspace usage. The error occurs, as far as I can tell, on the very first backward pass, when cudnn is still running its internal heuristics (as commanded by cudnn.benchmark = True) to choose and cache the fastest algorithms for the new gemm and convolution sizes it's discovering.

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

Alright I finally got a chance to test this today. Google cloud preemptible V100 GPUs have been in very short supply the last few days. I suspect this has something to do with the CVPR2019 paper deadline this week.

Lowering the batch size indeed fixes the problem, but I’ve got some really weird results for maximum batch size now before some kind of OOM error:

Resnet50
Fp32: 128
Fp16: 224

Resnet18
Fp32: 512
Fp16: 224

Generally I’ve observed that you can normally double the batch size when moving to fp16 with Apex. For resnet50 this holds, although previously I was able to use a max batch size of 256. For resnet18, I’m now only able to train with half the maximum batch size of standard float32 training…. Am I missing something here?

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Yeah that is weird. I'll try to reproduce this memory usage pattern.

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

So I've rebuilt pytorch from the dev head today (21/11/18), together with cuda 10 & cudnn 7.4.1 & am no longer seeing this memory issue

I can now train resnet18 with a max batch size of 768 & resnet50 with a max batch size of 256 in fp16 mode.

from apex.

mcarilli avatar mcarilli commented on August 27, 2024

Sorry I didn't get a chance to look into this yet, I've had a lot of input streams recently/always, but I'm glad upgrading fixed the issue. Now I know what to look for in the future, so thanks for reporting this!

from apex.

yaysummeriscoming avatar yaysummeriscoming commented on August 27, 2024

No worries, thanks for taking the time to make such a great package!

from apex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.