I've been using apex for a few months now with pytorch 0.4.1. Now I'm trying to use t

Apex utilities and s are meant to work and are regularly tested with top of tree

Error with latest pytorch head about apex HOT 13 CLOSED

nvidia commented on August 27, 2024

Error with latest pytorch head

from apex.

Comments (13)

ngimel commented on August 27, 2024

What cudnn version are you using? Can you please collect cudnn API logs http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging

from apex.

mcarilli commented on August 27, 2024

Apex utilities and scripts are meant to work and are regularly tested with top of tree Pytorch (1.0). The python-only utilities should support all versions from 0.4 to present. The cpp and cuda extensions are only officially supported for 1.0.

So your use case (with the latest Pytorch head) should succeed, although it's possible a change may have entered very recently that broke something. I'll try to reproduce on my end. In the meantime, to sandbox the issue, can you try running main.py instead of main_fp16_optimizer.py? main.py implements mixed precision training manually, so if this also fails, it may indicate a deeper problem (e.g. with cudnn as Natalia says).

from apex.

yaysummeriscoming commented on August 27, 2024

Thanks for the quick replies, sorry I'm using cuda 9.2 and cudnn 7.3.0. I tried using main.py, but it leads to the same error. This time on line 339 where the backward pass is called.

I've enabled cudnn debug logging & attached the log output.

apex error cudnn log word.docx

Interestingly whilst I was trying to run the cudnn logger on just the backward pass line, I get a different error:

THCudaCheck FAIL file=/home/user/pytorch/aten/src/THCUNN/generic/Threshold.cu line=67 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1664, in
main()
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/user/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/user/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 505, in
main()
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 230, in main
train(train_loader, model, criterion, optimizer, epoch)
File "/home/user/project/apex/examples/imagenet/main_mine.py", line 339, in train
loss.backward()
File "/home/user/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/user/pytorch/aten/src/THCUNN/generic/Threshold.cu:67

I! CuDNN (v7300) function cudnnDestroy() called:
i! Time: 2018-11-03T11:20:21.576597 (0d+0h+0m+18s since start)
i! Process=3997; Thread=3997; GPU=NULL; Handle=NULL; StreamId=NULL.

This is with halting execution for a few seconds before executing.

from apex.

mcarilli commented on August 27, 2024

Thank you for the thorough investigation. This definitely smells like some sort of cudnn error, although not one I've observed before on this particular script. I'll try it with cuda 9.2 and cudnn 7.3.0. Are you running in one of the official containers (e.g. nightly-devel-cuda9.2-cudnn7 from https://hub.docker.com/r/pytorch/pytorch/tags/)?

from apex.

yaysummeriscoming commented on August 27, 2024

Thanks for the prompt support! No I'm running a google cloud image based on Ubuntu 16.04.5.

from apex.

ngimel commented on August 27, 2024

Can you please try with cudnn 7.3.1? There were a few bugs present in 7.3.0 that were fixed in 7.3.1.

from apex.

yaysummeriscoming commented on August 27, 2024

I rebuilt the latest pytorch head (6/11/18) together with cudnn 7.3.1 as you suggested, however it's still failing on the backward pass

from apex.

mcarilli commented on August 27, 2024

Ok, I've reproduced the error. It seems to be caused by excessive memory use. Try running with batch size 224 instead of 256. 256 used to fit (just barely) on a 16GB card, so maybe there's been some creep in either Pytorch's memory usage or cudnn's workspace usage. The error occurs, as far as I can tell, on the very first backward pass, when cudnn is still running its internal heuristics (as commanded by cudnn.benchmark = True) to choose and cache the fastest algorithms for the new gemm and convolution sizes it's discovering.

from apex.

yaysummeriscoming commented on August 27, 2024

Alright I finally got a chance to test this today. Google cloud preemptible V100 GPUs have been in very short supply the last few days. I suspect this has something to do with the CVPR2019 paper deadline this week.

Lowering the batch size indeed fixes the problem, but I’ve got some really weird results for maximum batch size now before some kind of OOM error:

Resnet50
Fp32: 128
Fp16: 224

Resnet18
Fp32: 512
Fp16: 224

Generally I’ve observed that you can normally double the batch size when moving to fp16 with Apex. For resnet50 this holds, although previously I was able to use a max batch size of 256. For resnet18, I’m now only able to train with half the maximum batch size of standard float32 training…. Am I missing something here?

from apex.

mcarilli commented on August 27, 2024

Yeah that is weird. I'll try to reproduce this memory usage pattern.

from apex.

yaysummeriscoming commented on August 27, 2024

So I've rebuilt pytorch from the dev head today (21/11/18), together with cuda 10 & cudnn 7.4.1 & am no longer seeing this memory issue

I can now train resnet18 with a max batch size of 768 & resnet50 with a max batch size of 256 in fp16 mode.

from apex.

mcarilli commented on August 27, 2024

Sorry I didn't get a chance to look into this yet, I've had a lot of input streams recently/always, but I'm glad upgrading fixed the issue. Now I know what to look for in the future, so thanks for reporting this!

from apex.

yaysummeriscoming commented on August 27, 2024

No worries, thanks for taking the time to make such a great package!

from apex.

Error with latest pytorch head about apex HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent