Hi, Thanks for your wonderful work and detailed tutorial. I am just a fresh ne

About RuntimeError：CUDA out of memory about light-weight-refinenet HOT 4 CLOSED

drsleep commented on July 18, 2024

About RuntimeError：CUDA out of memory

from light-weight-refinenet.

Comments (4)

arindamrc commented on July 18, 2024

I'm observing similar behaviour as well. I can train only with a batch size of 1. The GPU memory isn't fully utilized either. I'm training on a GTX 1080; vram is 8gb.

from light-weight-refinenet.

DrSleep commented on July 18, 2024

can't help with this one, would suggest to make sure that no other GPU processes are being run alongside.
I think with batch size of 1 1080 should be enough, for reference I am using 1080Ti with the batch size of 6

from light-weight-refinenet.

rfairhurst commented on July 18, 2024

I realize you can't help, but I am also getting this error. I am using Nvidia Quadro P4000 with 8 GB vram.

The Task Manager shows very low GPU memory usage until the program prints:

Train epoch: 0 [0/132] Avg. Loss: 3.711 Avg. Time: 2.425

Then the GPU memory usage jumps in under a second to over 90% and throws the error:

File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 425, in
main()
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 409, in main
args.freeze_bn[task_idx])
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 280, in train_segmenter
loss.backward()
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\autograd_init_.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 230.00 MiB (GPU 0; 8.00 GiB total capacity; 5.81 GiB already allocated; 159.27 MiB free; 333.44 MiB cached)

I believe my batch size is set to 1. Anyway, I will search Google about this error to see if there is anything I can try.

from light-weight-refinenet.

rfairhurst commented on July 18, 2024

Apparently I was wrong about my batch size setting. It must have been set to 6 or higher, because when I made sure it was set to 5 or less the training ran successfully, but it failed if I set the batch size to 6.

from light-weight-refinenet.

About RuntimeError：CUDA out of memory about light-weight-refinenet HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent