🐛 Describe the bug I want to train a model on HPC using SLURM and

CUDA out of memory still exist after using FSDP about pytorch HOT 1 OPEN

TuyetHan commented on May 8, 2024

CUDA out of memory still exist after using FSDP

from pytorch.

Comments (1)

awgu commented on May 8, 2024

It looks like you are seeing out-of-memory (OOM) because your activation size is too large, which is not directly related to FSDP:

 File "/project/p_trancal/trsclbjob/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1126, in forward
   attn_mask = F._canonical_mask(
 File "/project/p_trancal/trsclbjob/lib/python3.10/site-packages/torch/nn/functional.py", line 5115, in _canonical_mask
   torch.zeros_like(mask, dtype=target_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.07 GiB. GPU 3 has a total capacity of 39.43 GiB of which 25.15 GiB is free. Including non-PyTorch memory, this process has 14.27 GiB memory in use. Of the allocated memory 11.74 GiB is allocated by PyTorch, and 932.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

You may want to check your input activation sizes since you are trying to allocate a 29.07 GiB attn_mask.

from pytorch.

CUDA out of memory still exist after using FSDP about pytorch HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent