System Info transformers ve

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I’ve tried the various strategies described in Issue <a class="issue-link js-issue-lin

Yeah that's right. I replaced del model with <code cl

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Quantized T5EncoderModel cannot be removed from VRAM on CUDA systems about transformers HOT 6 OPEN

lstein commented on July 24, 2024

Quantized T5EncoderModel cannot be removed from VRAM on CUDA systems

from transformers.

Comments (6)

SunMarc commented on July 24, 2024

Hi @lstein, thanks for reporting ! This issue is similar to #21094. If you do the following, the memory should be freed:

print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = T5EncoderModel.from_pretrained(FULL_MODEL,
                                       torch_dtype = torch.float16,
                                       subfolder='text_encoder_3',
                                       quantization_config=quantization_config,
                                       low_cpu_mem_usage=True,
                                       )
print('After loading, VRAM usage=',torch.cuda.memory_allocated())

referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))

model = None
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

LMK if this works on your side !

from transformers.

lstein commented on July 24, 2024

Hi @SunMarc, thanks so much for the rapid response! Unfortunately this doesn’t seem to work on my side. Can I confirm that the only difference in the proposed solution is to replace del model with model = None?

Here is the output from a cut-and-paste of the proposed solution:

* With quantized model *
Downloading shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6781.41it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.06it/s]
After loading, VRAM usage= 7918596096
Referrers =  7
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 7918596096

I’ll check out the other issue thread to try to understand the problem better.

from transformers.

lstein commented on July 24, 2024

I’ve tried the various strategies described in Issue #21094, including calling accelerate.release_memory(), but without success so far. Also no change when explicitly loading with device_map=“auto” or device_map=“cuda”.

from transformers.

SunMarc commented on July 24, 2024

Yeah that's right. I replaced del model with model = None and it worked on my side. Here's my output:

* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Downloading shards: 100%|████████████████████████████████████████████████| 2/2 [00:00<00:00, 18040.02it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████| 2/2 [00:21<00:00, 10.75s/it]
After loading, VRAM usage= 7918596096
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 0

Just to be sure, when checking nvidia-smi, the memory is not freed, is that right ?

from transformers.

lstein commented on July 24, 2024

I'm glad to hear it is working on your end! There must be some difference in our environments. Could you let me know what versions of Python, transformers, torch, accelerate and CUDA you're using? My info is at the top, except for the CUDA library, which is version 12.2.

Yes, I've confirmed with nvidia-smi that the memory is indeed allocated and used until the process ends.

I appreciate your working through this with me. It's become a bit of a blocker.

Interestingly, this almost works:

model = T5EncoderModel.from_pretrained(...)

state_dict = model.state_dict()
for k, v in state_dict.items():
    state_dict[k] = None

model = None
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

The last line prints out VRAM usage= 8192. So just the model parameters are left.

from transformers.

SunMarc commented on July 24, 2024

Hi @lstein, I'm on the main branch of transformers and accelerate. As for torch, it's v2.3.0 and my cuda version 12.2 from nvidia-smi.

from transformers.

Quantized T5EncoderModel cannot be removed from VRAM on CUDA systems about transformers HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent