Comments (6)
Hi @lstein, thanks for reporting ! This issue is similar to #21094. If you do the following, the memory should be freed:
print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = T5EncoderModel.from_pretrained(FULL_MODEL,
torch_dtype = torch.float16,
subfolder='text_encoder_3',
quantization_config=quantization_config,
low_cpu_mem_usage=True,
)
print('After loading, VRAM usage=',torch.cuda.memory_allocated())
referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))
model = None
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())
LMK if this works on your side !
from transformers.
Hi @SunMarc, thanks so much for the rapid response! Unfortunately this doesn’t seem to work on my side. Can I confirm that the only difference in the proposed solution is to replace del model
with model = None
?
Here is the output from a cut-and-paste of the proposed solution:
* With quantized model *
Downloading shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6781.41it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.06it/s]
After loading, VRAM usage= 7918596096
Referrers = 7
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 7918596096
I’ll check out the other issue thread to try to understand the problem better.
from transformers.
I’ve tried the various strategies described in Issue #21094, including calling accelerate.release_memory()
, but without success so far. Also no change when explicitly loading with device_map=“auto”
or device_map=“cuda”
.
from transformers.
Yeah that's right. I replaced del model
with model = None
and it worked on my side. Here's my output:
* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Downloading shards: 100%|████████████████████████████████████████████████| 2/2 [00:00<00:00, 18040.02it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████| 2/2 [00:21<00:00, 10.75s/it]
After loading, VRAM usage= 7918596096
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 0
Just to be sure, when checking nvidia-smi, the memory is not freed, is that right ?
from transformers.
I'm glad to hear it is working on your end! There must be some difference in our environments. Could you let me know what versions of Python, transformers, torch, accelerate and CUDA you're using? My info is at the top, except for the CUDA library, which is version 12.2.
Yes, I've confirmed with nvidia-smi
that the memory is indeed allocated and used until the process ends.
I appreciate your working through this with me. It's become a bit of a blocker.
Interestingly, this almost works:
model = T5EncoderModel.from_pretrained(...)
state_dict = model.state_dict()
for k, v in state_dict.items():
state_dict[k] = None
model = None
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())
The last line prints out VRAM usage= 8192
. So just the model parameters are left.
from transformers.
Hi @lstein, I'm on the main branch of transformers and accelerate. As for torch, it's v2.3.0 and my cuda version 12.2 from nvidia-smi.
from transformers.
Related Issues (20)
- 如果在单个GPU上out of memory 如何用两个GPU加载推理同一个模型? HOT 3
- Check diff files in `check_copies`
- from_pretrained 加载checkpoint过慢的问题 HOT 1
- LLM during inference do not deallocate memory
- NVMLError_NotSupported when creating Trainer() object. HOT 2
- Stopping criteria not working with \n
- GGML (GGUF) Llama3 unit test fails HOT 4
- Error on fine tuning paligemma for object detection HOT 8
- Potential Bug in llava_next when calling pack_image_features function. HOT 5
- Source link to `LlamaForSequenceClassification` seems broken, if so, update it. HOT 3
- Process hangs when evaluating the model before finishing an epoch using `accelerate` in a multi-GPU environment (no trainer). HOT 4
- HuggingFace GroundingDINO inference execution time is slower than the original groundingDINO (~100ms) HOT 2
- Batch Generation giving different output when using batch size > 1 or when using padding in MambaForCausalLM HOT 2
- gh: consider `i18n` HOT 2
- Nested from_pretrained() gives warnings loading weights - "copying from a non-meta parameter" HOT 1
- Problem with the masked language modeling tutorial HOT 1
- When running `ruff format src/transformers`, some files needs to be reformatted HOT 2
- Something wrong for `StoppingCriteria` HOT 5
- Index out of range when generate using optimum HOT 1
- Fail to load model without .safetensors file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.