Comments (6)
I'd suggest to introduce a quantization_map
concept, that would act similarly to device_map
. It would guide creation of the empty model object, and then the weights and quant-components would be loaded using standard transformers functions as parameters / buffers.
For the existing quantized models, the quantization_map would have default value like {"embedding": None, "Layers": "existing_q_method", ...}
. Perhaps its values should be not just strings, but sub-dicts with quantization params.
The quantization_map
would be stored and loaded from model config, similarly to quantization config
I'd also use this occasion to revive #28874 which streamlines saving format for bnb.
from transformers.
I also have a question about memory usage of quantizing the lm_head
. I found that currently quantizing huge lm_head
seems to increase the memory usage?😳I try to quantize Gemma's 256k lm_head into 4bit with AWQ but found the memory footprint increase, which didn't happen when quantizing a smaller 32k lm_head.
Because I am not an expert in quantization, could anyone explain this? I am wondering is this due to the dimension of lm_head? because for gemma, the vocab_size >> hidden_size🤔
(Please correct me if there is anything wrong with my code below)
from transformers import AutoConfig, AutoModelForCausalLM
from transformers import AwqConfig
from transformers.integrations.awq import replace_with_awq_linear
model_id = "google/gemma-2b"
# model_id = "TinyLlama/TinyLlama_v1.1"
quantization_config = AwqConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained(model_id)
model, _ = replace_with_awq_linear(
model, quantization_config=quantization_config,
modules_to_not_convert=["lm_head"]
)
model.cuda()
print('Model Memory Footprint', model.get_memory_footprint(), 'bytes')
print('Model Struture: \n', model)
# Gemma-2b (256k vocab_size)
# quant everything include `lm_head` 3,399,459,840 bytes
# quant everything except `lm_head` 3,127,075,840 bytes # enable `modules_to_not_convert=["lm_head"]`
# TinyLlama-1.1b (32k vocab_size)
# quant everything include `lm_head` 799,929,088 bytes
# quant everything except `lm_head` 1,028,025,088 bytes # enable `modules_to_not_convert=["lm_head"]`
Beside, I also hope to know why you think embeddings appear easier to quantize than heads
😳Could you please elaborate more on this? @galqiwi
from transformers.
cc: @younesbelkada
from transformers.
cc @SunMarc too :)
from transformers.
@Orion-Zheng, this issue is a feature request, i.e. we discuss a potential future enhancement of transformers here.
Your question seems important, but it is not related to this feature request, and I currently do not know enough about AWQ to help - but someone else likely does. It looks like your question may belong to the forum.
from transformers.
Speaking of this feature request, I am now trying to put together a minimal prototype of a quantized embedding for bitsandbytes and will be back with news (hopefully) by Thursday. If you have any tips or suggestions on how to implement this, please comment.
from transformers.
Related Issues (20)
- 如果在单个GPU上out of memory 如何用两个GPU加载推理同一个模型? HOT 4
- Check diff files in `check_copies`
- from_pretrained 加载checkpoint过慢的问题 HOT 2
- LLM during inference do not deallocate memory HOT 1
- NVMLError_NotSupported when creating Trainer() object. HOT 2
- Stopping criteria not working with \n
- GGML (GGUF) Llama3 unit test fails HOT 4
- Error on fine tuning paligemma for object detection HOT 9
- Potential Bug in llava_next when calling pack_image_features function. HOT 5
- Source link to `LlamaForSequenceClassification` seems broken, if so, update it. HOT 3
- Process hangs when evaluating the model before finishing an epoch using `accelerate` in a multi-GPU environment (no trainer). HOT 4
- HuggingFace GroundingDINO inference execution time is slower than the original groundingDINO (~100ms) HOT 5
- Batch Generation giving different output when using batch size > 1 or when using padding in MambaForCausalLM HOT 2
- gh: consider `i18n` HOT 2
- Nested from_pretrained() gives warnings loading weights - "copying from a non-meta parameter" HOT 1
- Problem with the masked language modeling tutorial HOT 2
- When running `ruff format src/transformers`, some files needs to be reformatted HOT 2
- Something wrong for `StoppingCriteria` HOT 5
- Index out of range when generate using optimum HOT 2
- Fail to load model without .safetensors file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.