Feature request Hi! I’ve been researching LLM quantization recentl

cc: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Quantization support for heads and embeddings about transformers HOT 6 OPEN

galqiwi commented on August 27, 2024 8

Quantization support for heads and embeddings

from transformers.

Comments (6)

poedator commented on August 27, 2024 2

I'd suggest to introduce a quantization_map concept, that would act similarly to device_map. It would guide creation of the empty model object, and then the weights and quant-components would be loaded using standard transformers functions as parameters / buffers.
For the existing quantized models, the quantization_map would have default value like {"embedding": None, "Layers": "existing_q_method", ...}. Perhaps its values should be not just strings, but sub-dicts with quantization params.
The quantization_map would be stored and loaded from model config, similarly to quantization config

I'd also use this occasion to revive #28874 which streamlines saving format for bnb.

from transformers.

Orion-Zheng commented on August 27, 2024 1

I also have a question about memory usage of quantizing the lm_head. I found that currently quantizing huge lm_head seems to increase the memory usage?😳I try to quantize Gemma's 256k lm_head into 4bit with AWQ but found the memory footprint increase, which didn't happen when quantizing a smaller 32k lm_head.
Because I am not an expert in quantization, could anyone explain this? I am wondering is this due to the dimension of lm_head? because for gemma, the vocab_size >> hidden_size🤔
(Please correct me if there is anything wrong with my code below)

from transformers import AutoConfig, AutoModelForCausalLM
from transformers import AwqConfig
from transformers.integrations.awq import replace_with_awq_linear

model_id = "google/gemma-2b"
# model_id = "TinyLlama/TinyLlama_v1.1"
quantization_config = AwqConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained(model_id)

model, _ = replace_with_awq_linear(
    model, quantization_config=quantization_config,
    modules_to_not_convert=["lm_head"]
)
model.cuda()
print('Model Memory Footprint', model.get_memory_footprint(), 'bytes')
print('Model Struture: \n', model)

# Gemma-2b (256k vocab_size)
# quant everything include `lm_head` 3,399,459,840 bytes
# quant everything except `lm_head` 3,127,075,840 bytes  # enable `modules_to_not_convert=["lm_head"]`

# TinyLlama-1.1b (32k vocab_size)
# quant everything include `lm_head` 799,929,088 bytes
# quant everything except `lm_head` 1,028,025,088 bytes # enable `modules_to_not_convert=["lm_head"]`

Beside, I also hope to know why you think embeddings appear easier to quantize than heads😳Could you please elaborate more on this? @galqiwi

from transformers.

galqiwi commented on August 27, 2024

cc: @younesbelkada

from transformers.

amyeroberts commented on August 27, 2024

cc @SunMarc too :)

from transformers.

galqiwi commented on August 27, 2024

@Orion-Zheng, this issue is a feature request, i.e. we discuss a potential future enhancement of transformers here.
Your question seems important, but it is not related to this feature request, and I currently do not know enough about AWQ to help - but someone else likely does. It looks like your question may belong to the forum.

from transformers.

galqiwi commented on August 27, 2024

Speaking of this feature request, I am now trying to put together a minimal prototype of a quantized embedding for bitsandbytes and will be back with news (hopefully) by Thursday. If you have any tips or suggestions on how to implement this, please comment.

from transformers.

Quantization support for heads and embeddings about transformers HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent