Giter Site home page Giter Site logo

Comments (6)

poedator avatar poedator commented on July 24, 2024 2

I'd suggest to introduce a quantization_map concept, that would act similarly to device_map. It would guide creation of the empty model object, and then the weights and quant-components would be loaded using standard transformers functions as parameters / buffers.
For the existing quantized models, the quantization_map would have default value like {"embedding": None, "Layers": "existing_q_method", ...}. Perhaps its values should be not just strings, but sub-dicts with quantization params.
The quantization_map would be stored and loaded from model config, similarly to quantization config

I'd also use this occasion to revive #28874 which streamlines saving format for bnb.

from transformers.

Orion-Zheng avatar Orion-Zheng commented on July 24, 2024 1

I also have a question about memory usage of quantizing the lm_head. I found that currently quantizing huge lm_head seems to increase the memory usage?😳I try to quantize Gemma's 256k lm_head into 4bit with AWQ but found the memory footprint increase, which didn't happen when quantizing a smaller 32k lm_head.
Because I am not an expert in quantization, could anyone explain this? I am wondering is this due to the dimension of lm_head? because for gemma, the vocab_size >> hidden_size🤔
(Please correct me if there is anything wrong with my code below)

from transformers import AutoConfig, AutoModelForCausalLM
from transformers import AwqConfig
from transformers.integrations.awq import replace_with_awq_linear

model_id = "google/gemma-2b"
# model_id = "TinyLlama/TinyLlama_v1.1"
quantization_config = AwqConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained(model_id)

model, _ = replace_with_awq_linear(
    model, quantization_config=quantization_config,
    modules_to_not_convert=["lm_head"]
)
model.cuda()
print('Model Memory Footprint', model.get_memory_footprint(), 'bytes')
print('Model Struture: \n', model)

# Gemma-2b (256k vocab_size)
# quant everything include `lm_head` 3,399,459,840 bytes
# quant everything except `lm_head` 3,127,075,840 bytes  # enable `modules_to_not_convert=["lm_head"]`

# TinyLlama-1.1b (32k vocab_size)
# quant everything include `lm_head` 799,929,088 bytes
# quant everything except `lm_head` 1,028,025,088 bytes # enable `modules_to_not_convert=["lm_head"]`

Beside, I also hope to know why you think embeddings appear easier to quantize than heads😳Could you please elaborate more on this? @galqiwi

from transformers.

galqiwi avatar galqiwi commented on July 24, 2024

cc: @younesbelkada

from transformers.

amyeroberts avatar amyeroberts commented on July 24, 2024

cc @SunMarc too :)

from transformers.

galqiwi avatar galqiwi commented on July 24, 2024

@Orion-Zheng, this issue is a feature request, i.e. we discuss a potential future enhancement of transformers here.
Your question seems important, but it is not related to this feature request, and I currently do not know enough about AWQ to help - but someone else likely does. It looks like your question may belong to the forum.

from transformers.

galqiwi avatar galqiwi commented on July 24, 2024

Speaking of this feature request, I am now trying to put together a minimal prototype of a quantized embedding for bitsandbytes and will be back with news (hopefully) by Thursday. If you have any tips or suggestions on how to implement this, please comment.

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.