ist-daslab / gptq Goto Github PK

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Home Page: https://arxiv.org/abs/2210.17323

License: Apache License 2.0

Python 95.72% C++ 0.51% Cuda 3.77%

gptq's Introduction

GPTQ

This repository contains the code for the ICLR 2023 paper GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. The current release includes the following features:

An efficient implementation of the GPTQ algorithm: gptq.py
Compressing all models from the OPT and BLOOM families to 2/3/4 bits, including weight grouping: opt.py, bloom.py, zeroShot/
Evaluating the perplexity of quantized models on several language generation tasks: opt.py, bloom.py
Evaluating the performance of quantized models on several ZeroShot tasks: zeroShot/
A 3-bit quantized matrix full-precision vector product CUDA kernel: quant_cuda_kernel.cu, quant_cuda.cpp, setup_cuda.py
Benchmarking code for individual matrix-vector products and for language generation with quantized models: test_kernel.py, opt.py

New Features

Update July 2023:

Added --static-groups options which determines all group-grids in advance rather than dynamically during quantization, which has the effect that --act-order does not require any inference changes (that may cause slowdown) when used together with this option.

Together with the camera ready version of the paper we have added several updates to this repository:

Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
Optimized 3bit kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via --faster-kernel.
A minimal LlaMa integration (for more complete features see the GPTQ-for-LLaMA repository), which demonstrates two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Here is a summary of LLaMa results:

Wiki2 PPL	FP16	4bit-RTN	4bit-GPTQ	3bit-RTN	3bit-GPTQ	3g128-GPTQ
LLaMa-7B	5.68	6.29	6.09	25.54	8.07	6.61
LLaMa-13B	5.09	5.53	5.36	11.40	6.63	5.62
LLaMa-30B	4.10	4.54	4.45	14.89	5.69	4.80
LLaMa-65B	3.53	3.92	3.84	10.59	5.04	4.17

Here is a sample command:

python llama.py LLAMA_HF_FOLDER c4 --wbits 4 --true-sequential --act-order --new-eval

The --act-order heuristic also dramatically improves accuracy on the OPT-66B outlier model: 9.55 to 9.34 and 14.16 to 9.95 PPL on Wiki2 for 4bit and 3bit, respectively.

Dependencies

torch: tested on v1.10.1+cu111
transformers: tested on v4.21.2 (the LLaMa integration currently requires a main install from source and sentencepiece)
datasets: tested on v1.17.0
(to run 3-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.4)

All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well.

Language Generation

OPT

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 [--groupsize 1024]

To run other OPT models replace opt-125m with one of: opt-350m, opt-1.3b, opt-2.7b, opt-6.7b, opt-13b, opt-66b. For the 175B-parameter mode, you have to request access from Meta and then convert it to a local HuggingFace checkpoint using their scripts in metaseq. Once you have such a checkpoint, simply pass its path instead of facebook/opt-125m.

BLOOM

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 [--groupsize 1024]

To run other BLOOM models replace bloom-560m with one of: bloom-1b1, bloom-1b7, bloom-3b, bloom-7b1, bloom.

ZeroShot

See zeroShot/ folder.

3-bit CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 3-bit OPT-175B:
# OPT175B denotes the name of the folder with the HuggingFace OPT-175b checkpoint (see above)

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --wbits 3 --save opt175-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --load opt175b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python opt.py OPT175B c4 --benchmark 128

Please note that our 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.

Cite

If you found this work useful, please consider citing:

@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}

gptq's People

Contributors

Stargazers

Watchers

Forkers

d-matrix-ai xinlinli170 yaohuicai techthiyanes yangwang92 yidanzhou mindrages touristshaun berserkr linhduongtuan baissfoundation gqadonis alpindale dumpmemory yidong72 xydrolase bestpredicts zineos katanallama flytigerw zurichrain tallesairan stancx1 wenhuach21 trianxy xiuyu-li oobabooga co-simulation murilocurti t-vi xu-kai bihujrj itsharex natuan forutepiano kimjehyun zhaojp-frank berniri summerflowers apollohuang1 henriettaknight icloudai raymond-lu6 leiwang1999 kaiser1711 minghaobd aiorganisation derekliu-hz p3nj mao-ku wesleysanjose jjhw cygwynd ertkonuk shaunhenju minghui-bd lihengwannafly charlie-xiaoqi guoqiangjia pkafma-aon sorokinvld nanqiai fernandezbaptiste brian-fb luoyingsong rosssong deep-learner-msp paixai ghhpc botmasterza caozhongz stjordanis aiworkspace sunmarc kp-forks willyjtong huzicong berumotto-vermouth dujianhua1008 hbcbh1999 syaikhipin zzz0906 xindong-sony harveyp123 highel michaeljayw jan-karsten-kuhnke yonghuazhang-buaa feihugis lyf-00 unix1986 xipingyan blacksamorez philhoonoh jvhs0706 soonchangai smalltong02 mekkcyber kimho666 qzl164

gptq's Issues

Test on CNN model containing group conv by GPTQ method

Hi,
for supportting CNN mode, I modified the GPTQ code as follows:
1, supportting group conv;
2, use symmetric quantization without zero point parameter.

But I found it performance not good on mobilenetv2/mnasnet1_0 models when quantization bits = 4.
Here are my results:
model | FP32 | GPTQ_W4 sym
mbv2 71.88 60.84(84.64%)
mnasnet1_0 73.47 64.71(88.08%)
I saw resnet18/resnet50 quantization result in your paper only, have you tested gptq on mobilenetv2/mnasnet1_0 model?

Looking forward to your reply...

quantized GPTJ - error on inference

hi there, i'm trying to quantize a finetuned version of gptj trought the https://github.com/AlpinDale/gptq-gptj repo.

To quantize the model i use this command:

CUDA_VISIBLE_DEVICES=0 python gptj.py ../finetuned6B/checkpoint-3000/ c4 --wbits 4 --save GPTJQ.pt

the process complete successfully and the file GPTJQ.pt is produced. The only warning i get is:

Token indices sequence length is longer than the specified maximum sequence length for this model (3403 > 2048). Running this sequence through the model will result in indexing errors.

When i run the inference trough this command:

CUDA_VISIBLE_DEVICES=0 python gptj-inference.py EleutherAI/gpt-j-6b --wbits 4 --load GPTJQ.pt --text "Hello"

i get the following error. what am i doing wrong?

thank you very much for any help!

the error:

CUDA extension not installed.
Loading model ...
Traceback (most recent call last):
File "gptj-inference.py", line 120, in
model = load_quant(args.model, args.load, args.wbits)
File "gptj-inference.py", line 55, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/home/gianmarco/miniconda3/envs/gpt_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTJForCausalLM:
Missing key(s) in state_dict: "transformer.h.0.attn.k_proj.qzeros", "transformer.h.0.attn.k_proj.scales", "transformer.h.0.attn.k_proj.bias", "transformer.h.0.attn.k_proj.qweight", "transformer.h.0.attn.v_proj.qzeros", "transformer.h.0.attn.v_proj.scales", "transformer.h.0.attn.v_proj.bias", "transformer.h.0.attn.v_proj.qweight", "transformer.h.0.attn.q_proj.qzeros", "transformer.h.0.attn.q_proj.scales", "transformer.h.0.attn.q_proj.bias", "transformer.h.0.attn.q_proj.qweight", "transformer.h.0.attn.out_proj.qzeros", "transformer.h.0.attn.out_proj.scales", "transformer.h.0.attn.out_proj.bias", "transformer.h.0.attn.out_proj.qweight", "transformer.h.0.mlp.fc_in.qzeros", "transformer.h.0.mlp.fc_in.scales", "transformer.h.0.mlp.fc_in.qweight", "transformer.h.0.mlp.fc_out.qzeros", "transformer.h.0.mlp.fc_out.scales", "transformer.h.0.mlp.fc_out.qweight", "transformer.h.1.attn.k_proj.qzeros", "transformer.h.1.attn.k_proj.scales", "transformer.h.1.attn.k_proj.bias", "transformer.h.1.attn.k_proj.qweight", "transformer.h.1.attn.v_proj.qzeros", "transformer.h.1.attn.v_proj.scales", "transformer.h.1.attn.v_proj.bias", "transformer.h.1.attn.v_proj.qweight", "transformer.h.1.attn.q_proj.qzeros", "transformer.h.1.attn.q_proj.scales", "transformer.h.1.attn.q_proj.bias", "transformer.h.1.attn.q_proj.qweight", "transformer.h.1.attn.out_proj.qzeros", "transformer.h.1.attn.out_proj.scales", "transformer.h.1.attn.out_proj.bias", "transformer.h.1.attn.out_proj.qweight", "transformer.h.1.mlp.fc_in.qzeros", "transformer.h.1.mlp.fc_in.scales", "transformer.h.1.mlp.fc_in.qweight", "transformer.h.1.mlp.fc_out.qzeros", "transformer.h.1.mlp.fc_out.scales", "transformer.h.1.mlp.fc_out.qweight", "transformer.h.2.attn.k_proj.qzeros", "transformer.h.2.attn.k_proj.scales", "transformer.h.2.attn.k_proj.bias", "transformer.h.2.attn.k_proj.qweight", "transformer.h.2.attn.v_proj.qzeros", "transformer.h.2.attn.v_proj.scales", "transformer.h.2.attn.v_proj.bias", "transformer.h.2.attn.v_proj.qweight", "transformer.h.2.attn.q_proj.qzeros", "transformer.h.2.attn.q_proj.scales", "transformer.h.2.attn.q_proj.bias", "transformer.h.2.attn.q_proj.qweight", "transformer.h.2.attn.out_proj.qzeros", "transformer.h.2.attn.out_proj.scales", "transformer.h.2.attn.out_proj.bias", "transformer.h.2.attn.out_proj.qweight", "transformer.h.2.mlp.fc_in.qzeros", "transformer.h.2.mlp.fc_in.scales", "transformer.h.2.mlp.fc_in.qweight", "transformer.h.2.mlp.fc_out.qzeros", "transformer.h.2.mlp.fc_out.scales", "transformer.h.2.mlp.fc_out.qweight", "transformer.h.3.attn.k_proj.qzeros", "transformer.h.3.attn.k_proj.scales", "transformer.h.3.attn.k_proj.bias", "transformer.h.3.attn.k_proj.qweight", "transformer.h.3.attn.v_proj.qzeros", "transformer.h.3.attn.v_proj.scales", "transformer.h.3.attn.v_proj.bias", "transformer.h.3.attn.v_proj.qweight", "transformer.h.3.attn.q_proj.qzeros", "transformer.h.3.attn.q_proj.scales", "transformer.h.3.attn.q_proj.bias", "transformer.h.3.attn.q_proj.qweight", "transformer.h.3.attn.out_proj.qzeros", "transformer.h.3.attn.out_proj.scales", "transformer.h.3.attn.out_proj.bias", "transformer.h.3.attn.out_proj.qweight", "transformer.h.3.mlp.fc_in.qzeros", "transformer.h.3.mlp.fc_in.scales", "transformer.h.3.mlp.fc_in.qweight", "transformer.h.3.mlp.fc_out.qzeros", "transformer.h.3.mlp.fc_out.scales", "transformer.h.3.mlp.fc_out.qweight", "transformer.h.4.attn.k_proj.qzeros", "transformer.h.4.attn.k_proj.scales", "transformer.h.4.attn.k_proj.bias", "transformer.h.4.attn.k_proj.qweight", "transformer.h.4.attn.v_proj.qzeros", "transformer.h.4.attn.v_proj.scales", "transformer.h.4.attn.v_proj.bias", "transformer.h.4.attn.v_proj.qweight", "transformer.h.4.attn.q_proj.qzeros", "transformer.h.4.attn.q_proj.scales", "transformer.h.4.attn.q_proj.bias", "transformer.h.4.attn.q_proj.qweight", "transformer.h.4.attn.out_proj.qzeros", "transformer.h.4.attn.out_proj.scales", "transformer.h.4.attn.out_proj.bias", "transformer.h.4.attn.out_proj.qweight", "transformer.h.4.mlp.fc_in.qzeros", "transformer.h.4.mlp.fc_in.scales", "transformer.h.4.mlp.fc_in.qweight", "transformer.h.4.mlp.fc_out.qzeros", "transformer.h.4.mlp.fc_out.scales", "transformer.h.4.mlp.fc_out.qweight", "transformer.h.5.attn.k_proj.qzeros", "transformer.h.5.attn.k_proj.scales", "transformer.h.5.attn.k_proj.bias", "transformer.h.5.attn.k_proj.qweight", "transformer.h.5.attn.v_proj.qzeros", "transformer.h.5.attn.v_proj.scales", "transformer.h.5.attn.v_proj.bias", "transformer.h.5.attn.v_proj.qweight", "transformer.h.5.attn.q_proj.qzeros", "transformer.h.5.attn.q_proj.scales", "transformer.h.5.attn.q_proj.bias", "transformer.h.5.attn.q_proj.qweight", "transformer.h.5.attn.out_proj.qzeros", "transformer.h.5.attn.out_proj.scales", "transformer.h.5.attn.out_proj.bias", "transformer.h.5.attn.out_proj.qweight", "transformer.h.5.mlp.fc_in.qzeros", "transformer.h.5.mlp.fc_in.scales", "transformer.h.5.mlp.fc_in.qweight", "transformer.h.5.mlp.fc_out.qzeros", "transformer.h.5.mlp.fc_out.scales", "transformer.h.5.mlp.fc_out.qweight", "transformer.h.6.attn.k_proj.qzeros", "transformer.h.6.attn.k_proj.scales", "transformer.h.6.attn.k_proj.bias", "transformer.h.6.attn.k_proj.qweight", "transformer.h.6.attn.v_proj.qzeros", "transformer.h.6.attn.v_proj.scales", "transformer.h.6.attn.v_proj.bias", "transformer.h.6.attn.v_proj.qweight", "transformer.h.6.attn.q_proj.qzeros", "transformer.h.6.attn.q_proj.scales", "transformer.h.6.attn.q_proj.bias", "transformer.h.6.attn.q_proj.qweight", "transformer.h.6.attn.out_proj.qzeros", "transformer.h.6.attn.out_proj.scales", "transformer.h.6.attn.out_proj.bias", "transformer.h.6.attn.out_proj.qweight", "transformer.h.6.mlp.fc_in.qzeros", "transformer.h.6.mlp.fc_in.scales", "transformer.h.6.mlp.fc_in.qweight", "transformer.h.6.mlp.fc_out.qzeros", "transformer.h.6.mlp.fc_out.scales", "transformer.h.6.mlp.fc_out.qweight", "transformer.h.7.attn.k_proj.qzeros", "transformer.h.7.attn.k_proj.scales", "transformer.h.7.attn.k_proj.bias", "transformer.h.7.attn.k_proj.qweight", "transformer.h.7.attn.v_proj.qzeros", "transformer.h.7.attn.v_proj.scales", "transformer.h.7.attn.v_proj.bias", "transformer.h.7.attn.v_proj.qweight", "transformer.h.7.attn.q_proj.qzeros", "transformer.h.7.attn.q_proj.scales", "transformer.h.7.attn.q_proj.bias", "transformer.h.7.attn.q_proj.qweight", "transformer.h.7.attn.out_proj.qzeros", "transformer.h.7.attn.out_proj.scales", "transformer.h.7.attn.out_proj.bias", "transformer.h.7.attn.out_proj.qweight", "transformer.h.7.mlp.fc_in.qzeros", "transformer.h.7.mlp.fc_in.scales", "transformer.h.7.mlp.fc_in.qweight", "transformer.h.7.mlp.fc_out.qzeros", "transformer.h.7.mlp.fc_out.scales", "transformer.h.7.mlp.fc_out.qweight", "transformer.h.8.attn.k_proj.qzeros", "transformer.h.8.attn.k_proj.scales", "transformer.h.8.attn.k_proj.bias", "transformer.h.8.attn.k_proj.qweight", "transformer.h.8.attn.v_proj.qzeros", "transformer.h.8.attn.v_proj.scales", "transformer.h.8.attn.v_proj.bias", "transformer.h.8.attn.v_proj.qweight", "transformer.h.8.attn.q_proj.qzeros", "transformer.h.8.attn.q_proj.scales", "transformer.h.8.attn.q_proj.bias", "transformer.h.8.attn.q_proj.qweight", "transformer.h.8.attn.out_proj.qzeros", "transformer.h.8.attn.out_proj.scales", "transformer.h.8.attn.out_proj.bias", "transformer.h.8.attn.out_proj.qweight", "transformer.h.8.mlp.fc_in.qzeros", "transformer.h.8.mlp.fc_in.scales", "transformer.h.8.mlp.fc_in.qweight", "transformer.h.8.mlp.fc_out.qzeros", "transformer.h.8.mlp.fc_out.scales", "transformer.h.8.mlp.fc_out.qweight", "transformer.h.9.attn.k_proj.qzeros", "transformer.h.9.attn.k_proj.scales", "transformer.h.9.attn.k_proj.bias", "transformer.h.9.attn.k_proj.qweight", "transformer.h.9.attn.v_proj.qzeros", "transformer.h.9.attn.v_proj.scales", "transformer.h.9.attn.v_proj.bias", "transformer.h.9.attn.v_proj.qweight", "transformer.h.9.attn.q_proj.qzeros", "transformer.h.9.attn.q_proj.scales", "transformer.h.9.attn.q_proj.bias", "transformer.h.9.attn.q_proj.qweight", "transformer.h.9.attn.out_proj.qzeros", "transformer.h.9.attn.out_proj.scales", "transformer.h.9.attn.out_proj.bias", "transformer.h.9.attn.out_proj.qweight", "transformer.h.9.mlp.fc_in.qzeros", "transformer.h.9.mlp.fc_in.scales", "transformer.h.9.mlp.fc_in.qweight", "transformer.h.9.mlp.fc_out.qzeros", "transformer.h.9.mlp.fc_out.scales", "transformer.h.9.mlp.fc_out.qweight", "transformer.h.10.attn.k_proj.qzeros", "transformer.h.10.attn.k_proj.scales", "transformer.h.10.attn.k_proj.bias", "transformer.h.10.attn.k_proj.qweight", "transformer.h.10.attn.v_proj.qzeros", "transformer.h.10.attn.v_proj.scales", "transformer.h.10.attn.v_proj.bias", "transformer.h.10.attn.v_proj.qweight", "transformer.h.10.attn.q_proj.qzeros", "transformer.h.10.attn.q_proj.scales", "transformer.h.10.attn.q_proj.bias", "transformer.h.10.attn.q_proj.qweight", "transformer.h.10.attn.out_proj.qzeros", "transformer.h.10.attn.out_proj.scales", "transformer.h.10.attn.out_proj.bias", "transformer.h.10.attn.out_proj.qweight", "transformer.h.10.mlp.fc_in.qzeros", "transformer.h.10.mlp.fc_in.scales", "transformer.h.10.mlp.fc_in.qweight", "transformer.h.10.mlp.fc_out.qzeros", "transformer.h.10.mlp.fc_out.scales", "transformer.h.10.mlp.fc_out.qweight", "transformer.h.11.attn.k_proj.qzeros", "transformer.h.11.attn.k_proj.scales", "transformer.h.11.attn.k_proj.bias", "transformer.h.11.attn.k_proj.qweight", "transformer.h.11.attn.v_proj.qzeros", "transformer.h.11.attn.v_proj.scales", "transformer.h.11.attn.v_proj.bias", "transformer.h.11.attn.v_proj.qweight", "transformer.h.11.attn.q_proj.qzeros", "transformer.h.11.attn.q_proj.scales", "transformer.h.11.attn.q_proj.bias", "transformer.h.11.attn.q_proj.qweight", "transformer.h.11.attn.out_proj.qzeros", "transformer.h.11.attn.out_proj.scales", "transformer.h.11.attn.out_proj.bias", "transformer.h.11.attn.out_proj.qweight", "transformer.h.11.mlp.fc_in.qzeros", "transformer.h.11.mlp.fc_in.scales", "transformer.h.11.mlp.fc_in.qweight", "transformer.h.11.mlp.fc_out.qzeros", "transformer.h.11.mlp.fc_out.scales", "transformer.h.11.mlp.fc_out.qweight", "transformer.h.12.attn.k_proj.qzeros", "transformer.h.12.attn.k_proj.scales", "transformer.h.12.attn.k_proj.bias", "transformer.h.12.attn.k_proj.qweight", "transformer.h.12.attn.v_proj.qzeros", "transformer.h.12.attn.v_proj.scales", "transformer.h.12.attn.v_proj.bias", "transformer.h.12.attn.v_proj.qweight", "transformer.h.12.attn.q_proj.qzeros", "transformer.h.12.attn.q_proj.scales", "transformer.h.12.attn.q_proj.bias", "transformer.h.12.attn.q_proj.qweight", "transformer.h.12.attn.out_proj.qzeros", "transformer.h.12.attn.out_proj.scales", "transformer.h.12.attn.out_proj.bias", "transformer.h.12.attn.out_proj.qweight", "transformer.h.12.mlp.fc_in.qzeros", "transformer.h.12.mlp.fc_in.scales", "transformer.h.12.mlp.fc_in.qweight", "transformer.h.12.mlp.fc_out.qzeros", "transformer.h.12.mlp.fc_out.scales", "transformer.h.12.mlp.fc_out.qweight", "transformer.h.13.attn.k_proj.qzeros", "transformer.h.13.attn.k_proj.scales", "transformer.h.13.attn.k_proj.bias", "transformer.h.13.attn.k_proj.qweight", "transformer.h.13.attn.v_proj.qzeros", "transformer.h.13.attn.v_proj.scales", "transformer.h.13.attn.v_proj.bias", "transformer.h.13.attn.v_proj.qweight", "transformer.h.13.attn.q_proj.qzeros", "transformer.h.13.attn.q_proj.scales", "transformer.h.13.attn.q_proj.bias", "transformer.h.13.attn.q_proj.qweight", "transformer.h.13.attn.out_proj.qzeros", "transformer.h.13.attn.out_proj.scales", "transformer.h.13.attn.out_proj.bias", "transformer.h.13.attn.out_proj.qweight", "transformer.h.13.mlp.fc_in.qzeros", "transformer.h.13.mlp.fc_in.scales", "transformer.h.13.mlp.fc_in.qweight", "transformer.h.13.mlp.fc_out.qzeros", "transformer.h.13.mlp.fc_out.scales", "transformer.h.13.mlp.fc_out.qweight", "transformer.h.14.attn.k_proj.qzeros", "transformer.h.14.attn.k_proj.scales", "transformer.h.14.attn.k_proj.bias", "transformer.h.14.attn.k_proj.qweight", "transformer.h.14.attn.v_proj.qzeros", "transformer.h.14.attn.v_proj.scales", "transformer.h.14.attn.v_proj.bias", "transformer.h.14.attn.v_proj.qweight", "transformer.h.14.attn.q_proj.qzeros", "transformer.h.14.attn.q_proj.scales", "transformer.h.14.attn.q_proj.bias", "transformer.h.14.attn.q_proj.qweight", "transformer.h.14.attn.out_proj.qzeros", "transformer.h.14.attn.out_proj.scales", "transformer.h.14.attn.out_proj.bias", "transformer.h.14.attn.out_proj.qweight", "transformer.h.14.mlp.fc_in.qzeros", "transformer.h.14.mlp.fc_in.scales", "transformer.h.14.mlp.fc_in.qweight", "transformer.h.14.mlp.fc_out.qzeros", "transformer.h.14.mlp.fc_out.scales", "transformer.h.14.mlp.fc_out.qweight", "transformer.h.15.attn.k_proj.qzeros", "transformer.h.15.attn.k_proj.scales", "transformer.h.15.attn.k_proj.bias", "transformer.h.15.attn.k_proj.qweight", "transformer.h.15.attn.v_proj.qzeros", "transformer.h.15.attn.v_proj.scales", "transformer.h.15.attn.v_proj.bias", "transformer.h.15.attn.v_proj.qweight", "transformer.h.15.attn.q_proj.qzeros", "transformer.h.15.attn.q_proj.scales", "transformer.h.15.attn.q_proj.bias", "transformer.h.15.attn.q_proj.qweight", "transformer.h.15.attn.out_proj.qzeros", "transformer.h.15.attn.out_proj.scales", "transformer.h.15.attn.out_proj.bias", "transformer.h.15.attn.out_proj.qweight", "transformer.h.15.mlp.fc_in.qzeros", "transformer.h.15.mlp.fc_in.scales", "transformer.h.15.mlp.fc_in.qweight", "transformer.h.15.mlp.fc_out.qzeros", "transformer.h.15.mlp.fc_out.scales", "transformer.h.15.mlp.fc_out.qweight", "transformer.h.16.attn.k_proj.qzeros", "transformer.h.16.attn.k_proj.scales", "transformer.h.16.attn.k_proj.bias", "transformer.h.16.attn.k_proj.qweight", "transformer.h.16.attn.v_proj.qzeros", "transformer.h.16.attn.v_proj.scales", "transformer.h.16.attn.v_proj.bias", "transformer.h.16.attn.v_proj.qweight", "transformer.h.16.attn.q_proj.qzeros", "transformer.h.16.attn.q_proj.scales", "transformer.h.16.attn.q_proj.bias", "transformer.h.16.attn.q_proj.qweight", "transformer.h.16.attn.out_proj.qzeros", "transformer.h.16.attn.out_proj.scales", "transformer.h.16.attn.out_proj.bias", "transformer.h.16.attn.out_proj.qweight", "transformer.h.16.mlp.fc_in.qzeros", "transformer.h.16.mlp.fc_in.scales", "transformer.h.16.mlp.fc_in.qweight", "transformer.h.16.mlp.fc_out.qzeros", "transformer.h.16.mlp.fc_out.scales", "transformer.h.16.mlp.fc_out.qweight", "transformer.h.17.attn.k_proj.qzeros", "transformer.h.17.attn.k_proj.scales", "transformer.h.17.attn.k_proj.bias", "transformer.h.17.attn.k_proj.qweight", "transformer.h.17.attn.v_proj.qzeros", "transformer.h.17.attn.v_proj.scales", "transformer.h.17.attn.v_proj.bias", "transformer.h.17.attn.v_proj.qweight", "transformer.h.17.attn.q_proj.qzeros", "transformer.h.17.attn.q_proj.scales", "transformer.h.17.attn.q_proj.bias", "transformer.h.17.attn.q_proj.qweight", "transformer.h.17.attn.out_proj.qzeros", "transformer.h.17.attn.out_proj.scales", "transformer.h.17.attn.out_proj.bias", "transformer.h.17.attn.out_proj.qweight", "transformer.h.17.mlp.fc_in.qzeros", "transformer.h.17.mlp.fc_in.scales", "transformer.h.17.mlp.fc_in.qweight", "transformer.h.17.mlp.fc_out.qzeros", "transformer.h.17.mlp.fc_out.scales", "transformer.h.17.mlp.fc_out.qweight", "transformer.h.18.attn.k_proj.qzeros", "transformer.h.18.attn.k_proj.scales", "transformer.h.18.attn.k_proj.bias", "transformer.h.18.attn.k_proj.qweight", "transformer.h.18.attn.v_proj.qzeros", "transformer.h.18.attn.v_proj.scales", "transformer.h.18.attn.v_proj.bias", "transformer.h.18.attn.v_proj.qweight", "transformer.h.18.attn.q_proj.qzeros", "transformer.h.18.attn.q_proj.scales", "transformer.h.18.attn.q_proj.bias", "transformer.h.18.attn.q_proj.qweight", "transformer.h.18.attn.out_proj.qzeros", "transformer.h.18.attn.out_proj.scales", "transformer.h.18.attn.out_proj.bias", "transformer.h.18.attn.out_proj.qweight", "transformer.h.18.mlp.fc_in.qzeros", "transformer.h.18.mlp.fc_in.scales", "transformer.h.18.mlp.fc_in.qweight", "transformer.h.18.mlp.fc_out.qzeros", "transformer.h.18.mlp.fc_out.scales", "transformer.h.18.mlp.fc_out.qweight", "transformer.h.19.attn.k_proj.qzeros", "transformer.h.19.attn.k_proj.scales", "transformer.h.19.attn.k_proj.bias", "transformer.h.19.attn.k_proj.qweight", "transformer.h.19.attn.v_proj.qzeros", "transformer.h.19.attn.v_proj.scales", "transformer.h.19.attn.v_proj.bias", "transformer.h.19.attn.v_proj.qweight", "transformer.h.19.attn.q_proj.qzeros", "transformer.h.19.attn.q_proj.scales", "transformer.h.19.attn.q_proj.bias", "transformer.h.19.attn.q_proj.qweight", "transformer.h.19.attn.out_proj.qzeros", "transformer.h.19.attn.out_proj.scales", "transformer.h.19.attn.out_proj.bias", "transformer.h.19.attn.out_proj.qweight", "transformer.h.19.mlp.fc_in.qzeros", "transformer.h.19.mlp.fc_in.scales", "transformer.h.19.mlp.fc_in.qweight", "transformer.h.19.mlp.fc_out.qzeros", "transformer.h.19.mlp.fc_out.scales", "transformer.h.19.mlp.fc_out.qweight", "transformer.h.20.attn.k_proj.qzeros", "transformer.h.20.attn.k_proj.scales", "transformer.h.20.attn.k_proj.bias", "transformer.h.20.attn.k_proj.qweight", "transformer.h.20.attn.v_proj.qzeros", "transformer.h.20.attn.v_proj.scales", "transformer.h.20.attn.v_proj.bias", "transformer.h.20.attn.v_proj.qweight", "transformer.h.20.attn.q_proj.qzeros", "transformer.h.20.attn.q_proj.scales", "transformer.h.20.attn.q_proj.bias", "transformer.h.20.attn.q_proj.qweight", "transformer.h.20.attn.out_proj.qzeros", "transformer.h.20.attn.out_proj.scales", "transformer.h.20.attn.out_proj.bias", "transformer.h.20.attn.out_proj.qweight", "transformer.h.20.mlp.fc_in.qzeros", "transformer.h.20.mlp.fc_in.scales", "transformer.h.20.mlp.fc_in.qweight", "transformer.h.20.mlp.fc_out.qzeros", "transformer.h.20.mlp.fc_out.scales", "transformer.h.20.mlp.fc_out.qweight", "transformer.h.21.attn.k_proj.qzeros", "transformer.h.21.attn.k_proj.scales", "transformer.h.21.attn.k_proj.bias", "transformer.h.21.attn.k_proj.qweight", "transformer.h.21.attn.v_proj.qzeros", "transformer.h.21.attn.v_proj.scales", "transformer.h.21.attn.v_proj.bias", "transformer.h.21.attn.v_proj.qweight", "transformer.h.21.attn.q_proj.qzeros", "transformer.h.21.attn.q_proj.scales", "transformer.h.21.attn.q_proj.bias", "transformer.h.21.attn.q_proj.qweight", "transformer.h.21.attn.out_proj.qzeros", "transformer.h.21.attn.out_proj.scales", "transformer.h.21.attn.out_proj.bias", "transformer.h.21.attn.out_proj.qweight", "transformer.h.21.mlp.fc_in.qzeros", "transformer.h.21.mlp.fc_in.scales", "transformer.h.21.mlp.fc_in.qweight", "transformer.h.21.mlp.fc_out.qzeros", "transformer.h.21.mlp.fc_out.scales", "transformer.h.21.mlp.fc_out.qweight", "transformer.h.22.attn.k_proj.qzeros", "transformer.h.22.attn.k_proj.scales", "transformer.h.22.attn.k_proj.bias", "transformer.h.22.attn.k_proj.qweight", "transformer.h.22.attn.v_proj.qzeros", "transformer.h.22.attn.v_proj.scales", "transformer.h.22.attn.v_proj.bias", "transformer.h.22.attn.v_proj.qweight", "transformer.h.22.attn.q_proj.qzeros", "transformer.h.22.attn.q_proj.scales", "transformer.h.22.attn.q_proj.bias", "transformer.h.22.attn.q_proj.qweight", "transformer.h.22.attn.out_proj.qzeros", "transformer.h.22.attn.out_proj.scales", "transformer.h.22.attn.out_proj.bias", "transformer.h.22.attn.out_proj.qweight", "transformer.h.22.mlp.fc_in.qzeros", "transformer.h.22.mlp.fc_in.scales", "transformer.h.22.mlp.fc_in.qweight", "transformer.h.22.mlp.fc_out.qzeros", "transformer.h.22.mlp.fc_out.scales", "transformer.h.22.mlp.fc_out.qweight", "transformer.h.23.attn.k_proj.qzeros", "transformer.h.23.attn.k_proj.scales", "transformer.h.23.attn.k_proj.bias", "transformer.h.23.attn.k_proj.qweight", "transformer.h.23.attn.v_proj.qzeros", "transformer.h.23.attn.v_proj.scales", "transformer.h.23.attn.v_proj.bias", "transformer.h.23.attn.v_proj.qweight", "transformer.h.23.attn.q_proj.qzeros", "transformer.h.23.attn.q_proj.scales", "transformer.h.23.attn.q_proj.bias", "transformer.h.23.attn.q_proj.qweight", "transformer.h.23.attn.out_proj.qzeros", "transformer.h.23.attn.out_proj.scales", "transformer.h.23.attn.out_proj.bias", "transformer.h.23.attn.out_proj.qweight", "transformer.h.23.mlp.fc_in.qzeros", "transformer.h.23.mlp.fc_in.scales", "transformer.h.23.mlp.fc_in.qweight", "transformer.h.23.mlp.fc_out.qzeros", "transformer.h.23.mlp.fc_out.scales", "transformer.h.23.mlp.fc_out.qweight", "transformer.h.24.attn.k_proj.qzeros", "transformer.h.24.attn.k_proj.scales", "transformer.h.24.attn.k_proj.bias", "transformer.h.24.attn.k_proj.qweight", "transformer.h.24.attn.v_proj.qzeros", "transformer.h.24.attn.v_proj.scales", "transformer.h.24.attn.v_proj.bias", "transformer.h.24.attn.v_proj.qweight", "transformer.h.24.attn.q_proj.qzeros", "transformer.h.24.attn.q_proj.scales", "transformer.h.24.attn.q_proj.bias", "transformer.h.24.attn.q_proj.qweight", "transformer.h.24.attn.out_proj.qzeros", "transformer.h.24.attn.out_proj.scales", "transformer.h.24.attn.out_proj.bias", "transformer.h.24.attn.out_proj.qweight", "transformer.h.24.mlp.fc_in.qzeros", "transformer.h.24.mlp.fc_in.scales", "transformer.h.24.mlp.fc_in.qweight", "transformer.h.24.mlp.fc_out.qzeros", "transformer.h.24.mlp.fc_out.scales", "transformer.h.24.mlp.fc_out.qweight", "transformer.h.25.attn.k_proj.qzeros", "transformer.h.25.attn.k_proj.scales", "transformer.h.25.attn.k_proj.bias", "transformer.h.25.attn.k_proj.qweight", "transformer.h.25.attn.v_proj.qzeros", "transformer.h.25.attn.v_proj.scales", "transformer.h.25.attn.v_proj.bias", "transformer.h.25.attn.v_proj.qweight", "transformer.h.25.attn.q_proj.qzeros", "transformer.h.25.attn.q_proj.scales", "transformer.h.25.attn.q_proj.bias", "transformer.h.25.attn.q_proj.qweight", "transformer.h.25.attn.out_proj.qzeros", "transformer.h.25.attn.out_proj.scales", "transformer.h.25.attn.out_proj.bias", "transformer.h.25.attn.out_proj.qweight", "transformer.h.25.mlp.fc_in.qzeros", "transformer.h.25.mlp.fc_in.scales", "transformer.h.25.mlp.fc_in.qweight", "transformer.h.25.mlp.fc_out.qzeros", "transformer.h.25.mlp.fc_out.scales", "transformer.h.25.mlp.fc_out.qweight", "transformer.h.26.attn.k_proj.qzeros", "transformer.h.26.attn.k_proj.scales", "transformer.h.26.attn.k_proj.bias", "transformer.h.26.attn.k_proj.qweight", "transformer.h.26.attn.v_proj.qzeros", "transformer.h.26.attn.v_proj.scales", "transformer.h.26.attn.v_proj.bias", "transformer.h.26.attn.v_proj.qweight", "transformer.h.26.attn.q_proj.qzeros", "transformer.h.26.attn.q_proj.scales", "transformer.h.26.attn.q_proj.bias", "transformer.h.26.attn.q_proj.qweight", "transformer.h.26.attn.out_proj.qzeros", "transformer.h.26.attn.out_proj.scales", "transformer.h.26.attn.out_proj.bias", "transformer.h.26.attn.out_proj.qweight", "transformer.h.26.mlp.fc_in.qzeros", "transformer.h.26.mlp.fc_in.scales", "transformer.h.26.mlp.fc_in.qweight", "transformer.h.26.mlp.fc_out.qzeros", "transformer.h.26.mlp.fc_out.scales", "transformer.h.26.mlp.fc_out.qweight", "transformer.h.27.attn.k_proj.qzeros", "transformer.h.27.attn.k_proj.scales", "transformer.h.27.attn.k_proj.bias", "transformer.h.27.attn.k_proj.qweight", "transformer.h.27.attn.v_proj.qzeros", "transformer.h.27.attn.v_proj.scales", "transformer.h.27.attn.v_proj.bias", "transformer.h.27.attn.v_proj.qweight", "transformer.h.27.attn.q_proj.qzeros", "transformer.h.27.attn.q_proj.scales", "transformer.h.27.attn.q_proj.bias", "transformer.h.27.attn.q_proj.qweight", "transformer.h.27.attn.out_proj.qzeros", "transformer.h.27.attn.out_proj.scales", "transformer.h.27.attn.out_proj.bias", "transformer.h.27.attn.out_proj.qweight", "transformer.h.27.mlp.fc_in.qzeros", "transformer.h.27.mlp.fc_in.scales", "transformer.h.27.mlp.fc_in.qweight", "transformer.h.27.mlp.fc_out.qzeros", "transformer.h.27.mlp.fc_out.scales", "transformer.h.27.mlp.fc_out.qweight".
Unexpected key(s) in state_dict: "transformer.h.0.attn.k_proj.weight", "transformer.h.0.attn.v_proj.weight", "transformer.h.0.attn.q_proj.weight", "transformer.h.0.attn.out_proj.weight", "transformer.h.0.mlp.fc_in.weight", "transformer.h.0.mlp.fc_out.weight", "transformer.h.1.attn.k_proj.weight", "transformer.h.1.attn.v_proj.weight", "transformer.h.1.attn.q_proj.weight", "transformer.h.1.attn.out_proj.weight", "transformer.h.1.mlp.fc_in.weight", "transformer.h.1.mlp.fc_out.weight", "transformer.h.2.attn.k_proj.weight", "transformer.h.2.attn.v_proj.weight", "transformer.h.2.attn.q_proj.weight", "transformer.h.2.attn.out_proj.weight", "transformer.h.2.mlp.fc_in.weight", "transformer.h.2.mlp.fc_out.weight", "transformer.h.3.attn.k_proj.weight", "transformer.h.3.attn.v_proj.weight", "transformer.h.3.attn.q_proj.weight", "transformer.h.3.attn.out_proj.weight", "transformer.h.3.mlp.fc_in.weight", "transformer.h.3.mlp.fc_out.weight", "transformer.h.4.attn.k_proj.weight", "transformer.h.4.attn.v_proj.weight", "transformer.h.4.attn.q_proj.weight", "transformer.h.4.attn.out_proj.weight", "transformer.h.4.mlp.fc_in.weight", "transformer.h.4.mlp.fc_out.weight", "transformer.h.5.attn.k_proj.weight", "transformer.h.5.attn.v_proj.weight", "transformer.h.5.attn.q_proj.weight", "transformer.h.5.attn.out_proj.weight", "transformer.h.5.mlp.fc_in.weight", "transformer.h.5.mlp.fc_out.weight", "transformer.h.6.attn.k_proj.weight", "transformer.h.6.attn.v_proj.weight", "transformer.h.6.attn.q_proj.weight", "transformer.h.6.attn.out_proj.weight", "transformer.h.6.mlp.fc_in.weight", "transformer.h.6.mlp.fc_out.weight", "transformer.h.7.attn.k_proj.weight", "transformer.h.7.attn.v_proj.weight", "transformer.h.7.attn.q_proj.weight", "transformer.h.7.attn.out_proj.weight", "transformer.h.7.mlp.fc_in.weight", "transformer.h.7.mlp.fc_out.weight", "transformer.h.8.attn.k_proj.weight", "transformer.h.8.attn.v_proj.weight", "transformer.h.8.attn.q_proj.weight", "transformer.h.8.attn.out_proj.weight", "transformer.h.8.mlp.fc_in.weight", "transformer.h.8.mlp.fc_out.weight", "transformer.h.9.attn.k_proj.weight", "transformer.h.9.attn.v_proj.weight", "transformer.h.9.attn.q_proj.weight", "transformer.h.9.attn.out_proj.weight", "transformer.h.9.mlp.fc_in.weight", "transformer.h.9.mlp.fc_out.weight", "transformer.h.10.attn.k_proj.weight", "transformer.h.10.attn.v_proj.weight", "transformer.h.10.attn.q_proj.weight", "transformer.h.10.attn.out_proj.weight", "transformer.h.10.mlp.fc_in.weight", "transformer.h.10.mlp.fc_out.weight", "transformer.h.11.attn.k_proj.weight", "transformer.h.11.attn.v_proj.weight", "transformer.h.11.attn.q_proj.weight", "transformer.h.11.attn.out_proj.weight", "transformer.h.11.mlp.fc_in.weight", "transformer.h.11.mlp.fc_out.weight", "transformer.h.12.attn.k_proj.weight", "transformer.h.12.attn.v_proj.weight", "transformer.h.12.attn.q_proj.weight", "transformer.h.12.attn.out_proj.weight", "transformer.h.12.mlp.fc_in.weight", "transformer.h.12.mlp.fc_out.weight", "transformer.h.13.attn.k_proj.weight", "transformer.h.13.attn.v_proj.weight", "transformer.h.13.attn.q_proj.weight", "transformer.h.13.attn.out_proj.weight", "transformer.h.13.mlp.fc_in.weight", "transformer.h.13.mlp.fc_out.weight", "transformer.h.14.attn.k_proj.weight", "transformer.h.14.attn.v_proj.weight", "transformer.h.14.attn.q_proj.weight", "transformer.h.14.attn.out_proj.weight", "transformer.h.14.mlp.fc_in.weight", "transformer.h.14.mlp.fc_out.weight", "transformer.h.15.attn.k_proj.weight", "transformer.h.15.attn.v_proj.weight", "transformer.h.15.attn.q_proj.weight", "transformer.h.15.attn.out_proj.weight", "transformer.h.15.mlp.fc_in.weight", "transformer.h.15.mlp.fc_out.weight", "transformer.h.16.attn.k_proj.weight", "transformer.h.16.attn.v_proj.weight", "transformer.h.16.attn.q_proj.weight", "transformer.h.16.attn.out_proj.weight", "transformer.h.16.mlp.fc_in.weight", "transformer.h.16.mlp.fc_out.weight", "transformer.h.17.attn.k_proj.weight", "transformer.h.17.attn.v_proj.weight", "transformer.h.17.attn.q_proj.weight", "transformer.h.17.attn.out_proj.weight", "transformer.h.17.mlp.fc_in.weight", "transformer.h.17.mlp.fc_out.weight", "transformer.h.18.attn.k_proj.weight", "transformer.h.18.attn.v_proj.weight", "transformer.h.18.attn.q_proj.weight", "transformer.h.18.attn.out_proj.weight", "transformer.h.18.mlp.fc_in.weight", "transformer.h.18.mlp.fc_out.weight", "transformer.h.19.attn.k_proj.weight", "transformer.h.19.attn.v_proj.weight", "transformer.h.19.attn.q_proj.weight", "transformer.h.19.attn.out_proj.weight", "transformer.h.19.mlp.fc_in.weight", "transformer.h.19.mlp.fc_out.weight", "transformer.h.20.attn.k_proj.weight", "transformer.h.20.attn.v_proj.weight", "transformer.h.20.attn.q_proj.weight", "transformer.h.20.attn.out_proj.weight", "transformer.h.20.mlp.fc_in.weight", "transformer.h.20.mlp.fc_out.weight", "transformer.h.21.attn.k_proj.weight", "transformer.h.21.attn.v_proj.weight", "transformer.h.21.attn.q_proj.weight", "transformer.h.21.attn.out_proj.weight", "transformer.h.21.mlp.fc_in.weight", "transformer.h.21.mlp.fc_out.weight", "transformer.h.22.attn.k_proj.weight", "transformer.h.22.attn.v_proj.weight", "transformer.h.22.attn.q_proj.weight", "transformer.h.22.attn.out_proj.weight", "transformer.h.22.mlp.fc_in.weight", "transformer.h.22.mlp.fc_out.weight", "transformer.h.23.attn.k_proj.weight", "transformer.h.23.attn.v_proj.weight", "transformer.h.23.attn.q_proj.weight", "transformer.h.23.attn.out_proj.weight", "transformer.h.23.mlp.fc_in.weight", "transformer.h.23.mlp.fc_out.weight", "transformer.h.24.attn.k_proj.weight", "transformer.h.24.attn.v_proj.weight", "transformer.h.24.attn.q_proj.weight", "transformer.h.24.attn.out_proj.weight", "transformer.h.24.mlp.fc_in.weight", "transformer.h.24.mlp.fc_out.weight", "transformer.h.25.attn.k_proj.weight", "transformer.h.25.attn.v_proj.weight", "transformer.h.25.attn.q_proj.weight", "transformer.h.25.attn.out_proj.weight", "transformer.h.25.mlp.fc_in.weight", "transformer.h.25.mlp.fc_out.weight", "transformer.h.26.attn.k_proj.weight", "transformer.h.26.attn.v_proj.weight", "transformer.h.26.attn.q_proj.weight", "transformer.h.26.attn.out_proj.weight", "transformer.h.26.mlp.fc_in.weight", "transformer.h.26.mlp.fc_out.weight", "transformer.h.27.attn.k_proj.weight", "transformer.h.27.attn.v_proj.weight", "transformer.h.27.attn.q_proj.weight", "transformer.h.27.attn.out_proj.weight", "transformer.h.27.mlp.fc_in.weight", "transformer.h.27.mlp.fc_out.weight".

Reconstruct Quantized Model Layer in torch.

Hi, after quantizing LLAMA3, the layer sort of expanded in this checkpoint :

Now i can load the original Llama3 just fine using the layers provided here https://github.com/meta-llama/llama3/blob/main/llama/model.py, because the checkpoint has the same corresponding layers.

I wonder if you guys have written model.py for quantized llama model.

Thanks

How to apply 3/4-bit quantization to vision-language model?

How to apply 3/4-bit quantization to vision-language model like BLIP2？

Inference of the Quantised Model (OPT-13B)

Hey!
Huge congratulations on your achievement and thank you for sharing!
I am following the steps to quantise an OPT model (13B) that I have finetuned. I wish to serve this model for inference.
Will I simply be able to save the quantised model, and load it into the transformers library?

If not whats the best way to do this?

All the very best

Regarding the method for computing the Hessian matrix.

I would like to ask about line 61 in your gptq.py file: inp = math.sqrt(2 / self.nsamples) * inp.float(). According to the paper, it seems that it should be written as follows: inp = math.sqrt(tmp / self.nsamples) * inp.float(). After making this modification, I noticed a reduction in quantization error. Could you please verify if my understanding is correct, and if there might be any misunderstanding on my part?

Why is the wikitext-2 ppl calculated in the code lower than the ppl by lm-evaluation-harness?

About 50% lower.
What causes the difference? Does the ppl calculation method different?

Conversion of OPT-175B singleton to HF checkpoint

Delete

How should I verify the speedup effect of the algorithm?

Hi~ Thank you for your great works! It seems that GPTQ would lead to significant speedups for end-to-end inference. But after quantizing INT8 BLOOM-7B with GPTQ, I found it twice slower than FP16 model. How could I make it speedup as shown in paper?

ValueError: not enough values to unpack (expected 2, got 1)

Hello,
I tried your instruction and got a value error. Was I doing right for benchmarking ? Thank you.

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 3 --save opt125m-3bit.pt

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --load opt125m-3bit.pt --benchmark 128
Loading model ...
Done.
Found cached dataset json (/$HOME/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
Found cached dataset json (/$HOME/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
Benchmarking ...
Traceback (most recent call last):
File "/$HOME/gptq/opt.py", line 455, in
...
File "/$HOM/mambaforge/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 637, in forward
batch_size, seq_length = input_shape
ValueError: not enough values to unpack (expected 2, got 1)

Why no update to Hinv

In gptq.py fasterquant function, there seems no any update to Hinv during the quantization process. Can I know the intuition behand this? I kinda lost in the paper that the introduction of cholesky decomposition can eliminate the update of Hinv.

About the cuda code, I think "tmp2 >> 30" should be " tmp2 >> 31"

tmp = (tmp2 >> 30) | ((tmp1 << 1) & 0x6);

to construct a 3-bit number, tmp1 contributes 2bit, why tmp2 also contributes 2bit?
I'm confused.

PPL results on wikitext/ptb/c4 are worse than the official result

Hi, I ran the bloom.py using fp16 to test the perplexity (PPL) of BLOOM on Wikitext-2, PTB, and C4 datasets. The results are 11.79 / 20.14 / 17.68, which is worse than the official results of 11.37/19.40/14.13.

quant_cuda_kernel.cu(212): error: identifier "__hfma2" is undefined

an error is reported when compiling the quant_cuda kernel.

in my case,
Cuda compilation tools, release 12.0, V12.0.140

Application to T5 / UL2 family

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

H_inv not updated

After each quantization step, H_inv should be updated, but in the code fasterquant, H_inv is not updated. Is it a bug?

How to run the quantized model for perditions on my prompts?

I am able to quantize llama 7b model to 4 bit. But how can I run this for my prediction. If I try transformer library i get error.

Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama_7b_4bit_2.bin")
Traceback (most recent call last):
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 659, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 750, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 944, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 662, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at 'llama_7b_4bit_2.bin' is not a valid JSON file.

running speed slow on NVIDIA vGPU

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

Driver Version：470.161.03
CUDA Version: 11.4

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

context(500 tokens) processing speed: 48 tokens/s
decode speed: 1.6 token/s

Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU？

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...

License issues

Hi, I've forked this repo but it has no license. Could you please add one? Thanks.

Question about the difference between the pseudocode and the implementation

The Hessian inverse information in your pseudocode is computed by cholesky of H's inverse. In code, you use the cholesky first and then cholesky inverse and then cholesky again. I am not sure the reason of the difference. And is the cholesky_inverse kernel necessary here？Can I just compute the H's inverse and then use cholesky?

Thank you so much.

GPTQ转化的INT8模型，如何运行呢？请大佬指教

Can --save work with --groupsize in opt.py?

Hello there, nice work!

If I understand well, when groupsize is set above 0, the quantizer in module gptq of opt is only responsible for each group. The opt_pack3 is counting on the quantizer.pack function, which has only the zeros and scales of the last group if groupsize is set above 0.

So can --save and --groupsize work together in opt.py right now?

About `--sym` zero point

If <wbit=4, sym=True>, self.zero=8 as implemented here.

I thought self.zero=0 according to quantization doc or some inference code.

Is there any standard or consensus about symmetric quantization ?

opt_eval error

After quant opt-125m and save the quant model. When I use ‘opt_eval’, get an error： Only supports a single token currently

How to run on multi GPUs?

Im try run opt--30b on 4*2080Ti, However, the following error message appears when loading parameters.

Starting ...
Ready.
Traceback (most recent call last):
  File "opt.py", line 424, in <module>
    quantizers = opt_sequential(model, dataloader, DEV)
  File "/home/cciip/miniconda3/envs/int/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "opt.py", line 83, in opt_sequential
    gptq[name] = GPTQ(subset[name])
  File "/home/cciip/private/tianjie/gptq/gptq.py", line 29, in __init__
    self.H = torch.zeros((self.columns, self.columns), device=self.dev)
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 10.75 GiB total capacity; 9.30 GiB already allocated; 77.62 MiB free; 9.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How can I make it work?

Application to GPT-J family

Congratulations on your achievement.

Can you give us some hints and recommandations to adapt the procedure in order to quantify the GPT-J models family ?

Why are PPL so low on PTB?

Hello

Many thanks for your work, it's great to (finally) see results reported on openly available LLMs 😊
However, I was surprised when I saw perplexities on PTB for OPT and BLOOM models: 10.33 and 13.63 respectively.
Indeed, GPT-3's paper reports a PPL of 20.50 on such dataset and I was wondering whether you had any explanation for this (nearly 2x) difference?

Thanks!

Compatibility of Quant3Linear and 4-bit quantization

Hi! I've noticed that the quantization layer would pack the quantized weight using class Quant3Linear, as shown below:

However, it seems to me that it only suits for 2bits and 3bits weights. If the original weights in $intweight is 4bits, some bits would be lost.

Could you explain the logic behind this? Thanks!

GPTQ for BERT

I'm looking for the GPTQ implementation for BERT, why isn't it in the repository? i want to try 4bit implementation for speed comparison and try other models as well

Does GPTQ reduce to OBQ if I set block size to 1?

Can GPTQ models be used for fine-tuning?

I think the answer is no but wanted to check. can some expert let me know? thanks.

How can we use this lib to quantize Falcon7b / 40b models?

Is there a beginners guide to the GPTQ algorithm?

I found the paper hard to follow, is there a beginner / dummys guide? Perhaps slides from a talk or tutorial? Thanks!

AssertionError

File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1675, in load_dataset
builder_instance = load_dataset_builder(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1452, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1177, in dataset_module_factory
raise e1 from None
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1156, in dataset_module_factory
return HubDatasetModuleFactoryWithoutScript(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 743, in init
assert self.name.count("/") == 1
AssertionError

When I use this command, python3 opt.py facebook/opt-125m c4, I get the above error.
Could you please help me solve this issue?

How to adopt GPTQ on Conv2d with `groups` attribute?

Hi,

Thanks for your impressive work! It really helps me quantize lots of large models.
Recently, I try to implement GPTQ on grouped Conv2d layer, but the results seem to be not good.
Could you provide some hints to support GPTQ on grouped Conv2d?

Here is my rough implementation now:

In add_batch function, divide inp into different group and store hessian respectively.
In fasterquant function, divide W into different group and apply GPTQ with chunk of W and corresponding hessian.
Concat the different groups of Q to full Q.

Thank you in advance.

Reproduction of the results in the paper

@efrantar
Following the instructions in README.md, baseline and RTN perplexities match exactly as listed in Tables 2-3 in the paper.
However, GPTQ perplexity does not.

Is this due to differences in the calibration sample? Or is the result in the Tables statistics out of multiple runs with different random seeds?
Could you share the command that reproduces the results in the paper?

Much appreciated!

LAMBADA evaluation accuracy

Hello, I've been experimenting with GPTQ and trying to replicate your LAMBADA zero-shot results. But I have been getting significantly lower accuracy (10-15% lower for OPT specifically) compared to the paper, even for the FP16 baseline. I'm using your pipeline based on LM evaluation harness. I was wondering if you have seen this before?

Pretrained Weights for Bloom and BloomZ (4-bit)

Hi,

Thanks a lot for the excellent work.

Could you share with us a pretrained weights for Bloom and BloomZ (4-bits) ?

Please comment on why the A100 specific commit makes it faster?

Regarding: 54d35a8

Would be nice to know the reason behind the changes that make it faster on A100 specifically versus a 3090 for example.

Thanks!

GPTQ on BERT based

Hi all,

Wish this message finds everyone well. I have read the paper and found there is a table which compares the performance on OBQ and GPTQ on Bert-based model. Could anyone help me with finding the codes or implementation of GPTQ on BERT based model.? Thanks for your help

GPTQ pseudo-quantization saved weights (pt format) How load Re-evaluation

GPTQ pseudo-quantization saved weights (pt format) How load re-evaluation, I set the --load parameter after the execution, found an error:
gptq-main\quant.py", line 636, in forward
raise ValueError('Only supports a single token currently.')

Use modified Cholesky decomposition instead of regularized Cholesky

Hello, I just wanted to point out that the problem you are trying to avoid with regularization of indefinite Hessians has been encountered in the numerical analysis and optimization literature, and the class of algorithms goes by the name modified Cholesky, with cost $O(n^2)$ which is asymptotically irrelevant.

act-order on inference

Hi, does act-order usage require inference change? Thank you

How to apply 3/4-bit quantization to computer vision models?

OpenCL Support

Please add OpenCL Support that so that it can be used on GPU's that Support OpenCL and not CUDA
Then we could use something like quant_opencl.cpp instead of quant_cuda.cpp

qweight is empty when I gave --save option

As I want to get the quantized model through the GPTQ algorithm, I gave the --save option when I run the python script.

However, the qweight of each layer is empty because of pack function in Quant3Linear class. (quant.py)
I think the while loop (line 147 ~ line 170) is not executed so the qweight is just an empty ndarray.

If I comment out the while loop, I can get the qweight.
What is the role of the while loop? Can I just comment out and run the transformers?

Title: Feature Request: Add Saving Quantized Weights Functionality to bloom.py

Description:

Hi there,

I noticed that the opt.py file in the repository provides a method for saving quantized weights, but this functionality is not available in the bloom.py file. I was wondering if it would be possible to add this feature to bloom.py as well.

Being able to save quantized weights is a really useful feature for optimizing the size of models, and it would be great to have this functionality available in all relevant files in the repository.

If this feature could be added to bloom.py, I think it would be a really helpful addition for anyone who is working with this file.

Thank you for your time and consideration.

Best regards,

pack_model takes too long time

I used auto_gptq to quantize a large language model, this model's transformer has 80 layers, I found each layer needs almost 4 mininutes to pack, I have to wait serveral hours before the whole packing step finishes. Are there better suggestions of solving the problem? Can the packing model step speedup?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.