Giter Site home page Giter Site logo

btobolaski / llm-awq Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mit-han-lab/llm-awq

0.0 1.0 0.0 744 KB

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

License: MIT License

Shell 3.47% C++ 0.37% Python 65.59% C 0.34% Cuda 30.22%

llm-awq's Introduction

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Paper]

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

overview

The current release supports:

  • AWQ search for accurate quantization.
  • Pre-computed AWQ model zoo for LLMs (LLaMA, OPT, Vicuna, LLaVA; load to generate quantized weights).
  • Memory-efficient 4-bit Linear in PyTorch.
  • Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
  • Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (LLaVA).

Contents

Install

  1. Clone this repository and navigate to AWQ folder
git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
  1. Install Package
conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel
cd awq/kernels
python setup.py install

AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

The detailed support list:

Models Sizes INT4-g128 INT3-g128
LLaMA 7B/13B/30B/65B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Vicuna 7B/13B
LLaVA 13B

Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples for details.

overview

Usage

We provide several sample script to run AWQ (please refer to ./scripts). We use OPT-6.7B as an example.

  1. Perform AWQ search and save search results (we already did it for you):
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt
  1. Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/opt-6.7b-w4-g128.pt \
    --q_backend fake
  1. Generate real quantized weights (INT4)
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/opt-6.7b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/opt-6.7b-w4-g128-awq.pt
  1. Load and evaluate the real quantized model (now you can see smaller gpu memory usage)
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/opt-6.7b-w4-g128-awq.pt

Reference

If you find AWQ useful or relevant to your research, please kindly cite our paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Related Projects

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

Vicuna and FastChat

LLaVA: Large Language and Vision Assistant

llm-awq's People

Contributors

kentang-mit avatar sakits avatar tonylins avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.