Giter Site home page Giter Site logo

efeslab / atom Goto Github PK

View Code? Open in Web Editor NEW
259.0 259.0 21.0 15.67 MB

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

CMake 0.64% Cuda 60.86% C++ 8.58% Python 25.14% Shell 0.27% Jupyter Notebook 4.41% Makefile 0.05% C 0.05%

atom's People

Contributors

cylinbao avatar eltociear avatar happierpig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atom's Issues

ppl on ptb

When I ran your code, I found that the ppl score of the llama-7B model on the PTB dataset is greater than 20, whether quantized or not. And in another paper, the score is also different from yours. Why?

Screenshot 2024-01-12 at 19 10 49

Question regarding the efficiency evaluation

Hello, regarding the efficiency evaluation experiment, it seems that there are only codes for evaluating the throughput and latency of Atom and SmoothQuant. I would like to ask how the throughput and latency results for FP16 and AWQ were obtained?

LLM model load hanging problem

Hello,
When i follow the guidance and try reproduce the result, i encounter the problem shown in the screening below.

1719457102248

error:same device

"I would like to ask if there is a solution to this problem, as the error occurred without any changes to the code." :RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Question about KV Cache quantization

Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kernel code? Thank you.

Question about the synchronazation in low-precision kernel

First of all, thank you for your wonderful work! I have a few small questions about the low precision matrix multiplication code.

In kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh, in the kernel compute_gemm_imma. I noticed that in the entire kernel, when performing cp.async synchronization, only the statement

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 2));

is involved.

My question is, has the correctness of this kernel been checked? Because in my understanding, a complete kernel should involve similar

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 3));

and asm volatile("cp.async.wait_group %0;\n" ::"n"(0));

In other words, we should see statements like "wait for the last two memory transactions to complete" somewhere in the code, right? Of course, it is possible that I have missed some of the code. If you can provide some help in explaining the code, I will be very grateful!

AssertionError

I attempted the W4A4 operation on the OPT-350M model and was able to obtain the corresponding results. However, after switching the model to 2.7B, I encountered a mismatch error at line 238 in quant.py. Upon printing, I discovered the size to be ([32, 2048, 160]), whereas, for the 350M model, it was displayed as 16, 2048, 128. How should I resolve this error?

issue with `c4` dataset for eval

Thanks for the great work guys! Trying to run the W4A4 perplexity evaluation, the HF datasets complains about "ValueError: BuilderConfig 'allenai--c4' not found", so removing allenai--c4 from [datautils.py(https://github.com/efeslab/Atom/blob/main/model/datautils.py#L49) and keeping only 'allenai/c4 let the script to complete the run. Wonder if that would be ok to remove it/ if I am missing something.

Traceback (most recent call last):
  File "/data/home/hamidnazeri/Atom/model/llama.py", line 232, in <module>
    dataloader, testloader = get_loaders(
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 175, in get_loaders
    return get_c4(nsamples, seed, seqlen, model, tokenizer)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 51, in get_c4
    traindata = load_dataset(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
model,bit,wiki2,ptb,c4,ptb-new,c4-new
/data/home/hamidnazeri/PiPPy/

the ppl for llama-7b is very large

I have reproduced the results of llama-7b.
The ppl of WikiText2 is the same as it in Table 3(6.16).
The ppl of c4 is also the same as it in Table 3 (7.694).
But the ppl of ptb is very large (32.879), it is 9.62 in Table 3. why ?

The question about calib data

Hi, I am wandering which dataset should be the calib dataset? I want to evaluate the quantized model on my own dataset. Which dataset to generate the calib data? my own dataset or other public dataset like wikitext?

Is it possible to add support for other models?

Hi, this is great work. I found under /Atom/model/main.py file that it seems to only support llama, opt, mixtral models. If I need to add support for Qwen model, which files should I change?

not including dynamic quantizaiton when reproducing results, why?

I have reproduced the results of llama-7b, the ppl of WikiText2 is the same as it in Table 3(6.16).
But in the code you provide, the dynamic quantization for activations is not included. As far as I know, dynamic quantization also cause quantization error. Why omit dynamic quantization in your code?

RuntimeError when quant llama model

Hi, when I tried to quant llama model, I met the following error:

Traceback (most recent call last):
  File "/workspace/code/atom-main/model/main.py", line 205, in <module>
    act_scales = get_act_stats_func(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/yangshangtong/code/atom-main/model/outlier.py", line 95, in get_act_stats_llama
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 681, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (2048) must match the existing size (32768) at non-singleton dimension 3.  Target sizes: [1, 52, 2048, 2048].  Tensor sizes: [1, 1, 32768, 32768]

the command is:

python model/main.py /workspace/model/llama2-7B wikitext2 \
	--act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
	--reorder --act_sort_metric hessian \
	--a_clip_ratio 0.9 --w_clip_ratio 0.85 \
	--keeper 128 --keeper_precision 3 --kv_cache \
	--eval_ppl

What should I do to solve this problem? looking forward to your reply.

Quention about end-to-end efficiency evaluation of Atom

Thanks for your great work! I have a small question here.

Why the matrix dimension is (bs, (hidden_dim - group_size) // 2)) not (bs, hidden_dim - group_size)) here?
What does this "//2" mean? Is it some kind of hardware acceleration method? Can you elaborate it? Thank you.

" a = torch.randint(\n",
" 16,\n",
" 128, (bs, (hidden_dim - group_size) // 2),\n",
" dtype=torch.uint8).cuda()\n",
" b = torch.randint(\n",
" 16,\n",
" 128, (hidden_dim, (hidden_dim - group_size) // 2),\n",
" dtype=torch.uint8).cuda()\n",

TypeError: QLlamaDecoderLayer.forward() got an unexpected keyword argument 'cache_position'

Hi there, I followed all the steps of this proj until I encountered an issue while running the following command.
python model/main.py decapoda-research-llama-7b-hf wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense

Env

  1. GPU: Nvidia RTX 4090
  2. As same as you, I used the container of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
  3. The versions of Python and dependency libraries are consistent with yours requirements.txt
  4. For convenient, I temporarily skipped the step of Compiling kernels benchmarks
  5. By the way, for quickly validation, I changed this line to only verify wikitext2 dataset_only_wikitext2

Describe the issue

When running loglikelihood requests, The typeError in the screenshot has occurred:
image

I tried to make changes based on this issue for cache_position=None in transformers/models/llama/modeling_llama.py, but it also doesn't work.

Any suggestions will be greatly appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.