efeslab / atom Goto Github PK

View Code? Open in Web Editor NEW

259.0 259.0 21.0 15.67 MB

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

CMake 0.64% Cuda 60.86% C++ 8.58% Python 25.14% Shell 0.27% Jupyter Notebook 4.41% Makefile 0.05% C 0.05%

atom's People

Contributors

Stargazers

Watchers

atom's Issues

ppl on ptb

When I ran your code, I found that the ppl score of the llama-7B model on the PTB dataset is greater than 20, whether quantized or not. And in another paper, the score is also different from yours. Why?

Question regarding the efficiency evaluation

Hello, regarding the efficiency evaluation experiment, it seems that there are only codes for evaluating the throughput and latency of Atom and SmoothQuant. I would like to ask how the throughput and latency results for FP16 and AWQ were obtained?

LLM model load hanging problem

Hello,
When i follow the guidance and try reproduce the result, i encounter the problem shown in the screening below.

how to compare the performance with vllm/tgit/lightllm or other llm serving framework?

Great works!
I was wondering do you have some online or offline inference api that can be used to test the throughput/latency performance and will the performance be superior compared with vllm/tgit/lightllm, etc.?

Hope to get your answer, thanks

error：same device

"I would like to ask if there is a solution to this problem, as the error occurred without any changes to the code." :RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Question about KV Cache quantization

Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kernel code? Thank you.

Question about the synchronazation in low-precision kernel

First of all, thank you for your wonderful work! I have a few small questions about the low precision matrix multiplication code.

In kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh, in the kernel compute_gemm_imma. I noticed that in the entire kernel, when performing cp.async synchronization, only the statement

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 2));

is involved.

My question is, has the correctness of this kernel been checked? Because in my understanding, a complete kernel should involve similar

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 3));

and asm volatile("cp.async.wait_group %0;\n" ::"n"(0));

In other words, we should see statements like "wait for the last two memory transactions to complete" somewhere in the code, right? Of course, it is possible that I have missed some of the code. If you can provide some help in explaining the code, I will be very grateful!

AssertionError

I attempted the W4A4 operation on the OPT-350M model and was able to obtain the corresponding results. However, after switching the model to 2.7B, I encountered a mismatch error at line 238 in quant.py. Upon printing, I discovered the size to be ([32, 2048, 160]), whereas, for the 350M model, it was displayed as 16, 2048, 128. How should I resolve this error?

issue with `c4` dataset for eval

Thanks for the great work guys! Trying to run the W4A4 perplexity evaluation, the HF datasets complains about "ValueError: BuilderConfig 'allenai--c4' not found", so removing allenai--c4 from [datautils.py(https://github.com/efeslab/Atom/blob/main/model/datautils.py#L49) and keeping only 'allenai/c4 let the script to complete the run. Wonder if that would be ok to remove it/ if I am missing something.

Traceback (most recent call last):
  File "/data/home/hamidnazeri/Atom/model/llama.py", line 232, in <module>
    dataloader, testloader = get_loaders(
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 175, in get_loaders
    return get_c4(nsamples, seed, seqlen, model, tokenizer)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 51, in get_c4
    traindata = load_dataset(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
model,bit,wiki2,ptb,c4,ptb-new,c4-new
/data/home/hamidnazeri/PiPPy/

the ppl for llama-7b is very large

I have reproduced the results of llama-7b.
The ppl of WikiText2 is the same as it in Table 3（6.16）.
The ppl of c4 is also the same as it in Table 3 (7.694).
But the ppl of ptb is very large （32.879）， it is 9.62 in Table 3. why ?

The question about calib data

Hi, I am wandering which dataset should be the calib dataset? I want to evaluate the quantized model on my own dataset. Which dataset to generate the calib data? my own dataset or other public dataset like wikitext?

Is it possible to add support for other models?

Hi, this is great work. I found under /Atom/model/main.py file that it seems to only support llama, opt, mixtral models. If I need to add support for Qwen model, which files should I change?

How to load quantized weight?

Hi, I want to how to load the quantized weight to do evaluation?

not including dynamic quantizaiton when reproducing results, why?

I have reproduced the results of llama-7b, the ppl of WikiText2 is the same as it in Table 3（6.16）.
But in the code you provide, the dynamic quantization for activations is not included. As far as I know, dynamic quantization also cause quantization error. Why omit dynamic quantization in your code?

RuntimeError when quant llama model

Hi, when I tried to quant llama model, I met the following error:

Traceback (most recent call last):
  File "/workspace/code/atom-main/model/main.py", line 205, in <module>
    act_scales = get_act_stats_func(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/yangshangtong/code/atom-main/model/outlier.py", line 95, in get_act_stats_llama
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 681, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (2048) must match the existing size (32768) at non-singleton dimension 3.  Target sizes: [1, 52, 2048, 2048].  Tensor sizes: [1, 1, 32768, 32768]

the command is:

python model/main.py /workspace/model/llama2-7B wikitext2 \
	--act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
	--reorder --act_sort_metric hessian \
	--a_clip_ratio 0.9 --w_clip_ratio 0.85 \
	--keeper 128 --keeper_precision 3 --kv_cache \
	--eval_ppl

What should I do to solve this problem? looking forward to your reply.

Quention about end-to-end efficiency evaluation of Atom

Thanks for your great work! I have a small question here.

Why the matrix dimension is (bs, (hidden_dim - group_size) // 2)) not (bs, hidden_dim - group_size)) here?
What does this "//2" mean? Is it some kind of hardware acceleration method? Can you elaborate it? Thank you.

Atom/kernels/baselines/python-api.ipynb

Lines 285 to 292 in 7e3618b

    
           "    a = torch.randint(\n", 
        
           "        16,\n", 
        
           "        128, (bs, (hidden_dim - group_size) // 2),\n", 
        
           "        dtype=torch.uint8).cuda()\n", 
        
           "    b = torch.randint(\n", 
        
           "        16,\n", 
        
           "        128, (hidden_dim, (hidden_dim - group_size) // 2),\n", 
        
           "        dtype=torch.uint8).cuda()\n",

TypeError: QLlamaDecoderLayer.forward() got an unexpected keyword argument 'cache_position'

Hi there, I followed all the steps of this proj until I encountered an issue while running the following command.
python model/main.py decapoda-research-llama-7b-hf wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense

Env

GPU: Nvidia RTX 4090
As same as you, I used the container of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
The versions of Python and dependency libraries are consistent with yours requirements.txt
For convenient, I temporarily skipped the step of Compiling kernels benchmarks
By the way, for quickly validation, I changed this line to only verify wikitext2 dataset_only_wikitext2

Describe the issue

When running loglikelihood requests, The typeError in the screenshot has occurred:

I tried to make changes based on this issue for cache_position=None in transformers/models/llama/modeling_llama.py, but it also doesn't work.

Any suggestions will be greatly appreciated!

	" a = torch.randint(\n",
	" 16,\n",
	" 128, (bs, (hidden_dim - group_size) // 2),\n",
	" dtype=torch.uint8).cuda()\n",
	" b = torch.randint(\n",
	" 16,\n",
	" 128, (hidden_dim, (hidden_dim - group_size) // 2),\n",
	" dtype=torch.uint8).cuda()\n",

efeslab / atom Goto Github PK

atom's People

Contributors

Stargazers

Watchers

Forkers

atom's Issues

Env

Describe the issue

Recommend Projects

Recommend Topics

Recommend Org