efeslab / atom Goto Github PK
View Code? Open in Web Editor NEW[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Hello, regarding the efficiency evaluation experiment, it seems that there are only codes for evaluating the throughput and latency of Atom and SmoothQuant. I would like to ask how the throughput and latency results for FP16 and AWQ were obtained?
Great works!
I was wondering do you have some online or offline inference api that can be used to test the throughput/latency performance and will the performance be superior compared with vllm/tgit/lightllm, etc.?
Hope to get your answer, thanks
"I would like to ask if there is a solution to this problem, as the error occurred without any changes to the code." :RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kernel code? Thank you.
First of all, thank you for your wonderful work! I have a few small questions about the low precision matrix multiplication code.
In kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh
, in the kernel compute_gemm_imma
. I noticed that in the entire kernel, when performing cp.async synchronization, only the statement
asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 2));
is involved.
My question is, has the correctness of this kernel been checked? Because in my understanding, a complete kernel should involve similar
asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 3));
and asm volatile("cp.async.wait_group %0;\n" ::"n"(0));
In other words, we should see statements like "wait for the last two memory transactions to complete" somewhere in the code, right? Of course, it is possible that I have missed some of the code. If you can provide some help in explaining the code, I will be very grateful!
I attempted the W4A4 operation on the OPT-350M model and was able to obtain the corresponding results. However, after switching the model to 2.7B, I encountered a mismatch error at line 238 in quant.py. Upon printing, I discovered the size to be ([32, 2048, 160]), whereas, for the 350M model, it was displayed as 16, 2048, 128. How should I resolve this error?
Thanks for the great work guys! Trying to run the W4A4 perplexity evaluation, the HF datasets complains about "ValueError: BuilderConfig 'allenai--c4' not found", so removing allenai--c4
from [datautils.py(https://github.com/efeslab/Atom/blob/main/model/datautils.py#L49) and keeping only 'allenai/c4
let the script to complete the run. Wonder if that would be ok to remove it/ if I am missing something.
Traceback (most recent call last):
File "/data/home/hamidnazeri/Atom/model/llama.py", line 232, in <module>
dataloader, testloader = get_loaders(
File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 175, in get_loaders
return get_c4(nsamples, seed, seqlen, model, tokenizer)
File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 51, in get_c4
traindata = load_dataset(
File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
builder_instance = load_dataset_builder(
File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 1852, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
self.config, self.config_id = self._create_builder_config(
File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 539, in _create_builder_config
raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
model,bit,wiki2,ptb,c4,ptb-new,c4-new
/data/home/hamidnazeri/PiPPy/
I have reproduced the results of llama-7b.
The ppl of WikiText2 is the same as it in Table 3(6.16).
The ppl of c4 is also the same as it in Table 3 (7.694).
But the ppl of ptb is very large (32.879), it is 9.62 in Table 3. why ?
Hi, I am wandering which dataset should be the calib dataset? I want to evaluate the quantized model on my own dataset. Which dataset to generate the calib data? my own dataset or other public dataset like wikitext?
Hi, this is great work. I found under /Atom/model/main.py file that it seems to only support llama, opt, mixtral models. If I need to add support for Qwen model, which files should I change?
Hi, I want to how to load the quantized weight to do evaluation?
I have reproduced the results of llama-7b, the ppl of WikiText2 is the same as it in Table 3(6.16).
But in the code you provide, the dynamic quantization for activations is not included. As far as I know, dynamic quantization also cause quantization error. Why omit dynamic quantization in your code?
Hi, when I tried to quant llama model, I met the following error:
Traceback (most recent call last):
File "/workspace/code/atom-main/model/main.py", line 205, in <module>
act_scales = get_act_stats_func(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/yangshangtong/code/atom-main/model/outlier.py", line 95, in get_act_stats_llama
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 681, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (2048) must match the existing size (32768) at non-singleton dimension 3. Target sizes: [1, 52, 2048, 2048]. Tensor sizes: [1, 1, 32768, 32768]
the command is:
python model/main.py /workspace/model/llama2-7B wikitext2 \
--act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
--reorder --act_sort_metric hessian \
--a_clip_ratio 0.9 --w_clip_ratio 0.85 \
--keeper 128 --keeper_precision 3 --kv_cache \
--eval_ppl
What should I do to solve this problem? looking forward to your reply.
Thanks for your great work! I have a small question here.
Why the matrix dimension is (bs, (hidden_dim - group_size) // 2)) not (bs, hidden_dim - group_size)) here?
What does this "//2" mean? Is it some kind of hardware acceleration method? Can you elaborate it? Thank you.
Atom/kernels/baselines/python-api.ipynb
Lines 285 to 292 in 7e3618b
Hi there, I followed all the steps of this proj until I encountered an issue while running the following command.
python model/main.py decapoda-research-llama-7b-hf wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense
nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
requirements.txt
When running loglikelihood requests, The typeError in the screenshot has occurred:
I tried to make changes based on this issue for cache_position=None
in transformers/models/llama/modeling_llama.py
, but it also doesn't work.
Any suggestions will be greatly appreciated!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.