Regarding: <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="htt

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Please comment on why the A100 specific commit makes it faster? about gptq HOT 2 CLOSED

ist-daslab commented on July 20, 2024

Please comment on why the A100 specific commit makes it faster?

from gptq.

Comments (2)

efrantar commented on July 20, 2024 2

Hi,

since the A100 has a very high memory bandwidth and only rather little non-tensor-core compute (which is what we use for matrix-vector products), the initial simple kernels only gave moderate speedup over FP16 execution on the A100 (meanwhile, on e.g. the A6000 they were pretty close to the optimal 5.3x that is expected from 3bit compression, on large matrices).

The new kernels decode two quantized weights simultaneously into a fused 2xFP16 value, using a look-up table stored in fast shared memory (which is replicated to avoid bank conflicts during reads), which are then multiplied by 2 fused FP16 inputs in a single step. This significantly reduces the relative dequantization and computation overhead of the kernel, which was quite significant on the A100. Thus the bandwidth savings through quantization translate to better overall speedups.

In general, the kernels may also be a bit faster on not as strong GPUs but there the original simple kernels already worked quite well.

from gptq.

Qubitium commented on July 20, 2024

@efrantar Thank you so much for the explanation. I think this will help others perform further hardware specific quantize optimization.

from gptq.

Please comment on why the A100 specific commit makes it faster? about gptq HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent