I am trying to implement the following: I have two S8 sources I need to multiply e

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

avoiding a very slow binary operation about onednn HOT 9 CLOSED

dagdoron commented on June 1, 2024

avoiding a very slow binary operation

from onednn.

Comments (9)

igorsafo commented on June 1, 2024

Hi @dagdoron , Sorry for the late response, I was trying to understand what happens in the example and how it can be mitigated. The issue is binary primitive is optimized for operations where src0 and dst memory descriptors are similar (have the same tag, the same data type) while src1 can be broadcasted or have another data type. In your case src0 and dst have different data types so binary dispatches into reference implementation.

For (De-)quantization oneDNN uses reorder primitive which have optimized implementations for reorders between different data types. So one of potential solutions would be to use reorder with src scale (it must be in f32) to dequantize data from s8 to f32. Then separately use binary primitive to complete dequantization.

I see that your dequantization involeves shift and scale, but why do you multiply on shift and not add/subtract it? This prevents from implementing whole dequantization as a single reorder. Here is quantization that is supported by oneDNN: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_quantization.html

from onednn.

dagdoron commented on June 1, 2024

Hi @igorsafo
Thanks,
I'll try reorder before the binary and let you know if that helped
We are using a slightly different dequantization schema, where the shifts are actually bit shifts and we try to emulate the HW by multiplying by 2^x instead of shifting

from onednn.

dagdoron commented on June 1, 2024

Hi @igorsafo

I've changed src0 to be s32 in accordance with tmp0 type so now src0 and dst have the same type

e.g.
std::vector<int32_t> src0(512, 1);
memory::desc src0_md = memory::desc(dims, dt::s32, tag::nhwc);

the execution still falls back to ref
onednn_verbose,create:cache_miss,cpu,binary,ref:any,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x8x8x8:1x8x1x1,0.104004
onednn_verbose,exec,cpu,binary,ref:any,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x8x8x8:1x8x1x1,1.11914

from onednn.

igorsafo commented on June 1, 2024

Yes, I was able to reproduce it. Another limitation of JIT I found is it doesn't support s32 data type. I will create an internal ticket to track this issue.

from onednn.

igorsafo commented on June 1, 2024

@dagdoron s32 support is in progress for jit implementation.

A separate question: Would shift operation serve better or binary mul with s32 support is enough for your use cases?

from onednn.

dagdoron commented on June 1, 2024

@igorsafo
s32 would be good enough, however if you can support shifts it would be the best, it would save us some cycles converting them and I guess integer shifts may be faster than float mul

from onednn.

igorsafo commented on June 1, 2024

@dagdoron Could you please try the latest version of master branch? The support is added in 46135fd

from onednn.

dagdoron commented on June 1, 2024

@igorsafo - Thanks for the fast respond and fix

With this commit, the binary is executing the jit version

onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:acdb::f0 dst_s32::blocked:acdb::f0,,,1x256x128x128,0.911133
onednn_verbose,exec,cpu,binary,jit:uni,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-scratchpad:user attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x256x128x128:1x256x1x1,7.27001
onednn_verbose,exec,cpu,binary,jit:uni,undef,src_s8::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s8🅰️blocked:acdb::f0,attr-scratchpad:user attr-post-ops:binary_mul:s8:2+binary_add:s32:14:acdb ,alg:binary_mul,1x256x128x128:1x256x1x1,1.86791

from onednn.

igorsafo commented on June 1, 2024

Great to know! I added an internal request about shift operation, but there is no guarantee it will be implemented until we have more use cases and users, because it will require much more resources on our side to implement and maintain it.

I am closing this issue since the performance issue is fixed. Feel free to re-open or create a separate issue if you have any other requests.

from onednn.

avoiding a very slow binary operation about onednn HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent