Comments (9)
Hi @dagdoron , Sorry for the late response, I was trying to understand what happens in the example and how it can be mitigated. The issue is binary primitive is optimized for operations where src0
and dst
memory descriptors are similar (have the same tag, the same data type) while src1
can be broadcasted or have another data type. In your case src0
and dst
have different data types so binary dispatches into reference implementation.
For (De-)quantization oneDNN uses reorder primitive which have optimized implementations for reorders between different data types. So one of potential solutions would be to use reorder with src scale (it must be in f32) to dequantize data from s8 to f32. Then separately use binary primitive to complete dequantization.
I see that your dequantization involeves shift and scale, but why do you multiply on shift and not add/subtract it? This prevents from implementing whole dequantization as a single reorder. Here is quantization that is supported by oneDNN: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_quantization.html
from onednn.
Hi @igorsafo
Thanks,
I'll try reorder before the binary and let you know if that helped
We are using a slightly different dequantization schema, where the shifts are actually bit shifts and we try to emulate the HW by multiplying by 2^x instead of shifting
from onednn.
Hi @igorsafo
I've changed src0 to be s32 in accordance with tmp0 type so now src0 and dst have the same type
e.g.
std::vector<int32_t> src0(512, 1);
memory::desc src0_md = memory::desc(dims, dt::s32, tag::nhwc);
the execution still falls back to ref
onednn_verbose,create:cache_miss,cpu,binary,ref:any,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x8x8x8:1x8x1x1,0.104004
onednn_verbose,exec,cpu,binary,ref:any,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x8x8x8:1x8x1x1,1.11914
from onednn.
Yes, I was able to reproduce it. Another limitation of JIT I found is it doesn't support s32 data type. I will create an internal ticket to track this issue.
from onednn.
@dagdoron s32 support is in progress for jit implementation.
A separate question: Would shift operation serve better or binary mul with s32 support is enough for your use cases?
from onednn.
@igorsafo
s32 would be good enough, however if you can support shifts it would be the best, it would save us some cycles converting them and I guess integer shifts may be faster than float mul
from onednn.
@dagdoron Could you please try the latest version of master
branch? The support is added in 46135fd
from onednn.
@igorsafo - Thanks for the fast respond and fix
With this commit, the binary is executing the jit version
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:acdb::f0 dst_s32::blocked:acdb::f0,,,1x256x128x128,0.911133
onednn_verbose,exec,cpu,binary,jit:uni,undef,src_s32::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s32::blocked:acdb::f0,attr-scratchpad:user attr-post-ops:binary_mul:s8:2 ,alg:binary_mul,1x256x128x128:1x256x1x1,7.27001
onednn_verbose,exec,cpu,binary,jit:uni,undef,src_s8::blocked:acdb::f0 src_f32::blocked:acdb::f0 dst_s8
from onednn.
Great to know! I added an internal request about shift operation, but there is no guarantee it will be implemented until we have more use cases and users, because it will require much more resources on our side to implement and maintain it.
I am closing this issue since the performance issue is fixed. Feel free to re-open or create a separate issue if you have any other requests.
from onednn.
Related Issues (20)
- Problem with creating descriptor for pooling primitive HOT 6
- [nvidia] batch normalization primitive fails correctness check HOT 2
- how to dispatch "avx2_vnni_2" HOT 8
- [nvidia] pooling primitive fails correctness check HOT 2
- [nvidia] resampling primitive fails correctness check
- gemm_api and Reorder in HuggingFace OPT model HOT 6
- [nvidia|amd] Add missing synchronization HOT 2
- oneDNN does not build with Intel oneMKL as BLAS Vendor HOT 5
- Falling to ref code in matmul HOT 2
- Understand jit_brgemm_kernel_t and its internals HOT 2
- Help need: use graph API to construct a subgraph of multi-head attention HOT 8
- CPU usage is not as high as expected when thread number >30 HOT 4
- Meet a erro in building process about dnnl HOT 2
- benchdnn matmul failing tests on aarch64 HOT 1
- Expected Multi-Threaded CPU Performance HOT 4
- Wrongly handling of inf when the post-op operation is mul. HOT 2
- Matmul - tensor size effect on performance HOT 3
- [nvidia] int8 convolution primitive fails correctness check
- [nvidia] The build with latest DPC++ open source compiler is broken
- Integration of default in-order stream behaviour to release candidate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onednn.