Comments (6)
Hi @vineel96 ,
An additional buffer B is a way to change the layout of memory B at execution implicitly to speedup computations. However comparing to the caching, this procedure happens each execution because oneDNN primitives do not have state (for example, to support multi-threading execution). The feature is an implementation detail, so it might be used in some implementations but can be ignored by others.
Here is the algorithm:
- (Answers your question 1) brgemm initialization function
init_brgemm_matmul_conf
initializesuse_buffer_b
using the method you pointed at:oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.cpp
Line 1128 in d68912d
buffer_b
is either required or not. - If
buffer_b
is required:- An additional scratchpad memory is registered so at execution either user or the library will provide this buffer:
oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.cpp
Line 1473 in d68912d
- Brgemm strides are updated because when B is copied into scratchpad the copy routine could change leading dimensions or the layout in general:
oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.hpp
Lines 236 to 244 in d68912d
- An additional kernel
copy_B_kernel_
is created. It is responsible for copying matrix B into the scratchpad buffer:oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp
Line 219 in d68912d
- (Answers your questions 2 & 3) Each matmul execution data from B is copied into
buffer_b
using kernelcopy_B_kernel_
as part ofcopy_b_chunk_in_buffer
:oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp
Lines 314 to 315 in d68912d
- Compute kernel uses the scratchpad memory
buffer_b
:oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp
Line 1113 in d68912d
- An additional scratchpad memory is registered so at execution either user or the library will provide this buffer:
from onednn.
create_brgemm_matmul_copy_b
itself does not copy the matrix B, it generates a JIT kernel responsible for copying matrix B atprimitive::execute()
.jit_brgemm_matmul_copy_b_f32_t::generate()
contains all JIT code for the copy routine.copy_16_x_n_block()
andcompute_k_loop()
are parts of this copy routine, so they are called withingenerate()
. For example,copy_16_x_n_block()
emits instructions responsible for the copying (vmovups
instructions). The result ofgenerate()
is a code generated at runtime (Just-In-Time) that will be executed atprimitive::execute()
and will be destroyed later. Comparing to a regular code, JIT code lives on heap (is allocated viammap
) so profilers/debuggers might not see this code. Please read the following page: https://oneapi-src.github.io/oneDNN/dev_guide_inspecting_jit.html This will help you to dump binary code for the copy routine to inspect it later.- Oh, this is a good question! As far as I remember
EVEX_compress_addr()
was introduced as an optimization for Xeon Phi family of processors, the reason was that if an instructionvmovups
has an immediate offset bigger than0x200
then instruction encoding will be increased which will result in bigger code size. This should be useful for any avx512-capable CPU, but it is not required, sovmovups(zmm_src, EVEX_compress_addr(reg_src, i * src_stride_))
can be safely replaced byvmovups(zmm_src, ptr[reg_src + i * src_stride_])
from onednn.
Thank you for the answers. I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output. Can you suggest what might be the reason that is causing zero output?
I see, thanks for the details. There are at least the following possible issues:
- Please make sure you initialize contexts correctly for copy and compute kernels. In onednn contexts are used to pass arguments (which contain memory pointers to data) between primitives. Primitive itself has a context with all arguments passed by a user. For example, brgemm matmul implementation then should take these arguments and pass to the context of copy kernel and brgemm kernel:
- Init
brgemm_ctx
based on primitivectx
:oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp
Lines 261 to 262 in cec7b41
- Init copy routine
ctx
based onbrgemm_ctx
:oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp
Lines 665 to 666 in cec7b41
- Init
- Please make sure in case of
use_buffer_b
copy routine is invoked. In addition as I explained before you can inspect that JIT code for copy routine to make sure it is correct. If it is not correct it could copy into wrong memory sotr_src
still contains zeros after copy kernel.
In general, I would recommend you to reduce size of matmul to something like 4x4x4 or even smaller, fill weights and sources by some dummy numbers like 1, 2, 3, ...
and check the data prior to kernels execution, in-between and after to see where the issue comes from.
from onednn.
Hi @igorsafo,
Thank you very much for the answers.
- I profiled code flow through create_brgemm_matmul_copy_b function but i could not actually find place where exactly copy_B_kernel_ is copying matrix B data into the scratchpad buffer. The flow is:
[create_brgemm_matmul_copy_b() ->jit_brgemm_matmul_copy_b_f32_t -> create_kernel() -> generate()]
In generate() function I see jit assembly functions, but where exactly matrix B data is fetched into scratchpad? - In brgemm_matmul_copy_utils.cpp , copy_16_x_n_block(), compute_k_loop() and generate() functions are also used for copying data into buffer_b. How each function is responsible in copying B's data into scratchpad buffer?
- Also what is significance of EVEX_compress_addr() function in the copy_16_x_n_block() function?
from onednn.
Thank you for the answers.
I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output.
Can you suggest what might be the reason that is causing zero output?
from onednn.
Thanks for insights @igorsafo.
from onednn.
Related Issues (20)
- Bad speed for f32:s8:f32 matmul HOT 11
- How can I create a matmul primitive with A16W8 (active 16bits, weight 8bits) configuration? HOT 2
- [Proposal] Add cpu alloc/free callback to support customlize memory alloctor APIs. HOT 3
- Assertion `dynamic_cast<derived_type>(base) == base' failed HOT 3
- Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? HOT 3
- [ACL] 3D convolution kernel `NEConv3D` is not integrated
- INT8 Performance difference between OneDNN v2.6.3 and v3.4.1 HOT 1
- Possible null pointer dereference in cpu_reorder_pd
- Assertion failure in brgemm in debug build on G3 aarch64 machine HOT 2
- question about matmul_perf example HOT 2
- Information regarding threading backend in oneDNN HOT 1
- could not create a primitive descriptor iterator HOT 5
- cpu: s390x: build fails with saturate was not declared in this scope HOT 7
- Enabling onednn Graph API from framework level HOT 1
- Conditions for Running brgemm_convolution_fwd_t and jit_avx512_common_convolution_fwd_t in oneDNN HOT 3
- oneDNN with Nvidia GPU supprt
- batchnorm requires consistent in- and output mem format_tags HOT 1
- Build fail with CPU_RUNTIME=SEQ and graph compiler backend HOT 2
- OneDNN graph APi for LLM generation HOT 7
- Understand the document on block level APIs(https://github.com/oneapi-src/oneDNN/pull/1852) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onednn.