Dear team, I have following doubts: Is buffer_b is scratch

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

create_brgemm_matmul_copy_b itself does not cop

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for insights <a class="user-mention notranslate" data-hovercard-type="user" dat

Use of use_buffer_b pointer in brgemm matrix multiplication,about oneapi-src/onednn

igorsafo commented on June 8, 2024 1

An additional buffer B is a way to change the layout of memory B at execution implicitly to speedup computations. However comparing to the caching, this procedure happens each execution because oneDNN primitives do not have state (for example, to support multi-threading execution). The feature is an implementation detail, so it might be used in some implementations but can be ignored by others.

Here is the algorithm:

(Answers your question 1) brgemm initialization function init_brgemm_matmul_conf initializes use_buffer_b using the method you pointed at:

oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.cpp

Line 1128 in d68912d

bgmmc.use_buffer_b = bm_conf_utils.use_buffer_b();

Depending on Matmul/brgemm configuration an additional buffer for matrix B buffer_b is either required or not.

If buffer_b is required:

An additional scratchpad memory is registered so at execution either user or the library will provide this buffer:

oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.cpp

Line 1473 in d68912d

if (bgmmc.use_buffer_b) {

Brgemm strides are updated because when B is copied into scratchpad the copy routine could change leading dimensions or the layout in general:

oneDNN/src/cpu/x64/matmul/brgemm_matmul_utils.hpp

Lines 236 to 244 in d68912d

    
           inline dim_t get_actual_LDB() const { 
        
               if (bgmmc.wei_tag == format_tag::acbd && !bgmmc.use_buffer_b) { 
        
                   assert(bgmmc.b_dt_sz == bgmmc.tr_b_dt_sz); 
        
                   return bgmmc.B_strides[1] / bgmmc.b_dt_sz; 
        
               } 
        
               bool use_blocked_LDB = bgmmc.is_amx || bgmmc.use_buffer_b 
        
                       || bgmmc.wei_tag != plain_tensor_layout_tag; 
        
               return use_blocked_LDB ? bgmmc.wei_n_blk : bgmmc.N; 
        
           }

An additional kernel copy_B_kernel_ is created. It is responsible for copying matrix B into the scratchpad buffer:

oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp

Line 219 in d68912d

CHECK(create_brgemm_matmul_copy_b(copy_B_kernel_, &bgmmc));
(Answers your questions 2 & 3) Each matmul execution data from B is copied into buffer_b using kernel copy_B_kernel_ as part of copy_b_chunk_in_buffer:

oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp

Lines 314 to 315 in d68912d

if (bgmmc.use_buffer_b && !skip_copy_b)

copy_b_chunk_in_buffer(brgmm_ctx, ithr, b, nb, kc);
Compute kernel uses the scratchpad memory buffer_b:

oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp

Line 1113 in d68912d

addr_batch[b_iter].ptr.B = (bgmmc_.use_buffer_b)

from onednn.

igorsafo commented on June 8, 2024 1

create_brgemm_matmul_copy_b itself does not copy the matrix B, it generates a JIT kernel responsible for copying matrix B at primitive::execute().
jit_brgemm_matmul_copy_b_f32_t::generate() contains all JIT code for the copy routine. copy_16_x_n_block() and compute_k_loop() are parts of this copy routine, so they are called within generate(). For example, copy_16_x_n_block() emits instructions responsible for the copying (vmovups instructions). The result of generate() is a code generated at runtime (Just-In-Time) that will be executed at primitive::execute() and will be destroyed later. Comparing to a regular code, JIT code lives on heap (is allocated via mmap) so profilers/debuggers might not see this code. Please read the following page: https://oneapi-src.github.io/oneDNN/dev_guide_inspecting_jit.html This will help you to dump binary code for the copy routine to inspect it later.
Oh, this is a good question! As far as I remember EVEX_compress_addr() was introduced as an optimization for Xeon Phi family of processors, the reason was that if an instruction vmovups has an immediate offset bigger than 0x200 then instruction encoding will be increased which will result in bigger code size. This should be useful for any avx512-capable CPU, but it is not required, so vmovups(zmm_src, EVEX_compress_addr(reg_src, i * src_stride_)) can be safely replaced by vmovups(zmm_src, ptr[reg_src + i * src_stride_])

from onednn.

igorsafo commented on June 8, 2024 1

Thank you for the answers. I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output. Can you suggest what might be the reason that is causing zero output?

I see, thanks for the details. There are at least the following possible issues:

Please make sure you initialize contexts correctly for copy and compute kernels. In onednn contexts are used to pass arguments (which contain memory pointers to data) between primitives. Primitive itself has a context with all arguments passed by a user. For example, brgemm matmul implementation then should take these arguments and pass to the context of copy kernel and brgemm kernel:

Init brgemm_ctx based on primitive ctx:

oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp

Lines 261 to 262 in cec7b41

    
           brg_matmul_exec_ctx_t brgmm_ctx(ctx, pd(), oscales, src_zero_point, 
        
                   wei_zero_point, dst_zero_point, dst_scales, helper);

Init copy routine ctx based on brgemm_ctx:

oneDNN/src/cpu/x64/matmul/brgemm_matmul.cpp

Lines 665 to 666 in cec7b41

    
           ctx.tr_src = (void *)brgmm_ctx.get_buf_A_ptr(ithr, m_blk_idx, gb); 
        
           ctx.current_K_blk = nstl::min(bgmmc.K_blk, bgmmc.K);

Please make sure in case of use_buffer_b copy routine is invoked. In addition as I explained before you can inspect that JIT code for copy routine to make sure it is correct. If it is not correct it could copy into wrong memory so tr_src still contains zeros after copy kernel.

In general, I would recommend you to reduce size of matmul to something like 4x4x4 or even smaller, fill weights and sources by some dummy numbers like 1, 2, 3, ... and check the data prior to kernels execution, in-between and after to see where the issue comes from.

from onednn.

vineel96 commented on June 8, 2024

Hi @igorsafo,
Thank you very much for the answers.

I profiled code flow through create_brgemm_matmul_copy_b function but i could not actually find place where exactly copy_B_kernel_ is copying matrix B data into the scratchpad buffer. The flow is:
[create_brgemm_matmul_copy_b() ->jit_brgemm_matmul_copy_b_f32_t -> create_kernel() -> generate()]
In generate() function I see jit assembly functions, but where exactly matrix B data is fetched into scratchpad?
In brgemm_matmul_copy_utils.cpp

oneDNN/src/cpu/x64/matmul/brgemm_matmul_copy_utils.cpp

Line 3127 in d68912d

void jit_brgemm_matmul_copy_b_f32_t::copy_16_x_n_block(

, copy_16_x_n_block(), compute_k_loop() and generate() functions are also used for copying data into buffer_b. How each function is responsible in copying B's data into scratchpad buffer?
Also what is significance of EVEX_compress_addr() function in the copy_16_x_n_block() function?

from onednn.

vineel96 commented on June 8, 2024

Thank you for the answers.
I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output.
Can you suggest what might be the reason that is causing zero output?

from onednn.

vineel96 commented on June 8, 2024

Thanks for insights @igorsafo.

from onednn.

Use of use_buffer_b pointer in brgemm matrix multiplication about onednn HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	inline dim_t get_actual_LDB() const {
	if (bgmmc.wei_tag == format_tag::acbd && !bgmmc.use_buffer_b) {
	assert(bgmmc.b_dt_sz == bgmmc.tr_b_dt_sz);
	return bgmmc.B_strides[1] / bgmmc.b_dt_sz;
	}
	bool use_blocked_LDB = bgmmc.is_amx \|\| bgmmc.use_buffer_b
	\|\| bgmmc.wei_tag != plain_tensor_layout_tag;
	return use_blocked_LDB ? bgmmc.wei_n_blk : bgmmc.N;
	}

	if (bgmmc.use_buffer_b && !skip_copy_b)
	copy_b_chunk_in_buffer(brgmm_ctx, ithr, b, nb, kc);

	brg_matmul_exec_ctx_t brgmm_ctx(ctx, pd(), oscales, src_zero_point,
	wei_zero_point, dst_zero_point, dst_scales, helper);

	ctx.tr_src = (void *)brgmm_ctx.get_buf_A_ptr(ithr, m_blk_idx, gb);
	ctx.current_K_blk = nstl::min(bgmmc.K_blk, bgmmc.K);