Giter Site home page Giter Site logo

Comments (6)

igorsafo avatar igorsafo commented on June 8, 2024 1

Hi @vineel96 ,

An additional buffer B is a way to change the layout of memory B at execution implicitly to speedup computations. However comparing to the caching, this procedure happens each execution because oneDNN primitives do not have state (for example, to support multi-threading execution). The feature is an implementation detail, so it might be used in some implementations but can be ignored by others.

Here is the algorithm:

  1. (Answers your question 1) brgemm initialization function init_brgemm_matmul_conf initializes use_buffer_b using the method you pointed at:
    bgmmc.use_buffer_b = bm_conf_utils.use_buffer_b();
    Depending on Matmul/brgemm configuration an additional buffer for matrix B buffer_b is either required or not.
  2. If buffer_b is required:
    1. An additional scratchpad memory is registered so at execution either user or the library will provide this buffer:
      if (bgmmc.use_buffer_b) {
    2. Brgemm strides are updated because when B is copied into scratchpad the copy routine could change leading dimensions or the layout in general:
      inline dim_t get_actual_LDB() const {
      if (bgmmc.wei_tag == format_tag::acbd && !bgmmc.use_buffer_b) {
      assert(bgmmc.b_dt_sz == bgmmc.tr_b_dt_sz);
      return bgmmc.B_strides[1] / bgmmc.b_dt_sz;
      }
      bool use_blocked_LDB = bgmmc.is_amx || bgmmc.use_buffer_b
      || bgmmc.wei_tag != plain_tensor_layout_tag;
      return use_blocked_LDB ? bgmmc.wei_n_blk : bgmmc.N;
      }
    3. An additional kernel copy_B_kernel_ is created. It is responsible for copying matrix B into the scratchpad buffer:
      CHECK(create_brgemm_matmul_copy_b(copy_B_kernel_, &bgmmc));
    4. (Answers your questions 2 & 3) Each matmul execution data from B is copied into buffer_b using kernel copy_B_kernel_ as part of copy_b_chunk_in_buffer:
      if (bgmmc.use_buffer_b && !skip_copy_b)
      copy_b_chunk_in_buffer(brgmm_ctx, ithr, b, nb, kc);
    5. Compute kernel uses the scratchpad memory buffer_b:
      addr_batch[b_iter].ptr.B = (bgmmc_.use_buffer_b)

from onednn.

igorsafo avatar igorsafo commented on June 8, 2024 1
  1. create_brgemm_matmul_copy_b itself does not copy the matrix B, it generates a JIT kernel responsible for copying matrix B at primitive::execute().
  2. jit_brgemm_matmul_copy_b_f32_t::generate() contains all JIT code for the copy routine. copy_16_x_n_block() and compute_k_loop() are parts of this copy routine, so they are called within generate(). For example, copy_16_x_n_block() emits instructions responsible for the copying (vmovups instructions). The result of generate() is a code generated at runtime (Just-In-Time) that will be executed at primitive::execute() and will be destroyed later. Comparing to a regular code, JIT code lives on heap (is allocated via mmap) so profilers/debuggers might not see this code. Please read the following page: https://oneapi-src.github.io/oneDNN/dev_guide_inspecting_jit.html This will help you to dump binary code for the copy routine to inspect it later.
  3. Oh, this is a good question! As far as I remember EVEX_compress_addr() was introduced as an optimization for Xeon Phi family of processors, the reason was that if an instruction vmovups has an immediate offset bigger than 0x200 then instruction encoding will be increased which will result in bigger code size. This should be useful for any avx512-capable CPU, but it is not required, so vmovups(zmm_src, EVEX_compress_addr(reg_src, i * src_stride_)) can be safely replaced by vmovups(zmm_src, ptr[reg_src + i * src_stride_])

from onednn.

igorsafo avatar igorsafo commented on June 8, 2024 1

Thank you for the answers. I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output. Can you suggest what might be the reason that is causing zero output?

I see, thanks for the details. There are at least the following possible issues:

  • Please make sure you initialize contexts correctly for copy and compute kernels. In onednn contexts are used to pass arguments (which contain memory pointers to data) between primitives. Primitive itself has a context with all arguments passed by a user. For example, brgemm matmul implementation then should take these arguments and pass to the context of copy kernel and brgemm kernel:
  • Please make sure in case of use_buffer_b copy routine is invoked. In addition as I explained before you can inspect that JIT code for copy routine to make sure it is correct. If it is not correct it could copy into wrong memory so tr_src still contains zeros after copy kernel.

In general, I would recommend you to reduce size of matmul to something like 4x4x4 or even smaller, fill weights and sources by some dummy numbers like 1, 2, 3, ... and check the data prior to kernels execution, in-between and after to see where the issue comes from.

from onednn.

vineel96 avatar vineel96 commented on June 8, 2024

Hi @igorsafo,
Thank you very much for the answers.

  1. I profiled code flow through create_brgemm_matmul_copy_b function but i could not actually find place where exactly copy_B_kernel_ is copying matrix B data into the scratchpad buffer. The flow is:
    [create_brgemm_matmul_copy_b() ->jit_brgemm_matmul_copy_b_f32_t -> create_kernel() -> generate()]
    In generate() function I see jit assembly functions, but where exactly matrix B data is fetched into scratchpad?
  2. In brgemm_matmul_copy_utils.cpp
    void jit_brgemm_matmul_copy_b_f32_t::copy_16_x_n_block(
    , copy_16_x_n_block(), compute_k_loop() and generate() functions are also used for copying data into buffer_b. How each function is responsible in copying B's data into scratchpad buffer?
  3. Also what is significance of EVEX_compress_addr() function in the copy_16_x_n_block() function?

from onednn.

vineel96 avatar vineel96 commented on June 8, 2024

Thank you for the answers.
I have been trying porting matmul for aarch64 machine, where when we enable use_buffer_b, we are getting 0's as output for all input shapes, when we dont enable use_buffer_b, its giving non zero output.
Can you suggest what might be the reason that is causing zero output?

from onednn.

vineel96 avatar vineel96 commented on June 8, 2024

Thanks for insights @igorsafo.

from onednn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.