I'm testing out different input tensor's sizes in order to see their effect on the pri

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Adding <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Matmul - tensor size effect on performance about onednn HOT 3 CLOSED

ItayLasch commented on June 15, 2024

Matmul - tensor size effect on performance

from onednn.

Comments (3)

vpirogov commented on June 15, 2024

@ItayLasch, oneDNN programming model assumes that primitives are created once and reused during the model execution (either explicitly of via oneDNN primitive cache), so we do not provide any guarantees around creation time. If you need the creation time to be zero there are two options:

Using matmul with runtime dimensions. See example here
Using dnnl_?gemm API

While the creation time will be zero for these cases, the execution time will suffer in comparison to fully specialized matmul primitive.

Hope this helps!

from onednn.

ItayLasch commented on June 15, 2024

I don't want creation time to be zero. I wanted to know if there are certain cases with the inputs dimensions where the performance is worse the usual or better. The only variable I'm trying to look at right now is the dimension sizes of the inputs and output.

from onednn.

mgouicem commented on June 15, 2024

Adding @msotoflo @ankalinin

Is there any thumb rule to decide whether or not it is preferable to use this primitive regarding the tensor's sizes?
I noticed that the output's dimension perhaps affects the performance more than the input's sizes.

It is really implementation dependent and architecture dependent. In general, here are some guidelines, though their impact on final performance will vary:

use multiples of hardware vector length, so that you avoid tail handling (e.g. 64 should be ok for most instruction sets and datatypes)
use multiples of the number of cores if you are targeting a particular platform. This allows perfect load balancing.
Try to avoid large powers of 2 to maximize cache usage. This is because CPU caches are typically N-way associative, and for L1 cache, strided accesses prevent from using all the cache effectively (see here)

Hope that helps.

from onednn.

Recommend Projects

Matmul - tensor size effect on performance about onednn HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent