im2col + openblas sgemm cost 1/2 time which compared with nnpack in single thread. (m

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

nnpack in android cost more time in conv than openblas with singlethread. about nnpack HOT 11 CLOSED

maratyszcza commented on May 20, 2024

nnpack in android cost more time in conv than openblas with singlethread.

from nnpack.

Comments (11)

Maratyszcza commented on May 20, 2024

Which convolution parameters and algorithm do you use?

from nnpack.

conansherry commented on May 20, 2024

default param. AUTO BLOCK_BASED

from nnpack.

conansherry commented on May 20, 2024

this is my prototxt
input data is 1 X 3 X 60 X 60 (i use another size of input image[640X480], also slower than openblas. it cost about 2x time than openblas.)
my openblas is compiled with ndk12b include gfortran.

from nnpack.

austingg commented on May 20, 2024

@conansherry nnpack only supports conv with 1 stride, when stride > 1, nnpack also uses im2col + sgemm.
however, I wonder why it cost 2x time compared to openblas.

from nnpack.

conansherry commented on May 20, 2024

@austingg oh i see the source code. and you are right.
openblas with gfortran is the best blas library in android according my experiments. compared to pure c openblas and eigen.

from nnpack.

austingg commented on May 20, 2024

@conansherry thanks for sharing your experiments result. I will do some further experiments on openblas with gfortran.

from nnpack.

conansherry commented on May 20, 2024

@Maratyszcza @austingg does the nnpack only support specify size kernel like 3x3 or 16x16? in my new test, the kernel size 5 and strid 1 come the wrong results.

from nnpack.

conansherry commented on May 20, 2024

@Maratyszcza @austingg oh, i lookup the caffe2 implements. i use the tuple_based and everything is ok. I also check the souce code in convolution-inference.c other mode is not implement.

from nnpack.

conansherry commented on May 20, 2024

@Maratyszcza @austingg
Here i will share some result in my experiments.
Android mobile phone. XIAOMI 5 Plus. all library run on single thread mode.
dl net contains 4 conv layers and two inner production layers. all conv layers are stride1 and kernel size euqal 5 or 3.
I run program 10 times for each experiment. here is the results, so i continue to choose openblas in my library. Thank you for nnpack source sharing and it's also a good job for the multi-core cpu.

openblas with gfortran
time forward 11.841000
time forward 10.097000
time forward 10.139000
time forward 10.583000
time forward 10.498000
time forward 10.358000
time forward 10.501000
time forward 10.440000
time forward 10.524000
time forward 10.268000

NNPACK FFT16X16
time forward 32.105999
time forward 28.781000
time forward 29.034000
time forward 61.912998
time forward 31.129999
time forward 27.649000
time forward 27.438000
time forward 26.731001
time forward 31.448000
time forward 28.899000

NNPACK FFT8X8
time forward 21.823999
time forward 21.607000
time forward 13.321000
time forward 15.339000
time forward 33.285000
time forward 19.327000
time forward 20.174000
time forward 16.476999
time forward 15.926000
time forward 16.066000

NNPACK AUTO
time forward 19.642000
time forward 20.684000
time forward 17.167999
time forward 15.738000
time forward 15.673000
time forward 14.938000
time forward 14.289000
time forward 17.891001
time forward 17.363001
time forward 16.375000

NNPACK SGEMM
time forward 23.778999
time forward 22.764000
time forward 33.705002
time forward 34.299000
time forward 28.004000
time forward 30.851999
time forward 25.034000
time forward 25.563999
time forward 33.702999
time forward 23.247999

from nnpack.

austingg commented on May 20, 2024

@conansherry that's pretty good result for a cnn application only costs about 10 ms on mobile devices.

According to my research, gfortran is only related to LAPACK, the conv layers only use gemm. Have you ever do some experiments without gfortran, correct me if i am wrong.

from nnpack.

Maratyszcza commented on May 20, 2024

implicit GEMM algorithm is similar to Caffe's im2col+SGEMM, but it is optimized for smaller memory footprint. This memory footprint optimization can make it slower than im2col+SGEMM.
For stride > 1 cases only implicit GEMM algorithm is supported in NNPACK.
When the number of channels on the input to convolution is small, the operation is similar to outer product: it is intrinsically memory bound, and fast algorithms in NNPACK do not help with performance.

from nnpack.

nnpack in android cost more time in conv than openblas with singlethread. about nnpack HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent