Caffe is 20x faster than Torch when benchmarking the ARM Cortex A57 CPU on the NVIDIA

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

CPU Convnet Benchmarks: Caffe vs. Torch Discrepancies (20x) on Jetson TX1 A57 CPU about convnet-benchmarks HOT 7 CLOSED

commented on June 18, 2024

CPU Convnet Benchmarks: Caffe vs. Torch Discrepancies (20x) on Jetson TX1 A57 CPU

from convnet-benchmarks.

Comments (7)

soumith commented on June 18, 2024

Torch's MM based convolutions on CPU use a lot more memory, and the shapes probably are not as optimized for OpenBLAS-ARM (as it unfolds all mini-batches and does a single MM call, rather than doing per-batch unfold + gemm in caffe). I'd suggest trying out:
https://github.com/mvitez/OpenBLAS-conv
https://github.com/mvitez/thnets

This very old fork of torch also has optimized assembly NEON based convolutions in there, but only for 32-bit ARM: soumith/torch-android@af6dc1e

from convnet-benchmarks.

soumith commented on June 18, 2024

that being said, 20x seems hugely suspect, as they are both calling gemm.

from convnet-benchmarks.

commented on June 18, 2024

@soumith Thanks for the pointers! I'll try out the codes you've linked and post back here. If it is indeed a difference in unfolding all batches vs. per-batch unfolds (in Caffe), then it should make a huge difference. Thanks!

from convnet-benchmarks.

commented on June 18, 2024

@soumith I've validated AlexNet using thnets (https://github.com/mvitez/thnets) and the TX1's ARM A57 CPU is now within 18% of the Caffe implementation (4.54 FPS on thnets vs. 5.3 FPS on Caffe) for a batch_size = 4 on thnets. I attribute the speedup to assembly-level intrinsics + highly optimized openBLAS kernels for the ARM platform.

I couldn't verify your claim that Torch unfolds all batches and performs a single MM call (and that Caffe unfolds per batch and performs multiple MM calls. Running ~/tegrastats to monitor memory usage, it appears that Torch (for my original Torch benchmark) actually uses less memory than Caffe.

Anyways, you solved my problem :) thanks man.

from convnet-benchmarks.

RParedesPalacios commented on June 18, 2024

Hi, these numbers seems very bad... Just if it helps let me say that I am developing my on toolkit for academic purposes:

https://github.com/RParedesPalacios/Layers

(i have still to upload src code)

And AlexNet with batch=100 and only forward (inference) it takes 2 secs approx. I use lowering and all the batch unfolded. I will try to upload the code to try it.

regards

from convnet-benchmarks.

carlodelmundo-zz commented on June 18, 2024

@RParedesPalacios, are we talking about inference on the TX1's ARM CPU? If so, isn't 50 images per second unreasonable?

Say it's 720 MFLOP to perform a single forward pass for one image [1]. 50 images per second would roughly translate to 36 GFLOPS (720 MFLOP * 50 images/s). I suspect the peak performance of the TX1's ARM CPUs cannot surpass 10 GFLOPS even with NEON and multithreading enabled.

[1] https://groups.google.com/forum/#!topic/caffe-users/cUD3IF5NMOk

from convnet-benchmarks.

RParedesPalacios commented on June 18, 2024

@carlodelmundo, hi i misunderstood the point.. I read, "... Intel Xeon E5-2637 CPU ..." and i thought that the following numbers refer to that.. but i read again and i understand that all the numbers refer to the ARM CPU!.

sorry for that ,regards!

from convnet-benchmarks.

CPU Convnet Benchmarks: Caffe vs. Torch Discrepancies (20x) on Jetson TX1 A57 CPU about convnet-benchmarks HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent