soumith / convnet-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 580.0 673 KB

Easy benchmarking of all publicly accessible implementations of convnets

License: MIT License

Makefile 0.46% JavaScript 0.47% Lua 15.35% C++ 0.74% Python 66.91% Shell 5.50% Cuda 7.35% C 3.23%

convnet-benchmarks's People

Contributors

Stargazers

Watchers

Forkers

geraldstanje onlysang faysir amos-zq openhero yiiwood fanfannothing snazz2001 nouiz liuliu benanne f0k mdqyy lucentcosmos chagge dreadlord1984 wangdongfrank chenglongchen fish444555 sunkaianna cfandy jakobularius jiangdong123 splade lixiangnlp rootlessweed eyad-shami xiaozhuka facegen xshhhm jlb226 dl-nisl dorisya cc13ny rt0220 pfshawn invinciblejha sleepsophia maydaygmail idiosyncraticdragon xiaoqun2011 keyua-cisco feixuan090803 leo-zhou hughperkins zhmz90 dadaism njuhugn coderbhupendra atveit alexsisu shinexunju amiltonwong jethrotan jwmneu shyamalschandra mokacao spideryan iamyubin alexis-jacob xiuxiazhang liyuanpng ominux fysoft2006 xinmei9322 zhangxc11 ericxsun andyyuan78 kuronekodaisuki salemameen ajkxyz jiayuzhou cvml luyifan chenych11 cvfish colingogo chenjunzhou disenchan cadene naibaf7 zchengquan andyhx yingmin-li kalyanp hspathak xebitstudios walkoncross kod3r ozzie00 nagyistoce pkurainbow yangzhangchina nunb kgl-prml poneyo thanujadax ml-lab starimpact allenhsin

convnet-benchmarks's Issues

[October 2015] Intel are CPU magicians. But there's no one weird trick....

Intel released a small blog-post recently covering that they have crazy-talk speeds for ConvNets on their Haswell CPU line.
I took their Caffe implementation, painfully installed the dependencies, and the numbers look almost too good to be true. Either someone refutes me, or these are very cool numbers.

Link to blog-post:
https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors

A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms.

Just for comparison, the latest and greatest NVIDIA Titan-X does the same round-trip in 96 ms. An older generation GPU like Tesla K40 is slower, pegging at around 200+ ms.

I tried to get VGG working, but ran into assertions about unimplemented code pathways, but regardless, if AlexNet seems to be this fast, the others will probably in the ballpark.

Can someone else try the Intel stuff? I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365

Benchmark tensorflow

Once there are examples of standard networks running on tensorflow.

Error install ccn2

Hi,

when I execute luarocks install ccn2
I got he following error:
CMake Error at ccn2_generated_bias_kernels.cu.o.cmake:264 (message):
Error generating file
/tmp/luarocks_ccn2-scm-1-5691/cuda-convnet2.torch/build/CMakeFiles/ccn2.dir//./ccn2_generated_bias_kernels.cu.o
make[2]: *** [CMakeFiles/ccn2.dir/./ccn2_generated_bias_kernels.cu.o] Error 1
make[1]: *** [CMakeFiles/ccn2.dir/all] Error 2
make: *** [all] Error 2

many thanks

Question

Hi, I'm new to this deep learning.
I have a few newbie questions.
For benchmark, you have AlexNet, Overfeat, OxfordNet, GoogleNet., etc. These are models.
And also in each model category, there are libraries and classes. So how did you test them?
What I mean by this is for Overfeat, which is written in c++ and such has its own way to compile and run. To test benchmarks of Overfeat using different libraries such as torch, tensorflow, caffe., etc, did you simply rewrite the code in the given library? Or is there some kind of linker program(e.g. something that maps tensorflow and Alexnet)?
I know this is a very basic question. Thank you so much for your help!

Is the time for per image or a batch of images?

Thanks for your awesome work!
I am a little confused about the benchmarks you got, is the time in the tables for a single image or for a batch of images? I looked into imagenet_winners folder and found that for the first three nets, the batch size is 128(if I understand correctly), but for the vgg net the batch size is 64, so when you get your results, did you divided that time by the batch size?

OpenCL benchmarks are not fair

As far as I can tell, the OpenCL based benchmarks are also done on the Nvidia GPU. Nvidia purposely nerfed their OpenCL drivers to promote CUDA (AMD is on OpenCL 2.0 and Nvidia is on 1.1, and then are not equal in performance).

Non-machine learning benchmarks show more or less performance parity, when OpenCL is run on a AMD GPU of similar capabilities.

So as it is right now, I would say the OpenCL results are extremely misleading.

potential TensorFlow benchmark improvements

Hi -- thanks for the benchmarks!

I noticed that you do a sparse-to-dense conversion and use softmax_cross_entropy_with_logits. Have you tried eliding the sparse-to-dense conversion and using sparse_softmax_cross_entropy_with_logits? In my experience the sparse version is faster.

Also, reduce_mean does not have a GPU kernel. reduce_sum with a division would prevent a GPU -> CPU -> GPU step when calculating the loss.

Is your setup of Caffe-Greentea optimal ?

I see you are getting markedly slow results with Caffe-Greentea. Which backends are you using, and do you know if they are the best available ?

In ( @naibaf7 ) Fabian Tschopp's tech report http://arxiv.org/pdf/1509.03371.pdf table 6.10 he shows 20x variation in performance depending on which manufacturer's libraries are used.

HcCaffe

It could be interesting to benchmark this new effort: http://gpuopen.com/compute-product/hccaffe/

Caffe is fastest forward+backward

I added more complete numbers for Convolution layer benchmarks, now benchmarked forward+backward calls for 5 different layer configurations.

Caffe ends up being the fastest by quite a bit, mainly because it's :backward calls are quite fast compared to the rest!

The scripts are provided of course, for the benchmark to be entirely reproducible and the raw commandline outputs are checked in for each library!

Let me know if you see any discrepancies.

The caffe kernel was also recently updated to no longer be limited to square images or square filters, so it is quite flexible as well!

Awesome work by the Caffe guys, and looking forward to further optimization (banded approach for im2col anyone?).

Cheers,

Soumith

Experimenting with full Matthew's Net or Alex's Net configuration

It would be helpful to have full Matthew's Net configuration or Alex's Net configuration (or even better, includes Alex's One Weird Trick configuration as well, since cuda-convnet2 is not released yet, or wait until cuda-convnet2 release maybe) tested on all the convnet implementations. Different strides do impact performance a bit (in terms of cache friendliness).

Pre-trained Models?

I have been trying to get my hands on a pre-trained OverFeat caffenet model file. Anybody willing to share? I've found some prototxt's around, but (unsurprisingly) never the actual caffemodel.

[December 2014] benchmarking Imagenet winners

Time to take these benchmarks forward to a more meaningful metric (it's taken so long, but it's after all a side project for fun).

I've added benchmarks for the following networks:
VGG Model A (2nd place Imagenet 2014)
Overfeat Fast (1st place Imagenet 2013 Detection)
AlexNet (the holy network)

So far I covered only two sets of kernels:

Torch7 CUNN
NVIDIA CuDNN R1

In the next week or two, I will try to get Caffe in there as well (if there's a volunteer, this will happen faster, if not it will happen at my own pace, I am not exactly well versed with Caffe's configuration files).
I will try to get cuda-convnet2 as well, but it is failing some numFilters assertions (it supports only certain multiples of input/output plane configurations), will have to look into it.
I am looking for a volunteer to do this on the Theano side, @benanne ??
For the rest of the libraries, they are mostly poorly documented, and it took me a lot of effort to get the first round of benchmarks. Their kernels are not really at the top of the table either, so I will skip them.

Now, coming to GoogleNet, I coded the config in torch (https://github.com/soumith/convnet-benchmarks/blob/master/torch7/imagenet_winners/googlenet.lua), but it is too big to fit on a Titan Black (even with batch-size 1), I will try to benchmark it across 4 K40 cards (12GB each), I have a server that has the 4 cards on a single machine, and Torch7 supports Multi-GPU now, lets see, it will be an interesting exercise.

Have a happy NIPS everyone, the day CuDNN R2 releases I will have the numbers as well (the day is in the near future I believe)

Testing Toronto-convnet

Please refer to my opened issue in Toronto-convnet. Can you please help to verify the performance gain in multi-gpu setting of this tool?
TorontoDeepLearning/convnet#4

Benchmarking ConvNetJS

Hello, thanks for putting these together. I saw that there is a convnet.js placeholder folder and became very curious what the numbers were, so I hacked together a quick version for ConvNetJS. For L1, the code is (using node.js):

var convnetjs = require("../../node_modules/convnetjs");

// L1 Conv Layer definition
var opt = { in_sx:128, in_sy:128, in_depth:3, sx:11, filters:96, stride: 1, pad: 0 };
var layer = new convnetjs.ConvLayer(opt);

// create a random input volume
var x = new convnetjs.Vol(128, 128, 3);

// run it through batch_size number of times
var batch_size = 128;
var dtall = 0;
for(var i=0;i<batch_size;i++) { // batch of 128
  var t0 = +new Date();
  layer.forward(x); // forward
  var t1 = +new Date();
  var dt = t1 - t0;
  dtall += dt;
  console.log(i + ' took ' + dt + 'ms. Estimating full batch to take ' + (dtall/(i+1))*batch_size + 'ms');
}
console.log('total: ' + dtall + 'ms');

and this gives me a total running time of 346320ms, or 346 seconds on my machine :D So quite expensive: about 3400 times slower than Caffe. The current implementation is naive multiple inner loops as seen here:
https://github.com/karpathy/convnetjs/blob/master/src/convnet_layers_dotproducts.js

The only JS optimization used is typed arrays. I tried to optimize it a bit by being more clever with bounds checking and removing the innermost if OOB check, but it doesn't help much. I think the biggest gains by far, going forward, is my very preliminary port into WebGL. This isn't on git yet but I'm working on it (very slowly, admittedly...) on a side, and will report back when it's up.

Andrej

cuda-convnet2

added the relevant config files and a basic README.

looks like i have to add some low-level clocking and synchronization code inside to do layer-wise benchmarking (right now the entire network :forward is asynchronous), dropped an email to alex and he said that's the best approach as well.

Theano cuDNN binding

The PR isn't merged yet. But when it is, it would be great to add it to the benchmark.

Theano/Theano#2096

matconvnet

Hi,
can you please add matconvnet to these benchmarks?
Thanks,
-Amir

[April 2015] Revamp Benchmarks, move to Titan-X (Digits box)

List of libraries to rerun for Titan-X:
Layer-wise benchmarks

Imagenet models benchmarks

Target date: April 15th, 2015

Initial multi-gpu benchmark (4-GPU).

- Torch
- Purine
- Caffe's parallel branch (maybe, if it all builds well, there's lots of issues on github about build issues + parallel)

Target date: April 24th, 2015

GoogleNet failing with MXNET

Hi, I tried to run googlenet script with mxnet. It is not working. Can you please provide a pointer?
gnetv1.py is copied from your repo: /soumith/convnet-benchmarks/tree/master/mxnet
work-station$ python gnetv1.py
('Temp Space: ', 'Total 3258 MB allocated')
('Avg forward per batch: ', 0.3881040978431702)
[09:27:57] ./dmlc-core/include/dmlc/logging.h:241: [09:27:57] ./mshadow/mshadow/./tensor_blob.h:617: Check failed: (this->shape_.Size()) == (shape.Size()) TBlob.get_with_shape: new and old shape do not match total elements
[09:27:57] ./dmlc-core/include/dmlc/logging.h:241: [09:27:57] src/engine/./threaded_engine.h:295: [09:27:57] ./mshadow/mshadow/./tensor_blob.h:617: Check failed: (this->shape_.Size()) == (shape.Size()) TBlob.get_with_shape: new and old shape do not match total elements
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPEto NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
terminate called after throwing an instance of 'dmlc::Error'
what(): [09:27:57] src/engine/./threaded_engine.h:295: [09:27:57] ./mshadow/mshadow/./tensor_blob.h:617: Check failed: (this->shape_.Size()) == (shape.Size()) TBlob.get_with_shape: new and old shape do not match total elements
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPEto NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Aborted (core dumped)
workstation$

benchmark MXNet and Chainer. Compare with TensorFlow and others.

[reserved for review]

DNN OpenCV module

Are you interested to benchmark OpenCV DNN module?

fbcunn overfeat and vgg fail

overfeat and vgg benchmark fail with fbcunn kernel on a titan black card.
Even if i reduce the batch size the benchmark run longer but i still get this (for overfeat)
Running on device: GeForce GTX TITAN Black
ModelType: OverFeat[fast] Kernels: fbfft Input shape: 128x3x231x231
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0430 14:30:32.984395 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateOutput (b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x28
I0430 14:30:35.358149 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=921.31M strategy = (Many) (b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x28 GReductions(virtual fmas)/s = 3021.61 time = 20.41ms
I0430 14:30:35.358180 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateGradInput (b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x28
I0430 14:30:37.495909 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=921.31M strategy = (Many) (b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x28 GReductions(virtual fmas)/s = 3157.00 time = 19.53ms
I0430 14:30:37.495929 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = accGradParameters(b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x28
I0430 14:30:39.731259 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=1047.13M strategy = (Many) (b x p x f) = 128x96x256 (input rows x cols) = 28x28 (filter rows x cols) = 5x5 (common rows x cols) = 28x32 GReductions(virtual fmas)/s = 2990.89 time = 20.61ms
I0430 14:30:39.738314 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateOutput (b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:30:42.112609 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=798.49M strategy = (Many) (b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 1716.03 time = 17.25ms
I0430 14:30:42.112627 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateGradInput (b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:30:44.346437 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=797.44M strategy = (Many) (b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 1881.80 time = 15.73ms
I0430 14:30:44.346457 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = accGradParameters(b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:30:46.663117 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=797.44M strategy = (Many) (b x p x f) = 128x256x512 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 1732.12 time = 17.09ms
I0430 14:30:46.668112 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateOutput (b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:30:47.615033 32763 SpatialConvolutionCuFFTTuner.cpp:99] std::bad_alloc
I0430 14:30:47.615075 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=1963.54M strategy = (Batch) (b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 1559.80 time = 75.89ms
I0430 14:30:47.615084 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = updateGradInput (b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:30:54.954167 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=2531.26M strategy = (Batch) (b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 2655.10 time = 44.59ms
I0430 14:30:54.954197 32763 SpatialConvolutionCuFFTTuner.cpp:117] START exploring FFT perf for pass = accGradParameters(b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14
I0430 14:31:02.475479 32763 SpatialConvolutionCuFFTTuner.cpp:122] Found best result Buffer=2504.00M strategy = (Many) (b x p x f) = 128x512x1024 (input rows x cols) = 14x14 (filter rows x cols) = 3x3 (common rows x cols) = 14x14 GReductions(virtual fmas)/s = 2141.34 time = 55.28ms
/home/allezard/torch/install/bin/luajit: /home/allezard/torch/extra/cutorch/lib/THC/THCStorage.c(112) : cuda runtime error : invalid device pointer
Segmentation fault (core dumped)

Does anyone have seen this before it?
Is it possible to run the test on a titan black?
Thanks

Old TensorFlow version (0.6) is used for benchmarking.

r0.7 should have much better performance, as it allows using Cuda >= 7.0 and cuDNN R4, besides other improvements.

[August 2015] Rejigging the marks...

With Cudnn R3 coming in, improvements to Nervana, and a new kid on the block called Chainer, faster Facebook kernels, I will be doing a minor re-run of the benchmarks to see how things have improved.

Target date: August 15th.

I am still thinking quite a lot on how to take the benchmarks forward, beyond ConvNets, beyond Images (into NLP, Video and Audio) and beyond single-GPU. If any domain experts have suggestions (especially for Audio and NLP), please do write to me.

The only thing that stopped me from multi-GPU benchmarks was the lack of enough frameworks to do benchmarking. This somewhat seemed to have changed, and a decent number of frameworks now support multi-GPU, so will plan on that.

More fun to come soon.

Checklist:

Benchmark pylearn2/Theano using GpuCorrMM

Theano offers a convolution similar to what caffe offers using Toeplitz matrix. See: http://deeplearning.net/software/theano/library/tensor/nnet/conv.html.

This is enabled by setting THEANO_FLAGS=optimizer_including=conv_gemm.

On Chainer results

Thank you for adding Chainer results to the tables! I am surprised at the result superior to that of Caffe native, since the implementation of Convolution2D in Chainer is basically same as that of Caffe. Then I hypothesize that the Chainer's results are measured on environment with cuDNN installed, in which case Chainer automatically detects it and switch the implementation to use it. If so, could you add a note indicating that cuDNN is enabled, to avoid confusion? (If this hypothesis is wrong, then ignore this message).

setting up variables

One more question, how do you set up the test_iter and test_intervals (as in the caffe solver.prototxt) in your model?

Add caffe with CuDNN[R4] to benchmark.

Benchmark TensorFlow

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R2 (Torch)	564	174	390
TensorFlow	590	54	536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

Separating the library and the kernel in the results?

I noticed that the previous cuda-convnet result was replaced by a better one, and this time the wrapper used is pylearn2. I think both results are relevant and interesting, as both the kernel and the library used will affect performance. It would also be very interesting (for me personally at least) to see how the pylearn2-wrapped cuda-convnet compares with using cuda-convnet's own Python bindings, for example (and of course Torch's).

Additionally, some libraries (like Theano/pylearn2 and Torch) support different kernels, so it would be useful to get numbers for all of the options.

So I thought it would be useful to have separate "library" and "kernel" columns to indicate more clearly which libraries have been benchmarked, and which kernels were used, instead of listing a subset of library+kernel combinations. Just an idea :)

On a somewhat related note, I apologize if I was a bit too eager spreading the link to this repo around, as some people seem to be reacting strongly to these results :) I just thought it's really cool that someone is taking the time to compare these various options and publishing some hard numbers. Kudos to you!

A quick question about set up

Hi,
Since I recently wanted to do something similar on RNNs I wanted to ask you, what exactly is your set up for benchmarking? E.g. how many runs, tests do you do, how long you let them warm up(brunout period) and similar details? Also on Nvidia how did you measure the memory usage? All I get from the nvidia-smi is memory utilization in %? Otherwise for instance Theano sometimes preallocates almost all of the memory so Memory Usage shows like almost all of the 12GB, while the utilization is as low as 30%. Any other details you would share would be great!

Titan X thermo behavior might cause performance fluctuation

TL;DR: when you get hot, you run slowly. And it's a bit hard to predict when you get hot enough...

Now the long version: In the last few days we were confused by fluctuations we observe in running exactly the same benchmark code, so I think it'll probably be worth sharing here.

Basically, when you run a benchmark long enough to make the GPU really hot, Titan X seems to be running at full speed for a while, and then throttles down significantly to make the temperature stable at 84 degrees. You can observe this behavior by running nvidia-smi continuously and observe the number of watts it draws. (If you are interested, on two of our Titan Xs, we did not observe the fan to go 100% before throttling down the power. We did not force the GPU to run on P0.)

The implication is that a long enough burn-in period is necessary to get a stable speed number. If the burn-in is not long enough, a lot of factors may affect the final reported speed: (1) how many iterations you run (later iterations may get increasingly slower); (2) whether you ran something immediately before the benchmark (so the GPU has not cooled down yet); (3) whether you are in Reykjavik (we did not test this).

Empirically, it seems that to get a stable number, burn-in periods as long as a few hundred iterations and/or tens of seconds is necessary, which is often longer than what one would expect. For example, in Caffe I only did burn-in for one iteration so the framework can make all memory allocations.

We have observed a fluctuation of about 10% between a cold GPU and a hot GPU - maybe non-trivial enough to be careful about. Just my little observation that you might find useful.

cuda-convnet2 on osx

Hi,

are you able to run the cuda-convnet2 benchmark on osx?

Thanks,
Gerald

caffe benchmark does not run

It appears that the speedtest command has been removed from caffe.bin since the benchmark instructions were last updated:

$ ./caffe/build/tools/caffe.bin speedtest --net_proto_file=conv.prototxt --run_iterations=1 --speedtest_with_gpu --logtostderr=1
ERROR: unknown command line flag 'logtostderr'
ERROR: unknown command line flag 'net_proto_file'
ERROR: unknown command line flag 'run_iterations'
ERROR: unknown command line flag 'speedtest_with_gpu'

$ ./caffe/build/tools/caffe.bin
caffe.bin: command line brew
usage: caffe

commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time

Also, net_speed_benchmark.bin says 'Deprecated. Use caffe time ...'

But when using the 'time' command with the benchmark models conv*.protoxt, the gradient is not computed with respect to the bottom data, as one would expect for the first layer of a network.

Can't reproduce timings on NVIDIA Titan-X

My benchmark timings on NVIDIA Titan-X (libcudnn.so.7.0.64) are much slower than those of Soumith. My Titan-X is stock from NVIDIA, modestly overclocked (250 Mhz) and overvolted (.05mV) (see below for the guide). Anybody has any ideas why? Do I need to flash a custom BIOS?

user@dnn1:~/convnet-benchmarks/torch7/imagenet_winners$ th benchmark.lua
Running on device: GeForce GTX TITAN X
ModelType: AlexNet Kernels: cudnn Input shape: 128x3x224x224
cudnn :updateOutput(): 41.66
cudnn :updateGradInput(): 62.85
cudnn :accGradParameters(): 59.55
cudnn :Forward: 41.66
cudnn :Backward: 122.41
cudnn :TOTAL: 164.07
ModelType: OverFeat[fast] Kernels: cudnn Input shape: 128x3x231x231
cudnn :updateOutput(): 136.36
cudnn :updateGradInput(): 251.24
cudnn :accGradParameters(): 209.67
cudnn :Forward: 136.36
cudnn :Backward: 460.91
cudnn :TOTAL: 597.27
ModelType: VGG Model-A Kernels: cudnn Input shape: 64x3x224x224
cudnn :updateOutput(): 248.66
cudnn :updateGradInput(): 338.22
cudnn :accGradParameters(): 272.91
cudnn :Forward: 248.66
cudnn :Backward: 611.13
cudnn :TOTAL: 859.79
ModelType: GoogleNet Kernels: cudnn Input shape: 128x3x224x224
cudnn :updateOutput(): 120.70
cudnn :updateGradInput(): 210.48
cudnn :accGradParameters(): 141.56
cudnn :Forward: 120.70
cudnn :Backward: 352.04
cudnn :TOTAL: 472.74

./build/tools/caffe time --model convnet-benchmarks/caffe/imagenet_winners/alexnet.prototxt -gpu=0 --iterations=10
I0923 15:29:33.813731 4276 caffe.cpp:354] Average time per layer:
I0923 15:29:33.813736 4276 caffe.cpp:357] conv1 forward: 7.28601 ms.
I0923 15:29:33.813745 4276 caffe.cpp:360] conv1 backward: 16.4974 ms.
I0923 15:29:33.813751 4276 caffe.cpp:357] conv1/relu forward: 0.802221 ms.
I0923 15:29:33.813758 4276 caffe.cpp:360] conv1/relu backward: 1.17941 ms.
I0923 15:29:33.813765 4276 caffe.cpp:357] pool1/3x3_s2 forward: 0.540144 ms.
I0923 15:29:33.813771 4276 caffe.cpp:360] pool1/3x3_s2 backward: 2.35192 ms.
I0923 15:29:33.813776 4276 caffe.cpp:357] conv2/5x5_s1 forward: 9.20351 ms.
I0923 15:29:33.813782 4276 caffe.cpp:360] conv2/5x5_s1 backward: 32.1066 ms.
I0923 15:29:33.813789 4276 caffe.cpp:357] cpnv2/relu forward: 0.563315 ms.
I0923 15:29:33.813796 4276 caffe.cpp:360] cpnv2/relu backward: 0.836832 ms.
I0923 15:29:33.813802 4276 caffe.cpp:357] pool2/3x3_s2 forward: 0.38049 ms.
I0923 15:29:33.813807 4276 caffe.cpp:360] pool2/3x3_s2 backward: 1.83551 ms.
I0923 15:29:33.813813 4276 caffe.cpp:357] conv3/3x3_s1 forward: 4.51088 ms.
I0923 15:29:33.813819 4276 caffe.cpp:360] conv3/3x3_s1 backward: 18.8913 ms.
I0923 15:29:33.813825 4276 caffe.cpp:357] conv3/relu forward: 0.275491 ms.
I0923 15:29:33.813832 4276 caffe.cpp:360] conv3/relu backward: 0.412963 ms.
I0923 15:29:33.813838 4276 caffe.cpp:357] conv4/3x3_s1 forward: 5.82792 ms.
I0923 15:29:33.813844 4276 caffe.cpp:360] conv4/3x3_s1 backward: 23.6848 ms.
I0923 15:29:33.813850 4276 caffe.cpp:357] conv4/relu forward: 0.185322 ms.
I0923 15:29:33.813856 4276 caffe.cpp:360] conv4/relu backward: 0.277126 ms.
I0923 15:29:33.813863 4276 caffe.cpp:357] conv5/3x3_s1 forward: 4.08288 ms.
I0923 15:29:33.813868 4276 caffe.cpp:360] conv5/3x3_s1 backward: 15.7675 ms.
I0923 15:29:33.813874 4276 caffe.cpp:357] conv5/relu forward: 0.189014 ms.
I0923 15:29:33.813880 4276 caffe.cpp:360] conv5/relu backward: 0.280253 ms.
I0923 15:29:33.813886 4276 caffe.cpp:357] pool5/3x3_s2 forward: 0.12167 ms.
I0923 15:29:33.813892 4276 caffe.cpp:360] pool5/3x3_s2 backward: 0.681011 ms.
I0923 15:29:33.813899 4276 caffe.cpp:357] fc6 forward: 2.85131 ms.
I0923 15:29:33.813905 4276 caffe.cpp:360] fc6 backward: 3.13384 ms.
I0923 15:29:33.813911 4276 caffe.cpp:357] fc7 forward: 0.799648 ms.
I0923 15:29:33.813917 4276 caffe.cpp:360] fc7 backward: 1.42024 ms.
I0923 15:29:33.813923 4276 caffe.cpp:357] fc8 forward: 0.46623 ms.
I0923 15:29:33.813930 4276 caffe.cpp:360] fc8 backward: 0.492112 ms.
I0923 15:29:33.813942 4276 caffe.cpp:365] Average Forward pass: 38.1855 ms.
I0923 15:29:33.813948 4276 caffe.cpp:367] Average Backward pass: 119.916 ms.
I0923 15:29:33.813954 4276 caffe.cpp:369] Average Forward-Backward: 158.163 ms.

user@dnn1:~$ nvidia-smi -q -d TEMPERATURE,POWER,CLOCK,PERFORMANCE

==============NVSMI LOG==============

Timestamp : Wed Sep 23 15:23:32 2015
Driver Version : 346.82

Attached GPUs : 1
GPU 0000:01:00.0
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Temperature
GPU Current Temp : 52 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 279.96 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 275.00 W
Clocks
Graphics : 1427 MHz
SM : 1427 MHz
Memory : 3505 MHz
Applications Clocks
Graphics : 1392 MHz
Memory : 3505 MHz
Default Applications Clocks
Graphics : 999 MHz
Memory : 3505 MHz
Max Clocks
Graphics : 1642 MHz
SM : 1642 MHz
Memory : 3505 MHz
SM Clock Samples
Duration : 29.80 sec
Number of Samples : 100
Max : 1465 MHz
Min : 1253 MHz
Avg : 1423 MHz
Memory Clock Samples
Duration : 29.80 sec
Number of Samples : 100
Max : 3505 MHz
Min : 3505 MHz
Avg : 3505 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On

user@dnn1:~$ cat ~/overclock.sh
sudo nvidia-smi -i 0 -pm 1
sudo nvidia-smi -i 0 -ac 3505,1392
DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1
DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUOverVoltageOffset=50000

from nvidia-settings GUI (can’t do it from command-line) set “Graphics clock Offset” to 300 Mhz, “Memory Transfer Rate Offset” to 800

MultiGPU Tensorflow

Hi,

How can I run the alexnet.py code on multiple GPUs with tensorflow?

Running Torch benchmark with cuDNN

I'm having a problem running torch benchmark, I'm getting CUDNN_STATUS_NOT_SUPPORTED.
I followed the instructions and installed all required dependencies and Torch works fine on simple (non-convolutional) nets. Here is the output:

alexeyk@gcrgpu1071:~/convnet-benchmarks/torch7/imagenet_winners$ th benchmark.lua Running on device: Tesla K40m ModelType: AlexNet Kernels: cudnn Input shape: 128x3x224x224 /var/storage/shared/ipgsp/sys/jobs/application_1447977864059_0183/torch/install/bin/luajit: ...77864059_0183/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED stack traceback: [C]: in function 'error' ...77864059_0183/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:385: in function 'updateOutput' ...64059_0183/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' ...64059_0183/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' benchmark.lua:55: in main chunk [C]: in function 'dofile' ...0183/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
Any ideas what I might be doing wrong?

[August 2015] Rejigging the marks...

The benchmarks this time around are interesting, with some fairly clear trends emerging for the near future.

Looking Back

First, some appreciation for where things are,

9 months ago, we were ~3x slower on alexnet and ~4x slower on overfeat.
Training that took 3 weeks, takes 1 week (on the same 1-GPU metric). That is a huge fundamental speedup results in several man-hours saved in waiting for experiments to finish.

Pushing these boundaries so fast, in such a short time-frame is quite something. There's two sets of teams who have made this happen:

NVIDIA, with their Maxwell cards that are fast as f**k
Nervana Systems (Scott Gray and team) who have pushed the CUDA kernels to the limits of the GPUs with efficiencies > 95%.

Now

The result of Nervana pushing the limits of compute means that others who were competing to be faster had to play smarter. Nervana has pushed limits so hard that the GPU cant run at boosted clock speeds for long and has to slow down a little bit.

Nervana had the flexibility to choose the ideal data layout for the task, and they used it to its maximum potential, combined with very low-level optimizations and hand-coded assembly.

The trend of the near-future

The CuDNN and Facebook teams did not have this kind of flexibility because they were working with a constraint of supporting existing frameworks such as Caffe and Torch which froze themselves to use the BDHW data layout, which is not the most ideal data layout for convolutions in the spatial domain.

Switching to FFT-based convolutions and optimizing the hell out of them was an obvious choice.
However, there has been skepticism that FFT-based convolutions take too much extra memory.
This was demonstrated by the Facebook convolutions (FBFFT, FBCuFFT) which were fairly fast, but took an unreasonable amount of extra memory.

However, FFT-based convolutions dont necessarily need a lot of extra memory, especially if one writes the full FFT pipeline from scratch. In fact, Nicolas Vasilache from Facebook demonstrated that FFT based convolutions dont need any extra memory with a single-threaded implementation, but he did not optimize them further to achieve competitive performance. He also showcased a tiling strategy for FFT based convolutions that speeds the convolutions up quite a bit, reducing the extra memory needed as well.

NVIDIA with their R3 release of CuDNN show that their FFT based convolutions can be very competitive in speed with Nervana kernels, and faster in some cases. (See imagenet_winners in README.md on the main page for more details)

One has to remember that FFT based convolutions take the same speed to compute regardless of the convolution size (except in a tiling FFT strategy). So if you have a 3x3 convolution layer or a 15x15, it takes the same speed.

NVIDIA fused lots of the CUDA kernels in their implementation to reduce the amount of memory needed by the FFT convolutions. This reduces the amount of extra memory needed, and it is a matter of time before which they will release completely-fused kernels which barely need any extra memory.

CUDNN (R3) Extra Memory to train Imagenet winners

Network	Extra Memory
AlexNet	324 MB
VGG-A	2.6 GB
Overfeat	2.59 GB
GoogleNet	202 MB

The overall trend I see is that:

Nervana has already pushed spatial domain convolutions and do not have any more optimizations to do to speed things up even more
FFT-based convolutions seem to have so much more room to optimize further
NVIDIA switched to focusing on optimizing FFT convolutions and have very competitive performance, which they will only improve over time.

p.s.: sorry for not finishing the Chainer benchmarks, I am having some issues running things. My chainer install seems to have some strange CUDA issues. I will update the README with those results in a week or two when I get time. Overall, my first feel of Chainer is that I am a bit annoyed at the 15/20/30 seconds it takes to compile the compute graph. If I read the documentation hard enough, I'll probably find a debug mode that starts running faster, haven't come that far yet!

Ubuntu 12.04, Geforce 660 compatiable for Using this code in torch7?

Hi,
I am going to start with this code, But i want to ask is this code compatible with NVIDIA Ge-force 660 GPU. I am running it on UBUNTU 12.04. Having both cuda 7 and 6 on the system.
Will this be ok. And there will be no problem?
Sorry about asking such question, but i am trying to use Deep learning, with torch for first time, and i don't want to spend much time on trouble shooting. Your guidance and experience will save my time.
Regards
Ihsan

consider tuning partial_sum for FilterActs (cuda-convnet)

I noticed that in the current Theano benchmark script, The FilterActs class is instantiated without any arguments, which means the partial_sum option defaults to 'None', which is the most conservative value (in terms of memory usage and performance). In my experience, setting this to partial_sum=1 can improve performance significantly (at the cost of more memory usage).

Supposedly you can tune it even further for specific image/filter sizes, but I've never observed any significant performance improvements from doing that. More info is in my blog post about the wrappers: http://benanne.github.io/2014/04/03/faster-convolutions-in-theano.html (see "Tuning the time-memory trade-off with partial_sum").

Do with this information what you want :) maybe you just want to keep everything to its default values, but I thought I'd mention it.

Speed up in Thean fft convolution

this PR in Theano speed up fft convolution:

Theano/Theano#2092

When it get merged, it would be great to update the benchmark. On a computer for the L1, I have this speed before the PR:

(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 899.721267363 GFLOP/s ( tm = 0.138061901855 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.198237329102 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 1.09417402344 )

After the PR:

(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: 896.253465383 GFLOP/s ( tm = 0.13859609375 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0 GFLOP/s ( tm = 0.198546276855 )
    (experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0 GFLOP/s ( tm = 0.231154541016 )

The bprop again the input is 4x!

Benchmark CNTK

Jigar Doshi ( @artvandelay ) has volunteered to take a crack at this: https://twitter.com/jigarkdoshi/status/691758082075070464

[August 2014] Discussion of results

This issue shall serve as a place to announce and discuss new results, to avoid the discussion being spread over several pull requests that just happened to be there (#11, #12).

So with 33f2122, caffe seems to be really fast in the backward pass. Looking at the code, are you 100% sure that propagate_down is set to True? Otherwise it would time the gradient wrt. weights only.

OpenCV DNN

Are you interested to benchmark OpenCV DNN module?

Benchmark peak GPU memory used

In addition to runtime, it would be good to know what the the peak amount of GPU RAM used by each of the the implementations of convnets.

Benchmark of CUDA 6 vs 6.5

I see that the benchmarks are now being run on 6.5.

I am curious how performance compares in CUDA 6 vs CUDA 6.5.

While duplicating the table seems like too much, I wonder how the "cherry-picking" numbers (ie the best) compare in the two CUDA versions.

Out of memory error for torch7/imagenet_winners

Hi,
I'm seeing an out of memory error with torch7/imagenet_winners on a titan-x for "VGG Model-A Kernels: fbfft ". All the dependencies are up to date.

stdout:
Running on device: Graphics Device
ModelType: AlexNet Kernels: cudnn Input shape: 128x3x224x224
cudnn :updateOutput(): 69.61
cudnn :updateGradInput(): 83.50
cudnn :accGradParameters(): 79.05
cudnn :Forward: 69.61
cudnn :Backward: 162.55
cudnn :TOTAL: 232.16
ModelType: AlexNet Kernels: nn Input shape: 128x3x224x224
nn :updateOutput(): 139.52
nn :updateGradInput(): 116.48
nn :accGradParameters(): 101.57
nn :Forward: 139.52
nn :Backward: 218.05
nn :TOTAL: 357.57
ModelType: AlexNet Kernels: fbfft Input shape: 128x3x224x224
fbfft :updateOutput(): 44.87
fbfft :updateGradInput(): 46.73
fbfft :accGradParameters(): 44.20
fbfft :Forward: 44.87
fbfft :Backward: 90.93
fbfft :TOTAL: 135.79
ModelType: VGG Model-A Kernels: cudnn Input shape: 64x3x224x224
cudnn :updateOutput(): 339.59
cudnn :updateGradInput(): 399.12
cudnn :accGradParameters(): 352.86
cudnn :Forward: 339.59
cudnn :Backward: 751.98
cudnn :TOTAL: 1091.57
ModelType: VGG Model-A Kernels: nn Input shape: 64x3x224x224
nn :updateOutput(): 348.41
nn :updateGradInput(): 381.35
nn :accGradParameters(): 372.84
nn :Forward: 348.41
nn :Backward: 754.19
nn :TOTAL: 1102.61
ModelType: VGG Model-A Kernels: fbfft Input shape: 64x3x224x224

stderr:
http://pastebin.com/NaCCgzt9

I'm just getting started with this suite and have very little experience with the environment. Any help would be appreciated.

[Feature Request] Benchmarking - RNN/LSTM

Please close this if it is inappropriate

It seems that benchmarking of RNN/LSTM is a worthwhile thing to do now as these things are getting amazing results, many of them being computed with novel training algorithms either in Torch or Theano

Also demonstrating which implementations are easiest to create novel architectures with - so basically nngraph, verses whatever else there is out there ???

Another important point for researchers is that we have to get our sums right - whatever the tools we use to compute our probabilistic graphs with. The same model and training method, implemented in Torch/Theano/PyBrain3/RNNLIB/whatever, should give very close numerical results, over a range of hyper-parameters. So this could develop into something more than a one off speed test for common datasets.

I propose that as part of the algorithm/system testing procedure, new research models and training algorithms should by default be independently implemented in both Torch, and a second 'calculator' to test the accuracy of the claimed results. This step should follow quite soon after the fundamental research has been done, and promising results have been achieved using some calculator. It should come before these new tools are passed on upstream to software engineers and data scientists.

Ideally it should be conducted by a team independent of the original researchers. The skills required for this are a strong enough mathematical background such that research documents/papers can be read, derivations/proofs can be checked, and testing/validation can be done with the minimal amount of code, basically starting from scratch. Since the object of these tests is to validate accuracy and speed of algorithms, there's no need for vast computational resources or datasets either.

This type of multiple route testing/validation is the standard procedure in quant financial/investment trading systems development, and it can easily be adopted here.

running fbnn benchmark

Hi,

I tried to run fbnn torch in imagenet_winners.

But I keep getting the following error.
/home/ubuntu/torch/install/bin/luajit: ./vgg_a.lua:37: attempt to call local 'SpatialConvolution' (a nil value)
stack traceback:
./vgg_a.lua:37: in function <./vgg_a.lua:1>
benchmark.lua:44: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
I left this in google group, too. Can anyone help?
I simply commented out irrelevant codes in benchmark.lua as below.
require 'sys'

require 'cunn'

require 'cudnn'
cudnn.benchmark = true -- run manual auto-tuner provided by cudnn
cudnn.verbose = false

require 'fbcunn'
-- require 'nnbhwd' -- not compiling anymore, file an issue
local nets = {}
--nets[#nets+1] = require 'alexnet'
--nets[#nets+1] = require 'overfeat'
nets[#nets+1] = require 'vgg_a'
--nets[#nets+1] = require 'googlenet'

local libs = {}
--libs[#libs+1] = {cudnn.SpatialConvolution, cudnn.SpatialMaxPooling, cudnn.ReLU, 'BDHW', 'cudnn'}
libs[#libs+1] = {fbnn.SpatialConvolution, cudnn.SpatialMaxPooling, cudnn.ReLU, 'BDHW', 'fbnn'}
--libs[#libs+1] = {nn.SpatialConvolutionMM, nn.SpatialMaxPooling, nn.ReLU, 'BDHW', 'nn'}
--libs[#libs+1] = {nn.SpatialConvolutionBHWD, nn.SpatialMaxPoolingBHWD, nn.ReLU, 'BHWD', 'nnBHWD'}

Thanks!