Giter Site home page Giter Site logo

dmikushin / cnn-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rejunity/cnn-benchmarks

0.0 1.0 0.0 270 KB

Benchmark of Cloud GPUs using Convolutional Neural Networks

Python 55.81% Lua 44.19%
cnn gpu benchmarking neural-networks

cnn-benchmarks's Introduction

GPU in the Cloud | cnn-benchmarks

Benchmark for GPUs available in the Computing Clouds using popular Convolutional Neural Network models.

This benchmark is based on jcjohnson/cnn-benchmarks.

Benchmarking your system

Install recent CUDA and cuDNN. Make sure CUBLAS is also installed (in recent versions of CUDA it is packaged separately).

The current codebase has been tested on CUDA 10.1 (V10.1.168) & cuDNN 7.6.4.

Checkout the repository:

https://github.com/dmikushin/cnn-benchmarks.git
cd cnn-benchmarks
git submodule init
git submodule update --recursive

Download the model weights:

sudo add-apt-repository ppa:longsleep/golang-backports -y
sudo apt-get update
sudo apt-get install golang-go
export GOPATH=/home/%USER%
go get github.com/prasmussen/gdrive
git clone https://github.com/rejunity/cnn-benchmarks.git
cd cnn-benchmarks
bin/gdrive download 0Byvt-AfX75o1STUxZTFpMU10djA
unzip models.zip

Prepare Torch installation:

cd torch
./clean.sh
TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" ./install.sh
install/bin/torch-activate
cd ..

Run the benchmark and format the results:

export GPU=v100
export CUDNN_VERSION=7604
python run_cnn_benchmarks.py --output_dir outputs/$GPU_cudnn$CUDNN_VERSION
python analyze_cnn_benchmark_results.py

Results

We use the following GPUs (roughly sorted by performance):

GPU Cloud Instance Name Arch CUDA Cores FP32 TFLOPS Memory GB Bandwidth GB/s Release Date
Tesla V100 Amazon_EC2
Paperspace
P3
V100
Volta 5120 14.03 16 900.1 Jun 2017
Quadro P6000 Paperspace P6000 Pascal 3840 12.63 24 432.8 Oct 2016
Quadro P5000 Paperspace P5000 Pascal 2560 8.87 16 288.3 Oct 2016
Tesla M60 Amazon_EC2
MS_Azure
IBM_Bluemix
G3
NVx
M60
Maxwell 2048 4.83 8 160.4 Aug 2015
Quadro M4000 Paperspace GPU+ Maxwell 1664 2.57 8 192.3 Jun 2015
Tesla K80 Amazon_EC2
MS_Azure
Google_Cloud
IBM_Bluemix
P2
NCx
K80
K80
Kepler 2496 4.37(?) 12 240.6 Nov 2014
GRID K520 Amazon_EC2 G2 Kepler 1536 2.45 4 160.0 Jul 2013

We use desktop GTX 1080 Ti GPU and Xeon E5-2666v3 CPU (available on AWS EC2 cloud as c4.4xlarge instance) for the reference.

Some general conclusions from this benchmarking:

  • V100 is the FASTEST card you can get for deep learning in the cloud right now!
  • P6000 == GTX 1080 Ti and P5000 == GTX 1080: Performance of both pairs of GPUs are very close on all models. The main difference is significantly more memory in the server-side Quadros.
  • P6000, P5000 and K80 for large models: Quadro P5000 and Tesla K80 have enough memory for the most of the tasks: 24GB, 16GB and 12GB respectively.
  • V100 > P6000: Across all models, the Tesla V100 is 1.3x to 1.6x faster than Quadro P6000 and GTX 1080 Ti.
  • P6000 > P5000: Across all models, the Quadro P6000 is 1.3x to 1.65x faster than Quadro P5000.
  • P5000 > M60: Across all models, the Quadro P5000 is 1.75x to 2x faster than Tesla M60.
  • M60 > K80: Across all models, the Tesla M60 is 1.3x to 1.75x faster than Tesla K80.
  • K80 > K520: Across all models, the Tesla K80 is 1.8x to 2.25x faster than GRID K520.
  • Prefer latest cuDNN: cuDNN5.1.10 is slightly faster than 5.1.05 which in turn is faster than 5.0.05.

The effect of varying minibatch size with VGG-19 when run on Tesla V100:

Batch size Forward (ms) Backward (ms) Total (ms) Speedup (forward) Speedup (total)
1 5.57 11.29 16.85 1.0x 1.0x
2 8.68 14.18 22.86 1.3x 1.5x
4 14.11 23.23 37.34 1.6x 1.8x
8 21.87 38.62 60.50 2.0x 2.2x
16 27.73 60.24 87.97 3.2x 3.1x
32 51.54 115.23 166.77 3.5x 3.2x
64 101.69 225.78 327.46 3.5x 3.3x

The effect of varying minibatch size with ResNet-34 when run on Tesla V100:

Batch size Forward (ms) Backward (ms) Total (ms) Speedup (forward) Speedup (total)
1 3.29 5.45 8.74 1.0x 1.0x
2 5.52 8.19 13.71 1.2x 1.3x
4 5.52 8.19 13.71 2.4x 2.5x
8 7.92 14.82 22.74 3.3x 3.1x
16 10.14 22.37 32.51 5.2x 4.3x
32 17.58 38.84 56.43 6.0x 5.0x
64 33.74 74.76 108.50 6.2x 5.2x

Following we benchmark all models with a minibatch size of 16 and an image size of 224 x 224; this allows large models to run on cards with 8GB of memory.

All benchmarks except V100 were run in Torch, Ubuntu 14.04 with the CUDA 8.0 Release Candidate. V100 benchmarks were run on Ubuntu 16.04.

All settings and models are exactly the same as in the jcjohnson/cnn-benchmarks.

See template shell script below to help with downloading the model weights and running the benchmark.

AlexNet

(input 16 x 3 x 224 x 224)

We use the BVLC AlexNet from Caffe.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 3.18 6.66 9.85
Quadro P6000 5.1.10 3.86 7.98 11.84
GTX 1080 Ti 5.1.10 4.31 9.58 13.89
Quadro P5000 5.1.10 5.91 13.68 19.58
Tesla M60 5.1.10 10.79 24.53 35.32
Quadro M4000 5.1.05 14.23 29.52 43.75
Tesla K80 5.1.10 15.98 31.63 47.61
GRID K520 5.1.10 39.77 66.51 106.28

Inception-V1

(input 16 x 3 x 224 x 224)

We use the Torch implementation of Inception-V1 from soumith/inception.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 7.90 20.41 28.31
GTX 1080 Ti 5.1.10 11.50 25.37 36.87
Quadro P6000 5.1.10 11.87 27.88 39.75
Quadro P5000 5.1.10 16.03 36.83 52.86
Tesla M60 5.1.10 29.46 63.62 93.08
Quadro M4000 5.1.05 40.29 89.48 129.77
Tesla K80 5.1.10 45.43 111.21 156.64
GRID K520 5.1.10 86.28 226.87 313.15
CPU: Dual Xeon E5-2666 v3 None 1569.44 1904.28 3473.72

VGG-16

(input 16 x 3 x 224 x 224)

This is Model D in [3] used in the ILSVRC-2014 competition, available here.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 23.76 52.69 76.45
Quadro P6000 5.1.10 38.66 83.38 122.04
GTX 1080 Ti 5.1.10 41.23 86.91 128.14
Quadro P5000 5.1.10 58.16 122.14 180.30
Tesla M60 5.1.10 107.41 233.42 340.83
Quadro M4000 5.1.05 144.84 299.51 444.35
Tesla K80 5.1.10 153.67 295.74 449.40
GRID K520 None 675.96 1937.51 2613.48
CPU: Dual Xeon E5-2666 v3 None 2648.97 4788.71 7437.69

VGG-19

(input 16 x 3 x 224 x 224)

This is Model E in [3] used in the ILSVRC-2014 competition, available here.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 27.73 60.24 87.97
Quadro P6000 5.1.10 45.59 96.97 142.56
GTX 1080 Ti 5.1.10 48.15 100.04 148.19
Quadro P5000 5.1.10 67.68 139.79 207.47
Tesla M60 5.1.10 125.61 277.30 402.91
Quadro M4000 5.1.05 169.70 347.80 517.50
Tesla K80 5.1.10 179.85 347.85 527.69
GRID K520 None 826.84 2275.49 3102.33
CPU: Dual Xeon E5-2666 v3 None 3119.22 5684.74 8803.97

ResNet-18

(input 16 x 3 x 224 x 224)

This is the 18-layer model described in [4] and implemented in fb.resnet.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 6.16 14.16 20.32
Quadro P6000 5.1.10 10.06 21.52 31.58
GTX 1080 Ti 5.1.10 10.45 22.34 32.78
Quadro P5000 5.1.10 14.58 29.48 44.06
Tesla M60 5.1.10 25.89 52.77 78.67
Quadro M4000 5.1.05 35.13 74.08 109.21
Tesla K80 5.1.10 37.87 74.88 112.74
GRID K520 5.1.10 64.82 140.53 205.36
CPU: Dual Xeon E5-2666 v3 None 606.22 1176.15 1782.37

ResNet-34

(input 16 x 3 x 224 x 224)

This is the 34-layer model described in [4] and implemented in fb.resnet.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 10.14 22.37 32.51
GTX 1080 Ti 5.1.10 16.71 34.60 51.31
Quadro P6000 5.1.10 17.11 35.35 52.46
Quadro P5000 5.1.10 24.57 48.04 72.61
Tesla M60 5.1.10 44.07 86.81 130.88
Quadro M4000 5.1.05 59.09 118.13 177.22
Tesla K80 5.1.10 64.79 124.24 189.03
GRID K520 5.1.10 112.04 231.02 343.06
CPU: Dual Xeon E5-2666 v3 None 720.24 1317.49 2037.72

ResNet-50

(input 16 x 3 x 224 x 224)

This is the 50-layer model described in [4] and implemented in fb.resnet.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 19.83 46.26 66.09
GTX 1080 Ti 5.1.10 34.14 67.06 101.21
Quadro P6000 5.1.10 34.02 68.76 102.78
Quadro P5000 5.1.10 48.77 98.72 147.49
Tesla M60 5.1.10 91.89 173.12 265.01
Quadro M4000 5.1.05 117.52 228.17 345.69
Tesla K80 5.1.10 124.38 274.43 398.81
CPU: Dual Xeon E5-2666 v3 None 1623.35 3042.77 4666.12

ResNet-101

(input 16 x 3 x 224 x 224)

This is the 101-layer model described in [4] and implemented in fb.resnet.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 31.64 82.11 113.75
GTX 1080 Ti 5.1.10 52.18 102.08 154.26
Quadro P6000 5.1.10 52.29 104.49 156.78
Quadro P5000 5.1.10 75.21 148.67 223.88
Tesla M60 5.1.10 142.62 257.42 400.04
Quadro M4000 5.1.05 186.16 350.82 536.98
Tesla K80 5.1.10 199.41 486.11 685.52
CPU: Dual Xeon E5-2666 v3 None 1946.84 3458.39 5405.23

ResNet-152

(input 16 x 3 x 224 x 224)

This is the 152-layer model described in [4] and implemented in fb.resnet.torch.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 44.80 117.58 162.37
GTX 1080 Ti 5.1.10 73.52 142.02 215.54
Quadro P6000 5.1.10 73.81 145.04 218.85
Quadro P5000 5.1.10 106.26 204.86 311.13
Tesla M60 5.1.10 200.83 359.60 560.43
Quadro M4000 5.1.05 264.14 482.02 746.16
Tesla K80 5.1.10 283.68 700.15 983.83
CPU: Dual Xeon E5-2666 v3 None 3742.47 6980.75 10723.22

ResNet-200

(input 16 x 3 x 224 x 224)

This is the 200-layer model described in [5] and implemented in fb.resnet.torch.

Even with a batch size of 16, the 8GB GTX 1080 Ti, M4000 and K520 did not have enough memory to run the model.

GPU cuDNN Forward (ms) Backward (ms) Total (ms)
Tesla V100 7.0.04 59.68 149.26 208.94
Quadro P6000 5.1.10 102.36 194.93 297.29
Quadro P5000 5.1.10 146.78 275.36 422.14
Tesla K80 5.1.10 385.33 904.29 1289.63
CPU: Dual Xeon E5-2666 v3 None 5298.52 9668.13 14966.64

Citations

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS 2012.

[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Andrew Rabinovich. "Going Deeper with Convolutions." CVPR 2015.

[3] Karen Simonyan and Andrew Zisserman. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR 2015.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." CVPR 2016.

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Identity Mappings in Deep Residual Networks." ECCV 2016.

cnn-benchmarks's People

Contributors

dmikushin avatar dmitryulyanov avatar jcjohnson avatar mantasp avatar pgericson avatar rejunity avatar shmuma avatar vakons avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.