Giter Site home page Giter Site logo

Comments (16)

scott-gray avatar scott-gray commented on May 22, 2024

Reposting this here:

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation. Perhaps with a bit more optimization they can overtake spatial domain on 3x3 filters? I wouldn't be surprised if we see much better fp16 numbers from them soon.

My GoogLeNet numbers may look good but I still have a lot of optimizations to make for the smaller feature map values in there. Right now I'm optimized for multiples of 64. I'll get that down to 32 this weekend. My CHWN tensor layout is also really helpful on those inception groupings.

A brand new version of neon is about to be released. You'll be able to run all these networks out of the box (plus lots more). The new syntax is much improved and more torch or keras like (perhaps better even).

Anyway, here's a changelog of updates since the last version:

No more multiplying by zero to implement padding in fprop and bprop (I now slice both the input and the filter)

Figured out a different way to do integer division for the now dynamically sized slice lookup table.

No more atomic adds in bprop. I've cast bprop as fprop upside down and the kernels are nearly identical. It requires a dimshuffle on the filter but this just takes microseconds and a small amount of additional memory that can be shared with all conv ops. Bprop used to be bandwidth bound on those atomic adds.

Tweaked the p,q block ordering to improve L2 cache performance. I'm using a zigzag pattern now for all operations.

Update already had a mode where you could stack all the gemm ops to eliminate atomic adds, but I've streamlined that stacking operation. Update also now fetches 32 rows deep. This comes at the cost of an instruction cache miss inside the main gemm loop, but is easily covered by the occupancy. The reason for doing this is the same for using a 32x33 shared memory block to implement transpose. With N contiguous the update op has the expensive strided access patterns on both the input and delta.

I also eliminate all shared memory bank conflicts when storing the global loads to shared with some clever shifting.

Added a beta param to bprop to allow delta accumulation for inception groupings.

from convnet-benchmarks.

soumith avatar soumith commented on May 22, 2024

The CuDNN guys said that they have more slated optimizations as well, that will help GoogleNet, will updated the numbers after the final release happens.

I'm excited to see the new Neon.

In the OpenCL land, there's lots of catching up to do in terms of perf, but I am really happy that the libraries are getting feature-complete. Thanks to Hugh Perkins, Fabian Tschopp and other OpenCL torch-bearers (no pun intended).

Another interesting data point comes this week from Intel, who claim to use 64 Xeon Phi nodes to train Overfeat in 3-4 hours. That comes into the distributed territory, and it does not make me super-duper impressed, seeing internal distributed systems based on GPUs that are equally good or better. Distributed training is a separate optimization from optimizing single nodes.
Intel slides at the HotChips Conference here behind conference paywall

from convnet-benchmarks.

scott-gray avatar scott-gray commented on May 22, 2024

Optimizing for Overfeat is like shooting fish in a barrel. And the kind of fish that no one particularly wants to eat. I'd be more impressed with Intel's results on SOTA networks of this year.

from convnet-benchmarks.

BlGene avatar BlGene commented on May 22, 2024

Has anyone been able to, or interested in, compiling opencl convolutions to fpgas ( using the alters Sdk). There was the somewhat similar neuflow project but I haven't heard from them in a while.

from convnet-benchmarks.

soumith avatar soumith commented on May 22, 2024

@BlGene neuflow used to be open-source, but they closed it up and are building a startup around it called TeraDeep. Neuflow was also built for inference, not training, and it was based on fixed-point FPGAs. If there are good floating point FPGAs that are affordable, it might be a good idea to start a new community project in that direction.

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

@BlGene Do you have some kind of indicative figures on a suitable FPGA in 100-400usd price range, and its relative performance, on convolution, with GPUs in same price range? Could be theoretical analysis, and could be approximate, as long as such caveats are clearly stated.

from convnet-benchmarks.

BlGene avatar BlGene commented on May 22, 2024

@soumith @hughperkins

I don't have anything concrete. I did a bit of research and came up with the following. I haven't read the papers in detail but from what I understand the sentiment seems to be a lot of work and not much faster yet ( is this correct? ). It might be possible to ask people if they are interested in comparable benchmarking... if they feel confident ;)

Quora: http://www.quora.com/Is-implementing-deep-learning-on-FPGAs-a-natural-next-step-after-the-success-with-GPUs

Research Papers:

  1. http://dl.acm.org/citation.cfm?id=2689060
  2. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=7160081

Companies:

  1. At Altera: http://www.slideshare.net/embeddedvision/a04-altera-singh
  2. At Altera-Baidu: http://newsroom.altera.com/press-releases/altera-baidu-fpga-cloud-data-centers.htm
  3. At Nervana Systems: https://gigaom.com/2014/08/21/nervana-systems-raises-3-3m-to-build-hardware-designed-for-deep-learning/
  4. At Auviz Systems: http://auvizsystems.com/products/auvizdnn/

What are your thoughts? Maybe @scott-gray can say something for Nervana?

from convnet-benchmarks.

scott-gray avatar scott-gray commented on May 22, 2024

I cant really speak much of our own hardware efforts except to say it should be extremely competitive to GPUs. Generally speaking, any ASIC that's custom designed for a particular task is going to be faster than one with a more general purpose.

Though there is this recent bit on Microsoft's efforts with FPGAs:
http://www.theplatform.net/2015/08/27/microsoft-extends-fpga-reach-from-bing-to-deep-learning/

Seems for them the real advantage is in power savings.

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation.

Is cuDNN using FFT then? How/what do we know about how cuDNN works?

from convnet-benchmarks.

soumith avatar soumith commented on May 22, 2024

@hughperkins the CuDNN manual details the available algorithms they use. Also, the headers give hints as well. https://github.com/soumith/cudnn.torch/blob/R5/ffi.lua#L394-L402

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

the CuDNN manual details the available algorithms they use.

Ok. To what extent is the CuDNN manual publicly, and to what extent does one have to click through some agreement where one agrees not to reveal its contents? I guess NVIDIA has larger pockets than I do :-D

(I can find plenty of stuff about v1, eg http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-sharan-chetlur-cudnn-webinar.pdf and http://arxiv.org/pdf/1410.0759v2.pdf , but dont seem to be able to find any sources for v2++?)

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

(well... seems there is a paragaph in the Lavin paper https://arxiv.org/pdf/1509.09308.pdf which asserts it is using FFT:

"The FFT and convolution theorem have been used to reduce the arithmetic complexity of convnet layers, first by Mathieu et al. [10], then refined by Visalache et al. [12], and then implemented in the NVIDIA cuDNN library[1]"

... so maybe I can just cite that??? (Although figure 1 seems to imply that:

  • there are two CUDNN implementations: one is FFT, one is not FFT, and CUDNN switches between them using probably a heuristic, eg from the text "cuDNN appears to erroneously select its FFT algorithm for intermediate values of N despite the fact that it performs very poorly, under 2 TFLOPS."
  • for the lowest-level layers, which as far as I know basically dominate the time? non-FFT is being used, for all batch sizes

... so seems that stating "CUDNN is proprietary, so we cannot reason well on how it works" is not an entirely unreasonable position?)

Edit: seems I should cite http://arxiv.org/abs/1412.7580 , but I remember these used to be in convnet-benchmarks, and maybe was even the reason convnet-benchmarks were originally created :-P , but have vanished since around the time of CUDNNv2-v3, presumably because fbfft is no longer competitive with CUDNNv2-v4?

Edit2: oh wait, fbfft is still there :-)

Edit3: hmmm, I guess fbfft is not dependent on a blas implementation or similar? just uses its own native code? therefore easily portable to OpenCL? And very excellent performance Edit4: noticed fbfft does depend on blas, so removed edit3 :-)

from convnet-benchmarks.

cliffwoolley avatar cliffwoolley commented on May 22, 2024

You can easily query which algorithm cuDNN's heuristic has selected for
your problem size and memory availability. You can also just pick an
algorithm and force cuDNN to use it. Or you can ask cuDNN to try every
algorithm it can and report how long they took, so that you can for example
pick the definitely fastest available one even if the heuristically chosen
one wouldn't have been the optimal choice.

cuDNN's EULA (while yes you do have to agree to it to use the library) does
not have any non-disclosure clause. Plenty of people have published papers
with the results of experimenting with cuDNN already...

Hope this helps,
-Cliff
On May 22, 2016 3:03 AM, "Hugh Perkins" [email protected] wrote:

(well... seems there is a paragaph in the Lavin paper which asserts it is
using FFT:

"The FFT and convolution theorem have been used to reduce the arithmetic
complexity of convnet layers, first by Mathieu et al. [10], then refined by
Visalache et al. [12], and then implemented in the NVIDIA cuDNN library[1]"

... so maybe I can just cite that??? (Although figure 1 seems to imply
that:

  • there are two CUDNN implementations: one is FFT, one is not FFT, and
    CUDNN switches between them using probably a heuristic, eg from the text
    "cuDNN appears to erroneously select its FFT algorithm for intermediate
    values of N despite the fact that it performs very poorly, under 2 TFLOPS."
  • for the lowest-level layers, which as far as I know basically
    dominate the time? non-FFT is being used, for all batch sizes

... so seems that stating "CUDNN is proprietary, so we cannot reason well
on how it works" is not an entirely unreasonable position?)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#56 (comment)

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

@cliffwoolley Thanks! Bang goes my excuse for not learning about CUDNN :-D

Edit: ok, seems your reading seems plausible. The API itself might be protected by copyright plausibly, but presumably 'fair use' applies, as far as stating what is in the api, describing it and so on?

from convnet-benchmarks.

andravin avatar andravin commented on May 22, 2024

@hughperkins as @cliffwoolley suggested, call cudnnFindConvolutionForwardAlgorithm() to find the fastest cuDNN convolution algorithm for a given layer configuration.

In my paper I used cudnnGetConvolutionForwardAlgorithm() which cannot be relied on to select the fastest algorithm. As of cuDNN v.3. at least, it would select FFT for moderate batch sizes where direct convolution would have been faster. I wanted to compare FFT and Winograd directly at those sizes, so I left it that way, but perhaps it is a bit confusing.

Note that you explicitly select the algorithm to use in any cuDNN operation. For fprop convolution, the algorithms are enumerated in cudnnConvolutionFwdAlgo_t. As of cuDNN v.5., one of the choices is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD. You could select algorithms manually and see what happens if you are curious about how they compare.

from convnet-benchmarks.

hughperkins avatar hughperkins commented on May 22, 2024

As of cuDNN v.5., one of the choices is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD.

Heh! Nice :-) By the way, my apologies, I didnt realize until about 17 hours ago who you are. But I know now :-) Or, at least, I am becoming aware of your contribution to http://arxiv.org/abs/1509.09308 , which obviously blows the previous approaches to GPU convolution out of the water :-)

from convnet-benchmarks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.