Giter Site home page Giter Site logo

Sum function is slow about cuarrays.jl HOT 8 CLOSED

clintonTE avatar clintonTE commented on June 21, 2024
Sum function is slow

from cuarrays.jl.

Comments (8)

maleadt avatar maleadt commented on June 21, 2024

Please try again with latest master:

Standard cpu sum:   20.186 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.527 ms (358 allocations: 11.69 KiB)
Summing using allocating a vector of 1s and dot:   3.392 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.050 ms (3 allocations: 48 bytes)

It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our sum kernel ultimately has to deal with quite a lot of flexibility (e.g. passing a map function, dimensions, type support, etc).

from cuarrays.jl.

clintonTE avatar clintonTE commented on June 21, 2024

Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:

Standard cpu sum:   20.667 ms (0 allocations: 0 bytes)
Standard cuda sum: ERROR: LoadError: Not implemented
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::CuArrays.CuArray{Float32,1,Nothing}, 
::Float32) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:5
 [3] mapreduce(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}; dims::Function, init::Nothing) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:38
 [4] mapreduce at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:24 [inlined]
 [5] _sum(::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:657
 [6] _sum(::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:656
 [7] #sum#583 at .\reducedim.jl:652 [inlined]
 [8] sum at .\reducedim.jl:652 [inlined]
 [9] ##core#552(::CuArrays.CuArray{Float32,1,Nothing}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:371
 [10] ##sample#553(::BenchmarkTools.Parameters) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:377
 [11] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:405
 [12] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, 
:gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#551")},BenchmarkTools.Parameters}})() at .\essentials.jl:715
 [13] #invokelatest#1 at .\essentials.jl:716 [inlined]
 [14] #run_result#37 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:32 [inlined]
 [15] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, 
:evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:94
 [16] #warmup#45 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141 [inlined]
 [17] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141
 [18] macro expansion at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:481 [inlined]
 [19] mwesum(::Int64) at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:383
 [20] top-level scope at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
in expression starting at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396


from cuarrays.jl.

maleadt avatar maleadt commented on June 21, 2024

You need to upgrade both GPUArrays and CuArrays to master.

from cuarrays.jl.

clintonTE avatar clintonTE commented on June 21, 2024

Perfect- that worked. Thanks for having a look!

Standard cpu sum:   20.455 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.999 ms (338 allocations: 11.31 KiB)
Summing using allocation and dot:   14.213 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.587 ms (3 allocations: 48 bytes)

from cuarrays.jl.

tverho avatar tverho commented on June 21, 2024

With smaller arrays the cuda performance is rather abysmal. For N=256^2:

Standard cpu sum:   4.131 μs (0 allocations: 0 bytes)
Standard cuda sum:   148.908 μs (356 allocations: 11.66 KiB)
Summing using allocating a vector of 1s and dot:   16.415 μs (14 allocations: 272 bytes)
Summing just using dot and a pre-allocated vector of 1s:   12.562 μs (3 allocations: 48 bytes)

from cuarrays.jl.

clintonTE avatar clintonTE commented on June 21, 2024

It's actually a bit worse than this - I should have included CUDA.@sync in the benchmarks. This at least makes the last two results in the ballpark of the sum function. Here is an update of the MWE.

using Revise, LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

This gives me:

Standard cpu sum:   4.471 μs (0 allocations: 0 bytes)
Standard cuda sum:   255.800 μs (366 allocations: 11.83 KiB)
Summing using allocating a vector of 1s and dot:   108.499 μs (24 allocations: 448 bytes)
Summing just using dot and a pre-allocated vector of 1s:   106.599 μs (13 allocations: 224 bytes)

If the issue is worth reopening, perhaps it should be moved to CUDA.jl?

from cuarrays.jl.

maleadt avatar maleadt commented on June 21, 2024

Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level.

from cuarrays.jl.

tverho avatar tverho commented on June 21, 2024

What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays.

Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:

using LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  cuv=CUDA.cu(cpuv)
  print("Transfer to CPU and sum: ")
  @btime CUDA.@sync sum(collect($cuv))
  
  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

Output:

Standard cpu sum:   10.966 μs (10 allocations: 176 bytes)
Standard cuda sum:   151.740 μs (366 allocations: 11.83 KiB)
Transfer to CPU and sum:   62.251 μs (14 allocations: 256.28 KiB)
Summing using allocating a vector of 1s and dot:   24.281 μs (25 allocations: 464 bytes)
Summing just using dot and a pre-allocated vector of 1s:   20.161 μs (14 allocations: 240 bytes)

from cuarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.