Describe the bug Summing a vector is very slow, slower than alloc

Please try again with latest master: <div class="snippet-clipboard-content notrans

Perfect- that worked. Thanks for having a look! <div class="snippet-clipboard-cont

It's actually a bit worse than this - I should have included <code class="notranslate"

Sum function is slow about cuarrays.jl HOT 8 CLOSED

clintonTE commented on June 24, 2024

Sum function is slow

from cuarrays.jl.

Comments (8)

maleadt commented on June 24, 2024

Please try again with latest master:

Standard cpu sum:   20.186 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.527 ms (358 allocations: 11.69 KiB)
Summing using allocating a vector of 1s and dot:   3.392 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.050 ms (3 allocations: 48 bytes)

It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our sum kernel ultimately has to deal with quite a lot of flexibility (e.g. passing a map function, dimensions, type support, etc).

from cuarrays.jl.

clintonTE commented on June 24, 2024

Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:

Standard cpu sum:   20.667 ms (0 allocations: 0 bytes)
Standard cuda sum: ERROR: LoadError: Not implemented
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::CuArrays.CuArray{Float32,1,Nothing}, 
::Float32) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:5
 [3] mapreduce(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}; dims::Function, init::Nothing) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:38
 [4] mapreduce at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:24 [inlined]
 [5] _sum(::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:657
 [6] _sum(::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:656
 [7] #sum#583 at .\reducedim.jl:652 [inlined]
 [8] sum at .\reducedim.jl:652 [inlined]
 [9] ##core#552(::CuArrays.CuArray{Float32,1,Nothing}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:371
 [10] ##sample#553(::BenchmarkTools.Parameters) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:377
 [11] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:405
 [12] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, 
:gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#551")},BenchmarkTools.Parameters}})() at .\essentials.jl:715
 [13] #invokelatest#1 at .\essentials.jl:716 [inlined]
 [14] #run_result#37 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:32 [inlined]
 [15] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, 
:evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:94
 [16] #warmup#45 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141 [inlined]
 [17] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141
 [18] macro expansion at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:481 [inlined]
 [19] mwesum(::Int64) at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:383
 [20] top-level scope at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
in expression starting at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396

from cuarrays.jl.

maleadt commented on June 24, 2024

You need to upgrade both GPUArrays and CuArrays to master.

from cuarrays.jl.

clintonTE commented on June 24, 2024

Perfect- that worked. Thanks for having a look!

Standard cpu sum:   20.455 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.999 ms (338 allocations: 11.31 KiB)
Summing using allocation and dot:   14.213 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.587 ms (3 allocations: 48 bytes)

from cuarrays.jl.

tverho commented on June 24, 2024

With smaller arrays the cuda performance is rather abysmal. For N=256^2:

Standard cpu sum:   4.131 μs (0 allocations: 0 bytes)
Standard cuda sum:   148.908 μs (356 allocations: 11.66 KiB)
Summing using allocating a vector of 1s and dot:   16.415 μs (14 allocations: 272 bytes)
Summing just using dot and a pre-allocated vector of 1s:   12.562 μs (3 allocations: 48 bytes)

from cuarrays.jl.

clintonTE commented on June 24, 2024

It's actually a bit worse than this - I should have included CUDA.@sync in the benchmarks. This at least makes the last two results in the ballpark of the sum function. Here is an update of the MWE.

using Revise, LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

This gives me:

Standard cpu sum:   4.471 μs (0 allocations: 0 bytes)
Standard cuda sum:   255.800 μs (366 allocations: 11.83 KiB)
Summing using allocating a vector of 1s and dot:   108.499 μs (24 allocations: 448 bytes)
Summing just using dot and a pre-allocated vector of 1s:   106.599 μs (13 allocations: 224 bytes)

If the issue is worth reopening, perhaps it should be moved to CUDA.jl?

from cuarrays.jl.

maleadt commented on June 24, 2024

Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level.

from cuarrays.jl.

tverho commented on June 24, 2024

What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays.

Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:

using LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  cuv=CUDA.cu(cpuv)
  print("Transfer to CPU and sum: ")
  @btime CUDA.@sync sum(collect($cuv))
  
  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

Output:

Standard cpu sum:   10.966 μs (10 allocations: 176 bytes)
Standard cuda sum:   151.740 μs (366 allocations: 11.83 KiB)
Transfer to CPU and sum:   62.251 μs (14 allocations: 256.28 KiB)
Summing using allocating a vector of 1s and dot:   24.281 μs (25 allocations: 464 bytes)
Summing just using dot and a pre-allocated vector of 1s:   20.161 μs (14 allocations: 240 bytes)

from cuarrays.jl.

Sum function is slow about cuarrays.jl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent