Comments (8)
Please try again with latest master:
Standard cpu sum: 20.186 ms (0 allocations: 0 bytes)
Standard cuda sum: 1.527 ms (358 allocations: 11.69 KiB)
Summing using allocating a vector of 1s and dot: 3.392 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s: 2.050 ms (3 allocations: 48 bytes)
It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our sum
kernel ultimately has to deal with quite a lot of flexibility (e.g. passing a map function, dimensions, type support, etc).
from cuarrays.jl.
Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:
Standard cpu sum: 20.667 ms (0 allocations: 0 bytes)
Standard cuda sum: ERROR: LoadError: Not implemented
Stacktrace:
[1] error(::String) at .\error.jl:33
[2] mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::CuArrays.CuArray{Float32,1,Nothing},
::Float32) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:5
[3] mapreduce(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}; dims::Function, init::Nothing) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:38
[4] mapreduce at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:24 [inlined]
[5] _sum(::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:657
[6] _sum(::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:656
[7] #sum#583 at .\reducedim.jl:652 [inlined]
[8] sum at .\reducedim.jl:652 [inlined]
[9] ##core#552(::CuArrays.CuArray{Float32,1,Nothing}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:371
[10] ##sample#553(::BenchmarkTools.Parameters) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:377
[11] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:405
[12] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals,
:gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#551")},BenchmarkTools.Parameters}})() at .\essentials.jl:715
[13] #invokelatest#1 at .\essentials.jl:716 [inlined]
[14] #run_result#37 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:32 [inlined]
[15] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples,
:evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:94
[16] #warmup#45 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141 [inlined]
[17] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141
[18] macro expansion at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:481 [inlined]
[19] mwesum(::Int64) at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:383
[20] top-level scope at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
in expression starting at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
from cuarrays.jl.
You need to upgrade both GPUArrays and CuArrays to master.
from cuarrays.jl.
Perfect- that worked. Thanks for having a look!
Standard cpu sum: 20.455 ms (0 allocations: 0 bytes)
Standard cuda sum: 1.999 ms (338 allocations: 11.31 KiB)
Summing using allocation and dot: 14.213 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s: 2.587 ms (3 allocations: 48 bytes)
from cuarrays.jl.
With smaller arrays the cuda performance is rather abysmal. For N=256^2:
Standard cpu sum: 4.131 μs (0 allocations: 0 bytes)
Standard cuda sum: 148.908 μs (356 allocations: 11.66 KiB)
Summing using allocating a vector of 1s and dot: 16.415 μs (14 allocations: 272 bytes)
Summing just using dot and a pre-allocated vector of 1s: 12.562 μs (3 allocations: 48 bytes)
from cuarrays.jl.
It's actually a bit worse than this - I should have included CUDA.@sync
in the benchmarks. This at least makes the last two results in the ballpark of the sum function. Here is an update of the MWE.
using Revise, LinearAlgebra, CUDA, BenchmarkTools
CUDA.allowscalar(false)
function mwesum(N)
cpuv= rand(Float32, N)
print("\nStandard cpu sum: ")
@btime CUDA.@sync sum($cpuv)
cuv=CUDA.cu(cpuv)
print("Standard cuda sum: ")
@btime CUDA.@sync sum($cuv)
print("Summing using allocating a vector of 1s and dot: ")
onesum(v) = dot(cuv, CUDA.ones(N))
@btime CUDA.@sync $onesum($cuv)
cuvones = CUDA.ones(N)
print("Summing just using dot and a pre-allocated vector of 1s: ")
onesum2(v) = dot(cuv, cuvones)
@btime CUDA.@sync $onesum2($cuv)
end
sleep(0.5)
mwesum(256^2)
This gives me:
Standard cpu sum: 4.471 μs (0 allocations: 0 bytes)
Standard cuda sum: 255.800 μs (366 allocations: 11.83 KiB)
Summing using allocating a vector of 1s and dot: 108.499 μs (24 allocations: 448 bytes)
Summing just using dot and a pre-allocated vector of 1s: 106.599 μs (13 allocations: 224 bytes)
If the issue is worth reopening, perhaps it should be moved to CUDA.jl?
from cuarrays.jl.
Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level.
from cuarrays.jl.
What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays.
Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:
using LinearAlgebra, CUDA, BenchmarkTools
CUDA.allowscalar(false)
function mwesum(N)
cpuv= rand(Float32, N)
print("\nStandard cpu sum: ")
@btime CUDA.@sync sum($cpuv)
cuv=CUDA.cu(cpuv)
print("Standard cuda sum: ")
@btime CUDA.@sync sum($cuv)
cuv=CUDA.cu(cpuv)
print("Transfer to CPU and sum: ")
@btime CUDA.@sync sum(collect($cuv))
print("Summing using allocating a vector of 1s and dot: ")
onesum(v) = dot(cuv, CUDA.ones(N))
@btime CUDA.@sync $onesum($cuv)
cuvones = CUDA.ones(N)
print("Summing just using dot and a pre-allocated vector of 1s: ")
onesum2(v) = dot(cuv, cuvones)
@btime CUDA.@sync $onesum2($cuv)
end
sleep(0.5)
mwesum(256^2)
Output:
Standard cpu sum: 10.966 μs (10 allocations: 176 bytes)
Standard cuda sum: 151.740 μs (366 allocations: 11.83 KiB)
Transfer to CPU and sum: 62.251 μs (14 allocations: 256.28 KiB)
Summing using allocating a vector of 1s and dot: 24.281 μs (25 allocations: 464 bytes)
Summing just using dot and a pre-allocated vector of 1s: 20.161 μs (14 allocations: 240 bytes)
from cuarrays.jl.
Related Issues (20)
- similar(PermutedDimsArray(::CuArray)) isa Array HOT 1
- In CuArrays v2.0, GPU operation takes hours to run for the first time HOT 5
- sum!(y::CuVector, x::CuMatrix) throws InvalidIRError error
- Where can I find
- Where can I find All the using instructions of CuArrays? HOT 3
- add implicit float conversion to math functions HOT 4
- Multiplication between mixed types doesn't drop leading dimensions HOT 2
- Very slow 4D broadcast in 2.0.1 HOT 1
- Failed to detect installed CUDA version. HOT 1
- CURAND_STATUS_PREEXISTING_FAILURE with v2.0.1 but not v1.7.3 HOT 8
- Deadlock during memory free HOT 5
- Indexing CuArrays with Empty Ranges Errors HOT 5
- Sum, any, etc. with function is no longer implemented HOT 7
- Training Halts when Using CuArrarys HOT 6
- CUBLAS initialization HOT 1
- Performance issue with v2.1.0 compared with v1.7.3 HOT 4
- .+ CartesianIndices: InvalidIRError: compiling kernel broadcast HOT 1
- Package fails to load HOT 4
- Project.toml becoming stale (many notable package downgrades) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuarrays.jl.