juliagpu / cudanative.jl Goto Github PK

View Code? Open in Web Editor NEW

390.0 390.0 55.0 2.32 MB

Julia support for native CUDA programming

Home Page: https://juliagpu.org/cuda/

License: Other

Julia 100.00%

cuda cuda-toolkit julia julia-library

cudanative.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman waltsims aakarsh cfoket keno vchuravy yuhangwang eminsight nunofernandes-plight steven-delbeke xu3kev jarvist wenzhi-cui floswald dhairyalgandhi ibustany pgbarletta una-dinosauria karlwessel shiyuzhao kristofferc ayanbiswas dpsanders csy1234 jonathanvdc qin-yu shahriariravanian chengchingwen giggleliu giordano leios lcw chenminacm roger-luo stjordanis mwarusz umashankartriforce strogo thomasfaingnaert ianbutterworth reactgary hackdemon23 philipvinc sleort jandelmi horakka5 jw3126 standardgalactic mfkiwl ning306en ranocha

cudanative.jl's Issues

Non-bitstype immutables are passed by value

using CUDAnative, CUDAdrv
const cu = CUDAnative
dev = CuDevice(0)
ctx = CuContext(dev)
@target ptx function kernel(A, b)
    @inbounds A[1] = imag(b)
    nothing
end
A = CuArray(zeros(Float32, (1,)));
x = Complex64(2,2)
@cuda (1, 1) kernel(A, x)

Array(A)
--> ERROR: Illegal memory access (CUDA error #700, ERROR_ILLEGAL_ADDRESS)

Replacing x with an CuArray of length one works!

Unused arguments can emit unsupported code

This only seems to happen in the context of our throw hack, where we null any argument and emit a plain trap. However, the argument is still being emitted, which can result in use of an unsupported language feature:

using CUDAnative

function kernel_throwarg()
    throw(ArgumentError("foo"))
    return nothing
end

CUDAnative.code_llvm(kernel_throwarg, Tuple{})

Possible solutions:

hook throw at the inference level, letting DCE kill the argument
have proper string support
have proper GC support

cudaconvert selects wrong type with --compilecache=no

... which effectively breaks DEBUG or TRACE=1. Workaround is to leave precompilation on, but trick julia into re-inferring by touching one of the package sources.

Caused by JuliaLang/julia#19942, cudaconvert(CuArray{Float32,2}) doesn't yield CuDeviceArray{Float32,2}, breaking argument handling in actual_types.

Support for libNVVM

When skimming the code, it looks like things are flexible enough to adapt this approach for OpenCL code generation.
Can we come up with an up for grabs list?
A first version doesn't need to be complete and we can add to that list as we go ;)
I assume one could reuse the LLVM-IR generation/linking code almost entirely? Probably switch out a few passes?
Then of course, hooking up the intrinsics and OpenCL driver will be quite a bit of work!
But it should already be possible to create a minimal working prototype, no?

cc: @vchuravy

NVPTX cannot return [2 x i32]

using CUDAnative

const T = Int32

@noinline bar() = (T(42),T(42))
function foo()
    x,_ = bar()
    return x
end

Variables:
  #self#::#foo
  x::Int32
  _::Int32
  #temp#::Int64

Body:
  begin 
      SSAValue(0) = $(Expr(:invoke, MethodInstance for bar(), :(Main.bar)))
      #temp#::Int64 = $(QuoteNode(1))
      SSAValue(3) = (Base.getfield)(SSAValue(0), 1)::Int32
      SSAValue(4) = (Base.add_int)(1, 1)::Int64
      x::Int32 = SSAValue(3)
      #temp#::Int64 = SSAValue(4)
      SSAValue(5) = (Base.getfield)(SSAValue(0), 2)::Int32
      SSAValue(6) = (Base.add_int)(2, 1)::Int64
      _::Int32 = SSAValue(5)
      #temp#::Int64 = SSAValue(6) # line 13:
      return x::Int32
  end::Int32

; Function Attrs: nounwind
define i32 @julia_foo_68115() local_unnamed_addr #0 !dbg !6 {
top:
  %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-10896, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #1
; Filename: /home/tbesard/Projects/Julia-CUDA/CUDAnative/bug.jl
; Source line: 12
  %0 = tail call [2 x i32] @julia_bar_68116(), !dbg !8
  %.elt = extractvalue [2 x i32] %0, 0, !dbg !8
; Source line: 13
  ret i32 %.elt, !dbg !9
}

Unknown return type
UNREACHABLE executed at /home/tbesard/Projects/Julia-CUDA/julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:379!

Does work with T = Int64, because that is returned as a [2 x i64]*. So this is probably ABI violating.

Note that this breaks the very handy fldmod1 after JuliaLang/julia#20854 which resulted in fldmod1 not getting inlined anymore:

function foo()
    wid, lane = fldmod1(threadIdx().x, warpsize())
    return lane
end

...
%1 = tail call [2 x i32] @julia_fldmod1_68201(i32 %0, i32 42), !dbg !11
...

We should probably also look into getting these functions inlined all the time.

CUDA 9 ptxas doesn't support sm_20

This shows as follows:

ptxas fatal   : Value 'sm_20' is not defined for option 'gpu-name'

I don't want to duplicate the compatibility database which now lives in CUDArt though...

Outlined throw not supported

In CuArrays I see a couple of errors under Pkg.test() that don't come up under julia runtests.jl.

Indexing: Error During Test
  Test threw an exception of type ErrorException
  Expression: collect(xs[1:2, 2]) == (collect(xs))[1:2, 2]
  error compiling index_kernel: emit_allocobj for /home/mike/.julia/v0.6/CuArrays/src/indexing.jl:37 requires the dynamic_alloc language feature, which is disabled

PermuteDims: Error During Test
  Test threw an exception of type CUDAdrv.CuError
  Expression: collect(permutedims(xs, (2, 1))) == permutedims(collect(xs), (2, 1))
  ptxas fatal   : Unresolved extern function 'jl_bounds_error_unboxed_int'
  CUDA error: no kernel image is available for execution on the device (code #209, ERROR_NO_BINARY_FOR_GPU)

I assume that this must be due to either things like forced bounds checks or lower optimisation levels.

Tests fail if `ptxas` is not on the PATH

It would be nice if Pkg.build automatically figured out where ptxas is and used that path accordingly.

Two test failures on Tesla P100 / Power8

GPU is a Tesla P100-SXM2-16GB with CUDA 8.0.

julia> versioninfo()
Julia Version 0.6.0-pre.beta.60
Commit 7421e11 (2017-04-06 23:02 UTC)
Platform Info:
  OS: Linux (powerpc64le-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  BLAS: libopenblas (NO_AFFINITY POWER8)
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

Test 1

child functions: Test Failed
  Expression: ismatch(@r_str("call .+ @julia_codegen_child_"), ir)
Stacktrace:
 [1] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:48 [inlined]
 [2] macro expansion at ./test.jl:853 [inlined]
 [3] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:0 [inlined]
 [4] macro expansion at ./test.jl:853 [inlined]
 [5] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:0 [inlined]
 [6] macro expansion at ./test.jl:853 [inlined]
 [7] anonymous at ./<missing>:?

IR:

; Function Attrs: norecurse nounwind readnone
define void @julia_codegen_parent_69063(i64) local_unnamed_addr #0 !dbg !6 {
top:
  ret void, !dbg !8
}

Test 2

child functions: Test Failed
  Expression: ismatch(@r_str("call.uni \\(retval0\\),\\s+julia_ptx_child_", "m"), asm)
Stacktrace:
 [1] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:75 [inlined]
 [2] macro expansion at ./test.jl:853 [inlined]
 [3] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:0 [inlined]
 [4] macro expansion at ./test.jl:853 [inlined]
 [5] macro expansion at /work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl:0 [inlined]
 [6] macro expansion at ./test.jl:853 [inlined]
 [7] anonymous at ./<missing>:?

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

	.file	1 "/work/scratch/vchuravy/juliapkg/v0.6/CUDAnative/test/codegen.jl"
	// .globl	julia_ptx_parent_69186

.visible .func julia_ptx_parent_69186(
	.param .b64 julia_ptx_parent_69186_param_0
)
{


	.loc 1 71 0
	ret;
}

	// .globl	julia_ptx_child_69187
.visible .func  (.param .b64 func_retval0) julia_ptx_child_69187(
	.param .b64 julia_ptx_child_69187_param_0
)
{
	.reg .s64 	%rd<3>;

	ld.param.u64 	%rd1, [julia_ptx_child_69187_param_0];
	.loc 1 70 0
	add.s64 	%rd2, %rd1, 1;
	st.param.b64	[func_retval0+0], %rd2;
	ret;
}

Tuples not supported by @cuda

using CUDAnative, CUDAdrv
dev = CUDAdrv.CuDevice(0)
ctx = CUDAdrv.CuContext(dev)
function broadcast_index{T, N}(idx, arg::AbstractArray{T,N}, keep, Idefault)
    idx = Base.Broadcast.newindex(idx, keep, Idefault)
    @inbounds return arg[idx]::T
end
function broadcast_index{T}(idx, arg::T, keep, Idefault)
    arg::T
end
@target ptx function broadcast_kernel(keeps::Tuple, Idefaults::Tuple, A, f, arg_1, arg_2)
    i = Int((blockIdx().x-1) * blockDim().x + threadIdx().x)
    @inbounds if i < length(A) && i > 0
        idx = CartesianIndex(ind2sub(size(A), Int(i)))
        k1 = keeps[1] #(true, true) #<--replacing this by the actual value
        Id = Idefaults[1] #(1, 1) # will make error disappear
        idx2 = Base.Broadcast.newindex(idx, k1, Id)
        A[idx] = f(
            arg_1[idx2],
            arg_2
        )
    end
    nothing
end
function cu_broadcast(f, A, args...)
    len = prod(size(A))
    threads = min(len, 1024)
    blocks = ceil(Int, len/threads)
    @show (threads, blocks)
    shape = indices(A)
    Base.Broadcast.check_broadcast_shape(shape, args...)
    keeps, Idefaults = Base.Broadcast.map_newindexer(shape, args)
    @show keeps, Idefaults
    @cuda (threads, blocks) broadcast_kernel(keeps, Idefaults, A, f, args...)
end
A = CuArray(rand(Float32, 40,40));
B = CuArray(rand(Float32, 40,40));
cu_broadcast(*, A, B, 8f0)
Array(A)

reported by @SimonDanisch

Common interface between backends

Now that I have a working prototype for GLSL transpilation, it'd be nice to have the same julia code compile to GLSL and CUDAnative without hassle!

Shared Memory

In GLSL it seems keywords like shared a just one keyword from a set of other keywords. So I had the idea of having to create an intrinsic type Qualified{Qualifier, Type} .
So you could create shared memory like this:

Qualified{:shared, StaticVector{10, Float32}()

I'm not sure how well this can work with CUDAnatives code generation...

intrinsics

There are a lot of shared intrinsics like memory barriers, work group index getters etc.
The problem with them is, that we'd need to dispatch on some backend type to allow to select the correct intrinsic name for the backend.
I could in theory just mirror the cuda names, since I go through the Julia code anyways and can just replace them with the correct names for GLSL.
Any thoughts on this?

wrong sin is an unrecoverable error

If I call sin.(xs) instead of CUDAnative.sin.(xs) with a GPUArray, for example, this gives an "invalid program counter" error that is persistent even once you've fixed the issue. It would be nice if CUDAnative could recover from this error without a restart.

Proper tracking of compiled functions

Moved over from bugs/host_after_ptx.jl:

#=
This doesn't work because functions are only compiled once, disregarding @target.

More specifically, jl_compile_linfo only compiles if functionObject is not set.
If the function has already been compiled as part of another cycle, its module
has already been finalized, and in turn consumed (ie. removed from
module_for_fname) by finalize_function.

This means that later uses will not trigger a new compilation because the
functionObject is already set, and consequently no module will be finalized
which means no entry in module_for_fname. Consequently, finalization of the
parent function will fail because the required module cannot be found.
=#

@noinline child(x) = x+1

function f_host()
    child(10)
end

@target ptx function f_ptx()
    child(10)
    return nothing
end

function main()
    code_native(f_ptx, ())
    code_native(f_host, ())
end

main()

getindex/setindex! should accept Int32

Index primitives (threadIdx().x) yield 32-bit integers, which aren't currently supported by CuDeviceArray. AbstractArray interface should enable this?

@cuprintf generated function throws with KeyError

A couple of issues with this example. The @cuprintf statement, according to the online docs, needs %ld and not %d. Secondly, with the previous change in place, in the message below line 7 refers to my return statement which is return nothing corrected from the bare return in the example code. Note that with the @cuprintf statement disabled, there is no problem the kernel returns correctly, and also if the @cuprintf prints only a simple string with no parameters there is also no problem.
I would suggest a PR but I can't get it to work.

> julia6 hello.jl 
ERROR: LoadError: error compiling hello_world: emit_invoke for /home/colin/cuda/hello.jl:7 requires the runtime language feature, which is disabled
Stacktrace:
 [1] _dump_function_linfo(::Core.MethodInstance, ::UInt64, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:592
 [2] _dump_function(::ANY, ::ANY, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:580
...

Using

julia> versioninfo()
Julia Version 0.6.0-dev.2481
Commit 2d86c42* (2017-01-30 22:50 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Random numbers in Kernel Function

is there currently a way to generate cuRand like numbers in a kernel function or is this functionality planned for the future?

@cuprintf format string sensitive to variable position

Considering the following toy code that element-wise multiplies up each value in the matrix, note that the same pair of variables are printed each time, with slight variation in the position in the string "i j j i". Format A prints correctly , but format B has reports zeros for both i and j. It seems to be tied to the position in the format string, not to one variable or the other, or the order in which they are defined in the kernel.

using CUDAdrv, CUDAnative

function kernel_mmul(a, c)
    one = 1
    two = 2
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    j = (blockIdx().y-1) * blockDim().y + threadIdx().y
    c[i,j] = a[i,j].*3
    @cuprintf("A %d %ld %d %ld\n",i,j,j,i) # format A
    @cuprintf("B %d %d %d %d\n",i,j,j,i)  # format B    
    return nothing
end
dev = CuDevice(0)
ctx = CuContext(dev)
a = Int32[1 2 3; 2 3 1; 3 1 2]
d_a = CuArray(a)
d_c = similar(d_a) 
@cuda ((1,1),(3,3)) kernel_mmul(d_a, d_c)
c = Array(d_c)
println(a)
println(c)
destroy(ctx)

with output:

A 1 1 1 1
A 2 1 1 2
A 3 1 1 3
A 1 2 2 1
A 2 2 2 2
A 3 2 2 3
A 1 3 3 1
A 2 3 3 2
A 3 3 3 3
B 1 0 1 0
B 2 0 1 0
B 3 0 1 0
B 1 0 2 0
B 2 0 2 0
B 3 0 2 0
B 1 0 3 0
B 2 0 3 0
B 3 0 3 0
Int32[1 2 3; 2 3 1; 3 1 2]
Int32[3 6 9; 6 9 3; 9 3 6]

`fldmod1` results in unsupported array return type

The following snippet results in IR returning a [i32,i32], something NVPTX cannot handle as of LLVM 3.9.1:

using CUDAdrv, CUDAnative

function kernel(a, x, y)
    x1, y1 = fldmod1(x, y)
    a[0] = x1 + y1

    return nothing
end

CUDAnative.code_ptx(kernel, Tuple{CuDeviceArray{UInt32,1}, UInt32, UInt32})

Unknown return type
UNREACHABLE executed at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:379!


Thread 1 "julia" received signal SIGABRT, Aborted.
0x00007ffff6cbb04f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff6cbb04f in raise () from /usr/lib/libc.so.6
#1  0x00007ffff6cbc47a in abort () from /usr/lib/libc.so.6
#2  0x00007ffff3aa57f0 in llvm::llvm_unreachable_internal (msg=0x7ffff5d3db60 "Unknown return type", file=0x7ffff5d3da57 "julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp", line=379)
    at julia/deps/srccache/llvm-3.9.1/lib/Support/ErrorHandling.cpp:118
#3  0x00007ffff522d3c3 in llvm::NVPTXAsmPrinter::printReturnValStr (this=0xd747f0, F=0x246cc38, O=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:379
#4  0x00007ffff522f5ab in llvm::NVPTXAsmPrinter::emitDeclaration (this=0xd747f0, F=0x246cc38, O=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:627
#5  0x00007ffff5230a28 in llvm::NVPTXAsmPrinter::emitDeclarations (this=0xd747f0, M=..., O=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:763
#6  0x00007ffff522de4d in llvm::NVPTXAsmPrinter::emitGlobals (this=0xd747f0, M=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:883
#7  0x00007ffff522dbba in llvm::NVPTXAsmPrinter::EmitFunctionEntryLabel (this=0xd747f0) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:456
#8  0x00007ffff449e7ff in llvm::AsmPrinter::EmitFunctionHeader (this=0xd747f0) at julia/deps/srccache/llvm-3.9.1/lib/CodeGen/AsmPrinter/AsmPrinter.cpp:576
#9  0x00007ffff449f183 in llvm::AsmPrinter::EmitFunctionBody (this=0xd747f0) at julia/deps/srccache/llvm-3.9.1/lib/CodeGen/AsmPrinter/AsmPrinter.cpp:842
#10 0x00007ffff44acd38 in llvm::AsmPrinter::runOnMachineFunction (this=0xd747f0, MF=...) at julia/deps/srccache/llvm-3.9.1/include/llvm/CodeGen/AsmPrinter.h:209
#11 0x00007ffff523923b in llvm::NVPTXAsmPrinter::runOnMachineFunction (this=0xd747f0, F=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.h:327
#12 0x00007ffff3fa6f56 in llvm::MachineFunctionPass::runOnFunction (this=0xd747f0, F=...) at julia/deps/srccache/llvm-3.9.1/lib/CodeGen/MachineFunctionPass.cpp:60
#13 0x00007ffff3d1325f in llvm::FPPassManager::runOnFunction (this=0x486dbc0, F=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1526
#14 0x00007ffff3d13575 in llvm::FPPassManager::runOnModule (this=0x486dbc0, M=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1547
#15 0x00007ffff3d13d0a in (anonymous namespace)::MPPassManager::runOnModule (this=0x2939480, M=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1603
#16 0x00007ffff3d13836 in llvm::legacy::PassManagerImpl::run (this=0x367b6e0, M=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1706
#17 0x00007ffff3d14221 in llvm::legacy::PassManager::run (this=0x7fffffffabe0, M=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1737
#18 0x00007ffff522731b in LLVMTargetMachineEmit (T=0x3aab7d0, M=0x26f6a90, OS=..., codegen=LLVMAssemblyFile, ErrorMessage=0x7ffdee4b04b0)
    at julia/deps/srccache/llvm-3.9.1/lib/Target/TargetMachineC.cpp:205
#19 0x00007ffff52273bb in LLVMTargetMachineEmitToMemoryBuffer (T=0x3aab7d0, M=0x26f6a90, codegen=LLVMAssemblyFile, ErrorMessage=0x7ffdee4b04b0, OutMemBuf=0x7ffdee4b04c0)
    at julia/deps/srccache/llvm-3.9.1/lib/Target/TargetMachineC.cpp:229
#20 0x00007ffdd452fb3a in ?? ()
#21 0x00007ffdee4b04c0 in ?? ()
#22 0x00007ffdee4b04b0 in ?? ()
#23 0xc2f8aaab3a571500 in ?? ()
#24 0x00007ffdee4b04b0 in ?? ()
#25 0x00007fffffffae38 in ?? ()
#26 0x00007ffdee4b04c0 in ?? ()
#27 0x00007ffdedbb7130 in ?? ()
#28 0x00007ffff7fbd1c8 in ?? ()
#29 0x00007fffffffae10 in ?? ()
#30 0x00007ffdd452fddf in ?? ()
#31 0x00007ffdee3f5090 in ?? ()
#32 0x0000000000000000 in ?? ()


(gdb) f 3
#3  0x00007ffff522d3c3 in llvm::NVPTXAsmPrinter::printReturnValStr (this=0xd747f0, F=0x246cc38, O=...) at julia/deps/srccache/llvm-3.9.1/lib/Target/NVPTX/NVPTXAsmPrinter.cpp:379
379           llvm_unreachable("Unknown return type");

(gdb) call Ty->dump()
[2 x i32]

(gdb) call Ty->getTypeID()
$1 = llvm::Type::ArrayTyID


(gdb) f 13
#13 0x00007ffff3d1325f in llvm::FPPassManager::runOnFunction (this=0x486dbc0, F=...) at julia/deps/srccache/llvm-3.9.1/lib/IR/LegacyPassManager.cpp:1526
1526          LocalChanged |= FP->runOnFunction(F);

(gdb) call F.dump()
; Function Attrs: nounwind
define void @julia_kernel_66759(%CuDeviceArray.0* nocapture readonly, i32, i32) local_unnamed_addr #0 !dbg !11 {
top:
  %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-2680, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  tail call void @llvm.dbg.value(metadata %CuDeviceArray.0* %0, i64 0, metadata !24, metadata !29), !dbg !30
  tail call void @llvm.dbg.value(metadata i32 %1, i64 0, metadata !25, metadata !31), !dbg !30
  tail call void @llvm.dbg.value(metadata i32 %2, i64 0, metadata !26, metadata !31), !dbg !30
  %3 = tail call [2 x i32] @julia_fldmod1_66763(i32 %1, i32 %2) #4, !dbg !30
  %.elt = extractvalue [2 x i32] %3, 0, !dbg !30
  tail call void @llvm.dbg.value(metadata i32 %.elt, i64 0, metadata !27, metadata !31), !dbg !30
  %.elt2 = extractvalue [2 x i32] %3, 1, !dbg !30
  tail call void @llvm.dbg.value(metadata i32 %.elt2, i64 0, metadata !28, metadata !31), !dbg !30
  %4 = add i32 %.elt, %.elt2, !dbg !32
  %5 = getelementptr %CuDeviceArray.0, %CuDeviceArray.0* %0, i64 0, i32 1, !dbg !32
  %6 = load i32*, i32** %5, align 8, !dbg !32, !tbaa !33
  %7 = getelementptr i32, i32* %6, i64 -1, !dbg !32
  store i32 %4, i32* %7, align 8, !dbg !32, !tbaa !36
  ret void, !dbg !38
}

cuda macro requires a literal tuple

Extremely minor, but I just figured I'd note all these small usability issues -- feel free to ignore if it's not what you're going for right now.

This has become a very common pattern for me:

blk, thr = cudims(length(C))
@cuda (blk, thr) ...

which is a bit boilerplate-y. I think the overhead of keyword args, noted in the code, can fairly easily be avoided by splatting a tuple -- or perhaps it will just get fixed with named tuples in future.

wrong alignement for staticSharedMem

using CUDAnative, CUDAdrv, Colors
dev = CUDAdrv.CuDevice(0)
ctx = CUDAdrv.CuContext(dev)

@inline function reduce_block{T}(v0::T)
    shared = CUDAnative.@cuStaticSharedMem(T, 32)
    @inbounds shared[Cuint(1)] = v0
    return
end

@cuda (1, 1) reduce_block((0f0, 0f0, 0f0))
sizeof(RGB{Float32})

Gives: error parsing LLVM-IR, alignment needs to be power of two
Needs a fix here:
https://github.com/JuliaGPU/CUDAnative.jl/blob/master/src/device/intrinsics/memory_shared.jl#L70
I guess we could have something like:

function cu_alignement(::Type{T}) where T
    x = sizeof(T)
    nextnum = ceil(Int, log2(x))
    result = 2^nextnum
end

Problem is, that we need to return the alignement of the element type...
Not sure how to implement that for arbitrary julia types. I think I read somewhere, that structs are aligned to their biggest element? So maybe just find that and call cu_alignement on that?

Broken test does not fail

Using:

Julia Version 0.6.0-dev.2464
Commit d233cf9* (2017-01-29 05:47 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Pkg.test("CUDAdrv") passes but Pkg.test("CUDAnative") fails with two errors:

> Pkg.test("CUDAnative")
INFO: Testing CUDAnative
Test Summary:  | 
base interface | No tests
Test Summary:   | Pass  Total
code generation |   15     15
Test Summary: | Pass  Total
execution     |   10     10
Test Summary: | Pass  Total
device arrays |    2      2
printing: Error During Test
  Got an exception of type UndefVarError outside of a @test
  UndefVarError: ret not defined
  Stacktrace:
   [1] macro expansion at /home/colin/.julia/v0.6/CUDAnative/test/intrinsics.jl:34 [inlined]
   [2] macro expansion at ./test.jl:852 [inlined]
   [3] macro expansion at /home/colin/.julia/v0.6/CUDAnative/test/intrinsics.jl:0 [inlined]
   [4] macro expansion at ./test.jl:852 [inlined]
   [5] macro expansion at /home/colin/.julia/v0.6/CUDAnative/test/intrinsics.jl:0 [inlined]
   [6] macro expansion at ./test.jl:852 [inlined]
   [7] anonymous at ./<missing>:?
   [8] include_from_node1(::String) at ./loading.jl:539
   [9] include(::String) at ./sysimg.jl:14

also an error on CUDAdrv reported here since CUDAdrv tests pass, but it appears to be JuliaGPU/CUDAdrv.jl#10 as reported by @timholy

down: Error During Test
  Got an exception of type CUDAdrv.CuError outside of a @test
  No kernel image available/suitable for GPU (CUDA error #209, ERROR_NO_BINARY_FOR_GPU)
  ptxas application ptx input, line 67; error   : Instruction 'shfl' requires .target sm_30 or higher
  ptxas fatal   : Ptx assembly aborted due to errors
  Stacktrace:

CUDA

> cuda/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
cuda/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 610"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 964 MBytes (1011023872 bytes)
  ( 1) Multiprocessors, ( 48) CUDA Cores/MP:     48 CUDA Cores
  GPU Max Clock rate:                            1620 MHz (1.62 GHz)
  Memory Clock rate:                             600 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 65536 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 610
Result = PASS

Source build requirement

Known issue, obviously, but I wanted to track the status of it somewhere.

@maleadt can you remind me what your planned workaround was, and also why we can't fix this in Base (if I'm remembering our conversation correctly)?

Testsets/try..catch mess up dispatch in testsuite

let
    @noinline child1() = return nothing
    parent1() = child1()    # static dispatch
    println(code_lowered(parent1, ()))
    code_llvm(parent1, ())
end

try
    @noinline child2() = return nothing
    parent2() = child2()    # dynamic dispatch
    println(code_lowered(parent2, ()))
    code_llvm(parent2, ())
end

yields

CodeInfo[CodeInfo(:(begin 
        nothing
        return ((Core.getfield)(#self#,:child1))()
    end))]

define void @julia_parent1_62203() #0 !dbg !5 {
top:
  ret void
}

CodeInfo[CodeInfo(:(begin 
        nothing
        return ((Core.getfield)((Core.getfield)(#self#,:child2),:contents))()
    end))]

define %jl_value_t* @julia_parent2_62204(%jl_value_t*, %jl_value_t**, i32) #0 !dbg !5 {
top:
  %3 = alloca %jl_value_t**, align 8
  store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2680, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %4 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*], [5 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [5 x %jl_value_t*], [5 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = bitcast %jl_value_t** %5 to i8*
  call void @llvm.memset.p0i8.i32(i8* %6, i8 0, i32 24, i32 8, i1 false)
  %7 = bitcast [5 x %jl_value_t*]* %4 to i64*
  store i64 6, i64* %7, align 8
  %8 = bitcast i8* %ptls_i8 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = getelementptr [5 x %jl_value_t*], [5 x %jl_value_t*]* %4, i64 0, i64 1
  %11 = bitcast %jl_value_t** %10 to i64*
  store i64 %9, i64* %11, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %ptls, align 8
  %12 = getelementptr [5 x %jl_value_t*], [5 x %jl_value_t*]* %4, i64 0, i64 4
  %13 = getelementptr [5 x %jl_value_t*], [5 x %jl_value_t*]* %4, i64 0, i64 3
  %14 = bitcast %jl_value_t* %0 to i64*
  %15 = load i64, i64* %14, align 8
  %16 = bitcast %jl_value_t** %5 to i64*
  store i64 %15, i64* %16, align 8
  store %jl_value_t* inttoptr (i64 140174223563688 to %jl_value_t*), %jl_value_t** %13, align 8
  %17 = call %jl_value_t* @jl_f_getfield(%jl_value_t* null, %jl_value_t** %5, i32 2)
  store %jl_value_t* %17, %jl_value_t** %12, align 8
  %18 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %12, i32 1)
  %19 = load i64, i64* %11, align 8
  store i64 %19, i64* %8, align 8
  ret %jl_value_t* %18
}

The dynamic dispatch caused by the try is obviously incompatible with GPU codegen. As a workaround, we @eval every function definition in our tests, and give it a global unique name.

Clean up the tests if this gets fixed.

Parameterize CuDeviceArray on its address space

In some cases (ie. shared memory allocated in the same function) LLVM can infer the AS, but in most cases (eg. all global loads) we don't. This results in generic ld instructions, which can be slower and have different cache behaviour.

Solution: add an address space argument to the pointerref and pointerset intrinsics, allowing us to perform "AS-typed" loads from getindex and setindex!.

Alternative solution: add an AS type parameter to Ptr, but that's quite insane.

This probably accounts for the 60% slowdown in Rodinia's nw benchmark, where after warm-up the overhead disappears, as well as using larger data sets, possibly indicating a caching issue.

code_llvm etc

It seems strange for CUDAnative to export these functions -- given the conflict with Base, it doesn't actually make them any easier to use (you still have to qualify or import) but does stop you using the Base versions (which I end up doing quite frequently to debug specialisation issues).

Is there another reason to do this that I'm missing?

Broken Link for Compatible Julia

The introduction states "It requires a version of Julia capable of generating PTX code, such as tb/cuda at JuliaGPU/julia." but the link to tb/cuda is broken.

JIT emits identical functions twice

julia> @noinline function child(i)
               if i < 10
                   return i*i
               else
                   return (i-1)*(i+1)
               end
           end
child (generic function with 1 method)

julia> parent(a, i) = (a[0] = child(child(i)); nothing)
parent (generic function with 1 method)

CUDAnative.jl:

julia> CUDAnative.code_llvm(parent, Tuple{CuDeviceArray{Int}, Int})
define void @julia_parent_65887(%jl_value_t.8*, i64) local_unnamed_addr #0 !dbg !7 {
top:
...
  %11 = call i64 @julia_child_65889(i64 %1) #0, !dbg !12
  %12 = call i64 @julia_child_65888(i64 %11) #0, !dbg !12
...
}

Base:

julia> code_llvm(parent, Tuple{CuDeviceArray{Int}, Int})
; WARNING: This code may not match what actually runs.

define void @julia_parent_65893(%jl_value_t*, i64) #0 !dbg !5 {
top:
...
  %11 = call i64 @julia_child_65894(i64 %1) #0
  %12 = call i64 @julia_child_65894(i64 %11) #0
...
}

Probably another function caching bug in combination with the module activation hook.

Method errors get reported as `InexactError`

would be nice to catch this early

Segfault using at-which

julia> using CUDAnative

julia> @which warpsize()
warpsize() in CUDAnative at CUDAnative/src/device/intrinsics.jl:237

julia> @which warpsize()+1
julia: julia/deps/srccache/llvm-3.9.1/lib/CodeGen/MachineFunctionAnalysis.cpp:49: virtual bool llvm::MachineFunctionAnalysis::runOnFunction(llvm::Function &): Assertion `!MF && "MachineFunctionAnalysis already initialized!"' failed.

cc @SimonDanisch

Mimic upstream pass sequence

Broken PTX code for stand-alone functions returning values

Take the following set of functions:

@noinline function child(i)
    if i < 10
        return i*i
    else
        return (i-1)*(i+1)
    end
end

function parent(a, i)
    a[1] = child(i)
    return nothing
end

When emitting PTX starting at parent, we get:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

        .file   1 "/home/tbesard/Projects/Julia-CUDA/CUDAnative/bugs/llvm_retval_emission.jl"
        // .globl       julia_parent_62268
.visible .func  (.param .b64 func_retval0) julia_child_62270
(
        .param .b64 julia_child_62270_param_0
)
;

.visible .entry julia_parent_62268(
        .param .u64 julia_parent_62268_param_0,
        .param .u64 julia_parent_62268_param_1
)
{
        .reg .s32       %r<2>;
        .reg .s64       %rd<6>;

        ld.param.u64    %rd1, [julia_parent_62268_param_0];
        cvta.to.global.u64      %rd2, %rd1;
        ld.param.u64    %rd3, [julia_parent_62268_param_1];
        .loc 1 12 0
        { // callseq 0
        .reg .b32 temp_param_reg;
        .param .b64 param0;
        st.param.b64    [param0+0], %rd3;
        .param .b64 retval0;
        call.uni (retval0), 
        julia_child_62270, 
        (
        param0
        );
        ld.param.b64    %rd4, [retval0+0];
        } // callseq 0
        ld.global.u64   %rd5, [%rd2+8];
        st.u64  [%rd5], %rd4;
        .loc 1 13 0
        ret;
}

        // .globl       julia_child_62270
.visible .func  (.param .b64 func_retval0) julia_child_62270(
        .param .b64 julia_child_62270_param_0
)
{
        .reg .pred      %p<2>;
        .reg .s32       %r<2>;
        .reg .s64       %rd<6>;

        ld.param.u64    %rd1, [julia_child_62270_param_0];
        .loc 1 4 0
        setp.gt.s64     %p1, %rd1, 9;
        @%p1 bra        LBB1_2;
        bra.uni         LBB1_1;
LBB1_2:
        .loc 1 7 0
        add.s64         %rd3, %rd1, -1;
        add.s64         %rd4, %rd1, 1;
        mul.lo.s64      %rd5, %rd3, %rd4;
        st.param.b64    [func_retval0+0], %rd5;
        ret;
LBB1_1:
        .loc 1 5 0
        mul.lo.s64      %rd2, %rd1, %rd1;
        st.param.b64    [func_retval0+0], %rd2;
        ret;
}

Note how the child function is declared (twice, one forward definition and one regular one):

.visible .func  (.param .b64 func_retval0) julia_child_62270
(
        .param .b64 julia_child_62270_param_0
)

However, if we emit code starting from the child function, the return value parameter is missing from the definition:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

        .file   1 "/home/tbesard/Projects/Julia-CUDA/CUDAnative/bugs/llvm_retval_emission.jl"
        // .globl       julia_child_62327

.visible .entry julia_child_62327(
        .param .u64 julia_child_62327_param_0
)
{
        .reg .pred      %p<2>;
        .reg .s32       %r<2>;
        .reg .s64       %rd<6>;

        ld.param.u64    %rd1, [julia_child_62327_param_0];
        .loc 1 4 0
        setp.gt.s64     %p1, %rd1, 9;
        @%p1 bra        LBB0_2;
        bra.uni         LBB0_1;
LBB0_2:
        .loc 1 7 0
        add.s64         %rd3, %rd1, -1;
        add.s64         %rd4, %rd1, 1;
        mul.lo.s64      %rd5, %rd3, %rd4;
        st.param.b64    [func_retval0+0], %rd5;
        ret;
LBB0_1:
        .loc 1 5 0
        mul.lo.s64      %rd2, %rd1, %rd1;
        st.param.b64    [func_retval0+0], %rd2;
        ret;
}

This understandably confuses ptxas when trying to generate SASS from that function:

ptxas application ptx input, line 30; error   : State space mismatch between instruction and address in instruction 'st'
ptxas application ptx input, line 35; error   : State space mismatch between instruction and address in instruction 'st'
ptxas application ptx input, line 30; error   : Unknown symbol 'func_retval0'
ptxas application ptx input, line 30; fatal   : Label expected for forward reference of 'func_retval0'
ptxas fatal   : Ptx assembly aborted due to errors

ptxas does support generating SASS for non-entry point functions (try taking the PTX function definition for child in the first dump from this report, it is compilable with ptxas).

Repro for all this:

using CUDAnative

@noinline function child(i)
    if i < 10
        return i*i
    else
        return (i-1)*(i+1)
    end
end

function parent(a, i)
    a[1] = child(i)
    return nothing
end

# this works
CUDAnative.code_ptx( parent, Tuple{CuDeviceArray{Int,1}, Int})
CUDAnative.code_sass(parent, Tuple{CuDeviceArray{Int,1}, Int})

# this doesn't: LLVM seems to forger the `(.param .b64 func_retval0)` return parameter
#               when emitting the child function on itself
CUDAnative.code_ptx( child, Tuple{Int})
CUDAnative.code_sass(child, Tuple{Int})

Looks like an LLVM bug, but I haven't looked at it in detail.

Bounds checking only partially implemented

Seems to miss checkbounds_indices. Repro:

using CUDAnative

@target ptx function foo(x::Float64, a)
    a[1] = round(UInt,x)
    nothing
end

code_llvm(foo, (Float64, CuDeviceArray{UInt}))

Yields:

ERROR: LoadError: error compiling foo: JL_TARGET_PTX target does not support generic call to "checkbounds_indices"
 in _dump_function(::Core.MethodInstance, ::Bool, ::Bool, ::Bool, ::Bool) at ./reflection.jl:513
 in _dump_function(::Any, ::Any, ::Bool, ::Bool, ::Bool, ::Bool) at ./reflection.jl:506
 in code_llvm at ./reflection.jl:539 [inlined] (repeats 2 times)
 in code_llvm(::Any, ::Any) at ./reflection.jl:541
 in include_from_node1(::String) at ./loading.jl:541
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:326

Add detailed installation instructions

I found it a bit of a challenge getting CUDAnative.jl to build. The steps I needed were:

brew cask install cuda
export DYLD_LIBRARY_PATH="/usr/local/cuda/lib"

CUDA 9.0 support

It would be nice to have support for the latest release, and to be able to build the whole CUDA setup for julia in a less hotchpotch fashion.
The release candidate is here:
https://developer.nvidia.com/cuda-release-candidate-download
The corresponding CUDNN release is here (albeit behind login):
https://developer.nvidia.com/cudnn

Currently none of the major framefork support this, so if Julia were to be first, we could get great people onboard.
I`m making this Issue because I am desperately failing to build the GPUArrays package.

Refuse device functions with non-native CC

We should require the Julia native CC (ie. julia_):

julia> foo() = (bar(1); return nothing)
foo (generic function with 1 method)

julia> @noinline bar(i...) = i[1]+1
bar (generic function with 1 method)

julia> @CUDAnative.code_llvm foo()

define void @julia_foo_60827() local_unnamed_addr #0 !dbg !7 {
top:
  %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
  %ptls = bitcast i8* %ptls_i8 to i8****
  %0 = alloca [3 x i8**], align 8
  %.sub = getelementptr inbounds [3 x i8**], [3 x i8**]* %0, i64 0, i64 0
  %1 = getelementptr [3 x i8**], [3 x i8**]* %0, i64 0, i64 2
  %2 = bitcast i8*** %1 to i64*
  store i64 0, i64* %2, align 8
  %3 = bitcast [3 x i8**]* %0 to i64*
  store i64 2, i64* %3, align 8, !tbaa !9
  %4 = bitcast i8* %ptls_i8 to i64*
  %5 = load i64, i64* %4, align 8
  %6 = getelementptr [3 x i8**], [3 x i8**]* %0, i64 0, i64 1
  %7 = bitcast i8*** %6 to i64*
  store i64 %5, i64* %7, align 8, !tbaa !9
  store i8*** %.sub, i8**** %ptls, align 8
; Filename: REPL[2]
; Source line: 1
  store i8** inttoptr (i64 139863340187792 to i8**), i8*** %1, align 8, !dbg !12, !tbaa !9
  %8 = call i8** @japi1_bar_61089(i8** undef, i8*** %1, i32 1), !dbg !12
  %9 = load i64, i64* %7, align 8, !dbg !12, !tbaa !9
  store i64 %9, i64* %4, align 8, !dbg !12, !tbaa !9
  ret void, !dbg !12
}

julia> @cuda (1,1) foo()
ERROR: ptxas application ptx input, line 100; fatal   : Parsing error near ':': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
CUDA error: no kernel image is available for execution on the device (code #209, ERROR_NO_BINARY_FOR_GPU)

Probably easiest using another CodegenParam.

Reduce example

Hi, found this not to be working:

using CUDAnative, CUDAdrv
const cu = CUDAnative
dev = CUDAdrv.CuDevice(0)
ctx = CUDAdrv.CuContext(dev)
@target ptx function reduce_warp{T,F<:Function}(val::T, op::F)
    offset = Int(warpsize) ÷ 2
    while offset > 0
        val = op(val, shfl_down(val, offset))::T
        offset ÷= 2
    end
    return val::T
end

@target ptx function reduce_block{T,F<:Function}(val::T, op::F)
    shared = @cuStaticSharedMem(T, 32)

    wid, lane = fldmod1(threadIdx().x, warpsize)

    val = reduce_warp(val, op)::T

    if lane == 1
        @inbounds shared[Int(wid)] = val
    end

    sync_threads()

    # read from shared memory only if that warp existed
    @inbounds val = ((threadIdx().x <= fld(blockDim().x, warpsize)) ? shared[Int(lane)] : zero(T))::T

    if wid == 1
        # final reduce within first warp
        val = reduce_warp(val, op)::T
    end

    return val::T
end
@target ptx function reduce_kernel{F<:Function,T,N}(
        A::AbstractArray{T,N}, out::AbstractArray{T,1}, op::F
    )
    local sum::T = zero(T)

    #reduce multiple elements per thread
    i = Int((blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x)
    while i <= length(A)
       @inbounds sum += A[i]
       i += blockDim().x * gridDim().x
    end
    sum = reduce_block(sum, op)::T
    if (threadIdx().x == 0)
        @inbounds out[Int(blockIdx().x)] = sum;
    end
    nothing
end
N = 2048
A = CuArray(rand(Float32, N));
threads = 512;
blocks = min((N + threads - 1) ÷ threads, 1024);
out = CuArray(Float32, (blocks,))
@cuda (blocks, threads) reduce_kernel(A, out, +);

julia> @cuda (blocks, threads) reduce_kernel(A, out, +);
TRACE: cuMemAlloc(ptr_ref=Base.RefValue{Ptr{Void}}(Ptr{Void} @0x00007ff7d0327088), nbytes=24) = SUCCESS(0)
TRACE: cuMemcpyHtoD(dst.inner=Ptr{CUDAnative.CuDeviceArray{Float32,1}} @0x0000000503180200, pointer_from_objref(src)=Ptr{Void} @0x00007ff7d4b8d690, sizeof(T)=24) = SUCCESS(0)
TRACE: cuMemAlloc(ptr_ref=Base.RefValue{Ptr{Void}}(Ptr{Void} @0x00007ff7d00d4178), nbytes=24) = SUCCESS(0)
TRACE: cuMemcpyHtoD(dst.inner=Ptr{CUDAnative.CuDeviceArray{Float32,1}} @0x0000000503180400, pointer_from_objref(src)=Ptr{Void} @0x00007ff7d4cc3530, sizeof(T)=24) = SUCCESS(0)
DEBUG: Compiling reduce_kernel(CUDAnative.CuDeviceArray{Float32,1}, CUDAnative.CuDeviceArray{Float32,1}, Base.#+)
TRACE: Writing kernel AST to /tmp/JuliaCUDA_9/reduce_kernel-CUDAnativeCuDeviceArrayFloat321.CUDAnativeCuDeviceArrayFloat321.Base.jl
TRACE: Writing kernel LLVM IR to /tmp/JuliaCUDA_9/reduce_kernel-CUDAnativeCuDeviceArrayFloat321.CUDAnativeCuDeviceArrayFloat321.Base.ll
TRACE: Writing kernel PTX assembly to /tmp/JuliaCUDA_9/reduce_kernel-CUDAnativeCuDeviceArrayFloat321.CUDAnativeCuDeviceArrayFloat321.Base.ptx
TRACE: Function entry point: julia_reduce_kernel_65139
TRACE: cuModuleLoadDataEx(handle_ref=Base.RefValue{Ptr{Void}}(Ptr{Void} @0x00007ff7d00f9230), data="
       //
       // Generated by LLVM NVPTX Back-End
       //

       .version 3.2
       .target sm_35
       .address_size 64

        .file   1 "./…

       [snip]

       …7;
       LBB3_8:
       LBB3_7:
        .loc 4 22 0
        mov.b32      %f5, %r30;
        st.param.f32    [func_retval0+0], %f5;
        ret;
       }", length(optionKeys)=7, optionKeys=CUDAdrv.CUjit_option[LOG_VERBOSE,INFO_LOG_BUFFER,INFO_LOG_BUFFER_SIZE_BYTES,GENERATE_DEBUG_INFO,GENERATE_LINE_INFO,ERROR_LOG_BUFFER,ERROR_LOG_BUFFER_SIZE_BYTES], optionVals=Ptr{Void}[Ptr{Void} @0x0000000000000001,Ptr{Void} @0x0000000005bf6980,Ptr{Void} @0x0000000000100000,Ptr{Void} @0x0000000000000001,Ptr{Void} @0x0000000000000001,Ptr{Void} @0x0000000005af68c0,Ptr{Void} @0x0000000000100000]) = ERROR_NO_BINARY_FOR_GPU(209)
ERROR: No kernel image available/suitable for GPU (CUDA error #209, ERROR_NO_BINARY_FOR_GPU)
ptxas fatal   : Unresolved extern function 'jl_throw'
 in macro expansion at /home/s/.julia/v0.6/CUDAdrv/src/base.jl:64 [inlined]
 in CUDAdrv.CuModule(::String) at /home/s/.julia/v0.6/CUDAdrv/src/module.jl:39
 in cufunction(::#reduce_kernel, ::Type{T}) at /home/s/.julia/v0.6/CUDAnative/src/execution.jl:200
 in macro expansion at /home/s/.julia/v0.6/CUDAnative/src/execution.jl:277 [inlined]
 in #generated_cuda#29(::Array{Any,1}, ::CUDAnative.#generated_cuda, ::Tuple{Int64,Int64}, ::#reduce_kernel, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}, ::Base.#+) at /home/s/.julia/v0.6/CUDAnative/src/execution.jl:265
 in generated_cuda(::Tuple{Int64,Int64}, ::#reduce_kernel, ::CUDAdrv.CuArray{Float32,1}, ::Vararg{Any,N}) at /home/s/.julia/v0.6/CUDAnative/src/execution.jl:265

I tried to reduce this example further, but somehow got stuck.
My first clue was that fldmod1 was the culprit, but trying to nail it down to it wasn't working.
The jl_throw seems to be in julia_....mod1...(which I guess is the compiled function of Base.mod1?)... Which is weird since this is the definition of mod1:

mod1{T<:Integer}(x::T, y::T) = mod(x+y-T(1),y)+T(1)
function mod{T<:Integer}(x::T, y::T)
    y == -1 && return T(0)   # avoid potential overflow in fld
    x - fld(x,y)*y
end
function fld{T<:Integer}(x::T, y::T)
    d = div(x,y)
    d - (signbit(x$y) & (d*y!=x))
end

Which shouldn't throw?

User-friendly at-cuprintf

@cuprintf is nice, but has horrible internals. Constant strings should be emitted as module-scope globals, not heap-allocated objects...

bounds check sometimes missing?

Still needs to be verified.

using CUDAdrv, CUDAnative

@target ptx function broken(arr::Ptr{Int32})
    temp = @cuStaticSharedMem(Int32, (2, 1))
    tx = Int(threadIdx().x)

    if tx == 1
        for i = 1:2
            temp[i] = 1 # doesn't yield boundscheck?
        end
    end
    sync_threads()

    Base.pointerset(arr, Base.pointerref(temp.ptr, tx, 8), tx, 8)

    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

d_arr = CuArray(Int32, (2, 1))
@cuda (1,2) broken(d_arr.ptr)
println(Array(d_arr))

destroy(ctx)

UNKNOWN_ERROR(999) during build

I get the following error on Pkg.add("CUDAnative.jl").

INFO: Cloning cache of CUDAnative from https://github.com/JuliaGPU/CUDAnative.jl.git
INFO: Cloning cache of LLVM from https://github.com/maleadt/LLVM.jl.git
INFO: Installing CUDAdrv v0.4.0
INFO: Installing CUDAnative v0.3.0
INFO: Installing LLVM v0.3.4
INFO: Building CUDAdrv
INFO: Building LLVM
INFO: Found 1 unique LLVM installations
WARNING: Only considering bundled LLVM v3.9.1 (define USE_SYSTEM_LLVM=1 to override)
INFO: Performing source build of LLVM extras library
INFO: Building CUDAnative
=============================[ ERROR: CUDAnative ]==============================

LoadError: InitError: CUDAdrv.CuError(999, Nullable{String}())
during initialization of module CUDAdrv
while loading /home/jrun/.julia/v0.6/CUDAnative/deps/build.jl, in expression starting on line 1

================================================================================
ERROR: UndefVarError: CUDAdrv not defined
deserialize_module(::SerializationState{IOStream}) at ./serialize.jl:754
handle_deserialize(::SerializationState{IOStream}, ::Int32) at ./serialize.jl:695
deserialize(::SerializationState{IOStream}) at ./serialize.jl:634
deserialize_datatype(::SerializationState{IOStream}, ::Bool) at ./serialize.jl:968
handle_deserialize(::SerializationState{IOStream}, ::Int32) at ./serialize.jl:674
deserialize(::SerializationState{IOStream}) at ./serialize.jl:634
handle_deserialize(::SerializationState{IOStream}, ::Int32) at ./serialize.jl:681
deserialize(::SerializationState{IOStream}, ::DataType) at ./serialize.jl:1075
handle_deserialize(::SerializationState{IOStream}, ::Int32) at ./serialize.jl:687
deserialize(::SerializationState{IOStream}, ::DataType) at ./serialize.jl:1075
handle_deserialize(::SerializationState{IOStream}, ::Int32) at ./serialize.jl:687
(::Base.Pkg.Entry.##54#55{Dict{Any,Any}})(::IOStream) at ./pkg/entry.jl:639
open(::Base.Pkg.Entry.##54#55{Dict{Any,Any}}, ::String, ::String) at ./iostream.jl:152
build!(::Array{String,1}, ::Dict{Any,Any}, ::Set{Any}) at ./pkg/entry.jl:636
build(::Array{String,1}) at ./pkg/entry.jl:650
resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}, ::Dict{String,VersionNumber}, ::Set{String}) at ./pkg/entry.jl:570
resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}) at ./pkg/entry.jl:479
edit(::Function, ::String, ::Base.Pkg.Types.VersionSet, ::Vararg{Base.Pkg.Types.VersionSet,N} where N) at ./pkg/entry.jl:30
(::Base.Pkg.Entry.##1#3{String,Base.Pkg.Types.VersionSet})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] add(::String, ::Base.Pkg.Types.VersionSet) at ./pkg/entry.jl:51
 [4] (::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}})() at ./pkg/dir.jl:36
 [5] cd(::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#add,Tuple{String}}, ::String) at ./file.jl:70
 [6] #cd#1(::Array{Any,1}, ::Function, ::Function, ::String, ::Vararg{String,N} where N) at ./pkg/dir.jl:36
 [7] add(::String) at ./pkg/pkg.jl:117```

Implement a simulator for execution on GPU

In the long run it will be helpful to have a simulator that executes code natively through threads (or even purely sequential).

Shared memory + multiple function exits cause invalid results

Cause seems to be an added checkbounds, if that even makes sense.

Repro:

using CUDAdrv, CUDAnative

@target ptx function kernel(arr::Ptr{Int32})
    temp = @cuStaticSharedMem(Int32, (2, 1))
    tx = Int(threadIdx().x)

    if tx == 1
        for i = 1:2
            # THIS BREAKS STUFF: checkbounds(temp, i)
            Base.pointerset(temp.ptr, 1, i, 8)
        end
    end
    sync_threads()

    Base.pointerset(arr, Base.pointerref(temp.ptr, tx, 8), tx, 8)

    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

d_arr = CuArray(Int32, (2, 1))
@cuda (1,2) kernel(d_arr.ptr)
println(Array(d_arr))

destroy(ctx)

Result without checkbounds: [1; 1]. With: [1; 0].

cc @cfoket

Int32(Float) constructor fails to compile

using CUDAdrv, CUDAnative

 function truncit(inval)
    aaa = Int32(inval)
    return nothing
 end

 dev = CuDevice(0)
 ctx = CuContext(dev)

 @cuda dev (1, 1) truncit(Float32(2.0))

 destroy(ctx)

Julia output:

ERROR: LoadError: LLVM error: Cannot select: t29: i64 = ExternalSymbol'truncf'
In function: julia_convert_62240
 in handle_error(::Cstring) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/LLVM/src/core/context.jl:96
 in LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/LLVM/deps/../lib/3.9/libLLVM_h.jl:301
 in emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/LLVM/src/targetmachine.jl:39
 in mcgen(::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/jit.jl:267
 in compile_function(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/jit.jl:338
 in cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/jit.jl:355
 in macro expansion at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/execution.jl:191 [inlined]
 in #generated_cuda#43(::Array{Any,1}, ::CUDAnative.#generated_cuda, ::CUDAdrv.CuDevice, ::Tuple{Int64,Int64}, ::#truncit, ::Float32) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/execution.jl:177
 in generated_cuda(::CUDAdrv.CuDevice, ::Tuple{Int64,Int64}, ::#truncit, ::Float32, ::Vararg{Float32,N}) at /home/nipennem/Documents/julia/julia-cuda/.julia/release/v0.6/CUDAnative/src/execution.jl:177
 in include_from_node1(::String) at ./loading.jl:532
 in include(::String) at ./sysimg.jl:14
 in process_options(::Base.JLOptions) at ./client.jl:271
 in _start() at ./client.jl:335
while loading /home/nipennem/Documents/julia/rodinia/julia_cuda/backprop/truncf.jl, in expression starting on line 11

It's similar with a Float64, except that there the 'trunc' symbol is not found.

NVPTX does not support Julia's address spaces

CUDAdrv tests ok, but when I test CUDAnative with:

Julia Version 0.7.0-DEV.665
Commit 401b724fbb (2017-06-20 06:22 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
Environment:

Pkg.status("CUDAnative")
 - CUDAnative                    0.3.0+             master

I get the eponymous error related to LLVMAddLowerGCFramePass.
Do I have a version issue here?

StepRanges not working

Imported from bugs/fill.jl:

using CUDAdrv, CUDAnative

function fill(arr)
    for i = 1:2:10
        arr[i] = i
    end
    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

d_arr = CuArray(Int, 10)
@cuda dev (1, 1) fill(d_arr)
@show Array(d_arr)

destroy(ctx)

Macro versions of code_*

Similar to how Base wraps code_llvm with @code_llvm, except that it should use the cudaconvert type conversion rules (see how @cuda works), eg. mapping CuArray to CuDeviceArray.

Support for Dynamic Parallelism

Dynamic parallelism support might be necessary to integrate cuBLAS on the device side.

Resources:

http://users.ece.gatech.edu/~sudha/academic/class/ece8823/Lectures/Module-6-Microarchitecture/cuda-dyn-par.pdf

Arithmetic bug with 64-bit integers

Computing 2^19 + 2 * 1 fails when working with 64-bit integers, but works with 32-bit ones. Consistently reproducible with a GeForce GTX TITAN on drivers 378.13 (current short-lived) and 381.09 (current beta), but works properly on 375.39 (current long-lived). SASS code is identical across driver versions, so this looks to be an even lower-level bug than #4.

Reproduced from a bogus OOB (64-bit indices...) when working on Rodiina/needle.
Looked something like:

function oob(reference)
    index    = 4130784 - 32784 * blockIdx().x + 16 * blockIdx().x + threadIdx().x + 2051

    ref = @cuStaticSharedMem(Int32, (16, 16))
    for ty = 0:15
        i = index + 2049 * ty + 1
        @inbounds ref[threadIdx().x, ty + 1] = reference[i]
    end

    return nothing
end

array = CuArray{Int32}(2049, 2049)
@cuda (1,1) oob(array)

Full repro:

using CUDAdrv, CUDAnative

function kernel{T}(one::T, ptr::Ptr{T})
    val = T(524288) + T(2) * one
    Base.pointerset(ptr, val, 1, 8)
    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

function test(name, T)
    ref = CuArray{T}(1)
    @cuda (1,1) kernel(T(1), pointer(ref))
    println("$name: ", Array(ref)[1])

    if !isfile("$name.ll")
        open("$name.ll", "w") do io
            CUDAnative.code_llvm(io, kernel, Tuple{T, Ptr{T}};
                                 dump_module=true, cap=capability(dev))
        end
    end
end

test("32bit", Int32)
test("64bit", Int64)

destroy(ctx)

32bit: 524290
64bit: 2

Bug has been filed with NVIDIA.

cc @cfoket

Fine-grained automatic recompilation of kernels

Ref #67

unresolved functions

I'm not sure if I'm encountering two different problems here.
But the first is encountered while doing something like this:

for dims in ((4048,), (1024,1024), (77,), (1923,209))
    for T in (Float32,)
        A = GPUArray(rand(Float32, (4048,))) # replace with T and dims for error, works otherwise!
        @test sum(A) ≈ sum(Array(A))
        @test maximum(A) ≈ maximum(Array(A))
        @test minimum(A) ≈ minimum(Array(A))
        @test sumabs(A) ≈ sumabs(Array(A))
        @test prod(A) ≈ prod(Array(A))
    end
end
# yields
ERROR: No kernel image available/suitable for GPU (CUDA error #209, ERROR_NO_BINARY_FOR_GPU)
ptxas fatal   : Unresolved extern function 'julia_mod1_64892'

While trying to reduce the example, I ran into this guy:

using CUDAnative, CUDAdrv
@target ptx function kernel(out)
    wid, lane = fldmod1(threadIdx().x, warpsize)
    out[1] = wid
    nothing
end
dev = CuDevice(0)
ctx = CuContext(dev)
d_out = CuArray(Int32, 1)
@cuda (1,1) kernel(d_out)
julia> fldmod1(Int32(1), Int32(1))
LLVM ERROR: Program used external function 'julia_mod1_64792' which could not be resolved! #quits Julia

Seems like some target selection doesn't work cleanly ;)
starting with cujulia --compilecache=no --precompiled=no doesn't help.

Precompile fails in Julia v.0.6 with "UndefVarError: targets not defined"

I believe this is due to a recent change somewhere, a few days ago precompile was not a problem.

Output:

julia> using CUDAdrv, CUDAnative
INFO: Recompiling stale cache file /home/colin/.julia/lib/v0.6/CUDAdrv.ji for module CUDAdrv.
WARNING: redefining constant libcuda
INFO: Recompiling stale cache file /home/colin/.julia/lib/v0.6/CUDAnative.ji for module CUDAnative.
ERROR: LoadError: UndefVarError: targets not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:539
 [2] include(::String) at ./sysimg.jl:14
 [3] anonymous at ./<missing>:2
while loading /home/colin/.julia/v0.6/CUDAnative/src/CUDAnative.jl, in expression starting on line 7
ERROR: Failed to precompile CUDAnative to /home/colin/.julia/lib/v0.6/CUDAnative.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:673
 [2] require(::Symbol) at ./loading.jl:431

julia> versioninfo()
Julia Version 0.6.0-dev.2899
Commit 61a291b* (2017-02-22 04:15 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)