Comments (33)
The partial fix is https://reviews.llvm.org/D45008 and https://reviews.llvm.org/D45070. Once they are committed, I'll update with the revision number that needs to be sync'ed pass.
from cudanative.jl.
Observations after having lost some more time on this:
exit
behaves differently thantrap
orbrkpt
: replacing calls tollvm.trap
withinline asm "exit;"
yielded invalid control flow postptxas
, whiletrap
orbrkpt
didn't.- the bug seems particularly sensitive to thread-divergent branches rather than multiple function exits. however, many of the examples here used
trap
or otherwise unreachable code, in which case LLVM often restructured the CFG to contain these invalid branches. - instances of the bug can sometimes be spotted with
cuda-memcheck --tool=synccheck
(if the code used synchronization of course), printingBarrier error detected. Divergent thread(s) in warp
Currently trying out some fairly horrible transformations that replace llvm.trap
with an inline assembly equivalent, and replace other sources of unreachable
with branches to whatever's close and hopefully synchronizing or not thread-divergent.
It seems to work OK and passes our fairly comprehensive tests, some of which consistently fail without these transformations (toolkit v10, driver v410.57). Closing this for now, I don't think we can do much better (apart from improving the transformations / moving to LLVM / convincing NVIDIA to fix their stuff).
from cudanative.jl.
Not reproducible with this code anymore, but rodinia/lud.jl
still fails with --check-bounds=yes
probably still caused by the same underlying issue.
from cudanative.jl.
New repro, again using shared memory + bounds checking, but this time the invalid value is the result of __shfl_down
(not touching shared memory at all):
using CUDAdrv, CUDAnative
function kernel(ptr::Ptr{Cint})
shared = @cuStaticSharedMem(Cint, 4)
lane = (threadIdx().x-1) % warpsize
if lane == 0
@boundscheck Base.checkbounds(shared, threadIdx().x)
unsafe_store!(shared.ptr, 0, threadIdx().x)
end
sync_threads()
val = shfl_down(Cint(32), 1, 4)
if lane == 0
unsafe_store!(ptr, val)
end
return
end
dev = CuDevice(0)
ctx = CuContext(dev)
gpu_val = CuArray(Cint, 1)
@cuda dev (1,4) kernel(gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)
destroy(ctx)
Returns 0 with checkbounds, 32 without.
from cudanative.jl.
Managed to reduce to two sets of LLVM IR, executed using the following snippet:
using CUDAdrv, CUDAnative, LLVM
dev = CuDevice(0)
ctx = CuContext(dev)
for ir_fn in ["bug-working.ll", "bug-broken.ll"]
gpu_val = CuArray(Cint[42])
ir = readstring(ir_fn)
mod = parse(LLVM.Module, ir)
fn = "kernel"
entry = get(functions(mod), "kernel")
ptx = CUDAnative.mcgen(mod, entry, v"3.0")
cuda_mod = CuModule(ptx)
cuda_fun = CuFunction(cuda_mod, fn)
cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)
end
destroy(ctx)
Working IR:
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4
define void @kernel(i32*) {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = and i32 %1, 31
%3 = icmp eq i32 %2, 0
br i1 %3, label %lane0_boundscheck, label %sync_shfl
lane0_boundscheck:
%4 = icmp ugt i32 %1, 3
br i1 %4, label %lane0_oob, label %lane0_shmem
lane0_oob:
tail call void @llvm.trap()
unreachable
sync_shfl:
tail call void @llvm.nvvm.barrier0()
%5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
br i1 %3, label %lane0_writeback, label %end
lane0_shmem:
%6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
store i32 0, i32 addrspace(3)* %6, align 8
br label %sync_shfl
lane0_writeback:
store i32 %5, i32* %0, align 8
br label %end
end:
ret void
}
declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()
Broken IR:
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4
define void @kernel(i32*) {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = and i32 %1, 31
%3 = icmp eq i32 %2, 0
br i1 %3, label %lane0_boundscheck, label %sync_shfl
lane0_boundscheck:
%4 = icmp ugt i32 %1, 3
br i1 %4, label %lane0_oob, label %lane0_shmem
sync_shfl:
tail call void @llvm.nvvm.barrier0()
%5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
br i1 %3, label %lane0_writeback, label %end
lane0_oob:
tail call void @llvm.trap()
unreachable
lane0_shmem:
%6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
store i32 0, i32 addrspace(3)* %6, align 8
br label %sync_shfl
lane0_writeback:
store i32 %5, i32* %0, align 8
br label %end
end:
ret void
}
declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()
That's right, the only difference between those two is the placement of the oob
BB...
cc @cfoket
from cudanative.jl.
One layer deeper...
Working PTX:
.version 3.2
.target sm_30
.address_size 64
.visible .entry kernel(
.param .u64 output // single int output
)
{
.reg .pred %p<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<6>;
.shared .align 4 .b8 shmem[16]; // 4 integers
ld.param.u64 %rd1, [output];
// calculate lane, check if 0
mov.u32 %r1, %tid.x;
and.b32 %r2, %r1, 31;
setp.ne.s32 %p1, %r2, 0;
@%p1 bra BB_SHFL;
// bounds check for shmem access
setp.lt.u32 %p2, %r1, 4;
@%p2 bra BB_SHMEM;
bra.uni BB_OOB;
BB_SHMEM:
mul.wide.s32 %rd2, %r1, 4;
mov.u64 %rd3, shmem;
add.s64 %rd4, %rd3, %rd2;
mov.u32 %r4, 0;
st.shared.u32 [%rd4], %r4;
BB_SHFL:
setp.eq.s32 %p3, %r2, 0;
bar.sync 0;
mov.u32 %r5, 32;
shfl.down.b32 %r3, %r5, 1, 7199;
@%p3 bra BB_WRITEBACK;
bra.uni BB_END;
BB_WRITEBACK:
cvta.to.global.u64 %rd5, %rd1;
st.global.u32 [%rd5], %r3;
BB_END:
ret;
BB_OOB:
trap;
}
Broken PTX:
.version 3.2
.target sm_30
.address_size 64
.visible .entry kernel(
.param .u64 output // single int output
)
{
.reg .pred %p<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<6>;
.shared .align 4 .b8 shmem[16]; // 4 integers
ld.param.u64 %rd1, [output];
// calculate lane, check if 0
mov.u32 %r1, %tid.x;
and.b32 %r2, %r1, 31;
setp.ne.s32 %p1, %r2, 0;
@%p1 bra BB_SHFL;
// bounds check for shmem access
setp.gt.u32 %p2, %r1, 3;
@%p2 bra BB_OOB;
bra.uni BB_SHMEM;
BB_SHMEM:
mul.wide.s32 %rd2, %r1, 4;
mov.u64 %rd3, shmem;
add.s64 %rd4, %rd3, %rd2;
mov.u32 %r4, 0;
st.shared.u32 [%rd4], %r4;
BB_SHFL:
setp.eq.s32 %p3, %r2, 0;
bar.sync 0;
mov.u32 %r5, 32;
shfl.down.b32 %r3, %r5, 1, 7199;
@%p3 bra BB_WRITEBACK;
bra.uni BB_END;
BB_WRITEBACK:
cvta.to.global.u64 %rd5, %rd1;
st.global.u32 [%rd5], %r3;
BB_END:
ret;
BB_OOB:
trap;
}
Loader:
using CUDAdrv
dev = CuDevice(0)
ctx = CuContext(dev)
fn = "kernel"
for name in ["bug-working", "bug-broken"]
gpu_val = CuArray(Cint[42])
ptx = readstring("$name.ptx")
cuda_mod = CuModule(ptx)
cuda_fun = CuFunction(cuda_mod, fn)
cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)
end
destroy(ctx)
Only difference: the bounds-check branch (>3 or <4):
$ diff bug-working.ptx bug-broken.ptx *[master]
22,24c22,24
< setp.lt.u32 %p2, %r1, 4;
< @%p2 bra BB_SHMEM;
< bra.uni BB_OOB;
---
> setp.gt.u32 %p2, %r1, 3;
> @%p2 bra BB_OOB;
> bra.uni BB_SHMEM;
Probably an assembler bug.
from cudanative.jl.
Alternative loader, using ptxas
to generate a cubin
(in order to play with ptxas
optimization flags, but doesn't seem to matter):
using CUDAdrv
dev = CuDevice(0)
ctx = CuContext(dev)
fn = "kernel"
for name in ["kernel-working", "kernel-broken"]
gpu_val = CuArray(Cint[42])
run(`ptxas -arch=sm_61 -o $name.cubin $name.ptx`)
cuda_mod = CuModule(read("$name.cubin"))
cuda_fun = CuFunction(cuda_mod, fn)
cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)
end
destroy(ctx)
from cudanative.jl.
Almost definitely looks like an assembler bug. See the following annotated & prettified Pascal SASS (sm_61
):
Working version:
kernel:
.text.kernel:
MOV R1, c[0x0][0x20];
S2R R2, SR_TID.X;
SSY `(BB_SHFL); // push BB_SHFL on reconvergence stack
// calculate lane, check if 0
LOP32I.AND R0, R2, 0x1f;
ISETP.NE.AND P0, PT, R0, RZ, PT;
@P0 SYNC // not lane 0, pop BB_SHFL from reconvergence stack
// bounds check for shmem access
ISETP.LT.U32.AND P0, PT, R2, 0x4, PT;
@P0 BRA `(BB_SHMEM);
//BB_OOB:
BPT.TRAP 0x1;
EXIT;
BB_SHMEM:
SHL R2, R2, 0x2;
STS [R2], RZ;
SYNC // pop BB_SHFL from reconvergence stack
BB_SHFL:
// check if lane 0
{ ISETP.EQ.AND P0, PT, R0, RZ, PT;
BAR.SYNC 0x0; }
// shuffle unconditionally
MOV32I R0, 0x20;
SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
@!P0 EXIT; // not lane 0, exit
//BB_WRITEBACK:
MOV R2, c[0x0][0x140];
MOV R3, c[0x0][0x144];
STG.E [R2], R0;
EXIT;
.BB_END:
BRA `(.BB_END);
Broken version:
kernel:
.text.kernel:
MOV R1, c[0x0][0x20];
S2R R2, SR_TID.X;
// calculate lane, check if 0
LOP32I.AND R0, R2, 0x1f;
ISETP.NE.AND P0, PT, R0, RZ, PT;
@P0 BRA `(BB_SHFL); // not lane 0, branch to BB_SHFL
// bounds check for shmem access
ISETP.GT.U32.AND P0, PT, R2, 0x3, PT;
@P0 BRA `(BB_OOB);
//BB_SHMEM:
SHL R2, R2, 0x2;
STS [R2], RZ;
BB_SHFL:
// check if lane 0
{ ISETP.EQ.AND P0, PT, R0, RZ, PT;
BAR.SYNC 0x0; }
// shuffle unconditionally
MOV32I R0, 0x20;
SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
@!P0 EXIT; // not lane 0, exit
//BB_WRITEBACK:
MOV R2, c[0x0][0x140];
MOV R3, c[0x0][0x144];
STG.E [R2], R0;
EXIT;
BB_OOB:
BPT.TRAP 0x1;
EXIT;
.L_3:
BRA `(.L_3);
.L_18:
The broken version clearly messes up its reconvergence stack, not pushing anything on it despite multiple conditional branches (for some info on how this works, see this paper by Bialas and Strzelecki)...
from cudanative.jl.
And a C++ loader, for reporting purposes.
#include <stdio.h>
#include <cuda.h>
#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
if (CUDA_SUCCESS != err) {
const char *name, *descr;
cuGetErrorName(err, &name);
cuGetErrorString(err, &name);
fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
abort();
}
}
int test(const char *path) {
CUmodule mod;
cuModuleLoad(&mod, path);
CUfunction fun;
CHECK(cuModuleGetFunction(&fun, mod, "kernel"));
int *gpu_val;
CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));
void *args[1] = {&gpu_val};
cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL);
int val;
CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));
CHECK(cuModuleUnload(mod));
return val;
}
int main() {
CHECK(cuInit(0));
CUdevice dev;
CHECK(cuDeviceGet(&dev, 0));
CUcontext ctx;
CHECK(cuCtxCreate(&ctx, 0, dev));
printf("working: %d\n", test("kernel-working.ptx"));
printf("broken: %d\n", test("kernel-broken.ptx"));
CHECK(cuCtxDestroy(ctx));
return 0;
}
Will probably submit this to NVIDIA soon, unless anybody still spots us doing something wrong.
from cudanative.jl.
Reported this repro to NVIDIA, bug #1833004. Will disable bounds checking for the time being.
from cudanative.jl.
Could we fix this on the LLVM side? Any bugfix to the assembler is going to be deployed slowly.
from cudanative.jl.
I haven't figured out what PTX pattern exactly triggers the SASS emission bug. Probably the branch to a trap BB. I've asked NVIDIA for some background on the bug, if they deem it a bug, so I'm going to wait for them to respond before sinking more time into this.
from cudanative.jl.
Status update from NVIDIA:
The following items have been modified for this bug:
- Status changed from "Open - pending review" to "Open - Fix being tested"
... but I haven't got access to their bug tracker (I'm only on its CC list), so I can't look at or ask for more details 😕
from cudanative.jl.
At least you know it is in fact their fault :)
from cudanative.jl.
The following items have been modified for this bug:
- Status changed from "Open - Fix being tested" to "Closed - Fixed"
No idea how / starting which version / ... though (still don't allow me access to the bug tracker).
from cudanative.jl.
Revisited this issue. Seems like it's still there, at least on NVIDIA driver 375.39, but I found out that it only reproduces on sm_61
hardware or newer. I haven't heard back from NVIDIA, so I don't know which driver includes the fix, and the only system with sm_61
hardware I have is locked to driver 375.39...
Anyone with sm_61
hw on more recent drivers care to test this? I've updated the repro scripts too.
from cudanative.jl.
I only have access to sm_60
, but I could test it on that.
from cudanative.jl.
Great! Please send me the output (verify the bug is still there), SASS files generated by ptx.jl
(remove existing ones first), and the driver version. No hurry though, it's not like we can do much about it. But given some extra data points, it might be possible to re-enable bounds checking...
from cudanative.jl.
from cudanative.jl.
Bug still there on 375.66 (current long-lived).
from cudanative.jl.
It looks like you've discovered https://bugs.llvm.org/show_bug.cgi?id=27738, or something related. Unfortunately we've gotten zero movement from nvidia on this in the ~1.5 years since we discovered it ourselves and brought it to their attention. It's possible that CUDA 9's ptxas will be better, but I don't expect a proper fix except inasmuch as "buy a Volta card and use the new sync
intrinsics" is a fix.
Yours is the cleanest reduction of this bug I've seen, btw.
from cudanative.jl.
FYI, @timshen91 is rolling out an incomplete fix for this in LLVM, and working on the full fix. He'll post details in the bug.
Empirically, the partial fix he has in hand fixes this problem for everything we've seen on our end. We'd be curious to hear if it fixes anything for you all.
from cudanative.jl.
Oh cool, thanks for the ping! I'll have a look about reproducing, since it's a while ago since I last looked at this. We also mentioned this issue to NVIDIA and they were going to look into giving us more info; if that happens I'll update here.
from cudanative.jl.
Any LLVM who's revision is larger than or equal to r328885 should include my partial fix.
I tried to use 367.48 nvcc and ptxas (but with newer driver) to reproduce the bug but failed. I'll wait for @maleadt for a short period of time and see what will happen. :)
from cudanative.jl.
Similarly, I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).
Testing on r329021, it seems like the bug is still there though (on sm_61
).
I'll recreate a full non-Julia MWE here so that you can test for yourself:
working.ll
:
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4
define void @kernel(i32*) {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = and i32 %1, 31
%3 = icmp eq i32 %2, 0
br i1 %3, label %lane0_boundscheck, label %sync_shfl
lane0_boundscheck:
%4 = icmp ugt i32 %1, 3
br i1 %4, label %lane0_oob, label %lane0_shmem
lane0_oob:
tail call void @llvm.trap()
unreachable
sync_shfl:
tail call void @llvm.nvvm.barrier0()
%5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
br i1 %3, label %lane0_writeback, label %end
lane0_shmem:
%6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
store i32 0, i32 addrspace(3)* %6, align 8
br label %sync_shfl
lane0_writeback:
store i32 %5, i32* %0, align 8
br label %end
end:
ret void
}
declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()
!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}
broken.ll
(only difference is the ordering of the lane0_oob
and sync_shfl
BBs):
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4
define void @kernel(i32*) {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = and i32 %1, 31
%3 = icmp eq i32 %2, 0
br i1 %3, label %lane0_boundscheck, label %sync_shfl
lane0_boundscheck:
%4 = icmp ugt i32 %1, 3
br i1 %4, label %lane0_oob, label %lane0_shmem
sync_shfl:
tail call void @llvm.nvvm.barrier0()
%5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
br i1 %3, label %lane0_writeback, label %end
lane0_oob:
tail call void @llvm.trap()
unreachable
lane0_shmem:
%6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
store i32 0, i32 addrspace(3)* %6, align 8
br label %sync_shfl
lane0_writeback:
store i32 %5, i32* %0, align 8
br label %end
end:
ret void
}
declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()
!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}
Compile to PTX:
llc -mcpu=sm_30 broken.ll -o working.ptx
llc -mcpu=sm_30 broken.ll -o broken.ptx
Loader for PTX code:
#include <stdio.h>
#include <cuda.h>
#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
if (CUDA_SUCCESS != err) {
const char *name, *descr;
cuGetErrorName(err, &name);
cuGetErrorString(err, &name);
fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
abort();
}
}
int test(const char *path) {
CUmodule mod;
CHECK(cuModuleLoad(&mod, path));
CUfunction fun;
CHECK(cuModuleGetFunction(&fun, mod, "kernel"));
int *gpu_val;
CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));
void *args[1] = {&gpu_val};
CHECK(cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL));
int val;
CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));
CHECK(cuModuleUnload(mod));
return val;
}
int main() {
CHECK(cuInit(0));
CUdevice dev;
CHECK(cuDeviceGet(&dev, 0));
CUcontext ctx;
CHECK(cuCtxCreate(&ctx, 0, dev));
printf("working: %d\n", test("working.ptx"));
printf("broken: %d\n", test("broken.ptx"));
CHECK(cuCtxDestroy(ctx));
return 0;
}
Output:
$ clang++ ptx_loader.cpp -o ptx_loader -lcuda
$ ./ptx_loader
working: 32
broken: 0
Even though the generated PTX does differ between LLVM 6.0 and LLVM ToT (but differs identically wrt. the working or broken versions):
--- working_6.0.ptx 2018-04-03 10:34:01.000000000 +0200
+++ working_ToT.ptx 2018-04-03 09:57:20.000000000 +0200
@@ -39,12 +39,12 @@
mov.u32 %r5, 32;
shfl.down.b32 %r3, %r5, 1, 7199;
@%p3 bra LBB0_5;
-// %bb.6: // %end
- ret;
+ bra.uni LBB0_6;
LBB0_5: // %lane0_writeback
ld.param.u64 %rd2, [kernel_param_0];
cvta.to.global.u64 %rd1, %rd2;
st.global.u32 [%rd1], %r3;
+LBB0_6: // %end
ret;
LBB0_2: // %lane0_oob
trap;
from cudanative.jl.
I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).
I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using. If you compiled all the way to SASS for your GPU (dunno if your frontend does this) ahead of time using ptxas, then the driver version shouldn't matter.
I can link you to how we do this in XLA if it'd be helpful.
Will leave the analysis here to @timshen91.
from cudanative.jl.
I also reproduced the ptxas miscompile on sm_61 with ptxas 8.0. I modified the launcher to call kernel<<<...>>>(...), and link the pre-compiled ptx into the launcher.
It looks like the lane0_oob
block breaks the region structure (roughly a single-entry, single-exit set of basic blocks) of the program control flow graph (CFG). It has a trap instruction.
I attempted four different variations:
a) add a ret after trap.
b) add a bra.uni THE_RET_BLOCK after trap.
c) At ptx level, "inline" the trapping block into the predecessor(s).
d) replace the trap with a ret.
(a) and (b) attempted to fix the control flow graph (CFG) region structure, but they didn't work. Both (c) and (d) work, but I can' extract a principled heuristic from (c) or (d). Hopefully the new ptxas fixes this kind of issue(s) once for all.
from cudanative.jl.
I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using.
Yeah, I've been deliberately using the driver for this because I assume it to be faster than having to call ptxas
(we generate code at run-time, so we care about compiler performance). But with issues like this one, https://github.com/JuliaGPU/CUDAnative.jl/issues/165 (device support of the driver's embedded ptxas
not matching that of CUDA's ptxas
, despite reporting the same version), and the fact that its not possible to probe the embedded compiler's version in order to work around or guard against bugs like this one, maybe I should consider the manual approach.
It has a trap instruction.
Right, I assume this breaks the structured CFG requirement. I'll just avoid emitting trap
for now, thanks for looking into alternatives though.
By the way, any suggestions on similar fatal error reporting mechanisms? trap
isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state.
I guess XLA doesn't require such functionality though.
from cudanative.jl.
By the way, any suggestions on similar fatal error reporting mechanisms? trap isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state. I guess XLA doesn't require such functionality though.
XLA doesn't require this functionality at the moment, but we have talked about adding an assert/trap instruction to XLA. Our idea for implementing it was to use a global variable. Which is ugly for sure. But I'm not sure how to do the global variable and prevent future kernels from running. That's really what trap
is for. I guess we could dereference a null pointer or something, although who knows what ptxas will do when it sees that. :-/
from cudanative.jl.
Pretty sure I just ran into another occurrence of this bug:
using CUDAnative, CUDAdrv
function cpu(input)
output = Vector{Cint}(2)
for i in 1:2
output[i] = input[1]
end
return output
end
function kernel(input, output, n)
i = threadIdx().x
temp = @cuStaticSharedMem(Cint, 1)
if i == 1
1 <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
temp[1] = input
end
sync_threads()
i <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
unsafe_store!(output, temp[1], i)
end
function gpu(input)
output_gpu = Mem.alloc(Cint, 2)
@cuda threads=2 kernel(input, convert(Ptr{eltype(input)}, output_gpu.ptr), 42)
return Mem.download(Cint, output_gpu, 2)
end
using Test
function main()
input = rand(Cint(1):Cint(100))
@test cpu(input) == gpu(input)
end
kernel
copies input[1]
to output[1]
and output[2]
, but doesn't produce the correct results when those calls to llvm.trap
are present (even though the branches are never taken). This is on 396.54 with sm_35
.
The following PTX is generated:
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_35
.address_size 64
// shmem1 has been demoted
// @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
.param .u32 ptxcall_kernel_1_param_0,
.param .u64 ptxcall_kernel_1_param_1,
.param .u64 ptxcall_kernel_1_param_2
)
{
.reg .pred %p<2>;
.reg .b32 %r<7>;
.reg .b64 %rd<4>;
// demoted variable
.shared .align 16 .b8 shmem1[4];
// %bb.0: // %entry
ld.param.u64 %rd1, [ptxcall_kernel_1_param_1];
mov.u32 %r1, %tid.x;
setp.ne.s32 %p1, %r1, 0;
@%p1 bra LBB0_2;
// %bb.1: // %L19.i
ld.param.u32 %r2, [ptxcall_kernel_1_param_0];
st.shared.u32 [shmem1], %r2;
LBB0_2: // %julia_kernel_37106.exit
bar.sync 0;
ld.shared.u32 %r3, [shmem1];
mul.wide.u32 %rd2, %r1, 4;
add.s64 %rd3, %rd1, %rd2;
shr.u32 %r4, %r3, 24;
st.u8 [%rd3+3], %r4;
shr.u32 %r5, %r3, 16;
st.u8 [%rd3+2], %r5;
shr.u32 %r6, %r3, 8;
st.u8 [%rd3+1], %r6;
st.u8 [%rd3], %r3;
ret;
}
// -- End function
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_35
.address_size 64
// shmem1 has been demoted
// @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
.param .u32 ptxcall_kernel_1_param_0,
.param .u64 ptxcall_kernel_1_param_1,
.param .u64 ptxcall_kernel_1_param_2
)
{
.reg .pred %p<4>;
.reg .b32 %r<7>;
.reg .b64 %rd<6>;
// demoted variable
.shared .align 16 .b8 shmem1[4];
// %bb.0: // %entry
ld.param.u64 %rd3, [ptxcall_kernel_1_param_2];
mov.u32 %r1, %tid.x;
setp.ne.s32 %p1, %r1, 0;
@%p1 bra LBB0_4;
// %bb.1: // %L19.i
setp.gt.s64 %p2, %rd3, 0;
@%p2 bra LBB0_3;
bra.uni LBB0_2;
LBB0_3: // %L23.i
ld.param.u32 %r2, [ptxcall_kernel_1_param_0];
st.shared.u32 [shmem1], %r2;
LBB0_4: // %L40.i
cvt.u64.u32 %rd1, %r1;
bar.sync 0;
setp.lt.s64 %p3, %rd1, %rd3;
@%p3 bra LBB0_6;
bra.uni LBB0_5;
LBB0_6: // %julia_kernel_37106.exit
ld.param.u64 %rd2, [ptxcall_kernel_1_param_1];
ld.shared.u32 %r3, [shmem1];
shl.b64 %rd4, %rd1, 2;
add.s64 %rd5, %rd2, %rd4;
shr.u32 %r4, %r3, 24;
st.u8 [%rd5+3], %r4;
shr.u32 %r5, %r3, 16;
st.u8 [%rd5+2], %r5;
shr.u32 %r6, %r3, 8;
st.u8 [%rd5+1], %r6;
st.u8 [%rd5], %r3;
ret;
LBB0_5: // %L44.i
trap;
LBB0_2: // %L22.i
trap;
}
// -- End function
Was going to reduce this further (SASS, C++ loader) but these PTX files now seems to hang in both cuModuleLoad
and ptxas
, not sure what's going on there, but this has cost me enough time already.
from cudanative.jl.
Ugh, these ptxas bugs are the worst. :(
The ptx LLVM is generating here does not look particularly well-structurized to me, though. It's conceivable that better structurization in LLVM would resolve this. I think @timshen91 had been hoping that the current amount of structurization we apply (really, iirc, it's that we turned off passes that would make the graph less structured) would be sufficient, but maybe you're proving that's not the case.
from cudanative.jl.
Hi @maleadt,
Do you have the LLVM IR, and possibly the set of LLVM flags used to generate the ptx?
from cudanative.jl.
Sure. I'll dump as much relevant info as possible. The original high-level source code is as follows:
function kernel(input::Int32, output::Ptr{Int32}, yes::Bool=true)
i = threadIdx().x
temp = @cuStaticSharedMem(Cint, 1)
if i == 1
yes || no()
temp[1] = input
end
sync_threads()
yes || no()
unsafe_store!(output, temp[1], i)
end
function no()
ccall("llvm.trap", llvmcall, Cvoid, ())
end
That is compiled to the following LLVM IR:
; ModuleID = 'KernelWrapper'
source_filename = "KernelWrapper"
target triple = "nvptx64-nvidia-cuda"
%jl_value_t = type opaque
@shmem1 = addrspace(3) global [1 x i32] zeroinitializer, align 16
define i64 @julia_kernel_36616(i32, i64, i8) local_unnamed_addr {
top:
%3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
%4 = icmp eq i32 %3, 0
%5 = and i8 %2, 1
br i1 %4, label %L17, label %L27
L17: ; preds = %top
%6 = icmp eq i8 %5, 0
br i1 %6, label %L19, label %L22
L19: ; preds = %L17
call void @llvm.trap()
unreachable
L22: ; preds = %L17
store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
br label %L27
L27: ; preds = %top, %L22
call void @llvm.nvvm.barrier0()
%7 = icmp eq i8 %5, 0
br i1 %7, label %L30, label %L33
L30: ; preds = %L27
call void @llvm.trap()
unreachable
L33: ; preds = %L27
%8 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
%9 = zext i32 %3 to i64
%10 = inttoptr i64 %1 to i32*
%11 = getelementptr inbounds i32, i32* %10, i64 %9
store i32 %8, i32* %11, align 1, !tbaa !6
ret i64 %1
}
; Function Attrs: noreturn nounwind
declare void @llvm.trap() #0
; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #1
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2
define void @ptxcall_kernel_1(i32, i64, i8) local_unnamed_addr {
entry:
%3 = call i64 @julia_kernel_36616(i32 %0, i64 %1, i8 %2)
ret void
}
attributes #0 = { noreturn nounwind }
attributes #1 = { convergent nounwind }
attributes #2 = { nounwind readnone }
attributes #3 = { allocsize(1) }
!llvm.module.flags = !{!0}
!nvvm.annotations = !{!1}
!0 = !{i32 1, !"Debug Info Version", i32 3}
!1 = !{void (i32, i64, i8)* @ptxcall_kernel_1, !"kernel", i32 1}
!2 = !{i32 0, i32 1023}
!3 = !{!4, !4, i64 0, i64 0}
!4 = !{!"ptxtbaa_shared", !5, i64 0}
!5 = !{!"ptxtbaa"}
!6 = !{!7, !7, i64 0}
!7 = !{!"jtbaa_data", !8, i64 0}
!8 = !{!"jtbaa"}
Which in turn generates the following PTX:
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_35
.address_size 64
.visible .shared .align 16 .b8 shmem1[4];
// @julia_kernel_36783
.visible .func (.param .b64 func_retval0) julia_kernel_36783(
.param .b32 julia_kernel_36783_param_0,
.param .b64 julia_kernel_36783_param_1,
.param .b32 julia_kernel_36783_param_2
)
{
.reg .pred %p<4>;
.reg .b16 %rs<3>;
.reg .b32 %r<7>;
.reg .b64 %rd<4>;
// %bb.0: // %top
mov.u32 %r1, %tid.x;
setp.ne.s32 %p1, %r1, 0;
ld.param.u8 %rs2, [julia_kernel_36783_param_2];
and.b16 %rs1, %rs2, 1;
@%p1 bra LBB0_4;
// %bb.1: // %L17
setp.ne.s16 %p2, %rs1, 0;
@%p2 bra LBB0_3;
bra.uni LBB0_2;
LBB0_3: // %L22
ld.param.u32 %r2, [julia_kernel_36783_param_0];
st.shared.u32 [shmem1], %r2;
LBB0_4: // %L27
bar.sync 0;
setp.ne.s16 %p3, %rs1, 0;
@%p3 bra LBB0_6;
bra.uni LBB0_5;
LBB0_6: // %L33
ld.param.u64 %rd1, [julia_kernel_36783_param_1];
ld.shared.u32 %r3, [shmem1];
mul.wide.u32 %rd2, %r1, 4;
add.s64 %rd3, %rd1, %rd2;
shr.u32 %r4, %r3, 24;
st.u8 [%rd3+3], %r4;
shr.u32 %r5, %r3, 16;
st.u8 [%rd3+2], %r5;
shr.u32 %r6, %r3, 8;
st.u8 [%rd3+1], %r6;
st.u8 [%rd3], %r3;
st.param.b64 [func_retval0+0], %rd1;
ret;
LBB0_5: // %L30
trap;
LBB0_2: // %L19
trap;
}
// -- End function
// .globl ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
.param .u32 ptxcall_kernel_2_param_0,
.param .u64 ptxcall_kernel_2_param_1,
.param .u8 ptxcall_kernel_2_param_2
) // @ptxcall_kernel_2
{
.reg .b32 %r<3>;
.reg .b64 %rd<3>;
// %bb.0: // %entry
ld.param.u32 %r1, [ptxcall_kernel_2_param_0];
ld.param.u64 %rd1, [ptxcall_kernel_2_param_1];
ld.param.u8 %r2, [ptxcall_kernel_2_param_2];
{ // callseq 3, 0
.reg .b32 temp_param_reg;
.param .b32 param0;
st.param.b32 [param0+0], %r1;
.param .b64 param1;
st.param.b64 [param1+0], %rd1;
.param .b32 param2;
st.param.b32 [param2+0], %r2;
.param .b64 retval0;
call.uni (retval0),
julia_kernel_36783,
(
param0,
param1,
param2
);
ld.param.b64 %rd2, [retval0+0];
} // callseq 3
ret;
}
// -- End function
This code generates broken SASS:
code for sm_35
Function : ptxcall_kernel_3
.headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/*0008*/ MOV R1, c[0x0][0x44];
/*0010*/ MOV R0, c[0x0][0x150];
/*0018*/ LOP32I.AND R0, R0, 0xff;
/*0020*/ CAL 0x38;
/*0028*/ MOV RZ, RZ;
/*0030*/ EXIT;
/*0038*/ S2R R2, SR_TID.X;
/*0048*/ BFE.U32 R0, R0, 0x800;
/*0050*/ SSY 0xc0;
/*0058*/ ISETP.NE.AND P0, PT, R2, RZ, PT;
/*0060*/ LOP32I.AND R0, R0, 0x1;
/*0068*/ @P0 BRA 0x98;
/*0070*/ ISETP.NE.AND P0, PT, R0, RZ, PT;
/*0078*/ @!P0 NOP.S;
/*0088*/ MOV R3, c[0x0][0x140];
/*0090*/ STS [RZ], R3;
/*0098*/ ISETP.NE.AND P0, PT, R0, RZ, PT;
/*00a0*/ BAR.SYNC 0x0;
/*00a8*/ @P0 BRA 0xd8;
/*00b0*/ BPT.TRAP 0x1;
/*00b8*/ NOP.S;
/*00c8*/ BPT.TRAP 0x1;
/*00d0*/ RET;
/*00d8*/ LDS R0, [RZ];
/*00e0*/ ISCADD R5.CC, R2, c[0x0][0x148], 0x2;
/*00e8*/ MOV32I R3, 0x4;
/*00f0*/ IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
/*00f8*/ MOV R2, R5;
/*0108*/ PRMT R4, RZ, 0x7, R0;
/*0110*/ PRMT R5, RZ, 0x76, R0;
/*0118*/ PRMT R6, RZ, 0x765, R0;
/*0120*/ PRMT R7, R0, 0x7610, R7;
/*0128*/ ST.E.U8 [R2+0x3], R4;
/*0130*/ ST.E.U8 [R2+0x2], R5;
/*0138*/ ST.E.U8 [R2+0x1], R6;
/*0148*/ ST.E.U8 [R2], R7;
/*0150*/ RET;
/*0158*/ BRA 0x158;
/*0160*/ NOP;
/*0168*/ NOP;
/*0170*/ NOP;
/*0178*/ NOP;
Interestingly, changing the kernel wrapper to pass a literal 1
(the result of invoking this kernel with 2 arguments, having the yes
parameter default to true
) results in this 1
embedded in the LLVM IR and PTX (which didn't influence optimization since we don't inline at the LLVM level):
define void @ptxcall_kernel_1(i32, i64) local_unnamed_addr {
entry:
%2 = call i64 @julia_kernel_36630(i32 %0, i64 %1, i8 1), !dbg !65
ret void
}
// .globl ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
.param .u32 ptxcall_kernel_2_param_0,
.param .u64 ptxcall_kernel_2_param_1
) // @ptxcall_kernel_2
{
.reg .b32 %r<3>;
.reg .b64 %rd<3>;
// %bb.0: // %entry
ld.param.u32 %r1, [ptxcall_kernel_2_param_0];
ld.param.u64 %rd1, [ptxcall_kernel_2_param_1];
mov.u32 %r2, 1;
{ // callseq 0, 0
.reg .b32 temp_param_reg;
.param .b32 param0;
st.param.b32 [param0+0], %r1;
.param .b64 param1;
st.param.b64 [param1+0], %rd1;
.param .b32 param2;
st.param.b32 [param2+0], %r2;
.param .b64 retval0;
call.uni (retval0),
julia_kernel_36676,
(
param0,
param1,
param2
);
ld.param.b64 %rd2, [retval0+0];
} // callseq 0
ret;
}
// -- End function
ptxas
works with this info and now generates valid SASS:
code for sm_35
Function : ptxcall_kernel_3
.headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/*0008*/ MOV R1, c[0x0][0x44];
/*0010*/ CAL 0x28;
/*0018*/ MOV RZ, RZ;
/*0020*/ EXIT;
/*0028*/ S2R R2, SR_TID.X;
/*0030*/ MOV32I R3, 0x4;
/*0038*/ ISETP.NE.AND P0, PT, R2, RZ, PT;
/*0048*/ @!P0 MOV R6, c[0x0][0x140];
/*0050*/ ISCADD R4.CC, R2, c[0x0][0x148], 0x2;
/*0058*/ @!P0 STS [RZ], R6;
/*0060*/ IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
/*0068*/ BAR.SYNC 0x0;
/*0070*/ LDS R0, [RZ];
/*0078*/ MOV R2, R4;
/*0088*/ PRMT R4, RZ, 0x7, R0;
/*0090*/ PRMT R5, RZ, 0x76, R0;
/*0098*/ PRMT R6, RZ, 0x765, R0;
/*00a0*/ PRMT R7, R0, 0x7610, R7;
/*00a8*/ ST.E.U8 [R2+0x3], R4;
/*00b0*/ ST.E.U8 [R2+0x2], R5;
/*00b8*/ ST.E.U8 [R2+0x1], R6;
/*00c8*/ ST.E.U8 [R2], R7;
/*00d0*/ RET;
/*00d8*/ BRA 0xd8;
/*00e0*/ NOP;
/*00e8*/ NOP;
/*00f0*/ NOP;
/*00f8*/ NOP;
You actually don't need the trap
, a call to vprintf
+ unreachable
seems sufficient to end up with "unsupported" control flow:
@0 = internal unnamed_addr constant [27 x i8] c"go home ptxas you're drunk\00"
define i64 @julia_kernel_36648(i32, i64, i8) local_unnamed_addr {
top:
%3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
%4 = icmp eq i32 %3, 0
%5 = and i8 %2, 1
br i1 %4, label %L17, label %L30
L17: ; preds = %top
%6 = icmp eq i8 %5, 0
br i1 %6, label %L19, label %L25
L19: ; preds = %L17
%7 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
unreachable
L25: ; preds = %L17
store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
br label %L30
L30: ; preds = %top, %L25
call void @llvm.nvvm.barrier0()
%8 = icmp eq i8 %5, 0
br i1 %8, label %L33, label %L39
L33: ; preds = %L30
%9 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
unreachable
L39: ; preds = %L30
%10 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
%11 = zext i32 %3 to i64
%12 = inttoptr i64 %1 to i32*
%13 = getelementptr inbounds i32, i32* %12, i64 %11
store i32 %10, i32* %13, align 1, !tbaa !6
ret i64 %1
}
All this is done with LLVM 6.0, with quite some patches but none specific to NVPXT. What LLVM flags are relevant here? PTX MC target is initialized with only a ISA flag set , targeting sm_35
in this case. I also set --nvptx-fma-level=1
.
from cudanative.jl.
Related Issues (20)
- CUDAnative failed to initialize, no CUPTI provided HOT 2
- CUDAnative.fma performance problems HOT 4
- Better error if exclusive device is already in use HOT 1
- C++ compatible name mangling HOT 3
- WMMA examples always execute HOT 2
- Better warning on CUPTI insufficient permissions
- shfl does not like Bool HOT 2
- On Julia master: "Warning: Incompatibility detected between CUDA and LLVM 8.0+; disabling debug info emission for CUDA kernels" HOT 1
- Performance regression with new mapreduce HOT 1
- Errors in CUDAnative unit tests HOT 1
- WMMA tests fail on julia-debug HOT 6
- Cthulhu integration HOT 3
- Racy initializing HOT 1
- Can't access GPUs, get "ERROR: CUDA error: invalid device context (code 201, ERROR_INVALID_CONTEXT)" HOT 5
- 'Symbol "__nv_exp"not found' on Jetson Xavier NX HOT 4
- Is there a `@cushow` macro? HOT 7
- Initializing static shared memory with function argument HOT 1
- Atomic operations on shared memory HOT 3
- PTX JIT compilation issue: Call to gpu_report_oom has wrong number of parameters HOT 9
- Warning: `haskey(::TargetIterator, name::String)` is deprecated HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudanative.jl.