Cause seems to be an added checkbounds , if that even

One layer deeper... Working PTX: <div class="snippet-clipboard-c

Alternative loader, using ptxas to generate a <code c

And a C++ loader, for reporting purposes. <div class="highlight highlight-source-c

Shared memory + multiple function exits cause invalid results,about juliagpu/cudanative.jl

Comments (33)

timshen91 commented on June 1, 2024 1

The partial fix is https://reviews.llvm.org/D45008 and https://reviews.llvm.org/D45070. Once they are committed, I'll update with the revision number that needs to be sync'ed pass.

from cudanative.jl.

maleadt commented on June 1, 2024 1

Observations after having lost some more time on this:

exit behaves differently than trap or brkpt: replacing calls to llvm.trap with inline asm "exit;" yielded invalid control flow post ptxas, while trap or brkpt didn't.
the bug seems particularly sensitive to thread-divergent branches rather than multiple function exits. however, many of the examples here used trap or otherwise unreachable code, in which case LLVM often restructured the CFG to contain these invalid branches.
instances of the bug can sometimes be spotted with cuda-memcheck --tool=synccheck (if the code used synchronization of course), printing Barrier error detected. Divergent thread(s) in warp

Currently trying out some fairly horrible transformations that replace llvm.trap with an inline assembly equivalent, and replace other sources of unreachable with branches to whatever's close and hopefully synchronizing or not thread-divergent.

It seems to work OK and passes our fairly comprehensive tests, some of which consistently fail without these transformations (toolkit v10, driver v410.57). Closing this for now, I don't think we can do much better (apart from improving the transformations / moving to LLVM / convincing NVIDIA to fix their stuff).

from cudanative.jl.

maleadt commented on June 1, 2024

Not reproducible with this code anymore, but rodinia/lud.jl still fails with --check-bounds=yes probably still caused by the same underlying issue.

from cudanative.jl.

maleadt commented on June 1, 2024

New repro, again using shared memory + bounds checking, but this time the invalid value is the result of __shfl_down (not touching shared memory at all):

using CUDAdrv, CUDAnative

function kernel(ptr::Ptr{Cint})
    shared = @cuStaticSharedMem(Cint, 4)

    lane = (threadIdx().x-1) % warpsize

    if lane == 0
        @boundscheck Base.checkbounds(shared, threadIdx().x)
        unsafe_store!(shared.ptr, 0, threadIdx().x)
    end

    sync_threads()

    val = shfl_down(Cint(32), 1, 4)
    if lane == 0
        unsafe_store!(ptr, val)
    end

    return
end

dev = CuDevice(0)
ctx = CuContext(dev)

gpu_val = CuArray(Cint, 1)
@cuda dev (1,4) kernel(gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)

destroy(ctx)

Returns 0 with checkbounds, 32 without.

from cudanative.jl.

maleadt commented on June 1, 2024

Managed to reduce to two sets of LLVM IR, executed using the following snippet:

using CUDAdrv, CUDAnative, LLVM

dev = CuDevice(0)
ctx = CuContext(dev)

for ir_fn in ["bug-working.ll", "bug-broken.ll"]
    gpu_val = CuArray(Cint[42])

    ir = readstring(ir_fn)
    mod = parse(LLVM.Module, ir)
    fn = "kernel"
    entry = get(functions(mod), "kernel")
    ptx = CUDAnative.mcgen(mod, entry, v"3.0")

    cuda_mod = CuModule(ptx)
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

Working IR:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

lane0_oob:
  tail call void @llvm.trap()
  unreachable

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

Broken IR:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_oob:
  tail call void @llvm.trap()
  unreachable

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

That's right, the only difference between those two is the placement of the oob BB...
cc @cfoket

from cudanative.jl.

maleadt commented on June 1, 2024

One layer deeper...

Working PTX:

.version 3.2
.target sm_30
.address_size 64

.visible .entry kernel(
        .param .u64 output  // single int output
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<6>;
        .reg .b64       %rd<6>;
        .shared .align 4 .b8 shmem[16];     // 4 integers
        ld.param.u64    %rd1, [output];

        // calculate lane, check if 0
        mov.u32 %r1, %tid.x;
        and.b32         %r2, %r1, 31;
        setp.ne.s32     %p1, %r2, 0;
        @%p1 bra        BB_SHFL;

        // bounds check for shmem access
        setp.lt.u32     %p2, %r1, 4;
        @%p2 bra        BB_SHMEM;
        bra.uni         BB_OOB;
BB_SHMEM:
        mul.wide.s32    %rd2, %r1, 4;
        mov.u64         %rd3, shmem;
        add.s64         %rd4, %rd3, %rd2;
        mov.u32         %r4, 0;
        st.shared.u32   [%rd4], %r4;
BB_SHFL:
        setp.eq.s32     %p3, %r2, 0;
        bar.sync        0;
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        BB_WRITEBACK;
        bra.uni         BB_END;
BB_WRITEBACK:
        cvta.to.global.u64      %rd5, %rd1;
        st.global.u32   [%rd5], %r3;
BB_END:
        ret;
BB_OOB:
        trap;
}

Broken PTX:

.version 3.2
.target sm_30
.address_size 64

.visible .entry kernel(
        .param .u64 output  // single int output
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<6>;
        .reg .b64       %rd<6>;
        .shared .align 4 .b8 shmem[16];     // 4 integers
        ld.param.u64    %rd1, [output];

        // calculate lane, check if 0
        mov.u32 %r1, %tid.x;
        and.b32         %r2, %r1, 31;
        setp.ne.s32     %p1, %r2, 0;
        @%p1 bra        BB_SHFL;

        // bounds check for shmem access
        setp.gt.u32     %p2, %r1, 3;
        @%p2 bra        BB_OOB;
        bra.uni         BB_SHMEM;
BB_SHMEM:
        mul.wide.s32    %rd2, %r1, 4;
        mov.u64         %rd3, shmem;
        add.s64         %rd4, %rd3, %rd2;
        mov.u32         %r4, 0;
        st.shared.u32   [%rd4], %r4;
BB_SHFL:
        setp.eq.s32     %p3, %r2, 0;
        bar.sync        0;
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        BB_WRITEBACK;
        bra.uni         BB_END;
BB_WRITEBACK:
        cvta.to.global.u64      %rd5, %rd1;
        st.global.u32   [%rd5], %r3;
BB_END:
        ret;
BB_OOB:
        trap;
}

Loader:

using CUDAdrv

dev = CuDevice(0)
ctx = CuContext(dev)

fn = "kernel"

for name in ["bug-working", "bug-broken"]
    gpu_val = CuArray(Cint[42])

    ptx = readstring("$name.ptx")

    cuda_mod = CuModule(ptx)
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

Only difference: the bounds-check branch (>3 or <4):

$ diff bug-working.ptx bug-broken.ptx                                                                                                                                            *[master] 
22,24c22,24
<         setp.lt.u32     %p2, %r1, 4;
<         @%p2 bra        BB_SHMEM;
<         bra.uni         BB_OOB;
---
>         setp.gt.u32     %p2, %r1, 3;
>         @%p2 bra        BB_OOB;
>         bra.uni         BB_SHMEM;

Probably an assembler bug.

from cudanative.jl.

maleadt commented on June 1, 2024

Alternative loader, using ptxas to generate a cubin (in order to play with ptxas optimization flags, but doesn't seem to matter):

using CUDAdrv

dev = CuDevice(0)
ctx = CuContext(dev)

fn = "kernel"

for name in ["kernel-working", "kernel-broken"]
    gpu_val = CuArray(Cint[42])

    run(`ptxas -arch=sm_61 -o $name.cubin $name.ptx`)

    cuda_mod = CuModule(read("$name.cubin"))
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

from cudanative.jl.

maleadt commented on June 1, 2024

Almost definitely looks like an assembler bug. See the following annotated & prettified Pascal SASS (sm_61):

Working version:

kernel:
.text.kernel:
        MOV R1, c[0x0][0x20];
        S2R R2, SR_TID.X;
        SSY `(BB_SHFL);         // push BB_SHFL on reconvergence stack

        // calculate lane, check if 0
        LOP32I.AND R0, R2, 0x1f;
        ISETP.NE.AND P0, PT, R0, RZ, PT;
    @P0 SYNC                    // not lane 0, pop BB_SHFL from reconvergence stack

        // bounds check for shmem access
        ISETP.LT.U32.AND P0, PT, R2, 0x4, PT;
    @P0 BRA `(BB_SHMEM);

//BB_OOB:
        BPT.TRAP 0x1;
        EXIT;
BB_SHMEM:
        SHL R2, R2, 0x2;
        STS [R2], RZ;
        SYNC                    // pop BB_SHFL from reconvergence stack
BB_SHFL:
        // check if lane 0
  {     ISETP.EQ.AND P0, PT, R0, RZ, PT;
        BAR.SYNC 0x0;        }
        // shuffle unconditionally
        MOV32I R0, 0x20;
        SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
   @!P0 EXIT;                  // not lane 0, exit
//BB_WRITEBACK:
        MOV R2, c[0x0][0x140];
        MOV R3, c[0x0][0x144];
        STG.E [R2], R0;
        EXIT;
.BB_END:
        BRA `(.BB_END);

Broken version:

kernel:
.text.kernel:
        MOV R1, c[0x0][0x20];
        S2R R2, SR_TID.X;

        // calculate lane, check if 0
        LOP32I.AND R0, R2, 0x1f;
        ISETP.NE.AND P0, PT, R0, RZ, PT;
    @P0 BRA `(BB_SHFL);         // not lane 0, branch to BB_SHFL

        // bounds check for shmem access
        ISETP.GT.U32.AND P0, PT, R2, 0x3, PT;
    @P0 BRA `(BB_OOB);

//BB_SHMEM:
        SHL R2, R2, 0x2;
        STS [R2], RZ;
BB_SHFL:
        // check if lane 0
 {      ISETP.EQ.AND P0, PT, R0, RZ, PT;
        BAR.SYNC 0x0;        }
        // shuffle unconditionally
        MOV32I R0, 0x20;
        SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
   @!P0 EXIT;                  // not lane 0, exit
//BB_WRITEBACK:
        MOV R2, c[0x0][0x140];
        MOV R3, c[0x0][0x144];
        STG.E [R2], R0;
        EXIT;
BB_OOB:
        BPT.TRAP 0x1;
        EXIT;
.L_3:
        BRA `(.L_3);
.L_18:

The broken version clearly messes up its reconvergence stack, not pushing anything on it despite multiple conditional branches (for some info on how this works, see this paper by Bialas and Strzelecki)...

from cudanative.jl.

maleadt commented on June 1, 2024

And a C++ loader, for reporting purposes.

#include <stdio.h>

#include <cuda.h>

#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
  if (CUDA_SUCCESS != err) {
    const char *name, *descr;
    cuGetErrorName(err, &name);
    cuGetErrorString(err, &name);
    fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
    abort();
  }
}

int test(const char *path) {
  CUmodule mod;
  cuModuleLoad(&mod, path);

  CUfunction fun;
  CHECK(cuModuleGetFunction(&fun, mod, "kernel"));

  int *gpu_val;
  CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));

  void *args[1] = {&gpu_val};
  cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL);

  int val;
  CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));

  CHECK(cuModuleUnload(mod));

  return val;
}

int main() {
  CHECK(cuInit(0));

  CUdevice dev;
  CHECK(cuDeviceGet(&dev, 0));

  CUcontext ctx;
  CHECK(cuCtxCreate(&ctx, 0, dev));

  printf("working: %d\n", test("kernel-working.ptx"));
  printf("broken: %d\n", test("kernel-broken.ptx"));

  CHECK(cuCtxDestroy(ctx));

  return 0;
}

Will probably submit this to NVIDIA soon, unless anybody still spots us doing something wrong.

from cudanative.jl.

maleadt commented on June 1, 2024

Reported this repro to NVIDIA, bug #1833004. Will disable bounds checking for the time being.

from cudanative.jl.

vchuravy commented on June 1, 2024

Could we fix this on the LLVM side? Any bugfix to the assembler is going to be deployed slowly.

from cudanative.jl.

maleadt commented on June 1, 2024

I haven't figured out what PTX pattern exactly triggers the SASS emission bug. Probably the branch to a trap BB. I've asked NVIDIA for some background on the bug, if they deem it a bug, so I'm going to wait for them to respond before sinking more time into this.

from cudanative.jl.

maleadt commented on June 1, 2024

Status update from NVIDIA:

The following items have been modified for this bug:
 - Status changed from "Open - pending review" to "Open - Fix being tested"

... but I haven't got access to their bug tracker (I'm only on its CC list), so I can't look at or ask for more details 😕

from cudanative.jl.

jmaebe commented on June 1, 2024

At least you know it is in fact their fault :)

from cudanative.jl.

maleadt commented on June 1, 2024

The following items have been modified for this bug:

Status changed from "Open - Fix being tested" to "Closed - Fixed"

No idea how / starting which version / ... though (still don't allow me access to the bug tracker).

from cudanative.jl.

maleadt commented on June 1, 2024

Revisited this issue. Seems like it's still there, at least on NVIDIA driver 375.39, but I found out that it only reproduces on sm_61 hardware or newer. I haven't heard back from NVIDIA, so I don't know which driver includes the fix, and the only system with sm_61 hardware I have is locked to driver 375.39...

Anyone with sm_61 hw on more recent drivers care to test this? I've updated the repro scripts too.

from cudanative.jl.

vchuravy commented on June 1, 2024

I only have access to sm_60, but I could test it on that.

from cudanative.jl.

maleadt commented on June 1, 2024

Great! Please send me the output (verify the bug is still there), SASS files generated by ptx.jl (remove existing ones first), and the driver version. No hurry though, it's not like we can do much about it. But given some extra data points, it might be possible to re-enable bounds checking...

from cudanative.jl.

vchuravy commented on June 1, 2024

It might take me a week or so.

…

On Thu, 27 Apr 2017, 20:33 Tim Besard, ***@***.***> wrote: Great! Please send me the output (verify the bug is still there), SASS files generated by ptx.jl (remove existing ones first), and the driver version. No hurry though, it's not like we can do much about it. But given some extra data points, it might be possible to re-enable bounds checking... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI3akQIlKponJ9Fkru2M0ORdxPFs5yqks5r0IstgaJpZM4J_KJM> .

from cudanative.jl.

maleadt commented on June 1, 2024

Bug still there on 375.66 (current long-lived).

from cudanative.jl.

jlebar commented on June 1, 2024

It looks like you've discovered https://bugs.llvm.org/show_bug.cgi?id=27738, or something related. Unfortunately we've gotten zero movement from nvidia on this in the ~1.5 years since we discovered it ourselves and brought it to their attention. It's possible that CUDA 9's ptxas will be better, but I don't expect a proper fix except inasmuch as "buy a Volta card and use the new sync intrinsics" is a fix.

Yours is the cleanest reduction of this bug I've seen, btw.

from cudanative.jl.

jlebar commented on June 1, 2024

FYI, @timshen91 is rolling out an incomplete fix for this in LLVM, and working on the full fix. He'll post details in the bug.

Empirically, the partial fix he has in hand fixes this problem for everything we've seen on our end. We'd be curious to hear if it fixes anything for you all.

from cudanative.jl.

maleadt commented on June 1, 2024

Oh cool, thanks for the ping! I'll have a look about reproducing, since it's a while ago since I last looked at this. We also mentioned this issue to NVIDIA and they were going to look into giving us more info; if that happens I'll update here.

from cudanative.jl.

timshen91 commented on June 1, 2024

Any LLVM who's revision is larger than or equal to r328885 should include my partial fix.

I tried to use 367.48 nvcc and ptxas (but with newer driver) to reproduce the bug but failed. I'll wait for @maleadt for a short period of time and see what will happen. :)

from cudanative.jl.

maleadt commented on June 1, 2024

Similarly, I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).

Testing on r329021, it seems like the bug is still there though (on sm_61).
I'll recreate a full non-Julia MWE here so that you can test for yourself:

working.ll:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

lane0_oob:
  tail call void @llvm.trap()
  unreachable

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}

broken.ll (only difference is the ordering of the lane0_oob and sync_shfl BBs):

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_oob:
  tail call void @llvm.trap()
  unreachable

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}

Compile to PTX:

llc -mcpu=sm_30 broken.ll -o working.ptx
llc -mcpu=sm_30 broken.ll -o broken.ptx

Loader for PTX code:

#include <stdio.h>

#include <cuda.h>

#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
  if (CUDA_SUCCESS != err) {
    const char *name, *descr;
    cuGetErrorName(err, &name);
    cuGetErrorString(err, &name);
    fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
    abort();
  }
}

int test(const char *path) {
  CUmodule mod;
  CHECK(cuModuleLoad(&mod, path));

  CUfunction fun;
  CHECK(cuModuleGetFunction(&fun, mod, "kernel"));

  int *gpu_val;
  CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));

  void *args[1] = {&gpu_val};
  CHECK(cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL));

  int val;
  CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));

  CHECK(cuModuleUnload(mod));

  return val;
}

int main() {
  CHECK(cuInit(0));

  CUdevice dev;
  CHECK(cuDeviceGet(&dev, 0));

  CUcontext ctx;
  CHECK(cuCtxCreate(&ctx, 0, dev));

  printf("working: %d\n", test("working.ptx"));
  printf("broken: %d\n", test("broken.ptx"));

  CHECK(cuCtxDestroy(ctx));

  return 0;
}

Output:

$ clang++ ptx_loader.cpp -o ptx_loader -lcuda
$ ./ptx_loader
working: 32
broken: 0

Even though the generated PTX does differ between LLVM 6.0 and LLVM ToT (but differs identically wrt. the working or broken versions):

--- working_6.0.ptx        2018-04-03 10:34:01.000000000 +0200
+++ working_ToT.ptx        2018-04-03 09:57:20.000000000 +0200
@@ -39,12 +39,12 @@
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        LBB0_5;
-// %bb.6:                               // %end
-       ret;
+       bra.uni         LBB0_6;
 LBB0_5:                                 // %lane0_writeback
        ld.param.u64    %rd2, [kernel_param_0];
        cvta.to.global.u64      %rd1, %rd2;
        st.global.u32   [%rd1], %r3;
+LBB0_6:                                 // %end
        ret;
 LBB0_2:                                 // %lane0_oob
        trap;

from cudanative.jl.

jlebar commented on June 1, 2024

I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).

I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using. If you compiled all the way to SASS for your GPU (dunno if your frontend does this) ahead of time using ptxas, then the driver version shouldn't matter.

I can link you to how we do this in XLA if it'd be helpful.

Will leave the analysis here to @timshen91.

from cudanative.jl.

timshen91 commented on June 1, 2024

I also reproduced the ptxas miscompile on sm_61 with ptxas 8.0. I modified the launcher to call kernel<<<...>>>(...), and link the pre-compiled ptx into the launcher.

It looks like the lane0_oob block breaks the region structure (roughly a single-entry, single-exit set of basic blocks) of the program control flow graph (CFG). It has a trap instruction.

I attempted four different variations:
a) add a ret after trap.
b) add a bra.uni THE_RET_BLOCK after trap.
c) At ptx level, "inline" the trapping block into the predecessor(s).
d) replace the trap with a ret.

(a) and (b) attempted to fix the control flow graph (CFG) region structure, but they didn't work. Both (c) and (d) work, but I can' extract a principled heuristic from (c) or (d). Hopefully the new ptxas fixes this kind of issue(s) once for all.

from cudanative.jl.

maleadt commented on June 1, 2024

I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using.

Yeah, I've been deliberately using the driver for this because I assume it to be faster than having to call ptxas (we generate code at run-time, so we care about compiler performance). But with issues like this one, https://github.com/JuliaGPU/CUDAnative.jl/issues/165 (device support of the driver's embedded ptxas not matching that of CUDA's ptxas, despite reporting the same version), and the fact that its not possible to probe the embedded compiler's version in order to work around or guard against bugs like this one, maybe I should consider the manual approach.

It has a trap instruction.

Right, I assume this breaks the structured CFG requirement. I'll just avoid emitting trap for now, thanks for looking into alternatives though.

By the way, any suggestions on similar fatal error reporting mechanisms? trap isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state.
I guess XLA doesn't require such functionality though.

from cudanative.jl.

jlebar commented on June 1, 2024

By the way, any suggestions on similar fatal error reporting mechanisms? trap isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state. I guess XLA doesn't require such functionality though.

XLA doesn't require this functionality at the moment, but we have talked about adding an assert/trap instruction to XLA. Our idea for implementing it was to use a global variable. Which is ugly for sure. But I'm not sure how to do the global variable and prevent future kernels from running. That's really what trap is for. I guess we could dereference a null pointer or something, although who knows what ptxas will do when it sees that. :-/

from cudanative.jl.

maleadt commented on June 1, 2024

Pretty sure I just ran into another occurrence of this bug:

using CUDAnative, CUDAdrv

function cpu(input)
    output = Vector{Cint}(2)
    
    for i in 1:2
        output[i] = input[1]
    end
    
    return output
end

function kernel(input, output, n)
    i = threadIdx().x

    temp = @cuStaticSharedMem(Cint, 1)
    if i == 1
        1 <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
        temp[1] = input
    end
    sync_threads()

    i <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
    unsafe_store!(output, temp[1], i)
end

function gpu(input)
    output_gpu = Mem.alloc(Cint, 2)

    @cuda threads=2 kernel(input, convert(Ptr{eltype(input)}, output_gpu.ptr), 42)

    return Mem.download(Cint, output_gpu, 2)
end


using Test

function main()
    input = rand(Cint(1):Cint(100))

    @test cpu(input) == gpu(input)
end

kernel copies input[1] to output[1] and output[2], but doesn't produce the correct results when those calls to llvm.trap are present (even though the branches are never taken). This is on 396.54 with sm_35.
The following PTX is generated:

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

// shmem1 has been demoted
                                        // @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
        .param .u32 ptxcall_kernel_1_param_0,
        .param .u64 ptxcall_kernel_1_param_1,
        .param .u64 ptxcall_kernel_1_param_2
)
{
        .reg .pred      %p<2>;
        .reg .b32       %r<7>;
        .reg .b64       %rd<4>;
        // demoted variable
        .shared .align 16 .b8 shmem1[4];
// %bb.0:                               // %entry
        ld.param.u64    %rd1, [ptxcall_kernel_1_param_1];
        mov.u32         %r1, %tid.x;
        setp.ne.s32     %p1, %r1, 0;
        @%p1 bra        LBB0_2;
// %bb.1:                               // %L19.i
        ld.param.u32    %r2, [ptxcall_kernel_1_param_0];
        st.shared.u32   [shmem1], %r2;
LBB0_2:                                 // %julia_kernel_37106.exit
        bar.sync        0;
        ld.shared.u32   %r3, [shmem1];
        mul.wide.u32    %rd2, %r1, 4;
        add.s64         %rd3, %rd1, %rd2;
        shr.u32         %r4, %r3, 24;
        st.u8   [%rd3+3], %r4;
        shr.u32         %r5, %r3, 16;
        st.u8   [%rd3+2], %r5;
        shr.u32         %r6, %r3, 8;
        st.u8   [%rd3+1], %r6;
        st.u8   [%rd3], %r3;
        ret;
}
                                        // -- End function

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

// shmem1 has been demoted
                                        // @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
        .param .u32 ptxcall_kernel_1_param_0,
        .param .u64 ptxcall_kernel_1_param_1,
        .param .u64 ptxcall_kernel_1_param_2
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<7>;
        .reg .b64       %rd<6>;
        // demoted variable
        .shared .align 16 .b8 shmem1[4];
// %bb.0:                               // %entry
        ld.param.u64    %rd3, [ptxcall_kernel_1_param_2];
        mov.u32         %r1, %tid.x;
        setp.ne.s32     %p1, %r1, 0;
        @%p1 bra        LBB0_4;
// %bb.1:                               // %L19.i
        setp.gt.s64     %p2, %rd3, 0;
        @%p2 bra        LBB0_3;
        bra.uni         LBB0_2;
LBB0_3:                                 // %L23.i
        ld.param.u32    %r2, [ptxcall_kernel_1_param_0];
        st.shared.u32   [shmem1], %r2;
LBB0_4:                                 // %L40.i
        cvt.u64.u32     %rd1, %r1;
        bar.sync        0;
        setp.lt.s64     %p3, %rd1, %rd3;
        @%p3 bra        LBB0_6;
        bra.uni         LBB0_5;
LBB0_6:                                 // %julia_kernel_37106.exit
        ld.param.u64    %rd2, [ptxcall_kernel_1_param_1];
        ld.shared.u32   %r3, [shmem1];
        shl.b64         %rd4, %rd1, 2;
        add.s64         %rd5, %rd2, %rd4;
        shr.u32         %r4, %r3, 24;
        st.u8   [%rd5+3], %r4;
        shr.u32         %r5, %r3, 16;
        st.u8   [%rd5+2], %r5;
        shr.u32         %r6, %r3, 8;
        st.u8   [%rd5+1], %r6;
        st.u8   [%rd5], %r3;
        ret;
LBB0_5:                                 // %L44.i
        trap;
LBB0_2:                                 // %L22.i
        trap;
}
                                        // -- End function

Was going to reduce this further (SASS, C++ loader) but these PTX files now seems to hang in both cuModuleLoad and ptxas, not sure what's going on there, but this has cost me enough time already.

from cudanative.jl.

jlebar commented on June 1, 2024

Ugh, these ptxas bugs are the worst. :(

The ptx LLVM is generating here does not look particularly well-structurized to me, though. It's conceivable that better structurization in LLVM would resolve this. I think @timshen91 had been hoping that the current amount of structurization we apply (really, iirc, it's that we turned off passes that would make the graph less structured) would be sufficient, but maybe you're proving that's not the case.

from cudanative.jl.

timshen91 commented on June 1, 2024

Hi @maleadt,

Do you have the LLVM IR, and possibly the set of LLVM flags used to generate the ptx?

from cudanative.jl.

maleadt commented on June 1, 2024

Sure. I'll dump as much relevant info as possible. The original high-level source code is as follows:

function kernel(input::Int32, output::Ptr{Int32}, yes::Bool=true)
    i = threadIdx().x

    temp = @cuStaticSharedMem(Cint, 1)
    if i == 1
        yes || no()
        temp[1] = input
    end
    sync_threads()

    yes || no()
    unsafe_store!(output, temp[1], i)
end

function no()
    ccall("llvm.trap", llvmcall, Cvoid, ())
end

That is compiled to the following LLVM IR:

; ModuleID = 'KernelWrapper'
source_filename = "KernelWrapper"
target triple = "nvptx64-nvidia-cuda"

%jl_value_t = type opaque

@shmem1 = addrspace(3) global [1 x i32] zeroinitializer, align 16

define i64 @julia_kernel_36616(i32, i64, i8) local_unnamed_addr {
top:
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
  %4 = icmp eq i32 %3, 0
  %5 = and i8 %2, 1
  br i1 %4, label %L17, label %L27

L17:                                              ; preds = %top
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %L19, label %L22

L19:                                              ; preds = %L17
  call void @llvm.trap()
  unreachable

L22:                                              ; preds = %L17
  store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  br label %L27

L27:                                              ; preds = %top, %L22
  call void @llvm.nvvm.barrier0()
  %7 = icmp eq i8 %5, 0
  br i1 %7, label %L30, label %L33

L30:                                              ; preds = %L27
  call void @llvm.trap()
  unreachable

L33:                                              ; preds = %L27
  %8 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  %9 = zext i32 %3 to i64
  %10 = inttoptr i64 %1 to i32*
  %11 = getelementptr inbounds i32, i32* %10, i64 %9
  store i32 %8, i32* %11, align 1, !tbaa !6
  ret i64 %1
}

; Function Attrs: noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #1

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2

define void @ptxcall_kernel_1(i32, i64, i8) local_unnamed_addr {
entry:
  %3 = call i64 @julia_kernel_36616(i32 %0, i64 %1, i8 %2)
  ret void
}

attributes #0 = { noreturn nounwind }
attributes #1 = { convergent nounwind }
attributes #2 = { nounwind readnone }
attributes #3 = { allocsize(1) }

!llvm.module.flags = !{!0}
!nvvm.annotations = !{!1}

!0 = !{i32 1, !"Debug Info Version", i32 3}
!1 = !{void (i32, i64, i8)* @ptxcall_kernel_1, !"kernel", i32 1}
!2 = !{i32 0, i32 1023}
!3 = !{!4, !4, i64 0, i64 0}
!4 = !{!"ptxtbaa_shared", !5, i64 0}
!5 = !{!"ptxtbaa"}
!6 = !{!7, !7, i64 0}
!7 = !{!"jtbaa_data", !8, i64 0}
!8 = !{!"jtbaa"}

Which in turn generates the following PTX:

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

.visible .shared .align 16 .b8 shmem1[4];
                                        // @julia_kernel_36783
.visible .func  (.param .b64 func_retval0) julia_kernel_36783(
    .param .b32 julia_kernel_36783_param_0,
    .param .b64 julia_kernel_36783_param_1,
    .param .b32 julia_kernel_36783_param_2
)
{
    .reg .pred  %p<4>;
    .reg .b16   %rs<3>;
    .reg .b32   %r<7>;
    .reg .b64   %rd<4>;

// %bb.0:                               // %top
    mov.u32     %r1, %tid.x;
    setp.ne.s32     %p1, %r1, 0;
    ld.param.u8     %rs2, [julia_kernel_36783_param_2];
    and.b16     %rs1, %rs2, 1;
    @%p1 bra    LBB0_4;
// %bb.1:                               // %L17
    setp.ne.s16     %p2, %rs1, 0;
    @%p2 bra    LBB0_3;
    bra.uni     LBB0_2;
LBB0_3:                                 // %L22
    ld.param.u32    %r2, [julia_kernel_36783_param_0];
    st.shared.u32   [shmem1], %r2;
LBB0_4:                                 // %L27
    bar.sync    0;
    setp.ne.s16     %p3, %rs1, 0;
    @%p3 bra    LBB0_6;
    bra.uni     LBB0_5;
LBB0_6:                                 // %L33
    ld.param.u64    %rd1, [julia_kernel_36783_param_1];
    ld.shared.u32   %r3, [shmem1];
    mul.wide.u32    %rd2, %r1, 4;
    add.s64     %rd3, %rd1, %rd2;
    shr.u32     %r4, %r3, 24;
    st.u8   [%rd3+3], %r4;
    shr.u32     %r5, %r3, 16;
    st.u8   [%rd3+2], %r5;
    shr.u32     %r6, %r3, 8;
    st.u8   [%rd3+1], %r6;
    st.u8   [%rd3], %r3;
    st.param.b64    [func_retval0+0], %rd1;
    ret;
LBB0_5:                                 // %L30
    trap;
LBB0_2:                                 // %L19
    trap;
}
                                        // -- End function
    // .globl   ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
    .param .u32 ptxcall_kernel_2_param_0,
    .param .u64 ptxcall_kernel_2_param_1,
    .param .u8 ptxcall_kernel_2_param_2
)                                       // @ptxcall_kernel_2
{
    .reg .b32   %r<3>;
    .reg .b64   %rd<3>;

// %bb.0:                               // %entry
    ld.param.u32    %r1, [ptxcall_kernel_2_param_0];
    ld.param.u64    %rd1, [ptxcall_kernel_2_param_1];
    ld.param.u8     %r2, [ptxcall_kernel_2_param_2];
    { // callseq 3, 0
    .reg .b32 temp_param_reg;
    .param .b32 param0;
    st.param.b32    [param0+0], %r1;
    .param .b64 param1;
    st.param.b64    [param1+0], %rd1;
    .param .b32 param2;
    st.param.b32    [param2+0], %r2;
    .param .b64 retval0;
    call.uni (retval0), 
    julia_kernel_36783, 
    (
    param0, 
    param1, 
    param2
    );
    ld.param.b64    %rd2, [retval0+0];
    } // callseq 3
    ret;
}
                                        // -- End function

This code generates broken SASS:

code for sm_35
    Function : ptxcall_kernel_3
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           
    /*0008*/                   MOV R1, c[0x0][0x44];                       
    /*0010*/                   MOV R0, c[0x0][0x150];                      
    /*0018*/                   LOP32I.AND R0, R0, 0xff;                    
    /*0020*/                   CAL 0x38;                                   
    /*0028*/                   MOV RZ, RZ;                                 
    /*0030*/                   EXIT;                                       
    /*0038*/                   S2R R2, SR_TID.X;                           
                                                                           
    /*0048*/                   BFE.U32 R0, R0, 0x800;                      
    /*0050*/                   SSY 0xc0;                                   
    /*0058*/                   ISETP.NE.AND P0, PT, R2, RZ, PT;            
    /*0060*/                   LOP32I.AND R0, R0, 0x1;                     
    /*0068*/               @P0 BRA 0x98;                                   
    /*0070*/                   ISETP.NE.AND P0, PT, R0, RZ, PT;            
    /*0078*/              @!P0 NOP.S;                                      
                                                                           
    /*0088*/                   MOV R3, c[0x0][0x140];                      
    /*0090*/                   STS [RZ], R3;                               
    /*0098*/                   ISETP.NE.AND P0, PT, R0, RZ, PT;            
    /*00a0*/                   BAR.SYNC 0x0;                               
    /*00a8*/               @P0 BRA 0xd8;                                   
    /*00b0*/                   BPT.TRAP 0x1;                               
    /*00b8*/                   NOP.S;                                      
                                                                           
    /*00c8*/                   BPT.TRAP 0x1;                               
    /*00d0*/                   RET;                                        
    /*00d8*/                   LDS R0, [RZ];                               
    /*00e0*/                   ISCADD R5.CC, R2, c[0x0][0x148], 0x2;       
    /*00e8*/                   MOV32I R3, 0x4;                             
    /*00f0*/                   IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
    /*00f8*/                   MOV R2, R5;                                 
                                                                           
    /*0108*/                   PRMT R4, RZ, 0x7, R0;                       
    /*0110*/                   PRMT R5, RZ, 0x76, R0;                      
    /*0118*/                   PRMT R6, RZ, 0x765, R0;                     
    /*0120*/                   PRMT R7, R0, 0x7610, R7;                    
    /*0128*/                   ST.E.U8 [R2+0x3], R4;                       
    /*0130*/                   ST.E.U8 [R2+0x2], R5;                       
    /*0138*/                   ST.E.U8 [R2+0x1], R6;                       
                                                                           
    /*0148*/                   ST.E.U8 [R2], R7;                           
    /*0150*/                   RET;                                        
    /*0158*/                   BRA 0x158;                                  
    /*0160*/                   NOP;                                        
    /*0168*/                   NOP;                                        
    /*0170*/                   NOP;                                        
    /*0178*/                   NOP;

Interestingly, changing the kernel wrapper to pass a literal 1 (the result of invoking this kernel with 2 arguments, having the yes parameter default to true) results in this 1 embedded in the LLVM IR and PTX (which didn't influence optimization since we don't inline at the LLVM level):

define void @ptxcall_kernel_1(i32, i64) local_unnamed_addr {
entry:
  %2 = call i64 @julia_kernel_36630(i32 %0, i64 %1, i8 1), !dbg !65
  ret void
}

    // .globl   ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
    .param .u32 ptxcall_kernel_2_param_0,
    .param .u64 ptxcall_kernel_2_param_1
)                                       // @ptxcall_kernel_2
{
    .reg .b32   %r<3>;
    .reg .b64   %rd<3>;

// %bb.0:                               // %entry
    ld.param.u32    %r1, [ptxcall_kernel_2_param_0];
    ld.param.u64    %rd1, [ptxcall_kernel_2_param_1];
    mov.u32     %r2, 1;
    { // callseq 0, 0
    .reg .b32 temp_param_reg;
    .param .b32 param0;
    st.param.b32    [param0+0], %r1;
    .param .b64 param1;
    st.param.b64    [param1+0], %rd1;
    .param .b32 param2;
    st.param.b32    [param2+0], %r2;
    .param .b64 retval0;
    call.uni (retval0), 
    julia_kernel_36676, 
    (
    param0, 
    param1, 
    param2
    );
    ld.param.b64    %rd2, [retval0+0];
    } // callseq 0
    ret;
}
                                        // -- End function

ptxas works with this info and now generates valid SASS:

code for sm_35
    Function : ptxcall_kernel_3
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           
    /*0008*/                   MOV R1, c[0x0][0x44];                       
    /*0010*/                   CAL 0x28;                                   
    /*0018*/                   MOV RZ, RZ;                                 
    /*0020*/                   EXIT;                                       
    /*0028*/                   S2R R2, SR_TID.X;                           
    /*0030*/                   MOV32I R3, 0x4;                             
    /*0038*/                   ISETP.NE.AND P0, PT, R2, RZ, PT;            
                                                                           
    /*0048*/              @!P0 MOV R6, c[0x0][0x140];                      
    /*0050*/                   ISCADD R4.CC, R2, c[0x0][0x148], 0x2;       
    /*0058*/              @!P0 STS [RZ], R6;                               
    /*0060*/                   IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
    /*0068*/                   BAR.SYNC 0x0;                               
    /*0070*/                   LDS R0, [RZ];                               
    /*0078*/                   MOV R2, R4;                                 
                                                                           
    /*0088*/                   PRMT R4, RZ, 0x7, R0;                       
    /*0090*/                   PRMT R5, RZ, 0x76, R0;                      
    /*0098*/                   PRMT R6, RZ, 0x765, R0;                     
    /*00a0*/                   PRMT R7, R0, 0x7610, R7;                    
    /*00a8*/                   ST.E.U8 [R2+0x3], R4;                       
    /*00b0*/                   ST.E.U8 [R2+0x2], R5;                       
    /*00b8*/                   ST.E.U8 [R2+0x1], R6;                       
                                                                           
    /*00c8*/                   ST.E.U8 [R2], R7;                           
    /*00d0*/                   RET;                                        
    /*00d8*/                   BRA 0xd8;                                   
    /*00e0*/                   NOP;                                        
    /*00e8*/                   NOP;                                        
    /*00f0*/                   NOP;                                        
    /*00f8*/                   NOP;

You actually don't need the trap, a call to vprintf + unreachable seems sufficient to end up with "unsupported" control flow:

@0 = internal unnamed_addr constant [27 x i8] c"go home ptxas you're drunk\00"

define i64 @julia_kernel_36648(i32, i64, i8) local_unnamed_addr {
top:
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
  %4 = icmp eq i32 %3, 0
  %5 = and i8 %2, 1
  br i1 %4, label %L17, label %L30

L17:                                              ; preds = %top
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %L19, label %L25

L19:                                              ; preds = %L17
  %7 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
  unreachable

L25:                                              ; preds = %L17
  store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  br label %L30

L30:                                              ; preds = %top, %L25
  call void @llvm.nvvm.barrier0()
  %8 = icmp eq i8 %5, 0
  br i1 %8, label %L33, label %L39

L33:                                              ; preds = %L30
  %9 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
  unreachable

L39:                                              ; preds = %L30
  %10 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  %11 = zext i32 %3 to i64
  %12 = inttoptr i64 %1 to i32*
  %13 = getelementptr inbounds i32, i32* %12, i64 %11
  store i32 %10, i32* %13, align 1, !tbaa !6
  ret i64 %1
}

All this is done with LLVM 6.0, with quite some patches but none specific to NVPXT. What LLVM flags are relevant here? PTX MC target is initialized with only a ISA flag set , targeting sm_35 in this case. I also set --nvptx-fma-level=1.

from cudanative.jl.

Shared memory + multiple function exits cause invalid results about cudanative.jl HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent