Giter Site home page Giter Site logo

Comments (8)

maleadt avatar maleadt commented on June 24, 2024 1

Ah, so even setindex can trigger the GC... That would explain this deadlock, which is a separate issue #685, but not the CURAND failures. I'll have a look at fixing the former, for which this backtrace is very helpful.

from cuarrays.jl.

maleadt avatar maleadt commented on June 24, 2024

Could you add a call to CUDAdrv.synchronize() before the failing CURAND API call, e.g. in curandGenerateSeeds, to see if we can capture that preexisting failure?

from cuarrays.jl.

marius311 avatar marius311 commented on June 24, 2024

I tried both

@checked function curandGenerateSeeds(generator)
    initialize_api()
    CUDAdrv.synchronize()
    @runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
                   (curandGenerator_t,),
                   generator)
end

and

@checked function curandGenerateSeeds(generator)
    CUDAdrv.synchronize()
    initialize_api()
    @runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
                   (curandGenerator_t,),
                   generator)
end

and I still get the same error / stack trace, although anecdotally it seems like it takes a little longer to trigger (might just be in my head). Is that what you meant?

from cuarrays.jl.

maleadt avatar maleadt commented on June 24, 2024

Yes, but sadly it doesn't catch anything. I wonder why CURAND thinks there's a preexisting failure then.

Bisecting would be useful. Due to the coupling between CuArrays/CUDAnative/GPUArrays you'll probably have to use the Manifest that's part of CuArrays (only a few commits don't work, you can bisect skip those).

from cuarrays.jl.

marius311 avatar marius311 commented on June 24, 2024

Ok, bisected it to this being the first bad commit: 65a35b1

I checked a couple of times and I'm pretty sure this is it.

I'm using the Manifest like you suggested, so the breakdown is:

  • bad - CuArrays 65a35b1, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v3.0.0

  • good - CuArrays 138ece7, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v2.10.2 #58c6755

I notice that whenever I switch between these two commit I get

Building the CUDAnative run-time library for your sm_70 device, this might take a while...

which may be relevant.

from cuarrays.jl.

maleadt avatar maleadt commented on June 24, 2024

Hmm, that doesn't help much. Are you using multiple threads or tasks?

from cuarrays.jl.

marius311 avatar marius311 commented on June 24, 2024

My code is single threaded, and can run in a one-MPI-process-per-GPU configuration. I mentioned above sometimes it hangs intsead of giving me the CURAND_STATUS_PREEXISTING_FAILURE error, but based on your comment / that bisect I ran my code with a single MPI process, and it looks like in this case its just always hanging. Maybe the CURAND_STATUS_PREEXISTING_FAILURE is a red-herring / side-effect of the real issue?

With a single process, I reproduced the hang about 5 times (with the "bad" versions from above), each time I get this identical stack track if I just kill it:

free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:393
unknown function (ip: 0x2aac1ff19ad2)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:218 [inlined]
macro expansion at ./util.jl:234 [inlined]
free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:217 [inlined]
_unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:51
unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:40
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
run_finalizer at /global/u1/m/marius/src/julia-1.4/src/gc.c:277
jl_gc_run_finalizers_in_list at /global/u1/m/marius/src/julia-1.4/src/gc.c:363
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:391 [inlined]
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:370
jl_gc_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:3124
maybe_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:827 [inlined]
jl_gc_pool_alloc at /global/u1/m/marius/src/julia-1.4/src/gc.c:1142
jl_gc_alloc_ at /global/u1/m/marius/src/julia-1.4/src/julia_internal.h:246 [inlined]
_new_array_ at /global/u1/m/marius/src/julia-1.4/src/array.c:106 [inlined]
_new_array at /global/u1/m/marius/src/julia-1.4/src/array.c:162 [inlined]
jl_alloc_array_1d at /global/u1/m/marius/src/julia-1.4/src/array.c:433
Array at ./boot.jl:405 [inlined]
rehash! at ./dict.jl:193
_setindex! at ./dict.jl:367 [inlined]
setindex! at ./dict.jl:388
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:384 [inlined]
macro expansion at ./lock.jl:183 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:383
unknown function (ip: 0x2aac1fe7fcb5)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:180 [inlined]
macro expansion at ./util.jl:234 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:179 [inlined]
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:107
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:115 [inlined]
similar at ./abstractarray.jl:671 [inlined]
similar at ./abstractarray.jl:670 [inlined]
similar at /global/u1/m/marius/work/s4/dev/CuArrays/src/broadcast.jl:11 [inlined]
copy at ./broadcast.jl:840
materialize at ./broadcast.jl:820
copy at /global/homes/m/marius/.julia/packages/GPUArrays/QDGmr/src/host/abstractarray.jl:173 [inlined]
unsafe_execute! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:412 [inlined]
mul! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:449 [inlined]
Fourier at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flat_s0.jl:74 [inlined]
Basislike at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:56
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
Ð! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:62
v! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/lenseflow.jl:145
unknown function (ip: 0x2aac5fcebc0e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
RK4Solver at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:25
unknown function (ip: 0x2aac5fce911e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
odesolve at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:53
unknown function (ip: 0x2aac5fce85aa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
back at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flowops.jl:40
#187#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:59 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#175 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/lib/lib.jl:170 [inlined]
#344#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49 [inlined]
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:70 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:53 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
unknown function (ip: 0x2aac5fcdc6e3)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#460 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:286 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#36 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:36
unknown function (ip: 0x2aac5fcdb043)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
gradient at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:45
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_apply at /global/u1/m/marius/src/julia-1.4/src/builtins.c:643
jl_f__apply_latest at /global/u1/m/marius/src/julia-1.4/src/builtins.c:693
#invokelatest#1 at ./essentials.jl:712 [inlined]
invokelatest at ./essentials.jl:711 [inlined]
#419#420 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#419 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#418 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:14
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:25 [inlined]
macro expansion at /global/homes/m/marius/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:717 [inlined]
#symplectic_integrate#414 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:23
unknown function (ip: 0x2aac5fc69baa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69399)
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69195)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:284 [inlined]
macro expansion at ./util.jl:234 [inlined]
#458 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:277
iterate at ./generator.jl:47 [inlined]
_collect at ./array.jl:678
collect_similar at ./array.jl:607
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
map at ./abstractarray.jl:2072
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#sample_joint#449 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:249
unknown function (ip: 0x2aac5c096c3f)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
sample_joint##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:176
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_call at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:369
eval_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:458
eval_stmt_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:409 [inlined]
eval_body at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:803
jl_interpret_toplevel_thunk at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:911
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:814
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:764
slurmstepd: error: *** STEP 550402.8 ON cgpu03 CANCELLED AT 2020-04-16T14:18:58 ***
srun: Terminating job step 550402.8

from cuarrays.jl.

marius311 avatar marius311 commented on June 24, 2024

This appears to be fixed for me on 2.2.0, presumably by the referenced issue above. Guessing the CURAND thing was just a random side-effect.

from cuarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.