Giter Site home page Giter Site logo

Comments (6)

maleadt avatar maleadt commented on July 22, 2024

How does training halt? You need to be getting some kind of error or reason it halts, right?

Can you profile the code and see where the time goes? A typical problem is memory, where GPUs have much fewer RAM which doesn't compose well with Julia's GC and running close to the memory limit. Yours has 11GB though, so unless your model is huge that should generally work fine.

Also, which version of Julia?

from cuarrays.jl.

lpjiang97 avatar lpjiang97 commented on July 22, 2024

How does training halt? You need to be getting some kind of error or reason it halts, right?

The println statement (second last line) will stop being printed out, I have tried giving it about an hour, no updates.

Can you profile the code and see where the time goes? A typical problem is memory, where GPUs have much fewer RAM which doesn't compose well with Julia's GC and running close to the memory limit. Yours has 11GB though, so unless your model is huge that should generally work fine.

When I look at CuArrays.memory_status(), the usage is very high (99.97%), which I found very weird. I have a similar model written in PyTorch that takes only about 500 MB.

Also, which version of Julia?

I have tested this on Julia 1.3.1 and Julia 1.4.1 and both have this problem. I also wonder if this is related to #350 . But I have tried taking out sqrt. but have the same issue.

Update: if I call GC.gc() at the end of each epoch, the problem goes away.

from cuarrays.jl.

maleadt avatar maleadt commented on July 22, 2024

Update: if I call GC.gc() at the end of each epoch, the problem goes away.

Ah, so that problem again. I thought training exited, but it hangs, which is consistent with the GC taking up all time. This is a tough problem, but it's good to have another (small-ish) reproducer.

You can also try using the new, WIP, memory pool: JULIA_CUDA_MEMORY_POOL=split. Improves performance in some workloads, but ultimately still falls back on the Julia GC so might have the same problem with your use case.

from cuarrays.jl.

colinxs avatar colinxs commented on July 22, 2024

Hey @maleadt, I work with @lpjiang97 and spent a bit looking into this. For what it's worth, I was able to replicate this on my machine (information below) 3/5 attempts, each in a fresh Julia session. When it does lock up, stacktrace shows that it's waiting on the lock in either alloc or free:

Stacktrace:                                                                             
 [1] top-level scope at /home/colinxs/workspace/dev/Experiments/flux/foo/debug0.jl:29   
 [2] lock(::Base.Threads.SpinLock) at ./locks-mt.jl:71                                  
 [3] macro expansion at ./lock.jl:181 [inlined]                                         
 [4] free(::CUDAdrv.CuPtr{Nothing}) at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/
memory/binned.jl:393                                                                    
 [5] macro expansion at /home/colinxs/.julia/packages/TimerOutputs/NvIUx/src/TimerOutput
.jl:245 [inlined]                                                                       
 [6] macro expansion at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:231 [
inlined]                                                                                
 [7] macro expansion at ./util.jl:234 [inlined]                                         
 [8] free at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:230 [inlined]   
 [9] _unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:51                                                                
 [10] unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:40     

Single GPU (1050)

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_DOWNLOAD = /home/colinxs/pkg/installed/julia
  JULIA_NUM_THREADS = 6
  JULIA_PKG_DEVDIR = /home/colinxs/workspace/juliadev

julia> Pkg.status()
Status `~/workspace/dev/Experiments/flux/foo/Project.toml`
  [3895d2a7] CUDAapi v4.0.0
  [be33ccc6] CUDAnative v3.0.4
  [3a865a2d] CuArrays v2.1.0
  [587475ba] Flux v0.10.4

from cuarrays.jl.

colinxs avatar colinxs commented on July 22, 2024

I should've the open issues first, it appears you're already well aware of this: #685

from cuarrays.jl.

maleadt avatar maleadt commented on July 22, 2024

Correct, I suspected a performance issue but the backtrace is useful in identifying the actual issue. I'll have a look at the deadlock, since a couple of users have been running into this.

from cuarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.