Giter Site home page Giter Site logo

Comments (27)

bjarthur avatar bjarthur commented on July 2, 2024

i have access to an infiniband cluster with an LSF scheduler. am happy to help review/debug/test a PR, but since this is not functionality i need (now anyway) i'm not motivated to write it.

from clustermanagers.jl.

grero avatar grero commented on July 2, 2024

I would also be interested in something like this. @raminammour, if you have something that sort of works, I'd be interested to take a look.

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

I still only hack bsub to print out hostnames and use sshmanager, I have not had time to work on an actuallsf manager.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

please try #74 and send me feedback

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

Ben,

Thank you for the PR!

I am testing on julia 0.5, I added a few @show statements to see what is going on. Mainly the process hangs, and bjobs only shows that the job is running for a short time. I tried running bsub with -o and -e, got nothing useful.

Here is the output:

addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")

bsub_cmd = @cmd("bsub -I \$(manager.flags) -cwd \$dir -J \$jobname \"\$cmd\"") = `bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`
<<Waiting for dispatch ...>>
stream_proc = [open(bsub_cmd) for i = 1:np] = Tuple{Pipe,Base.Process}[(Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))]
(config.io,io_proc) = stream_proc[i] = (Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))
config.userdata = Dict{Symbol,Any}(:task => i,:process => io_proc) = Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning)))
push!(launched,config) = WorkerConfig[WorkerConfig(Pipe(closed => open, 0 bytes waiting),#NULL,#NULL,#NULL,#NULL,#NULL,Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))),#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL)]
<<Starting on r2i6n5.icex.cluster>>
notify(c) =

And the REPL hangs, if I kill it with ^C, it shows nprocs=2 but not connected to workers.

Hope this helps...

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

are you sure you're testing with this PR? i ask, because when i run your example, addprocs(LSFManager(3,``); dir="/tmp"), with an info(bsub_cmd) strategically placed in src/lsf.jl:launch(), i don't see a -cwd flag in the bsub call:

julia> addprocs(LSFManager(3,``); dir="/tmp")
INFO: `bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2`
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Starting on h07u23>>
<<Starting on h08u14>>
<<Starting on h08u14>>
3-element Array{Int64,1}:
 5
 6
 7

this is on julia 0.6. i've long since migrated from 0.5!

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

I changed that bit, there are some peculiarities in the environment that sometimes stomps bsub because of different naming of folders, the cwd flag helps it normally (I didn't want the error to come from my environment).

I don't see info(bsub_cmd) on my end, here is the file I see:

export LSFManager, addprocs_lsf

immutable LSFManager <: ClusterManager
    np::Integer
    flags::Cmd
end

function launch(manager::LSFManager, params::Dict, launched::Array, c::Condition)
    try
        dir = params[:dir]
        exename = params[:exename]
        exeflags = params[:exeflags]

        np = manager.np

        jobname = `julia-$(getpid())`

        cmd = `cd $dir '&&' $exename $exeflags $(worker_arg)`
        bsub_cmd = `bsub -I $(manager.flags) -env "all" -J $jobname "$cmd"`

        stream_proc = [open(bsub_cmd) for i in 1:np]

        for i in 1:np
            config = WorkerConfig()
            config.io, io_proc = stream_proc[i]
            config.userdata = Dict{Symbol, Any}(:task => i, :process => io_proc)
            push!(launched, config)
            notify(c)
        end
 
    catch e
        println("Error launching workers")
        println(e)
    end
end

manage(manager::LSFManager, id::Int64, config::WorkerConfig, op::Symbol) = nothing

function kill(manager::LSFManager, id::Int64, config::WorkerConfig)
    kill(get(config.userdata)[:process],15)
    close(get(config.io))
end

addprocs_lsf(np::Integer; flags::Cmd=``) = addprocs(LSFManager(np, flags))

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

On julia 0.6 I get:

addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")
INFO: `bsub -I -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r1i1n3.icex.cluster>>
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0

EDIT: I added the info command like you suggested and used the original file.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

if you clip out and run on its own from the unix command line /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib what do you see printed to stdout? should be something like julia_worker:9009#10.36.11.34

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

I do see that, and then of it waits 60 seconds and exits because master did not connect.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

what do you see if you execute the entire bsub command on the unix command line?

$ bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2
Job <484704> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on h06u04>>
julia_worker:9009#10.36.106.14

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

That is where it gets tricky, if I submit with -I it exists before:

bsub -I -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407259> is submitted to default queue <LAURE_USERS>.
<<Waiting for dispatch ...>>
<<Starting on r4i7n4.icex.cluster>>

without -I:

bsub -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407260> is submitted to default queue <LAURE_USERS>.
julia_worker:9009#10.55.167.198

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

hmmm. on my system it's the opposite: only works with the -I flag. one option to make it work on all systems would be to leave the flag out, and on those systems that need it, require the user to specify it with addprocs(LSFManager(3,`-I`); dir="/tmp").

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

From julia, in either case (interactive mode or not with bsub) I don't see the julia_worker:9009#10.55.167.198 if I add readlines(stream_proc[1][1])

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

if you delete the -I flag on line 19 of lsf.jl (and your -cwd modification) does it not work if you then try addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf") ?

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024
addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf")
INFO: `bsub -cwd /global/j0280401/try_lsf -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

The farthest I have been able to get is by deleting the cd $dir part:

INFO: `bsub -I -cwd /global/j0280401/try_lsf -J julia-123850 /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r3i2n10.icex.cluster>>
readlines((stream_proc[1])[1]) = String["Job <407302> is submitted to default queue <LAURE_USERS>.", "julia_worker:9016#192.168.159.146", "Master process (id 1) could not connect within 60.0 seconds.", "exiting."]

but it just hangs 60 seconds for the process and only returns after the worker exits. I have tried with detach on the command, to no avail, bsub -Is and bsub -Ip did not help either.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

if it dumps to stdout the host:port info when you run bsub manually on the unix command line, i have no idea why it wouldn't work from within the launch function.

there are a bunch of modifiers to the -I flag. -I{p,S,Sp,Ss,s,X}. might try them all.

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

I dug deeper yesterday, I had to go back to the codes in Base.Distributed there is not nearly enough info printed there when something goes wrong.

The host,port printed by julia --worker is the one under eth0:... from ip addr on the node. That ip address is unreachable (at least using ssh) and thus the command connect(s,host,UInt16(port)) in connect_to_woker does not work. I don't know if one can get julia --woker to get the ib0 ip address which would work.

Not sure if these settings are even standard or not, certainly not on your system :) . All of this goes beyond my expertise, but if you have any suggestions I would be glad to try them.

And thanks again for the PR!

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

from https://docs.julialang.org/en/latest/manual/parallel-computing/#ClusterManagers-1:

By default a worker will listen on a free port at the address returned by a call to getipaddr(). A specific address to listen on may be specified by optional argument --bind-to bind_addr[:port]. This is useful for multi-homed hosts.

you could try setting the exeflags keyword argument of addprocs to --bind-to $(xxx), where xxx is a unix command which returns the ib0 address.

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

Ben,

I was thankfully reading the same section, you were right, it works!

The commands that worked for me, in case it helps someone:

        cmd = `cd $dir ";" hostname -i "|" xargs $exename $exeflags $(worker_arg) --bind-to `
        bsub_cmd = `bsub -I $(manager.flags) -J $jobname "$cmd"`

Now I just need to figure out how to launch more than one julia worker on the same host, maybe the number of cores by default :)

I really appreciate your effort and patience!

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

does it work to leave my PR intact, and then to addprocs(LSFManager(3,``); exeflags="--bind-to=\$\(hostname\ -l\)") ?

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

No, but this works on command line:
bsub -I -J julia-126054 cd /home/j0280401;/data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WE8ZtbvDKYgAIgdt --bind-to $(hostname -i) (-i not -l)

My system never liked the -env all (it keeps saying all not found), and the exeflags have to come last in cmd if something like this is to work. Also, --bind-to not --bind-to=.

It is just a matter of figuring out the right backticks to get it right.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

@raminammour did you ever get LSF working? if so, it'd be nice to merge #74

from clustermanagers.jl.

raminammour avatar raminammour commented on July 2, 2024

Hey Ben,

Yes I did, my solution is not too clean as I had to use the xargs trick to get the right IP address on one system (the cleanest solution would be for getipaddress() to have some options to sway it output when there are multiple addresses). Also, on that system bsub assigns a full node (with many cores) so I mimicked the SSHManager to get extra workers per host.

On another system the pull request worked almost out of the box.

I don't have access to the codes now, but I will soon, and share them.

At any rate the PR was very useful to me, and many other i am sure!

Thank you :)

from clustermanagers.jl.

AStupidBear avatar AStupidBear commented on July 2, 2024

@raminammour Can you share your full script? I'm lost in this discussion.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

@AStupidBear see #74

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.