Hello, Are there plans to write a manager for lsf/bsub? If so, would

I would also be interested in something like this. <a class="user-mention notranslate

please try <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Ben, Thank you for the PR! I am testing on <code c

On julia 0.6 I get: <div class="highlight highlig

if you clip out and run on its own from the unix command line <code class="notranslate

LSF manager with bsub,about juliaparallel/clustermanagers.jl

Comments (27)

bjarthur commented on July 2, 2024

i have access to an infiniband cluster with an LSF scheduler. am happy to help review/debug/test a PR, but since this is not functionality i need (now anyway) i'm not motivated to write it.

from clustermanagers.jl.

grero commented on July 2, 2024

I would also be interested in something like this. @raminammour, if you have something that sort of works, I'd be interested to take a look.

from clustermanagers.jl.

raminammour commented on July 2, 2024

I still only hack bsub to print out hostnames and use sshmanager, I have not had time to work on an actuallsf manager.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

please try #74 and send me feedback

from clustermanagers.jl.

raminammour commented on July 2, 2024

Ben,

Thank you for the PR!

I am testing on julia 0.5, I added a few @show statements to see what is going on. Mainly the process hangs, and bjobs only shows that the job is running for a short time. I tried running bsub with -o and -e, got nothing useful.

Here is the output:

addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")

bsub_cmd = @cmd("bsub -I \$(manager.flags) -cwd \$dir -J \$jobname \"\$cmd\"") = `bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`
<<Waiting for dispatch ...>>
stream_proc = [open(bsub_cmd) for i = 1:np] = Tuple{Pipe,Base.Process}[(Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))]
(config.io,io_proc) = stream_proc[i] = (Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))
config.userdata = Dict{Symbol,Any}(:task => i,:process => io_proc) = Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning)))
push!(launched,config) = WorkerConfig[WorkerConfig(Pipe(closed => open, 0 bytes waiting),#NULL,#NULL,#NULL,#NULL,#NULL,Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))),#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL)]
<<Starting on r2i6n5.icex.cluster>>
notify(c) =

And the REPL hangs, if I kill it with ^C, it shows nprocs=2 but not connected to workers.

Hope this helps...

from clustermanagers.jl.

bjarthur commented on July 2, 2024

are you sure you're testing with this PR? i ask, because when i run your example, addprocs(LSFManager(3,``); dir="/tmp"), with an info(bsub_cmd) strategically placed in src/lsf.jl:launch(), i don't see a -cwd flag in the bsub call:

julia> addprocs(LSFManager(3,``); dir="/tmp")
INFO: `bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2`
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Starting on h07u23>>
<<Starting on h08u14>>
<<Starting on h08u14>>
3-element Array{Int64,1}:
 5
 6
 7

this is on julia 0.6. i've long since migrated from 0.5!

from clustermanagers.jl.

raminammour commented on July 2, 2024

I changed that bit, there are some peculiarities in the environment that sometimes stomps bsub because of different naming of folders, the cwd flag helps it normally (I didn't want the error to come from my environment).

I don't see info(bsub_cmd) on my end, here is the file I see:

export LSFManager, addprocs_lsf

immutable LSFManager <: ClusterManager
    np::Integer
    flags::Cmd
end

function launch(manager::LSFManager, params::Dict, launched::Array, c::Condition)
    try
        dir = params[:dir]
        exename = params[:exename]
        exeflags = params[:exeflags]

        np = manager.np

        jobname = `julia-$(getpid())`

        cmd = `cd $dir '&&' $exename $exeflags $(worker_arg)`
        bsub_cmd = `bsub -I $(manager.flags) -env "all" -J $jobname "$cmd"`

        stream_proc = [open(bsub_cmd) for i in 1:np]

        for i in 1:np
            config = WorkerConfig()
            config.io, io_proc = stream_proc[i]
            config.userdata = Dict{Symbol, Any}(:task => i, :process => io_proc)
            push!(launched, config)
            notify(c)
        end
 
    catch e
        println("Error launching workers")
        println(e)
    end
end

manage(manager::LSFManager, id::Int64, config::WorkerConfig, op::Symbol) = nothing

function kill(manager::LSFManager, id::Int64, config::WorkerConfig)
    kill(get(config.userdata)[:process],15)
    close(get(config.io))
end

addprocs_lsf(np::Integer; flags::Cmd=``) = addprocs(LSFManager(np, flags))

from clustermanagers.jl.

raminammour commented on July 2, 2024

On julia 0.6 I get:

addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")
INFO: `bsub -I -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r1i1n3.icex.cluster>>
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0

EDIT: I added the info command like you suggested and used the original file.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

if you clip out and run on its own from the unix command line /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib what do you see printed to stdout? should be something like julia_worker:9009#10.36.11.34

from clustermanagers.jl.

raminammour commented on July 2, 2024

I do see that, and then of it waits 60 seconds and exits because master did not connect.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

what do you see if you execute the entire bsub command on the unix command line?

$ bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2
Job <484704> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on h06u04>>
julia_worker:9009#10.36.106.14

from clustermanagers.jl.

raminammour commented on July 2, 2024

That is where it gets tricky, if I submit with -I it exists before:

bsub -I -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407259> is submitted to default queue <LAURE_USERS>.
<<Waiting for dispatch ...>>
<<Starting on r4i7n4.icex.cluster>>

without -I:

bsub -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407260> is submitted to default queue <LAURE_USERS>.
julia_worker:9009#10.55.167.198

from clustermanagers.jl.

bjarthur commented on July 2, 2024

hmmm. on my system it's the opposite: only works with the -I flag. one option to make it work on all systems would be to leave the flag out, and on those systems that need it, require the user to specify it with addprocs(LSFManager(3,`-I`); dir="/tmp").

from clustermanagers.jl.

raminammour commented on July 2, 2024

From julia, in either case (interactive mode or not with bsub) I don't see the julia_worker:9009#10.55.167.198 if I add readlines(stream_proc[1][1])

from clustermanagers.jl.

bjarthur commented on July 2, 2024

if you delete the -I flag on line 19 of lsf.jl (and your -cwd modification) does it not work if you then try addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf") ?

from clustermanagers.jl.

raminammour commented on July 2, 2024

addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf")
INFO: `bsub -cwd /global/j0280401/try_lsf -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
 [4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
 [5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
 [6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0

from clustermanagers.jl.

raminammour commented on July 2, 2024

The farthest I have been able to get is by deleting the cd $dir part:

INFO: `bsub -I -cwd /global/j0280401/try_lsf -J julia-123850 /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r3i2n10.icex.cluster>>
readlines((stream_proc[1])[1]) = String["Job <407302> is submitted to default queue <LAURE_USERS>.", "julia_worker:9016#192.168.159.146", "Master process (id 1) could not connect within 60.0 seconds.", "exiting."]

but it just hangs 60 seconds for the process and only returns after the worker exits. I have tried with detach on the command, to no avail, bsub -Is and bsub -Ip did not help either.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

if it dumps to stdout the host:port info when you run bsub manually on the unix command line, i have no idea why it wouldn't work from within the launch function.

there are a bunch of modifiers to the -I flag. -I{p,S,Sp,Ss,s,X}. might try them all.

from clustermanagers.jl.

raminammour commented on July 2, 2024

I dug deeper yesterday, I had to go back to the codes in Base.Distributed there is not nearly enough info printed there when something goes wrong.

The host,port printed by julia --worker is the one under eth0:... from ip addr on the node. That ip address is unreachable (at least using ssh) and thus the command connect(s,host,UInt16(port)) in connect_to_woker does not work. I don't know if one can get julia --woker to get the ib0 ip address which would work.

Not sure if these settings are even standard or not, certainly not on your system :) . All of this goes beyond my expertise, but if you have any suggestions I would be glad to try them.

And thanks again for the PR!

from clustermanagers.jl.

bjarthur commented on July 2, 2024

from https://docs.julialang.org/en/latest/manual/parallel-computing/#ClusterManagers-1:

By default a worker will listen on a free port at the address returned by a call to getipaddr(). A specific address to listen on may be specified by optional argument --bind-to bind_addr[:port]. This is useful for multi-homed hosts.

you could try setting the exeflags keyword argument of addprocs to --bind-to $(xxx), where xxx is a unix command which returns the ib0 address.

from clustermanagers.jl.

raminammour commented on July 2, 2024

Ben,

I was thankfully reading the same section, you were right, it works!

The commands that worked for me, in case it helps someone:

        cmd = `cd $dir ";" hostname -i "|" xargs $exename $exeflags $(worker_arg) --bind-to `
        bsub_cmd = `bsub -I $(manager.flags) -J $jobname "$cmd"`

Now I just need to figure out how to launch more than one julia worker on the same host, maybe the number of cores by default :)

I really appreciate your effort and patience!

from clustermanagers.jl.

bjarthur commented on July 2, 2024

does it work to leave my PR intact, and then to addprocs(LSFManager(3,``); exeflags="--bind-to=\$$hostname\ -l$") ?

from clustermanagers.jl.

raminammour commented on July 2, 2024

No, but this works on command line:
bsub -I -J julia-126054 cd /home/j0280401;/data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WE8ZtbvDKYgAIgdt --bind-to $(hostname -i) (-i not -l)

My system never liked the -env all (it keeps saying all not found), and the exeflags have to come last in cmd if something like this is to work. Also, --bind-to not --bind-to=.

It is just a matter of figuring out the right backticks to get it right.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

@raminammour did you ever get LSF working? if so, it'd be nice to merge #74

from clustermanagers.jl.

raminammour commented on July 2, 2024

Hey Ben,

Yes I did, my solution is not too clean as I had to use the xargs trick to get the right IP address on one system (the cleanest solution would be for getipaddress() to have some options to sway it output when there are multiple addresses). Also, on that system bsub assigns a full node (with many cores) so I mimicked the SSHManager to get extra workers per host.

On another system the pull request worked almost out of the box.

I don't have access to the codes now, but I will soon, and share them.

At any rate the PR was very useful to me, and many other i am sure!

Thank you :)

from clustermanagers.jl.

AStupidBear commented on July 2, 2024

@raminammour Can you share your full script? I'm lost in this discussion.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

@AStupidBear see #74

from clustermanagers.jl.

LSF manager with bsub about clustermanagers.jl HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent