Comments (27)
i have access to an infiniband cluster with an LSF scheduler. am happy to help review/debug/test a PR, but since this is not functionality i need (now anyway) i'm not motivated to write it.
from clustermanagers.jl.
I would also be interested in something like this. @raminammour, if you have something that sort of works, I'd be interested to take a look.
from clustermanagers.jl.
I still only hack bsub
to print out hostnames and use sshmanager
, I have not had time to work on an actuallsf
manager.
from clustermanagers.jl.
please try #74 and send me feedback
from clustermanagers.jl.
Ben,
Thank you for the PR!
I am testing on julia 0.5
, I added a few @show
statements to see what is going on. Mainly the process hangs, and bjobs
only shows that the job is running for a short time. I tried running bsub
with -o and -e
, got nothing useful.
Here is the output:
addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")
bsub_cmd = @cmd("bsub -I \$(manager.flags) -cwd \$dir -J \$jobname \"\$cmd\"") = `bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`
<<Waiting for dispatch ...>>
stream_proc = [open(bsub_cmd) for i = 1:np] = Tuple{Pipe,Base.Process}[(Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))]
(config.io,io_proc) = stream_proc[i] = (Pipe(closed => open, 0 bytes waiting),Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))
config.userdata = Dict{Symbol,Any}(:task => i,:process => io_proc) = Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning)))
push!(launched,config) = WorkerConfig[WorkerConfig(Pipe(closed => open, 0 bytes waiting),#NULL,#NULL,#NULL,#NULL,#NULL,Dict{Symbol,Any}(Pair{Symbol,Any}(:task,1),Pair{Symbol,Any}(:process,Process(`bsub -I -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/JULIA/julia/bin/julia --worker=kcdccvQrc5wopF7P`, ProcessRunning))),#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL,#NULL)]
<<Starting on r2i6n5.icex.cluster>>
notify(c) =
And the REPL
hangs, if I kill it with ^C
, it shows nprocs=2
but not connected to workers.
Hope this helps...
from clustermanagers.jl.
are you sure you're testing with this PR? i ask, because when i run your example, addprocs(LSFManager(3,``); dir="/tmp")
, with an info(bsub_cmd)
strategically placed in src/lsf.jl:launch()
, i don't see a -cwd
flag in the bsub
call:
julia> addprocs(LSFManager(3,``); dir="/tmp")
INFO: `bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2`
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Waiting for dispatch ...>>
<<Starting on h07u23>>
<<Starting on h08u14>>
<<Starting on h08u14>>
3-element Array{Int64,1}:
5
6
7
this is on julia 0.6. i've long since migrated from 0.5!
from clustermanagers.jl.
I changed that bit, there are some peculiarities in the environment that sometimes stomps bsub
because of different naming of folders, the cwd
flag helps it normally (I didn't want the error to come from my environment).
I don't see info(bsub_cmd)
on my end, here is the file I see:
export LSFManager, addprocs_lsf
immutable LSFManager <: ClusterManager
np::Integer
flags::Cmd
end
function launch(manager::LSFManager, params::Dict, launched::Array, c::Condition)
try
dir = params[:dir]
exename = params[:exename]
exeflags = params[:exeflags]
np = manager.np
jobname = `julia-$(getpid())`
cmd = `cd $dir '&&' $exename $exeflags $(worker_arg)`
bsub_cmd = `bsub -I $(manager.flags) -env "all" -J $jobname "$cmd"`
stream_proc = [open(bsub_cmd) for i in 1:np]
for i in 1:np
config = WorkerConfig()
config.io, io_proc = stream_proc[i]
config.userdata = Dict{Symbol, Any}(:task => i, :process => io_proc)
push!(launched, config)
notify(c)
end
catch e
println("Error launching workers")
println(e)
end
end
manage(manager::LSFManager, id::Int64, config::WorkerConfig, op::Symbol) = nothing
function kill(manager::LSFManager, id::Int64, config::WorkerConfig)
kill(get(config.userdata)[:process],15)
close(get(config.io))
end
addprocs_lsf(np::Integer; flags::Cmd=``) = addprocs(LSFManager(np, flags))
from clustermanagers.jl.
On julia 0.6
I get:
addprocs(ClusterManagers.LSFManager(1,``);dir="/global/j0280401/try_lsf")
INFO: `bsub -I -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r1i1n3.icex.cluster>>
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
[1] sync_end() at ./task.jl:287
[2] macro expansion at ./task.jl:303 [inlined]
[3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
[4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
[5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
[6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0
EDIT: I added the info command like you suggested and used the original file.
from clustermanagers.jl.
if you clip out and run on its own from the unix command line /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib
what do you see printed to stdout? should be something like julia_worker:9009#10.36.11.34
from clustermanagers.jl.
I do see that, and then of it waits 60 seconds and exits because master did not connect.
from clustermanagers.jl.
what do you see if you execute the entire bsub command on the unix command line?
$ bsub -I -env all -J julia-122813 cd /tmp '&&' /home/arthurb/bin/julia-0.6.0/bin/julia --worker=RrC71REPmRmsrOX2
Job <484704> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on h06u04>>
julia_worker:9009#10.36.106.14
from clustermanagers.jl.
That is where it gets tricky, if I submit with -I
it exists before:
bsub -I -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407259> is submitted to default queue <LAURE_USERS>.
<<Waiting for dispatch ...>>
<<Starting on r4i7n4.icex.cluster>>
without -I
:
bsub -env all -cwd /global/j0280401/try_lsf -J julia-115462 cd /global/j0280401/try_lsf && /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=kcdccvQrc5wopF7P
Job <407260> is submitted to default queue <LAURE_USERS>.
julia_worker:9009#10.55.167.198
from clustermanagers.jl.
hmmm. on my system it's the opposite: only works with the -I
flag. one option to make it work on all systems would be to leave the flag out, and on those systems that need it, require the user to specify it with addprocs(LSFManager(3,`-I`); dir="/tmp")
.
from clustermanagers.jl.
From julia, in either case (interactive mode or not with bsub) I don't see the julia_worker:9009#10.55.167.198
if I add readlines(stream_proc[1][1])
from clustermanagers.jl.
if you delete the -I flag on line 19 of lsf.jl (and your -cwd modification) does it not work if you then try addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf")
?
from clustermanagers.jl.
addprocs(ClusterManagers.LSFManager(1,`-cwd /global/j0280401/try_lsf`);dir="/global/j0280401/try_lsf")
INFO: `bsub -cwd /global/j0280401/try_lsf -env all -J julia-123850 cd /global/j0280401/try_lsf '&&' /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:236
connect(::ClusterManagers.LSFManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:391
create_worker(::ClusterManagers.LSFManager, ::WorkerConfig) at ./distributed/cluster.jl:443
setup_launched_worker(::ClusterManagers.LSFManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:389
(::Base.Distributed.##33#36{ClusterManagers.LSFManager,WorkerConfig,Array{Int64,1}})() at ./task.jl:335
Stacktrace:
[1] sync_end() at ./task.jl:287
[2] macro expansion at ./task.jl:303 [inlined]
[3] #addprocs_locked#30(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:344
[4] (::Base.Distributed.#kw##addprocs_locked)(::Array{Any,1}, ::Base.Distributed.#addprocs_locked, ::ClusterManagers.LSFManager) at ./<missing>:0
[5] #addprocs#29(::Array{Any,1}, ::Function, ::ClusterManagers.LSFManager) at ./distributed/cluster.jl:319
[6] (::Base.Distributed.#kw##addprocs)(::Array{Any,1}, ::Base.Distributed.#addprocs, ::ClusterManagers.LSFManager) at ./<missing>:0
from clustermanagers.jl.
The farthest I have been able to get is by deleting the cd $dir
part:
INFO: `bsub -I -cwd /global/j0280401/try_lsf -J julia-123850 /data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WD6Hw0RsixmmrKib`
<<Waiting for dispatch ...>>
<<Starting on r3i2n10.icex.cluster>>
readlines((stream_proc[1])[1]) = String["Job <407302> is submitted to default queue <LAURE_USERS>.", "julia_worker:9016#192.168.159.146", "Master process (id 1) could not connect within 60.0 seconds.", "exiting."]
but it just hangs 60 seconds for the process and only returns after the worker exits. I have tried with detach
on the command, to no avail, bsub -Is
and bsub -Ip
did not help either.
from clustermanagers.jl.
if it dumps to stdout the host:port info when you run bsub manually on the unix command line, i have no idea why it wouldn't work from within the launch function.
there are a bunch of modifiers to the -I flag. -I{p,S,Sp,Ss,s,X}. might try them all.
from clustermanagers.jl.
I dug deeper yesterday, I had to go back to the codes in Base.Distributed
there is not nearly enough info printed there when something goes wrong.
The host,port
printed by julia --worker
is the one under eth0:...
from ip addr
on the node. That ip address is unreachable (at least using ssh
) and thus the command connect(s,host,UInt16(port))
in connect_to_woker
does not work. I don't know if one can get julia --woker
to get the ib0
ip address which would work.
Not sure if these settings are even standard or not, certainly not on your system :) . All of this goes beyond my expertise, but if you have any suggestions I would be glad to try them.
And thanks again for the PR!
from clustermanagers.jl.
from https://docs.julialang.org/en/latest/manual/parallel-computing/#ClusterManagers-1:
By default a worker will listen on a free port at the address returned by a call to getipaddr(). A specific address to listen on may be specified by optional argument --bind-to bind_addr[:port]. This is useful for multi-homed hosts.
you could try setting the exeflags keyword argument of addprocs to --bind-to $(xxx)
, where xxx is a unix command which returns the ib0 address.
from clustermanagers.jl.
Ben,
I was thankfully reading the same section, you were right, it works!
The commands that worked for me, in case it helps someone:
cmd = `cd $dir ";" hostname -i "|" xargs $exename $exeflags $(worker_arg) --bind-to `
bsub_cmd = `bsub -I $(manager.flags) -J $jobname "$cmd"`
Now I just need to figure out how to launch more than one julia
worker on the same host, maybe the number of cores by default :)
I really appreciate your effort and patience!
from clustermanagers.jl.
does it work to leave my PR intact, and then to addprocs(LSFManager(3,``); exeflags="--bind-to=\$\(hostname\ -l\)")
?
from clustermanagers.jl.
No, but this works on command line:
bsub -I -J julia-126054 cd /home/j0280401;/data/gpfs/Users/j0280401/julia-next/julia-903644385b/bin/julia --worker=WE8ZtbvDKYgAIgdt --bind-to $(hostname -i)
(-i not -l)
My system never liked the -env all
(it keeps saying all not found), and the exeflags
have to come last in cmd
if something like this is to work. Also, --bind-to
not --bind-to=
.
It is just a matter of figuring out the right backticks to get it right.
from clustermanagers.jl.
@raminammour did you ever get LSF working? if so, it'd be nice to merge #74
from clustermanagers.jl.
Hey Ben,
Yes I did, my solution is not too clean as I had to use the xargs trick to get the right IP address on one system (the cleanest solution would be for getipaddress() to have some options to sway it output when there are multiple addresses). Also, on that system bsub assigns a full node (with many cores) so I mimicked the SSHManager to get extra workers per host.
On another system the pull request worked almost out of the box.
I don't have access to the codes now, but I will soon, and share them.
At any rate the PR was very useful to me, and many other i am sure!
Thank you :)
from clustermanagers.jl.
@raminammour Can you share your full script? I'm lost in this discussion.
from clustermanagers.jl.
@AStupidBear see #74
from clustermanagers.jl.
Related Issues (20)
- Extra options on SGE HOT 5
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
- Limiting number of cores per node on with LSF HOT 3
- Finalizer task switch bug
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.