juliaparallel / clustermanagers.jl Goto Github PK

License: Other

Julia 100.00%

clustermanagers.jl's Introduction

ClusterManagers

Support for different job queue systems commonly used on compute clusters.

Currently supported job queue systems

Job queue system	Command to add processors
Load Sharing Facility (LSF)	addprocs_lsf(np::Integer; bsub_flags=``, ssh_cmd=``) or `addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))`
Sun Grid Engine	addprocs_sge(np::Integer; qsub_flags=``) or `addprocs(SGEManager(np, qsub_flags))`
SGE via qrsh	addprocs_qrsh(np::Integer; qsub_flags=``) or `addprocs(QRSHManager(np, qsub_flags))`
PBS	addprocs_pbs(np::Integer; qsub_flags=``) or `addprocs(PBSManager(np, qsub_flags))`
Scyld	`addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))`
HTCondor	`addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))`
Slurm	`addprocs_slurm(np::Integer; kwargs...)` or `addprocs(SlurmManager(np); kwargs...)`
Local manager with CPU affinity setting	`addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)`
Kubernetes (K8s) via K8sClusterManagers.jl	`addprocs(K8sClusterManagers(np; kwargs...))`

You can also write your own custom cluster manager; see the instructions in the Julia manual

Slurm: a simple example

using Distributed, ClusterManagers

# Arguments to the Slurm srun(1) command can be given as keyword
# arguments to addprocs.  The argument name and value is translated to
# a srun(1) command line argument as follows:
# 1) If the length of the argument is 1 => "-arg value",
#    e.g. t="0:1:0" => "-t 0:1:0"
# 2) If the length of the argument is > 1 => "--arg=value"
#    e.g. time="0:1:0" => "--time=0:1:0"
# 3) If the value is the empty string, it becomes a flag value,
#    e.g. exclusive="" => "--exclusive"
# 4) If the argument contains "_", they are replaced with "-",
#    e.g. mem_per_cpu=100 => "--mem-per-cpu=100"
addprocs(SlurmManager(2), partition="debug", t="00:5:00")

hosts = []
pids = []
for i in workers()
	host, pid = fetch(@spawnat i (gethostname(), getpid()))
	push!(hosts, host)
	push!(pids, pid)
end

# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
	rmprocs(i)
end

SGE - a simple interactive example

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(5; qsub_flags=`-q queue_name`)
job id is 961, waiting for job to start .
5-element Array{Any,1}:
2
3
4
5
6

julia> @parallel for i=1:5
       run(`hostname`)
       end

julia>  From worker 2:  compute-6
        From worker 4:  compute-6
        From worker 5:  compute-6
        From worker 6:  compute-6
        From worker 3:  compute-6

Some clusters require the user to specify a list of required resources. For example, it may be necessary to specify how much memory will be needed by the job - see this issue. The keyword qsub_flags can be used to specify these and other options. Additionally the keyword wd can be used to specify the working directory (which defaults to ENV["HOME"]).

julia> using Distributed, ClusterManagers

julia> addprocs_sge(5; qsub_flags=`-q queue_name -l h_vmem=4G,tmem=4G`, wd=mktempdir())
Job 5672349 in queue.
Running.
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

julia> pmap(x->run(`hostname`),workers());

julia>  From worker 26: lum-7-2.local
        From worker 23: pace-6-10.local
        From worker 22: chong-207-10.local
        From worker 24: pace-6-11.local
        From worker 25: cheech-207-16.local

julia> rmprocs(workers())
Task (done)

SGE via qrsh

SGEManager uses SGE's qsub command to launch workers, which communicate the TCP/IP host:port info back to the master via the filesystem. On filesystems that are tuned to make heavy use of caching to increase throughput, launching Julia workers can frequently timeout waiting for the standard output files to appear. In this case, it's better to use the QRSHManager, which uses SGE's qrsh command to bypass the filesystem and captures STDOUT directly.

Load Sharing Facility (LSF)

LSFManager supports IBM's scheduler. See the addprocs_lsf docstring for more information.

Using `LocalAffinityManager` (for pinning local workers to specific cores)

Linux only feature.
Requires the Linux taskset command to be installed.
Usage : addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...).

where

np is the number of workers to be started.
affinities, if specified, is a list of CPU IDs. As many workers as entries in affinities are launched. Each worker is pinned to the specified CPU ID.
mode (used only when affinities is not specified, can be either COMPACT or BALANCED) - COMPACT results in the requested number of workers pinned to cores in increasing order, For example, worker1 => CPU0, worker2 => CPU1 and so on. BALANCED tries to spread the workers. Useful when we have multiple CPU sockets, with each socket having multiple cores. A BALANCED mode results in workers spread across CPU sockets. Default is BALANCED.

Using `ElasticManager` (dynamically adding workers to a cluster)

The ElasticManager is useful in scenarios where we want to dynamically add workers to a cluster. It achieves this by listening on a known port on the master. The launched workers connect to this port and publish their own host/port information for other workers to connect to.

On the master, you need to instantiate an instance of ElasticManager. The constructors defined are:

ElasticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all, printing_kwargs=())
ElasticManager(port) = ElasticManager(;port=port)
ElasticManager(addr, port) = ElasticManager(;addr=addr, port=port)
ElasticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie)

You can set addr=:auto to automatically use the host's private IP address on the local network, which will allow other workers on this network to connect. You can also use port=0 to let the OS choose a random free port for you (some systems may not support this). Once created, printing the ElasticManager object prints the command which you can run on workers to connect them to the master, e.g.:

julia> em = ElasticManager(addr=:auto, port=0)
ElasticManager:
  Active workers : []
  Number of workers to be added  : 0
  Terminated workers : []
  Worker connect command : 
    /home/user/bin/julia --project=/home/user/myproject/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("4cOSyaYpgSl6BC0C","127.0.1.1",36275)'

By default, the printed command uses the absolute path to the current Julia executable and activates the same project as the current session. You can change either of these defaults by passing printing_kwargs=(absolute_exename=false, same_project=false)) to the first form of the ElasticManager constructor.

Once workers are connected, you can print the em object again to see them added to the list of active workers.

clustermanagers.jl's People

Contributors

Stargazers

Watchers

Forkers

simonster dbindel davidavdav bjarthur tlamadon grero amitmurthy vtjnash jabl stevengj pearcemc gcamilo martinwolke jagot cth biospi floswald davidparks21 tkelman alancrawford lstagner marius311 azraq27 lopezm94 glenhertz ivborissov mdpradeep cako hearnsj joachimneu kaskarn conning jbrnd mkschleg stjordanis pmarg miguelraz s2maki darrencl cnclgithub kescobo belledon jishnub grahamas neversakura avmi artaxerces jonasisensee aminnj moelf jonathananderson jtschneider snirgaz standardgalactic lucasondel omus drchainsaw david96 playfloor oucyf ebertok njisrawi louisponet jamiemair milescranmer moritzdrechselgrau ssghost samtkaplan ranocha cnrrobertson hmeiland oschulz jarbus

clustermanagers.jl's Issues

Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.

should I be worried about an error message like that in the log of each of my workers on a cluster?

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#10.255.237.11

cwd not retained on SGE workers

local workers have the same working directory as the master worker, but SGE workers remain in the users home directory. would be nice if they were at least consistent. i prefer the local behavior. let me know if the following console log is not clear.

[arthurb@h01u07 ~]$ cd ~
[arthurb@h01u07 ~]$ pwd
/home/arthurb
[arthurb@h01u07 ~]$ cd src
[arthurb@h01u07 src]$ pwd
/home/arthurb/src
[arthurb@h01u07 src]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease+2078 (2014-03-17 22:41 UTC)
/ |_'|||__'| | Commit 00ca83c* (1 day old master)
|__/ | x86_64-redhat-linux

julia> addprocs(1)
1-element Array{Any,1}:
2

julia> using ClusterManagers; addprocs(1, cman=SGEManager())
job id is 7264129, waiting for job to start ..............................
1-element Array{Any,1}:
3

julia> pwd()
"/home/arthurb/src"

julia> remotecall_fetch(2,pwd)
"/home/arthurb/src"

julia> remotecall_fetch(3,pwd)
"/home/arthurb"

julia> versioninfo()
Julia Version 0.3.0-prerelease+2078
Commit 00ca83c* (2014-03-17 22:41 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
9 required packages:

ClusterManagers 0.0.1
DSP 0.0.1+ ifloor
Debug 0.0.0
Devectorize 0.2.1
Distributions 0.4.2
MAT 0.2.3
PyPlot 1.2.2
Stats 0.1.0
WAV 0.2.2
11 additional packages:
ArrayViews 0.4.1
BinDeps 0.2.12
Color 0.2.8
HDF5 0.2.20
NumericExtensions 0.5.6
PDMats 0.1.1
Polynomial 0.1.1
PyCall 0.4.2
StatsBase 0.3.9
URIParser 0.0.1
Zlib 0.1.5

julia>

addprocs_sge works incorrectly if the work directory is not the home directory and the list of default options for qsub includes "-cwd"

In our environment, the list of default options for qsub includes "-cwd" to preserve the current directory. If the current working directory is different from the user's home directory, addprocs_sge doesn't work (can't find the files with the information about workers). From the file, qsub.jl:

filenames(i) = "$home/julia-$(getpid()).o$id-$i","$home/julia-$(getpid())-$i.o$id","$home/julia-$(getpid()).o$id.$i"

Julia expects these files in the home dir, but they are created in the current dir instead.

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine

if i add local workers before adding remote SGE workers, then the SGE workers will terminate with an ECONNREFUSED error. if i reverse the order, and add the SGE workers before the local workers, then all is good. i presume this is not the desired behavior. let me know if there is anyway i can help debug. sample output and versioninfo below.

[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> addprocs(16)
16-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365131, waiting for job to start ..............................
10-element Array{Any,1}:
18
19
20
21
22
23
24
25
26
27

julia> Worker 19 terminated.
Worker 20 terminated.
Worker 21 terminated.
Worker 22 terminated.
Worker 18 terminated.
Worker 25 terminated.
Worker 24 terminated.
Worker 23 terminated.
Worker 27 terminated.
Worker 26 terminated.
From worker 18: fatal error on 18: ERROR: connect: connection refused (ECONNREFUSED)
From worker 18: in wait_connected at stream.jl:265
From worker 18: in connect at stream.jl:871
From worker 18: in Worker at multi.jl:119
From worker 18: in anonymous at task.jl:866
From worker 19: fatal error on 19: ERROR: connect: connection refused (ECONNREFUSED)
From worker 19: in wait_connected at stream.jl:265
From worker 19: in connect at stream.jl:871
From worker 19: in Worker at multi.jl:119
From worker 19: in anonymous at task.jl:866
From worker 20: fatal error on 20: ERROR: connect: connection refused (ECONNREFUSED)
From worker 20: in wait_connected at stream.jl:265
From worker 20: in connect at stream.jl:871
From worker 20: in Worker at multi.jl:119
From worker 20: in anonymous at task.jl:866
From worker 21: fatal error on 21: ERROR: connect: connection refused (ECONNREFUSED)
From worker 21: in wait_connected at stream.jl:265
From worker 21: in connect at stream.jl:871
From worker 21: in Worker at multi.jl:119
From worker 21: in anonymous at task.jl:866
From worker 22: fatal error on 22: ERROR: connect: connection refused (ECONNREFUSED)
From worker 22: in wait_connected at stream.jl:265
From worker 22: in connect at stream.jl:871
From worker 22: in Worker at multi.jl:119
From worker 22: in anonymous at task.jl:866
From worker 23: fatal error on 23: ERROR: connect: connection refused (ECONNREFUSED)
From worker 23: in wait_connected at stream.jl:265
From worker 23: in connect at stream.jl:871
From worker 23: in Worker at multi.jl:119
From worker 23: in anonymous at task.jl:866
From worker 24: fatal error on 24: ERROR: connect: connection refused (ECONNREFUSED)
From worker 24: in wait_connected at stream.jl:265
From worker 24: in connect at stream.jl:871
From worker 24: in Worker at multi.jl:119
From worker 24: in anonymous at task.jl:866
From worker 25: fatal error on 25: ERROR: connect: connection refused (ECONNREFUSED)
From worker 25: in wait_connected at stream.jl:265
From worker 25: in connect at stream.jl:871
From worker 25: in Worker at multi.jl:119
From worker 25: in anonymous at task.jl:866
From worker 26: fatal error on 26: ERROR: connect: connection refused (ECONNREFUSED)
From worker 26: in wait_connected at stream.jl:265
From worker 26: in connect at stream.jl:871
From worker 26: in Worker at multi.jl:119
From worker 26: in anonymous at task.jl:866
From worker 27: fatal error on 27: ERROR: connect: connection refused (ECONNREFUSED)
From worker 27: in wait_connected at stream.jl:265
From worker 27: in connect at stream.jl:871
From worker 27: in Worker at multi.jl:119
From worker 27: in anonymous at task.jl:866

julia>
[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365134, waiting for job to start ..............................
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11

julia> addprocs(16)
16-element Array{Any,1}:
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

julia> versioninfo()
Julia Version 0.3.0-prerelease
Commit 457bca9* (2014-02-24 14:04 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
7 required packages:

ClusterManagers 0.0.1
DSP 0.0.1
Debug 0.0.0
Devectorize 0.2.1
Distributions 0.4.0
MAT 0.2.2
WAV 0.2.2
9 additional packages:
ArrayViews 0.4.1
BinDeps 0.2.12
HDF5 0.2.17
NumericExtensions 0.5.4
PDMats 0.1.0
Polynomial 0.0.0
StatsBase 0.3.7
URIParser 0.0.1
Zlib 0.1.5

julia>

Split up this package

There is not much shared code between the managers and most of us only use a single workload/cluster manager so it is difficult to review PRs.

SGEManager creates many temporary files in working directory

@amitmurthy Is this necessary for the manager to work or could they be created in the temp folder?

[PackageEvaluator.jl] Your package ClusterManagers may have a testing issue.

This issue is being filed by a script, but if you reply, I will see it.

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).

The results of this script are used to generate a package listing enhanced with testing results.

The status of this package, ClusterManagers, on...

Julia 0.2 is 'No tests, but package loads.'
Julia 0.3 is 'No tests, but package loads.'

'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl file.

'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.

This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.

Comprehensive tests

One of the issues (and one that became even more apparent during the 1.0 transistion) is that it is really hard to test this package.
Without CI development is slow since we are likely to break use-cases that we can't test.

Ideally we could use docker environments to instantiate a "minimal" cluster environment in which we then can run tests.

As an example see:

https://github.com/giovtorres/slurm-docker-cluster

SLURM: `srun` doesn't see julia instance using ClusterManagers?

I started a session with an sbatch script with:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1G
#SBATCH --job-name=my_first_parallel_julia
#SBATCH --time=00-00:02:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=julia_in_parallel.output  # output and error messages go to this file

julia  my_first_parallel_julia_on_slurm.jl

Then in my julia script, tried to run addprocs_slurm(3) but got this error:

srun: Warning: can't honor --ntasks-per-node set to 2 which 
doesn't match the requested tasks 3 with the number of requested nodes 2.

But if I run addprocs_slurm(4), it works fine, even thought it seems like that means I have 5 processes working (the master + 4 workers)...

What am I missing?

PBSPro qsub does not have -t

Hello. I tried to use the ClusterManagers package on our cluster which uses PBSPro version 13 (I also hae version 14 available). The result is below.
I think the -t option is specific to OpenPBS with Torque (I may be wrong!)
I am happy to help with a port to PBSPro

Also note that on our cluster you cannot submit directly to a queue.
Jobs go to the entry queue and are the sent to queues depending on the resources which you request, including walltime. So the empty queue must work.

John Hearns

julia> addprocs_pbs(12,queue="long")
qsub: invalid option -- 't'
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k o|e|oe] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-S path] [-u user_list] [-W otherattributes=value...]
[-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
qsub --version
batch queue not available (could not run qsub)
0-element Array{Int64,1}

Add a nonblocking addprocs_sge (and others)

Currently the use of addprocs_sge (and the others) is blocking. It must obtain the node before it returns. For example, applying @async to a command which calls addprocs_sge is useless.

I would like to know if is it possible to add a non-blocking version of addprocs_sge. I am happy to help with the coding but am not overly familiar with the addprocs structure.

I suggested a workaround in a question on StackOverflow but am not sure my approach to the issue is recommended.

Add transport to ElasticManager

I'm still busy setting things up to get the elastic manager running on my local network and in a docker image running in ecs but I wondered if a transport option which connected the workers via an ssh tunnel would work? SSH works fine so it should theoretically be an adaptation of the elastic manager with aspects of the ssh manager included.

I'm unsure as to whether this has any merit. If this seems like a good idea then I should be able to make an attempt next week at it.
Using the elastic manager solves my problem where I would have PCs going off an on so with some extra effort to get it running and retrying once I can get connections to work over the network, the elastic manager becomes a great solution!

Recommend using `salloc` before using SlurmManager

Using addprocs_slurm to add 160 workers on a cluster, I noticed that workers were being dropped every 220 seconds while waiting for resources to be allocated. Commenting out lines 51:54 in https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/slurm.jl

if time() > t0 + 60 + np
    warn("dropping worker: file not created in $(60 + np) seconds")
    break
end

fixed the problem, but this is a hack. Presumably there should be a way to start this timeout countdown only after resources have been allocated / the job has made it through the slurm batch queue.

`addprocs_sge` does not allow user to set queue switch

Hi,

To launch with qsub I need to be able to set the queue with the -P switch (qsub -P queue_name). However the code uses qsub -q queue_name but that will not work for me. I'm willing to submit a PR but I'm not sure how it should be added. Perhaps with a keyword argument: addprocs_sge(5, queue="fast", queue_switch="-P") or if the queue name starts with a - and there is a space then assume it includes the switch? What do people prefer?

Glen

Static compilation fails on ClusterManagers.jl

I’m trying to create a system image of Julia with ClusterManagers.jl embedded, I encountered following error while building Julia (using release-0.5 branch).

â  make -j 4
    JULIA usr/lib/julia/sys.o
coreio.jl
exports.jl
essentials.jl
base.jl
generator.jl
reflection.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
operators.jl
pointer.jl
refpointer.jl
WARNING: Method definition (::Type{#T<:Any})(Any) in module Inference at coreimg.jl:39 overwritten in module Base at sysimg.jl:53.
checked.jl
abstractarray.jl
subarray.jl
array.jl
hashing.jl
rounding.jl
float.jl
complex.jl
rational.jl
multinverses.jl
abstractarraymath.jl
arraymath.jl
float16.jl
simdloop.jl
reduce.jl
reshapedarray.jl
bitarray.jl
intset.jl
dict.jl
set.jl
iterator.jl
build_h.jl
version_git.jl
osutils.jl
c.jl
sysinfo.jl
io.jl
iostream.jl
iobuffer.jl
char.jl
intfuncs.jl
strings/strings.jl
strings/errors.jl
strings/string.jl
strings/types.jl
strings/basic.jl
strings/search.jl
strings/util.jl
strings/io.jl
strings/utf8proc.jl
parse.jl
shell.jl
regex.jl
pcre.jl
show.jl
base64.jl
nullable.jl
version.jl
libc.jl
libdl.jl
env.jl
libuv.jl
uv_constants.jl
event.jl
task.jl
lock.jl
threads.jl
weakkeydict.jl
stream.jl
socket.jl
filesystem.jl
process.jl
multimedia.jl
grisu/grisu.jl
methodshow.jl
floatfuncs.jl
math.jl
cartesian.jl
multidimensional.jl
permuteddimsarray.jl
reducedim.jl
ordering.jl
collections.jl
sort.jl
WARNING: Method definition searchsortedfirst(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
WARNING: Method definition searchsortedlast(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
WARNING: Method definition searchsorted(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
gmp.jl
mpfr.jl
combinatorics.jl
hashing2.jl
dSFMT.jl
random.jl
printf.jl
meta.jl
Enums.jl
serialize.jl
channels.jl
clusterserialize.jl
multi.jl
workerpool.jl
pmap.jl
managers.jl
asyncmap.jl
loading.jl
mmap.jl
sharedarray.jl
datafmt.jl
deepcopy.jl
interactiveutil.jl
replutil.jl
test.jl
i18n.jl
initdefs.jl
Terminals.jl
LineEdit.jl
REPLCompletions.jl
REPL.jl
client.jl
util.jl
linalg/linalg.jl
broadcast.jl
statistics.jl
irrationals.jl
dft.jl
dsp.jl
quadgk.jl
fastmath.jl
libgit2/libgit2.jl
pkg/pkg.jl
stacktraces.jl
profile.jl
dates/Dates.jl
sparse/sparse.jl
threadcall.jl
deprecated.jl
docs/helpdb.jl
docs/helpdb/Base.jl
docs/basedocs.jl
markdown/Markdown.jl
docs/Docs.jl
/home/julia/TopStarredImage/julia/base/precompile.jl
LoadError("sysimg.jl",381,LoadError("/home/julia/TopStarredImage/julia/base/userimg.jl",1,LoadError("/home/julia/TopStarredPackages/v0.5/ClusterManagers/src/ClusterManagers.jl",7,UndefRefError())))
*** This error is usually fixed by running `make clean`. If the error persists, try `make cleanall`. ***
make[1]: *** [/home/julia/TopStarredImage/julia/usr/lib/julia/sys.o] Error 1
make: *** [julia-sysimg-release] Error 2

OS = Ubuntu 14.04
ClusterManagers version = 0.1.0

srun: unrecognized option '--enable-threaded-blas=false'

I'm having the same issue as #75 on Julia 0.6.2, ClusterMangers 0.1.2. Let me know if you can reproduce?

test.sbatch

#!/bin/bash
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 4
#SBATCH --time 1
ml load julia/0.6
julia ./slurm.jl

slurm.jl

using ClusterManagers
addprocs(SlurmManager(2))

hosts = []
pids = []
for i in workers()
	host, pid = fetch(@spawnat i (gethostname(), getpid()))
	push!(hosts, host)
	push!(pids, pid)
end

for i in workers()
	rmprocs(i)
end

Problems with SGE

I'm getting the following using the newest Julia on my SGE. Not sure where to begin debugging this.

In [4]:

ClusterManagers.addprocs_sge(4)
job id is 168009, waiting for job to start ...............................................
connect: connection timed out (ETIMEDOUT)
at In[4]:1
 in yield at multi.jl:1540
 in wait at task.jl:117
 in wait_connected at stream.jl:263
 in connect at stream.jl:878
 in create_worker at multi.jl:1036
 in start_cluster_workers at multi.jl:1000
 in addprocs_internal at multi.jl:1202
 in addprocs at multi.jl:1205
 in addprocs_sge at /home/malmaud/.julia/ClusterManagers/src/qsub.jl:73

SGEManager procs failing to launch

julia> using Distributed, ClusterManagers

julia> addprocs(SGEManager(1,""), qsub_env="",res_list="")
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessExited(0)), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessExited(0))], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000014) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}

julia> addprocs_sge(1)
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessRunning), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessRunning)], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000015) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}

I've twiddled with a few things like checking that qsub -N julia-47120 -terse -j y -R y -t 1-1 -V itself runs without error, but don't really know how to go about debugging this further, short of reading up on each of the functions named in the error or the launch method. Any suggestions?

As a workaround, I've been queueing myself with qrsh inside a tmux session and then launching julia once I'm out of the queue, or writing Julia scripts that can be called from a qsub script. In doing so I've noticed that adding processes with the addprocs(machines) syntax fails for any hostname not in my .ssh/known_hosts. This is, I expect, unrelated to the main issue, but I've included the error message below just in case.

julia> addprocs(["fmg01"])
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
ERROR: Unable to read host:port string from worker. Launch command exited with error?
error(::String) at ./error.jl:33
read_worker_host_port(::Pipe) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:273
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:397
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:505
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:451
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at ./task.jl:266
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
 [5] #addprocs at ./none:0 [inlined]
 [6] #addprocs#251(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:118
 [7] addprocs(::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:117
 [8] top-level scope at none:0

julia here is the precompiled 1.0.0 binary for Linux, running on a shared-filesystem, Scientific Linux 7.4 cluster, with SGE 6.2u3 installed.

SGE issue: could not connect

I tried using the following example:

using ClusterManagers
ClusterManagers.addprocs_sge(4)

And got the following output:

job id is 2559006, waiting for job to start ............................................................
    From worker 0:  Master process (id 1) could not connect within 60.0 seconds.
    From worker 0:  exiting.

I noticed (using qstat -U) that the jobs were started:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 4
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 3
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 2
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 [email protected]     1 1

But shortly thereafter, the jobs seemed to disappear:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 [email protected]     1 1

Julia then seemed to freeze indefinitely and had to be killed manually.

Any ideas as to what went wrong?

Thanks,

~David

slurm : job credential expired

i see this:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using ClusterManagers

julia> addprocs_slurm(3)
srun: job 14774 queued and waiting for resources
srun: job 14774 has been allocated resources
srun: error: Task launch for 14774.0 failed on node magi70: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
0-element Array{Int64,1}

and this is my squeue:

florian.oswald@magi3:~$ squeue -u florian.oswald
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             14774   COMPUTE julia-50 florian.  R       0:15      1 magi70
florian.oswald@magi3:~$ squeue -u florian.oswald
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
florian.oswald@magi3:~$

notice that while the job is running, the scheduler allows me to SSH without password into the compute node. so that requirement is fulfilled.

When will SGE fixes be added to the tagged release?

Could anyone provide a sense of when the changes from #101 will be added to the tagged release?

Adding processes to HTCondor

When adding a process to an HTCondor cluster I get the following output:
julia> using ClusterManagers
julia> addprocs_htc(1)
Submitting job(s).
Waiting for 1 workers:

Then nothing happens, the function is stuck. I'm on Julia 0.5 and also have tried on 0.4 with the same result. I am able to manually submit jobs to the cluster from the terminal.

Edit: I'm running the cluster on localhost and all the server ports are closed except for the ssh port which might be why the jobs aren't submitted correctly.

Example of reading cookie from (shared) file?

Hi all,

I'm sorry to do this, but I've been looking through the code most of the afternoon and I'm still lost as to how to manage this issue. Basically, I'm not allowed to have the cluster cookie passed via command line since it's visible to all users of the system. It is allowed to have the workers read a file (owned by me in a shared file) for the cookie.

I've read the docs where it says

Note that environments requiring higher levels of security can implement this via a custom ClusterManager. For example, cookies can be pre-shared and hence not specified as a startup argument.

but I can't figure out how to do this.

Ideally, I'd modify the Slurm connection manager to allow cookies to be read from a particular location, but again - I don't even know where to start. Any advice appreciated.

Update for `0.7`?

Hello,

As of now, ClusterManagers works on 0.7-beta2 with a ton of deprecation warnings but that is ok. It is however broken on 0.7-rc2/rc3, any plans to update? (it won't even compile because of this JuliaLang/julia#28499, wait for some fix of that, it is related)

If I find out some quick and dirty fixes I will submit them back here.

Thanks!

Is there any way to specify -l parameters to qsub in the PBS manager?

I need to give a few parameters to the qsub command such as:

-l nodes=2:ppn=16,walltime=1:00:00

I've been looking through the code, and it seems like qsub_env is the only parameter we can tweek, and if I'm not wrong that's just passing environment variables through to the assigned hosts.

It doesn't look too hard to add another set of parameters for additional qsub commands like this, but I thought I'd ask just in case I missed something.

Allow to specify queue name

It would be nice if in addprocs_sge() somehow it would be possible to specify the SGE queue name. For our set-up, the default queue is "short" which kills the sge workers after 1 hour CPU time.

queue name can be specified with "-q $name" option to qsub

Passing Environment Variables with qsub

My cluster has an old compiler, so I compiled GCC in my user directory, so I had to add to LD_LIBRARY_PATH my lib directory. Now launching workers does not work:

/home/---/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/---/julia/usr/bin/../lib/libjulia.so)

As the environment variables are not passed to the launched workers. Adding a -V option to the qsub command fixes this. Should we add this as a default?

LSF manager with bsub

Hello,

Are there plans to write a manager for lsf/bsub? If so, would it be able to use the infiniband transport for example.

I managed to "hack" a bsub submission to print out its hostnames and then use the sshmanager in julia to connect to those. My experience on these fronts is limited, and have had trouble connecting :all_to_all if the nodes happen to be on different racks.

I tried to mimic the qsub.jl with very limited success :)

Cheers!

Jobfile is incorrect on some PBS clusters

When I submit an array job to the cluster with a command:

qsub -N test -j oe -k o -t 1:5 ./test

I get list of log files in a form

test-1.o50084
test-2.o50084
test-3.o50084
test-4.o50084
test-5.o50084

which is different from logfile form ClusterManagers is now looking namely - .o-. In cluster I have access qsub --version is 4.2.10. Seems until 6.0.0 output schematics had changed as can be seen in this issue. But seems this specific detail is undocumented.

When changing schematics manually in qsub.jl cluster worked fine. Thus this might be an annoying bug for newcomers with an easy fix. For example filename(i) could return a list of files which would be checked in the loop or there could be another condition which decide the schematics of filenames depending on qsub version.

HTCondor and windows

I'm trying to edit the HTCondor manager to get it running on Windows. I can get jobs submitted but telnet connections don't seem to work. I have tried just running the generated .bat but that fails to connect. Running it in msys doesn't work either. I can't guarantee telnet is even installed on the machines.

Is there an alternative way to get a connection going? Unfortunately the "cluster" is just all the PCs in the lab which run Windows-limiting my current usage to just making jobs that run a script, save and return the output.

Document usage

Hi,

I'm interested in using this with SGE but I'm not sure how to use it. The documentation covers how to write a new manager but has no examples about how to use the existing ones. It seems like this is for starting new Julia workers and not for lunching Cmds. That is fine but it would be nice to have some examples of the typical usage.

Glen

Proper ClusterManager config for SharedArrays in SGE environment

I am trying to run a Julia script in an SGE environment on my department's compute cluster. The script makes heavy use of SharedArray objects, so I need to tell both Julia and the SGE scheduler to allocate workers on the same node.

I tried it with the SGE ORTE, which is probably better for DArray environments. I asked my sysadmin to add a new parallel environment for OpenMP with this website as a guide.

In the meantime, I want to ensure that my Julia code is at least attempting to correctly allocate the workers locally. I schedule my Julia script with qsub -pe orte 11 script.jl and then add the following code to the top of my script:

# set up cluster manager
using ClusterManagers
addprocs(LocalAffinityManager(np = 10, mode = COMPACT, affinities = Int[]))

# check if all cores on same host (node)
hosts = []
pids = []
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    push!(hosts, host)
    push!(pids, pid)
end
println("driver process is hosted on ", gethostname())
println("number of workers on host is ", nworkers())
println("where are workers hosted?")
display(hosts)

The last SGE run scheduled 6 processes on one node and 5 on the other. I expected that the previous code would show the workers allocated to two hosts. Instead, all worker processes claimed to reside on the same node:

driver process is hosted on compute-0-8.local
number of workers on host is 10
where are workers hosted?
10-element Array{Any,1}:
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"

How should I configure my ClusterManager to place all processes on the same node? (and how do I tell SGE to do the same?!)

Print does not work without workaround

I don't know if that was intended but println and other output commands does not work where run(echo "Hello") works well. Thus I have been overloading println after I start cluster:

@everywhere import Base.println
@everywhere println(x::AbstractString) = run(`echo "$x"`)

But I wonder if this workaround could find a place in the package?

Reqeuesting transfer of the repository to JuliaLang organization

Hi @nlhepler. Based on this discussion:

https://groups.google.com/forum/?hl=en#!topic/julia-users/0kXH96uKNhE

I wanted to check if you would be ok transferring the repo to the JuliaLang organization, so that the package can be better maintained? It is slowly seeing adoption as more people try Julia on their clusters.

LocalAffinityManager not updated for 1.0

Attempting to do addprocs(LocalAffinityManager(np = 2, mode = COMPACT, affinities = Int[])) gives errors - it looks like affinity.jl has not been updated for 1.0. Below are the errors I got, each after fixing the previous one:

ERROR: UndefVarError: OS_NAME not defined (fix: use Sys.KERNEL)
ERROR: UndefVarError: assert not defined (fix: use @assert)
ERROR: UndefVarError: CPU_CORES not defined (fix: use Sys.CPU_THREADS)
ERROR: MethodError: no method matching iterate(::Base.Process) (fix: ??)

Support for different PBS versions

TBH I don't know much about PBS at all, but thought I'd put this info here in case its useful to someone or maybe prompts some changes to this repo.

Basically, I'm on a cluster with,

$ qsub --version
pbs_version = PBSPro_10.0.9.110976

For this version, it seems the flag for an array job is -J rather than -t, so initially I was getting a invalid option -- t error. Additionally, once I fixed that, it was also saying cannot submit non-rerunable Array Job, so I had to add -r y. Thus my final qsub command here looks like:

`qsub -N $jobname -j oe -r y -k o -J 1-$np $queue $qsub_env $res_list`

Also, the output files created have a . instead of a - separating the job number here, so I had to change that to

"$home/julia-$(getpid()).o$id.$i"

With those changes, it now works beautifully on my setup. I have no idea how non-standard these things are, but I suppose it would be nice to be able to specify them from the addprocs call so I don't have to hack the code myself, and perhaps even nicer if this could somehow be auto-detected.

SLURM issues in 0.7+

There remain a couple issues in the SLURM manager; working solution implemented in #95

Note sure why, but cluster_cookie() throws an error on our (Linux) system; the PR fixes this by avoiding the worker_arg() call in slurm.jl, and instead uses (Distributed.init_multi(); Distributed.LPROC.cookie), which works fine.

Slurm: srun being passed a Julia option?

On Julia 0.6, latest release of ClusterManagers:

julia> using ClusterManagers

julia> addprocs(SlurmManager(2), partition="pdebug", t="00:5:00")
srun: unrecognized option '--enable-threaded-blas=false'
Try "srun --help" for more information

Slurm error

Hello

I'm just getting started with HPC, my submission system is slurm.
I'm getting this error when trying to start up some processes:

julia> using ClusterManagers
julia> addprocs(SlurmManager(2), partition="debug", t="00:5:00")
Error launching Slurm job:
MethodError(length,(:all_to_all,))
0-element Array{Int64,1}

Is it necessary to write a submission script and submit before this works? If so, what should one look like?

Any suggestions appreciated!

Trivial example with Slurm is failing

The commands

using ClusterManagers
addprocs(SlurmManager(2))

fail with

Error launching Slurm job:
MethodError(length,(:all_to_all,))
0-element Array{Int64,1}

It seems that the launch function receives a keyword argument :topology=>:all_to_all that isn't handled well.

Change scheduler manager

Hi,

This is not really an issue but more like a question.

Do you know if it is possible to use OAR instead of SLURM. Indeed, the cluster I want to use is set for OAR as manager.

Thank you for your help,

Best regards.

Broken With Julia v0.4

The master branch currently does not work with Julia 0.4, due to several problems:

a) Piping is done by a function now
b) Sting interpolation
c) Several types (WorkerConfig, Cmd) have changed implementation.

I've submitted a pull request that fixes, for me, addprocs_sge.

The problems, and changes were:

When I run the addprocs_sge(n) An error is reported that the shell can't find the julia executable:

julia> ClusterManagers.addprocs_sge(35)
WARNING: src::AbstractCmd |> dest::AbstractCmd is deprecated, use pipeline(src,dest) instead.
 in depwarn at deprecated.jl:73
 in |> at deprecated.jl:50
 in launch at /home/gcam/.julia/v0.4/ClusterManagers/src/qsub.jl:30
 in anonymous at task.jl:63
while loading no file, in expression starting on line 0
Unable to read script file because of error: error opening : No such file or directory
batch queue not available (could not run qsub)
0-element Array{Int64,1}

I presume that in line 30 the interpolation in of both jobname and queue does not work as expected when pasting the command to launch julia on the worker nodes. Changing

qsub_cmd = `echo $(Base.shell_escape(cmd))` |> (isPBS ? `qsub -N $jobname $queue -j oe -k o -t 1-$np` : `qsub -N $jobname $queue -terse -j y -t 1-$np`)

qsub_options = length(queue) > 0 ? [jobname queue] : jobname    
cmd = `cd $dir && $exename $exeflags`
qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` ,(isPBS ? `qsub -N $jqsub_options -j oe -k o -t 1-$np` : `qsub -N $qsub_options -terse -j y -t 1-$np`))

Afterwards, the problem is that we're trying to change the detach attribute of Cmd in line 50, but it's an immutable, so changing

 config.io, io_proc = open(cmd)

config.io, io_proc = open(detach(cmd))

Fixes that.

Finally in line 56, the WorkerConfig type does not have a line.buffered attribute, but the IO within it does, and is by default initialized to true, so commenting that out fixes everything.

Hangs on `addprocs_sge()`

I hope someone can help me. I am trying to do parallel computing with Julia on our SGE grid system, which I normally only fed with shell scripts.

When I run for example addprocs_sge(5,res_list="ct=00:01:00"), it immediately shows the received job id and is waiting for the job to start. A few seconds after however, I receive error messages which indicates that it can't tail the log files in my home, which however are present:

julia> using ClusterManagers

julia> addprocs_sge(5,res_list="ct=00:01:00")
job id is 5430281, waiting for job to start ........................
tail: /path/to/my/home/julia-59333.o5430281.1: No such file or directory
tail: no files remaining
tail: /path/to/my/home/julia-59333.o5430281.5: No such file or directory
tail: no files remaining

This is the content of one of the log files, so it apparently fails to run the julia process, which however is of course accessible (I am using the same binary for the REPL):

***************************************************************
* Submitted on:            Tue Feb 07 09:30:16 2017           *
* Started on:              Tue Feb 07 09:31:18 2017           *
***************************************************************

/var/spool/sge/ccwsge0830/job_scripts/5430281:1: permission denied: /path/to/my/home/apps/julia/julia-0.5.0/bin/julia

***************************************************************
* Ended on:                Tue Feb 07 09:31:36 2017           *
* Exit status:             126                                *
* Consumed                                                    *
*   cpu (HS06):            00:00:00                           *
*   cpu scaling factor:    11.100000                          *
*   cpu time:              0 / 60                             *
*   efficiency:            00 %                               *
*   io:                    0.00000GB                          *
*   vmem:                  N/A                                *
*   maxvmem:               N/A                                *
*   maxrss:                N/A                                *
***************************************************************

Any ideas what's happening here?

Sometimes addprocs_sge works, sometimes it doesn't

Hi everyone,

I'm currently trying to use ClusterManagers on a cluster running Sun Grid Engine. I'm running into some problems though. Sometimes "addprocs_sge(x)" simply works and gives me x workers. However, sometimes I run into the same problems as described in the closed Issue #24 . When I do addprocs_sge(x) I see from qstat that the SGE gave me x processors (i.e. they have status 'r'), but then in Julia I still get a timeout. More specifically, I get the following errors:

julia> addprocs_sge(32)
job id is 53661, waiting for job to start ..........................................Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Stacktrace:
[1] unsafe_read(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64) at ./iobuffer.jl:105
[2] unsafe_read(::TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:752
[3] unsafe_read(::TCPSocket, ::Base.RefValue{NTuple{4,Int64}}, ::Int64) at ./io.jl:361
[4] read at ./io.jl:363 [inlined]
[5] deserialize_hdr_raw at ./distributed/messages.jl:170 [inlined]
[6] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:157
[7] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
[8] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73

I have already tried increasing JULIA_WORKER_TIMEOUT from 60 seconds to a few minutes but that does not seem to help. I also tried addprocs_qrsh but I get the same problem: sometimes it works and sometimes it doesn't. I'm not really sure if the problems I'm encountering are due to a problem with Julia or due to a problem with the cluster.

Whenever "addprocs_sge(x)" does work, it's doing an amazing job! But right now it's too unreliable to be used in the project that I'm working on. I hope there's someone who can help me out. Thanks in advance :)

Forgot to mention: I'm running Julia v0.6. The issue mentioned above was closed because upgrading to v0.4 may have solved the problem.

using 9009 as the default ElasticManager port conflicts with Julia's use

Julia uses ports 9009 and up for internal connectivity between processes. By putting ElasticManager on that port, it risks confusing the registration of new processes.

stdout not redirected to repl

while stdout for a local process appears in the repl, that for a remote sge process does not. see the transcript below. not shown is that it works fine for remote ssh processes. is this related to JuliaLang/julia#6030, JuliaLang/julia#5995, and #6 ?

[arthurb@h06u01 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.3.0-prerelease+2417 (2014-04-02 18:29 UTC)
/ |_'|||__'| | Commit 193cb11* (0 days old master)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8083782, waiting for job to start ..................................................
1-element Array{Any,1}:
2

julia> addprocs(1)
1-element Array{Any,1}:
3

julia> remotecall_fetch(3,println,"foo")
From worker 3: foo

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(1,println,"foo")
foo

julia> versioninfo()
Julia Version 0.3.0-prerelease+2417
Commit 193cb11* (2014-04-02 18:29 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
10 required packages:

ClusterManagers 0.0.1+ master
DSP 0.0.1+ spectrogram
Debug 0.0.1
Distributions 0.4.2
HDF5 0.2.20
IProfile 0.2.5
MAT 0.2.3
PyPlot 1.2.2
Stats 0.1.0
WAV 0.2.2
11 additional packages:
ArrayViews 0.4.2
BinDeps 0.2.12
Color 0.2.9
NumericExtensions 0.6.0
NumericFuns 0.2.1
PDMats 0.1.1
Polynomial 0.1.1
PyCall 0.4.2
StatsBase 0.3.9
URIParser 0.0.1
Zlib 0.1.6

julia>

addprocs_slurm() fails since Julia 0.5

When trying to start new procs on the Slurm cluster via addprocs(SlurmManager(n)) I get the following error message (this worked with 0.4):

julia> using ClusterManagers; addprocs_slurm(1)
srun: job 900 queued and waiting for resources
srun: job 900 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: worker50: task 0: Exited with exit code 1

Entering Ctrl+C after nothing happens after the error message crashes Julia :/

InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f8f0ed650a2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)

Job hangs - “waiting for job to start” on a PBS Cluster

I'm trying to use ClusterManagers on a PBS cluster (interactively e.g.)
`
julia> using ClusterManagers

julia> addprocs_pbs(2, queue="default")
job id is 135963, waiting for job to start ................................................................
`

The job seems to hang even though it appears to run on qstat

Job id Name User Time Use S Queue

135963[].pippen julia-26303 snirgaz 0 R default

Any thoughts?

NodeFile and qsub script

I have been adapting some code from the google group for Julia to parse the node file given from a qsub command.

The difference with the existing qsub launcher is that qsub should not be called, only the the node file. The goal is to be able to start parallel jobs without requiring the interactive session.

The idea is to call qsub from the command line, then the qsub script calls Julia, which then parses the node file in the environment and adds the workers at the beginning.

I have a working version of my code, but not wrapped into a ClusterManager:
https://gist.github.com/tlamadon/636d8e468b328d23bb4d

In the code I use addprocs to connect the workers. The problem I have is that I don't know how to wrap this into a ClusterManager. It seems to me that addprocs actually uses the local manager already.

Any suggestion would be very welcome, I will be happy to write the ClusterManager and send a merge request,

Segfault using SlurmManager on Comet

Hey I'm playing around with the Comet Super Computing Cluster at SDSC and I have hit the following error when running on the login node.

julia> using ClusterManagers

julia> addprocs(SlurmManager(2), partition="debug", t="00:5:00")
connecting to worker 2 out of 2
signal (6): Aborted
while loading no file, in expression starting on line 0
init_once at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/threadpool.c:174
uv_once at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/unix/thread.c:239
uv__work_submit at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/threadpool.c:184
uv_getaddrinfo at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/unix/getaddrinfo.c:186
jl_getaddrinfo at /home/etrain75/julia/src/jl_uv.c:712
getaddrinfo at ./socket.jl:591
getaddrinfo at ./socket.jl:606
connect! at ./socket.jl:703
connect at ./stream.jl:949 [inlined]
connect_to_worker at ./managers.jl:483
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
connect at ./managers.jl:425
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
create_worker at ./multi.jl:1570
setup_launched_worker at ./multi.jl:1517
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
#527 at ./task.jl:309
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
jl_apply at /home/etrain75/julia/src/julia.h:1392 [inlined]
start_task at /home/etrain75/julia/src/task.c:253
unknown function (ip: 0xffffffffffffffff)
Allocations: 2377991 (Pool: 2377020; Big: 971); GC: 1
Aborted
[etrain75@comet-ln2 ~]$ srun: error: comet-14-01: tasks 0-1: Exited with exit code 1

If I run the same command but from a compute node it works as expected.
I expect it has something to do with running on a constrained system JuliaLang/julia#14807

This bug occurs on ClusterMangers.jl master branch (which someone should tag in METADATA)
and current Julia master

julia> versioninfo()
Julia Version 0.5.0-rc0+146
Commit 37e6397* (2016-08-03 00:47 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

juliaparallel / clustermanagers.jl Goto Github PK

clustermanagers.jl's Introduction

ClusterManagers

Currently supported job queue systems

Slurm: a simple example

SGE - a simple interactive example

SGE via qrsh

Load Sharing Facility (LSF)

Using LocalAffinityManager (for pinning local workers to specific cores)

Using ElasticManager (dynamically adding workers to a cluster)

clustermanagers.jl's People

Contributors

Stargazers

Watchers

Forkers

clustermanagers.jl's Issues

test.sbatch

slurm.jl

Recommend Projects

Recommend Topics

Recommend Org

Using `LocalAffinityManager` (for pinning local workers to specific cores)

Using `ElasticManager` (dynamically adding workers to a cluster)