Giter Site home page Giter Site logo

clustermanagers.jl's People

Contributors

amitmurthy avatar andreasnoack avatar bjarthur avatar david96 avatar davidavdav avatar dbindel avatar dependabot[bot] avatar drchainsaw avatar femtocleaner[bot] avatar floswald avatar gcamilo avatar grahamas avatar jabl avatar jbrnd avatar jiahao avatar juliatagbot avatar kescobo avatar kshyatt avatar lstagner avatar marius311 avatar mdpradeep avatar miguelraz avatar mikeingold avatar mkschleg avatar moelf avatar nantonel avatar nlhepler avatar s2maki avatar tanmaykm avatar vchuravy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clustermanagers.jl's Issues

SLURM: `srun` doesn't see julia instance using ClusterManagers?

I started a session with an sbatch script with:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1G
#SBATCH --job-name=my_first_parallel_julia
#SBATCH --time=00-00:02:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=julia_in_parallel.output  # output and error messages go to this file

julia  my_first_parallel_julia_on_slurm.jl

Then in my julia script, tried to run addprocs_slurm(3) but got this error:

srun: Warning: can't honor --ntasks-per-node set to 2 which 
doesn't match the requested tasks 3 with the number of requested nodes 2.

But if I run addprocs_slurm(4), it works fine, even thought it seems like that means I have 5 processes working (the master + 4 workers)...

What am I missing?

Segfault using SlurmManager on Comet

Hey I'm playing around with the Comet Super Computing Cluster at SDSC and I have hit the following error when running on the login node.

julia> using ClusterManagers

julia> addprocs(SlurmManager(2), partition="debug", t="00:5:00")
connecting to worker 2 out of 2
signal (6): Aborted
while loading no file, in expression starting on line 0
init_once at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/threadpool.c:174
uv_once at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/unix/thread.c:239
uv__work_submit at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/threadpool.c:184
uv_getaddrinfo at /home/etrain75/julia/deps/srccache/libuv-ecbd6eddfac4940ab8db57c73166a7378563ebd3/src/unix/getaddrinfo.c:186
jl_getaddrinfo at /home/etrain75/julia/src/jl_uv.c:712
getaddrinfo at ./socket.jl:591
getaddrinfo at ./socket.jl:606
connect! at ./socket.jl:703
connect at ./stream.jl:949 [inlined]
connect_to_worker at ./managers.jl:483
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
connect at ./managers.jl:425
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
create_worker at ./multi.jl:1570
setup_launched_worker at ./multi.jl:1517
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
#527 at ./task.jl:309
jl_call_method_internal at /home/etrain75/julia/src/julia_internal.h:192 [inlined]
jl_apply_generic at /home/etrain75/julia/src/gf.c:1930
jl_apply at /home/etrain75/julia/src/julia.h:1392 [inlined]
start_task at /home/etrain75/julia/src/task.c:253
unknown function (ip: 0xffffffffffffffff)
Allocations: 2377991 (Pool: 2377020; Big: 971); GC: 1
Aborted
[etrain75@comet-ln2 ~]$ srun: error: comet-14-01: tasks 0-1: Exited with exit code 1

If I run the same command but from a compute node it works as expected.
I expect it has something to do with running on a constrained system JuliaLang/julia#14807

This bug occurs on ClusterMangers.jl master branch (which someone should tag in METADATA)
and current Julia master

julia> versioninfo()
Julia Version 0.5.0-rc0+146
Commit 37e6397* (2016-08-03 00:47 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

Split up this package

There is not much shared code between the managers and most of us only use a single workload/cluster manager so it is difficult to review PRs.

Sometimes addprocs_sge works, sometimes it doesn't

Hi everyone,

I'm currently trying to use ClusterManagers on a cluster running Sun Grid Engine. I'm running into some problems though. Sometimes "addprocs_sge(x)" simply works and gives me x workers. However, sometimes I run into the same problems as described in the closed Issue #24 . When I do addprocs_sge(x) I see from qstat that the SGE gave me x processors (i.e. they have status 'r'), but then in Julia I still get a timeout. More specifically, I get the following errors:

julia> addprocs_sge(32)
job id is 53661, waiting for job to start ..........................................Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Stacktrace:
[1] unsafe_read(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64) at ./iobuffer.jl:105
[2] unsafe_read(::TCPSocket, ::Ptr{UInt8}, ::UInt64) at ./stream.jl:752
[3] unsafe_read(::TCPSocket, ::Base.RefValue{NTuple{4,Int64}}, ::Int64) at ./io.jl:361
[4] read at ./io.jl:363 [inlined]
[5] deserialize_hdr_raw at ./distributed/messages.jl:170 [inlined]
[6] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:157
[7] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
[8] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73

I have already tried increasing JULIA_WORKER_TIMEOUT from 60 seconds to a few minutes but that does not seem to help. I also tried addprocs_qrsh but I get the same problem: sometimes it works and sometimes it doesn't. I'm not really sure if the problems I'm encountering are due to a problem with Julia or due to a problem with the cluster.

Whenever "addprocs_sge(x)" does work, it's doing an amazing job! But right now it's too unreliable to be used in the project that I'm working on. I hope there's someone who can help me out. Thanks in advance :)

Forgot to mention: I'm running Julia v0.6. The issue mentioned above was closed because upgrading to v0.4 may have solved the problem.

Support for different PBS versions

TBH I don't know much about PBS at all, but thought I'd put this info here in case its useful to someone or maybe prompts some changes to this repo.

Basically, I'm on a cluster with,

$ qsub --version
pbs_version = PBSPro_10.0.9.110976

For this version, it seems the flag for an array job is -J rather than -t, so initially I was getting a invalid option -- t error. Additionally, once I fixed that, it was also saying cannot submit non-rerunable Array Job, so I had to add -r y. Thus my final qsub command here looks like:

`qsub -N $jobname -j oe -r y -k o -J 1-$np $queue $qsub_env $res_list`

Also, the output files created have a . instead of a - separating the job number here, so I had to change that to

"$home/julia-$(getpid()).o$id.$i"

With those changes, it now works beautifully on my setup. I have no idea how non-standard these things are, but I suppose it would be nice to be able to specify them from the addprocs call so I don't have to hack the code myself, and perhaps even nicer if this could somehow be auto-detected.

Comprehensive tests

One of the issues (and one that became even more apparent during the 1.0 transistion) is that it is really hard to test this package.
Without CI development is slow since we are likely to break use-cases that we can't test.

Ideally we could use docker environments to instantiate a "minimal" cluster environment in which we then can run tests.

As an example see:

addprocs_slurm() fails since Julia 0.5

When trying to start new procs on the Slurm cluster via addprocs(SlurmManager(n)) I get the following error message (this worked with 0.4):

julia> using ClusterManagers; addprocs_slurm(1)
srun: job 900 queued and waiting for resources
srun: job 900 has been allocated resources
Worker 2 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68
srun: error: worker50: task 0: Exited with exit code 1

Entering Ctrl+C after nothing happens after the error message crashes Julia :/

InterruptException()
jl_run_once at /home/centos/buildbot/slave/package_tarball64/build/src/jl_uv.c:142
process_events at ./libuv.jl:82
wait at ./event.jl:147
task_done_hook at ./task.jl:191
unknown function (ip: 0x7f8f0ed650a2)
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1942
jl_apply at /home/centos/buildbot/slave/package_tarball64/build/src/julia.h:1392 [inlined]
finish_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:214 [inlined]
start_task at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:261
unknown function (ip: 0xffffffffffffffff)

Broken With Julia v0.4

The master branch currently does not work with Julia 0.4, due to several problems:

a) Piping is done by a function now
b) Sting interpolation
c) Several types (WorkerConfig, Cmd) have changed implementation.

I've submitted a pull request that fixes, for me, addprocs_sge.

The problems, and changes were:

When I run the addprocs_sge(n) An error is reported that the shell can't find the julia executable:

julia> ClusterManagers.addprocs_sge(35)
WARNING: src::AbstractCmd |> dest::AbstractCmd is deprecated, use pipeline(src,dest) instead.
 in depwarn at deprecated.jl:73
 in |> at deprecated.jl:50
 in launch at /home/gcam/.julia/v0.4/ClusterManagers/src/qsub.jl:30
 in anonymous at task.jl:63
while loading no file, in expression starting on line 0
Unable to read script file because of error: error opening : No such file or directory
batch queue not available (could not run qsub)
0-element Array{Int64,1}

I presume that in line 30 the interpolation in of both jobname and queue does not work as expected when pasting the command to launch julia on the worker nodes. Changing

qsub_cmd = `echo $(Base.shell_escape(cmd))` |> (isPBS ? `qsub -N $jobname $queue -j oe -k o -t 1-$np` : `qsub -N $jobname $queue -terse -j y -t 1-$np`)
  • To
qsub_options = length(queue) > 0 ? [jobname queue] : jobname    
cmd = `cd $dir && $exename $exeflags`
qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` ,(isPBS ? `qsub -N $jqsub_options -j oe -k o -t 1-$np` : `qsub -N $qsub_options -terse -j y -t 1-$np`))

Afterwards, the problem is that we're trying to change the detach attribute of Cmd in line 50, but it's an immutable, so changing

 config.io, io_proc = open(cmd)

To

config.io, io_proc = open(detach(cmd))

Fixes that.

Finally in line 56, the WorkerConfig type does not have a line.buffered attribute, but the IO within it does, and is by default initialized to true, so commenting that out fixes everything.

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine

if i add local workers before adding remote SGE workers, then the SGE workers will terminate with an ECONNREFUSED error. if i reverse the order, and add the SGE workers before the local workers, then all is good. i presume this is not the desired behavior. let me know if there is anyway i can help debug. sample output and versioninfo below.

[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> addprocs(16)
16-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365131, waiting for job to start ..............................
10-element Array{Any,1}:
18
19
20
21
22
23
24
25
26
27

julia> Worker 19 terminated.
Worker 20 terminated.
Worker 21 terminated.
Worker 22 terminated.
Worker 18 terminated.
Worker 25 terminated.
Worker 24 terminated.
Worker 23 terminated.
Worker 27 terminated.
Worker 26 terminated.
From worker 18: fatal error on 18: ERROR: connect: connection refused (ECONNREFUSED)
From worker 18: in wait_connected at stream.jl:265
From worker 18: in connect at stream.jl:871
From worker 18: in Worker at multi.jl:119
From worker 18: in anonymous at task.jl:866
From worker 19: fatal error on 19: ERROR: connect: connection refused (ECONNREFUSED)
From worker 19: in wait_connected at stream.jl:265
From worker 19: in connect at stream.jl:871
From worker 19: in Worker at multi.jl:119
From worker 19: in anonymous at task.jl:866
From worker 20: fatal error on 20: ERROR: connect: connection refused (ECONNREFUSED)
From worker 20: in wait_connected at stream.jl:265
From worker 20: in connect at stream.jl:871
From worker 20: in Worker at multi.jl:119
From worker 20: in anonymous at task.jl:866
From worker 21: fatal error on 21: ERROR: connect: connection refused (ECONNREFUSED)
From worker 21: in wait_connected at stream.jl:265
From worker 21: in connect at stream.jl:871
From worker 21: in Worker at multi.jl:119
From worker 21: in anonymous at task.jl:866
From worker 22: fatal error on 22: ERROR: connect: connection refused (ECONNREFUSED)
From worker 22: in wait_connected at stream.jl:265
From worker 22: in connect at stream.jl:871
From worker 22: in Worker at multi.jl:119
From worker 22: in anonymous at task.jl:866
From worker 23: fatal error on 23: ERROR: connect: connection refused (ECONNREFUSED)
From worker 23: in wait_connected at stream.jl:265
From worker 23: in connect at stream.jl:871
From worker 23: in Worker at multi.jl:119
From worker 23: in anonymous at task.jl:866
From worker 24: fatal error on 24: ERROR: connect: connection refused (ECONNREFUSED)
From worker 24: in wait_connected at stream.jl:265
From worker 24: in connect at stream.jl:871
From worker 24: in Worker at multi.jl:119
From worker 24: in anonymous at task.jl:866
From worker 25: fatal error on 25: ERROR: connect: connection refused (ECONNREFUSED)
From worker 25: in wait_connected at stream.jl:265
From worker 25: in connect at stream.jl:871
From worker 25: in Worker at multi.jl:119
From worker 25: in anonymous at task.jl:866
From worker 26: fatal error on 26: ERROR: connect: connection refused (ECONNREFUSED)
From worker 26: in wait_connected at stream.jl:265
From worker 26: in connect at stream.jl:871
From worker 26: in Worker at multi.jl:119
From worker 26: in anonymous at task.jl:866
From worker 27: fatal error on 27: ERROR: connect: connection refused (ECONNREFUSED)
From worker 27: in wait_connected at stream.jl:265
From worker 27: in connect at stream.jl:871
From worker 27: in Worker at multi.jl:119
From worker 27: in anonymous at task.jl:866

julia>
[arthurb@h01u14 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease (2014-02-24 14:04 UTC)
/ |_'|||__'| | master/457bca9* (fork: -1 commits, 126 days)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> addprocs(10, cman=SGEManager())
job id is 6365134, waiting for job to start ..............................
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11

julia> addprocs(16)
16-element Array{Any,1}:
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

julia> versioninfo()
Julia Version 0.3.0-prerelease
Commit 457bca9* (2014-02-24 14:04 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
7 required packages:

  • ClusterManagers 0.0.1
  • DSP 0.0.1
  • Debug 0.0.0
  • Devectorize 0.2.1
  • Distributions 0.4.0
  • MAT 0.2.2
  • WAV 0.2.2
    9 additional packages:
  • ArrayViews 0.4.1
  • BinDeps 0.2.12
  • HDF5 0.2.17
  • NumericExtensions 0.5.4
  • PDMats 0.1.0
  • Polynomial 0.0.0
  • StatsBase 0.3.7
  • URIParser 0.0.1
  • Zlib 0.1.5

julia>

Job hangs - “waiting for job to start” on a PBS Cluster

I'm trying to use ClusterManagers on a PBS cluster (interactively e.g.)
`
julia> using ClusterManagers

julia> addprocs_pbs(2, queue="default")
job id is 135963, waiting for job to start ................................................................
`

The job seems to hang even though it appears to run on qstat

Job id Name User Time Use S Queue

135963[].pippen julia-26303 snirgaz 0 R default

Any thoughts?

addprocs_sge works incorrectly if the work directory is not the home directory and the list of default options for qsub includes "-cwd"

In our environment, the list of default options for qsub includes "-cwd" to preserve the current directory. If the current working directory is different from the user's home directory, addprocs_sge doesn't work (can't find the files with the information about workers). From the file, qsub.jl:

filenames(i) = "$home/julia-$(getpid()).o$id-$i","$home/julia-$(getpid())-$i.o$id","$home/julia-$(getpid()).o$id.$i"

Julia expects these files in the home dir, but they are created in the current dir instead.

Document usage

Hi,

I'm interested in using this with SGE but I'm not sure how to use it. The documentation covers how to write a new manager but has no examples about how to use the existing ones. It seems like this is for starting new Julia workers and not for lunching Cmds. That is fine but it would be nice to have some examples of the typical usage.

Glen

Allow to specify queue name

It would be nice if in addprocs_sge() somehow it would be possible to specify the SGE queue name. For our set-up, the default queue is "short" which kills the sge workers after 1 hour CPU time.

queue name can be specified with "-q $name" option to qsub

HTCondor and windows

I'm trying to edit the HTCondor manager to get it running on Windows. I can get jobs submitted but telnet connections don't seem to work. I have tried just running the generated .bat but that fails to connect. Running it in msys doesn't work either. I can't guarantee telnet is even installed on the machines.

Is there an alternative way to get a connection going? Unfortunately the "cluster" is just all the PCs in the lab which run Windows-limiting my current usage to just making jobs that run a script, save and return the output.

Slurm error

Hello

I'm just getting started with HPC, my submission system is slurm.
I'm getting this error when trying to start up some processes:

julia> using ClusterManagers
julia> addprocs(SlurmManager(2), partition="debug", t="00:5:00")
Error launching Slurm job:
MethodError(length,(:all_to_all,))
0-element Array{Int64,1}

Is it necessary to write a submission script and submit before this works? If so, what should one look like?

Any suggestions appreciated!

Recommend using `salloc` before using SlurmManager

Using addprocs_slurm to add 160 workers on a cluster, I noticed that workers were being dropped every 220 seconds while waiting for resources to be allocated. Commenting out lines 51:54 in https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/slurm.jl

if time() > t0 + 60 + np
    warn("dropping worker: file not created in $(60 + np) seconds")
    break
end

fixed the problem, but this is a hack. Presumably there should be a way to start this timeout countdown only after resources have been allocated / the job has made it through the slurm batch queue.

Proper ClusterManager config for SharedArrays in SGE environment

I am trying to run a Julia script in an SGE environment on my department's compute cluster. The script makes heavy use of SharedArray objects, so I need to tell both Julia and the SGE scheduler to allocate workers on the same node.

I tried it with the SGE ORTE, which is probably better for DArray environments. I asked my sysadmin to add a new parallel environment for OpenMP with this website as a guide.

In the meantime, I want to ensure that my Julia code is at least attempting to correctly allocate the workers locally. I schedule my Julia script with qsub -pe orte 11 script.jl and then add the following code to the top of my script:

# set up cluster manager
using ClusterManagers
addprocs(LocalAffinityManager(np = 10, mode = COMPACT, affinities = Int[]))

# check if all cores on same host (node)
hosts = []
pids = []
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    push!(hosts, host)
    push!(pids, pid)
end
println("driver process is hosted on ", gethostname())
println("number of workers on host is ", nworkers())
println("where are workers hosted?")
display(hosts)

The last SGE run scheduled 6 processes on one node and 5 on the other. I expected that the previous code would show the workers allocated to two hosts. Instead, all worker processes claimed to reside on the same node:

driver process is hosted on compute-0-8.local
number of workers on host is 10
where are workers hosted?
10-element Array{Any,1}:
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"
 "compute-0-8.local"

How should I configure my ClusterManager to place all processes on the same node? (and how do I tell SGE to do the same?!)

Trivial example with Slurm is failing

The commands

using ClusterManagers
addprocs(SlurmManager(2))

fail with

Error launching Slurm job:
MethodError(length,(:all_to_all,))
0-element Array{Int64,1}

It seems that the launch function receives a keyword argument :topology=>:all_to_all that isn't handled well.

Print does not work without workaround

I don't know if that was intended but println and other output commands does not work where run(echo "Hello") works well. Thus I have been overloading println after I start cluster:

@everywhere import Base.println
@everywhere println(x::AbstractString) = run(`echo "$x"`)

But I wonder if this workaround could find a place in the package?

NodeFile and qsub script

I have been adapting some code from the google group for Julia to parse the node file given from a qsub command.

The difference with the existing qsub launcher is that qsub should not be called, only the the node file. The goal is to be able to start parallel jobs without requiring the interactive session.

The idea is to call qsub from the command line, then the qsub script calls Julia, which then parses the node file in the environment and adds the workers at the beginning.

I have a working version of my code, but not wrapped into a ClusterManager:
https://gist.github.com/tlamadon/636d8e468b328d23bb4d

In the code I use addprocs to connect the workers. The problem I have is that I don't know how to wrap this into a ClusterManager. It seems to me that addprocs actually uses the local manager already.

Any suggestion would be very welcome, I will be happy to write the ClusterManager and send a merge request,

stdout not redirected to repl

while stdout for a local process appears in the repl, that for a remote sge process does not. see the transcript below. not shown is that it works fine for remote ssh processes. is this related to JuliaLang/julia#6030, JuliaLang/julia#5995, and #6 ?

[arthurb@h06u01 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease+2417 (2014-04-02 18:29 UTC)
/ |_'|||__'| | Commit 193cb11* (0 days old master)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8083782, waiting for job to start ..................................................
1-element Array{Any,1}:
2

julia> addprocs(1)
1-element Array{Any,1}:
3

julia> remotecall_fetch(3,println,"foo")
From worker 3: foo

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(1,println,"foo")
foo

julia> versioninfo()
Julia Version 0.3.0-prerelease+2417
Commit 193cb11* (2014-04-02 18:29 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
10 required packages:

  • ClusterManagers 0.0.1+ master
  • DSP 0.0.1+ spectrogram
  • Debug 0.0.1
  • Distributions 0.4.2
  • HDF5 0.2.20
  • IProfile 0.2.5
  • MAT 0.2.3
  • PyPlot 1.2.2
  • Stats 0.1.0
  • WAV 0.2.2
    11 additional packages:
  • ArrayViews 0.4.2
  • BinDeps 0.2.12
  • Color 0.2.9
  • NumericExtensions 0.6.0
  • NumericFuns 0.2.1
  • PDMats 0.1.1
  • Polynomial 0.1.1
  • PyCall 0.4.2
  • StatsBase 0.3.9
  • URIParser 0.0.1
  • Zlib 0.1.6

julia>

SLURM issues in 0.7+

There remain a couple issues in the SLURM manager; working solution implemented in #95

Note sure why, but cluster_cookie() throws an error on our (Linux) system; the PR fixes this by avoiding the worker_arg() call in slurm.jl, and instead uses (Distributed.init_multi(); Distributed.LPROC.cookie), which works fine.

SGE issue: could not connect

I tried using the following example:

using ClusterManagers
ClusterManagers.addprocs_sge(4)

And got the following output:

job id is 2559006, waiting for job to start ............................................................
    From worker 0:  Master process (id 1) could not connect within 60.0 seconds.
    From worker 0:  exiting.

I noticed (using qstat -U) that the jobs were started:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 4
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 3
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 [email protected]     1 2
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 [email protected]     1 1

But shortly thereafter, the jobs seemed to disappear:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 [email protected]     1 1

Julia then seemed to freeze indefinitely and had to be killed manually.

Any ideas as to what went wrong?

Thanks,

~David

Update for `0.7`?

Hello,

As of now, ClusterManagers works on 0.7-beta2 with a ton of deprecation warnings but that is ok. It is however broken on 0.7-rc2/rc3, any plans to update? (it won't even compile because of this JuliaLang/julia#28499, wait for some fix of that, it is related)

If I find out some quick and dirty fixes I will submit them back here.

Thanks!

cwd not retained on SGE workers

local workers have the same working directory as the master worker, but SGE workers remain in the users home directory. would be nice if they were at least consistent. i prefer the local behavior. let me know if the following console log is not clear.

[arthurb@h01u07 ~]$ cd ~
[arthurb@h01u07 ~]$ pwd
/home/arthurb
[arthurb@h01u07 ~]$ cd src
[arthurb@h01u07 src]$ pwd
/home/arthurb/src
[arthurb@h01u07 src]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease+2078 (2014-03-17 22:41 UTC)
/ |_'|||__'| | Commit 00ca83c* (1 day old master)
|__/ | x86_64-redhat-linux

julia> addprocs(1)
1-element Array{Any,1}:
2

julia> using ClusterManagers; addprocs(1, cman=SGEManager())
job id is 7264129, waiting for job to start ..............................
1-element Array{Any,1}:
3

julia> pwd()
"/home/arthurb/src"

julia> remotecall_fetch(2,pwd)
"/home/arthurb/src"

julia> remotecall_fetch(3,pwd)
"/home/arthurb"

julia> versioninfo()
Julia Version 0.3.0-prerelease+2078
Commit 00ca83c* (2014-03-17 22:41 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
9 required packages:

  • ClusterManagers 0.0.1
  • DSP 0.0.1+ ifloor
  • Debug 0.0.0
  • Devectorize 0.2.1
  • Distributions 0.4.2
  • MAT 0.2.3
  • PyPlot 1.2.2
  • Stats 0.1.0
  • WAV 0.2.2
    11 additional packages:
  • ArrayViews 0.4.1
  • BinDeps 0.2.12
  • Color 0.2.8
  • HDF5 0.2.20
  • NumericExtensions 0.5.6
  • PDMats 0.1.1
  • Polynomial 0.1.1
  • PyCall 0.4.2
  • StatsBase 0.3.9
  • URIParser 0.0.1
  • Zlib 0.1.5

julia>

LSF manager with bsub

Hello,

Are there plans to write a manager for lsf/bsub? If so, would it be able to use the infiniband transport for example.

I managed to "hack" a bsub submission to print out its hostnames and then use the sshmanager in julia to connect to those. My experience on these fronts is limited, and have had trouble connecting :all_to_all if the nodes happen to be on different racks.

I tried to mimic the qsub.jl with very limited success :)

Cheers!

Add a nonblocking addprocs_sge (and others)

Currently the use of addprocs_sge (and the others) is blocking. It must obtain the node before it returns. For example, applying @async to a command which calls addprocs_sge is useless.

I would like to know if is it possible to add a non-blocking version of addprocs_sge. I am happy to help with the coding but am not overly familiar with the addprocs structure.

I suggested a workaround in a question on StackOverflow but am not sure my approach to the issue is recommended.

Slurm: srun being passed a Julia option?

On Julia 0.6, latest release of ClusterManagers:

julia> using ClusterManagers

julia> addprocs(SlurmManager(2), partition="pdebug", t="00:5:00")
srun: unrecognized option '--enable-threaded-blas=false'
Try "srun --help" for more information

SGEManager procs failing to launch

julia> using Distributed, ClusterManagers

julia> addprocs(SGEManager(1,""), qsub_env="",res_list="")
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessExited(0)), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessExited(0))], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000014) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}

julia> addprocs_sge(1)
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /lmb/home/alewis && /lmb/home/alewis/local/julia-1.0.0/bin/julia --worker=EiIxQRHNaEkyxSBA'`, ProcessRunning), Process(`qsub -N julia-47120 -terse -j y -R y -t 1-1 -V`, ProcessRunning)], Base.DevNull(), Pipe(RawFD(0xffffffff) closed => RawFD(0x00000015) open, 0 bytes waiting), Base.TTY(RawFD(0x0000000f) open, 0 bytes waiting)),), 0x00000000000061b3)
0-element Array{Int64,1}

I've twiddled with a few things like checking that qsub -N julia-47120 -terse -j y -R y -t 1-1 -V itself runs without error, but don't really know how to go about debugging this further, short of reading up on each of the functions named in the error or the launch method. Any suggestions?

As a workaround, I've been queueing myself with qrsh inside a tmux session and then launching julia once I'm out of the queue, or writing Julia scripts that can be called from a qsub script. In doing so I've noticed that adding processes with the addprocs(machines) syntax fails for any hostname not in my .ssh/known_hosts. This is, I expect, unrelated to the main issue, but I've included the error message below just in case.

julia> addprocs(["fmg01"])
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
ERROR: Unable to read host:port string from worker. Launch command exited with error?
error(::String) at ./error.jl:33
read_worker_host_port(::Pipe) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:273
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:397
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:505
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:451
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at ./task.jl:266
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
 [5] #addprocs at ./none:0 [inlined]
 [6] #addprocs#251(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:118
 [7] addprocs(::Array{String,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:117
 [8] top-level scope at none:0

julia here is the precompiled 1.0.0 binary for Linux, running on a shared-filesystem, Scientific Linux 7.4 cluster, with SGE 6.2u3 installed.

Problems with SGE

I'm getting the following using the newest Julia on my SGE. Not sure where to begin debugging this.

In [4]:

ClusterManagers.addprocs_sge(4)
job id is 168009, waiting for job to start ...............................................
connect: connection timed out (ETIMEDOUT)
at In[4]:1
 in yield at multi.jl:1540
 in wait at task.jl:117
 in wait_connected at stream.jl:263
 in connect at stream.jl:878
 in create_worker at multi.jl:1036
 in start_cluster_workers at multi.jl:1000
 in addprocs_internal at multi.jl:1202
 in addprocs at multi.jl:1205
 in addprocs_sge at /home/malmaud/.julia/ClusterManagers/src/qsub.jl:73

srun: unrecognized option '--enable-threaded-blas=false'

I'm having the same issue as #75 on Julia 0.6.2, ClusterMangers 0.1.2. Let me know if you can reproduce?

test.sbatch

#!/bin/bash
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 4
#SBATCH --time 1
ml load julia/0.6
julia ./slurm.jl

slurm.jl

using ClusterManagers
addprocs(SlurmManager(2))

hosts = []
pids = []
for i in workers()
	host, pid = fetch(@spawnat i (gethostname(), getpid()))
	push!(hosts, host)
	push!(pids, pid)
end

for i in workers()
	rmprocs(i)
end

PBSPro qsub does not have -t

Hello. I tried to use the ClusterManagers package on our cluster which uses PBSPro version 13 (I also hae version 14 available). The result is below.
I think the -t option is specific to OpenPBS with Torque (I may be wrong!)
I am happy to help with a port to PBSPro

Also note that on our cluster you cannot submit directly to a queue.
Jobs go to the entry queue and are the sent to queues depending on the resources which you request, including walltime. So the empty queue must work.

John Hearns

julia> addprocs_pbs(12,queue="long")
qsub: invalid option -- 't'
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k o|e|oe] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-S path] [-u user_list] [-W otherattributes=value...]
[-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
qsub --version
batch queue not available (could not run qsub)
0-element Array{Int64,1}

Add transport to ElasticManager

I'm still busy setting things up to get the elastic manager running on my local network and in a docker image running in ecs but I wondered if a transport option which connected the workers via an ssh tunnel would work? SSH works fine so it should theoretically be an adaptation of the elastic manager with aspects of the ssh manager included.

I'm unsure as to whether this has any merit. If this seems like a good idea then I should be able to make an attempt next week at it.
Using the elastic manager solves my problem where I would have PCs going off an on so with some extra effort to get it running and retrying once I can get connections to work over the network, the elastic manager becomes a great solution!

slurm : job credential expired

i see this:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using ClusterManagers

julia> addprocs_slurm(3)
srun: job 14774 queued and waiting for resources
srun: job 14774 has been allocated resources
srun: error: Task launch for 14774.0 failed on node magi70: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
WARNING: dropping worker: file not created in 63 seconds
0-element Array{Int64,1}

and this is my squeue:

florian.oswald@magi3:~$ squeue -u florian.oswald
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             14774   COMPUTE julia-50 florian.  R       0:15      1 magi70
florian.oswald@magi3:~$ squeue -u florian.oswald
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
florian.oswald@magi3:~$ 

notice that while the job is running, the scheduler allows me to SSH without password into the compute node. so that requirement is fulfilled.

`addprocs_sge` does not allow user to set queue switch

Hi,

To launch with qsub I need to be able to set the queue with the -P switch (qsub -P queue_name). However the code uses qsub -q queue_name but that will not work for me. I'm willing to submit a PR but I'm not sure how it should be added. Perhaps with a keyword argument: addprocs_sge(5, queue="fast", queue_switch="-P") or if the queue name starts with a - and there is a space then assume it includes the switch? What do people prefer?

Glen

Change scheduler manager

Hi,

This is not really an issue but more like a question.

Do you know if it is possible to use OAR instead of SLURM. Indeed, the cluster I want to use is set for OAR as manager.

Thank you for your help,

Best regards.

[PackageEvaluator.jl] Your package ClusterManagers may have a testing issue.

This issue is being filed by a script, but if you reply, I will see it.

PackageEvaluator.jl is a script that runs nightly. It attempts to load all Julia packages and run their test (if available) on both the stable version of Julia (0.2) and the nightly build of the unstable version (0.3).

The results of this script are used to generate a package listing enhanced with testing results.

The status of this package, ClusterManagers, on...

  • Julia 0.2 is 'No tests, but package loads.' PackageEvaluator.jl
  • Julia 0.3 is 'No tests, but package loads.' PackageEvaluator.jl

'No tests, but package loads.' can be due to their being no tests (you should write some if you can!) but can also be due to PackageEvaluator not being able to find your tests. Consider adding a test/runtests.jl file.

'Package doesn't load.' is the worst-case scenario. Sometimes this arises because your package doesn't have BinDeps support, or needs something that can't be installed with BinDeps. If this is the case for your package, please file an issue and an exception can be made so your package will not be tested.

This automatically filed issue is a one-off message. Starting soon, issues will only be filed when the testing status of your package changes in a negative direction (gets worse). If you'd like to opt-out of these status-change messages, reply to this message.

Hangs on `addprocs_sge()`

I hope someone can help me. I am trying to do parallel computing with Julia on our SGE grid system, which I normally only fed with shell scripts.

When I run for example addprocs_sge(5,res_list="ct=00:01:00"), it immediately shows the received job id and is waiting for the job to start. A few seconds after however, I receive error messages which indicates that it can't tail the log files in my home, which however are present:

julia> using ClusterManagers

julia> addprocs_sge(5,res_list="ct=00:01:00")
job id is 5430281, waiting for job to start ........................
tail: /path/to/my/home/julia-59333.o5430281.1: No such file or directory
tail: no files remaining
tail: /path/to/my/home/julia-59333.o5430281.5: No such file or directory
tail: no files remaining

This is the content of one of the log files, so it apparently fails to run the julia process, which however is of course accessible (I am using the same binary for the REPL):

***************************************************************
* Submitted on:            Tue Feb 07 09:30:16 2017           *
* Started on:              Tue Feb 07 09:31:18 2017           *
***************************************************************

/var/spool/sge/ccwsge0830/job_scripts/5430281:1: permission denied: /path/to/my/home/apps/julia/julia-0.5.0/bin/julia

***************************************************************
* Ended on:                Tue Feb 07 09:31:36 2017           *
* Exit status:             126                                *
* Consumed                                                    *
*   cpu (HS06):            00:00:00                           *
*   cpu scaling factor:    11.100000                          *
*   cpu time:              0 / 60                             *
*   efficiency:            00 %                               *
*   io:                    0.00000GB                          *
*   vmem:                  N/A                                *
*   maxvmem:               N/A                                *
*   maxrss:                N/A                                *
***************************************************************

Any ideas what's happening here?

Is there any way to specify -l parameters to qsub in the PBS manager?

I need to give a few parameters to the qsub command such as:

-l nodes=2:ppn=16,walltime=1:00:00

I've been looking through the code, and it seems like qsub_env is the only parameter we can tweek, and if I'm not wrong that's just passing environment variables through to the assigned hosts.

It doesn't look too hard to add another set of parameters for additional qsub commands like this, but I thought I'd ask just in case I missed something.

Passing Environment Variables with qsub

My cluster has an old compiler, so I compiled GCC in my user directory, so I had to add to LD_LIBRARY_PATH my lib directory. Now launching workers does not work:

/home/---/julia/usr/bin/julia: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/---/julia/usr/bin/../lib/libjulia.so)

As the environment variables are not passed to the launched workers. Adding a -V option to the qsub command fixes this. Should we add this as a default?

Jobfile is incorrect on some PBS clusters

When I submit an array job to the cluster with a command:

qsub -N test -j oe -k o -t 1:5 ./test

I get list of log files in a form

test-1.o50084
test-2.o50084
test-3.o50084
test-4.o50084
test-5.o50084

which is different from logfile form ClusterManagers is now looking namely - .o-. In cluster I have access qsub --version is 4.2.10. Seems until 6.0.0 output schematics had changed as can be seen in this issue. But seems this specific detail is undocumented.

When changing schematics manually in qsub.jl cluster worked fine. Thus this might be an annoying bug for newcomers with an easy fix. For example filename(i) could return a list of files which would be checked in the loop or there could be another condition which decide the schematics of filenames depending on qsub version.

Static compilation fails on ClusterManagers.jl

I’m trying to create a system image of Julia with ClusterManagers.jl embedded, I encountered following error while building Julia (using release-0.5 branch).

â  make -j 4
    JULIA usr/lib/julia/sys.o
coreio.jl
exports.jl
essentials.jl
base.jl
generator.jl
reflection.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
operators.jl
pointer.jl
refpointer.jl
WARNING: Method definition (::Type{#T<:Any})(Any) in module Inference at coreimg.jl:39 overwritten in module Base at sysimg.jl:53.
checked.jl
abstractarray.jl
subarray.jl
array.jl
hashing.jl
rounding.jl
float.jl
complex.jl
rational.jl
multinverses.jl
abstractarraymath.jl
arraymath.jl
float16.jl
simdloop.jl
reduce.jl
reshapedarray.jl
bitarray.jl
intset.jl
dict.jl
set.jl
iterator.jl
build_h.jl
version_git.jl
osutils.jl
c.jl
sysinfo.jl
io.jl
iostream.jl
iobuffer.jl
char.jl
intfuncs.jl
strings/strings.jl
strings/errors.jl
strings/string.jl
strings/types.jl
strings/basic.jl
strings/search.jl
strings/util.jl
strings/io.jl
strings/utf8proc.jl
parse.jl
shell.jl
regex.jl
pcre.jl
show.jl
base64.jl
nullable.jl
version.jl
libc.jl
libdl.jl
env.jl
libuv.jl
uv_constants.jl
event.jl
task.jl
lock.jl
threads.jl
weakkeydict.jl
stream.jl
socket.jl
filesystem.jl
process.jl
multimedia.jl
grisu/grisu.jl
methodshow.jl
floatfuncs.jl
math.jl
cartesian.jl
multidimensional.jl
permuteddimsarray.jl
reducedim.jl
ordering.jl
collections.jl
sort.jl
WARNING: Method definition searchsortedfirst(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
WARNING: Method definition searchsortedlast(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
WARNING: Method definition searchsorted(AbstractArray{T<:Any, 1}, Any) in module Sort at sort.jl:184 overwritten at sort.jl:187.
gmp.jl
mpfr.jl
combinatorics.jl
hashing2.jl
dSFMT.jl
random.jl
printf.jl
meta.jl
Enums.jl
serialize.jl
channels.jl
clusterserialize.jl
multi.jl
workerpool.jl
pmap.jl
managers.jl
asyncmap.jl
loading.jl
mmap.jl
sharedarray.jl
datafmt.jl
deepcopy.jl
interactiveutil.jl
replutil.jl
test.jl
i18n.jl
initdefs.jl
Terminals.jl
LineEdit.jl
REPLCompletions.jl
REPL.jl
client.jl
util.jl
linalg/linalg.jl
broadcast.jl
statistics.jl
irrationals.jl
dft.jl
dsp.jl
quadgk.jl
fastmath.jl
libgit2/libgit2.jl
pkg/pkg.jl
stacktraces.jl
profile.jl
dates/Dates.jl
sparse/sparse.jl
threadcall.jl
deprecated.jl
docs/helpdb.jl
docs/helpdb/Base.jl
docs/basedocs.jl
markdown/Markdown.jl
docs/Docs.jl
/home/julia/TopStarredImage/julia/base/precompile.jl
LoadError("sysimg.jl",381,LoadError("/home/julia/TopStarredImage/julia/base/userimg.jl",1,LoadError("/home/julia/TopStarredPackages/v0.5/ClusterManagers/src/ClusterManagers.jl",7,UndefRefError())))
*** This error is usually fixed by running `make clean`. If the error persists, try `make cleanall`. ***
make[1]: *** [/home/julia/TopStarredImage/julia/usr/lib/julia/sys.o] Error 1
make: *** [julia-sysimg-release] Error 2 

OS = Ubuntu 14.04
ClusterManagers version = 0.1.0

Adding processes to HTCondor

When adding a process to an HTCondor cluster I get the following output:
julia> using ClusterManagers
julia> addprocs_htc(1)
Submitting job(s).
Waiting for 1 workers:

Then nothing happens, the function is stuck. I'm on Julia 0.5 and also have tried on 0.4 with the same result. I am able to manually submit jobs to the cluster from the terminal.

Edit: I'm running the cluster on localhost and all the server ports are closed except for the ssh port which might be why the jobs aren't submitted correctly.

Example of reading cookie from (shared) file?

Hi all,

I'm sorry to do this, but I've been looking through the code most of the afternoon and I'm still lost as to how to manage this issue. Basically, I'm not allowed to have the cluster cookie passed via command line since it's visible to all users of the system. It is allowed to have the workers read a file (owned by me in a shared file) for the cookie.

I've read the docs where it says

Note that environments requiring higher levels of security can implement this via a custom ClusterManager. For example, cookies can be pre-shared and hence not specified as a startup argument.

but I can't figure out how to do this.

Ideally, I'd modify the Slurm connection manager to allow cookies to be read from a particular location, but again - I don't even know where to start. Any advice appreciated.

LocalAffinityManager not updated for 1.0

Attempting to do addprocs(LocalAffinityManager(np = 2, mode = COMPACT, affinities = Int[])) gives errors - it looks like affinity.jl has not been updated for 1.0. Below are the errors I got, each after fixing the previous one:

ERROR: UndefVarError: OS_NAME not defined (fix: use Sys.KERNEL)
ERROR: UndefVarError: assert not defined (fix: use @assert)
ERROR: UndefVarError: CPU_CORES not defined (fix: use Sys.CPU_THREADS)
ERROR: MethodError: no method matching iterate(::Base.Process) (fix: ??)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.