Comments (17)
I just pushed another update to amitm/debug
which uses Base.start_worker
in srun_cmd
instead of --worker
option. Can you try with that and also post the debug output here?
from clustermanagers.jl.
Do you have more than one version of Julia installed?
from clustermanagers.jl.
This is most probably due to the workers being 0.4 and the master 0.5 .
Or the other way around.
from clustermanagers.jl.
I have just the official 0.5 precompiled binary of Julia on all systems.
I just rechecked and there is no other version of Julia flying around anywhere.
julia
also runs fine on the workers when started manually.
Edit: I had 0.4.6 and 0.5.0 installed previously, then removed 0.4.6 and tried again with the same result.
from clustermanagers.jl.
I just tested using ClusterManagers; addprocs_slurm(1)
with 0.4.7, and there it runs flawlessly.
from clustermanagers.jl.
Have you done a Pkg.update()
on 0.5 ? You will need a compatible version of ClusterManagers
too.
from clustermanagers.jl.
I did Pkg.update()
. I also tried Pkg.checkout
for ClusterManagers
, with the same result.
from clustermanagers.jl.
Unfortunately I don't have access to a SLURM setup to try this out. Can you post any output from the dead worker. I think it is written as jobN.out
files in the CWD on the login node.
Ref:
ClusterManagers.jl/src/slurm.jl
Line 49 in 00b1139
from clustermanagers.jl.
I found that addprocs_slurm(n)
failed for me but addprocs(SlurmManager(n),...)
worked.
from clustermanagers.jl.
Here the job0000.out
:
julia_worker:9009#(theip)
ErrorException("Process(1) - Invalid connection credentials sent by remote.")CapturedException(ErrorException("Process(1) - Invalid connection credentials sent by remote."),Any[( in process_hdr(::TCPSocket, ::Bool) at multi.jl:1400,1),( in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1299,1),( in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1276,1),( in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at event.jl:68,1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
I experience no difference between addprocs_slurm(n)
and addprocs(SlurmManager(n))
from clustermanagers.jl.
Can you checkout branch amitm/debug
and try? I have added a couple of debug statements which prints the local cookie and the slurm command.
from clustermanagers.jl.
Here are the results:
cookie: 2UPo2qVQVRPUhqRD
VERSION: 0.5.0
worker_arg: `--worker 2UPo2qVQVRPUhqRD`
srun_cmd: `srun -J julia-29603 -n 1 -o job%4t.out -D /nfs/numerik/bzfsikor /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia --worker 2UPo2qVQVRPUhqRD`
from clustermanagers.jl.
Seems you had the right nose here, it is working now :)
julia> using ClusterManagers; addprocs_slurm(1)
srun_cmd: `srun -J julia-19892 -n 1 -o job%4t.out -D /nfs/datanumerik/bzfsikor/julia/pkgdir/v0.5/ClusterManagers /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia -e 'Base.start_worker("1E5ahz7GRYCxfGs8")'`
srun: job 1103 queued and waiting for resources
srun: job 1103 has been allocated resources
1-element Array{Int64,1}:
2
So there is something wrong with the --worker
command line option? Strange nobody else had problems...
from clustermanagers.jl.
Maybe it has something to do with the local environment on your cluster. What is the locale on the worker nodes? Non-english language?
@vtjnash : do you have any ideas as to what could be the problem here?
The cookie is being passed as a required arg with --worker
and is read here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/client.jl#L224
The comparison which is failing on @axsk 's system is here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/multi.jl#L1402-L1406 - basically the cookie read from the command line is compared to the one read from the socket.
@axsk : are you open to building julia from source for the workers? I can provide a patch with appropriate debug statements to track down this issue.
from clustermanagers.jl.
locale
returns en_US.UTF-8
everywhere.
I'm open to building julia with the debug statetemets, but probably not today anymore, since I now got enough work with actually running the code I needed on the cluster :)
from clustermanagers.jl.
Here I am with another update:
I reinstalled all the packages (now on Julia 0.5.2), hence the Base.start_worker
patch is gone.
The bug returned, i.e. I get the same version read error
, but it only happens in about 50% of the times I try to run addprocs.
from clustermanagers.jl.
Too old to reproduce. Please check the new release.
from clustermanagers.jl.
Related Issues (20)
- ElasticManager does not export get_connect_cmd
- htcondor manager: failure when listening to a telnet commu HOT 4
- Extra options on SGE HOT 5
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.