Giter Site home page Giter Site logo

Comments (17)

amitmurthy avatar amitmurthy commented on July 2, 2024 1

I just pushed another update to amitm/debug which uses Base.start_worker in srun_cmd instead of --worker option. Can you try with that and also post the debug output here?

from clustermanagers.jl.

andreasnoack avatar andreasnoack commented on July 2, 2024

Do you have more than one version of Julia installed?

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

This is most probably due to the workers being 0.4 and the master 0.5 .
Or the other way around.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

I have just the official 0.5 precompiled binary of Julia on all systems.
I just rechecked and there is no other version of Julia flying around anywhere.

julia also runs fine on the workers when started manually.

Edit: I had 0.4.6 and 0.5.0 installed previously, then removed 0.4.6 and tried again with the same result.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

I just tested using ClusterManagers; addprocs_slurm(1) with 0.4.7, and there it runs flawlessly.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Have you done a Pkg.update() on 0.5 ? You will need a compatible version of ClusterManagers too.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

I did Pkg.update(). I also tried Pkg.checkout for ClusterManagers, with the same result.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Unfortunately I don't have access to a SLURM setup to try this out. Can you post any output from the dead worker. I think it is written as jobN.out files in the CWD on the login node.

Ref:

fn = "$exehome/job$(lpad(i, 4, "0")).out"

from clustermanagers.jl.

dmbates avatar dmbates commented on July 2, 2024

I found that addprocs_slurm(n) failed for me but addprocs(SlurmManager(n),...) worked.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

Here the job0000.out:

julia_worker:9009#(theip)
ErrorException("Process(1) - Invalid connection credentials sent by remote.")CapturedException(ErrorException("Process(1) - Invalid connection credentials sent by remote."),Any[( in process_hdr(::TCPSocket, ::Bool) at multi.jl:1400,1),( in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1299,1),( in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at multi.jl:1276,1),( in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at event.jl:68,1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

I experience no difference between addprocs_slurm(n) and addprocs(SlurmManager(n))

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Can you checkout branch amitm/debug and try? I have added a couple of debug statements which prints the local cookie and the slurm command.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

Here are the results:

cookie: 2UPo2qVQVRPUhqRD
VERSION: 0.5.0
worker_arg: `--worker 2UPo2qVQVRPUhqRD`
srun_cmd: `srun -J julia-29603 -n 1 -o job%4t.out -D /nfs/numerik/bzfsikor /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia --worker 2UPo2qVQVRPUhqRD`

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

Seems you had the right nose here, it is working now :)

julia> using ClusterManagers; addprocs_slurm(1)
srun_cmd: `srun -J julia-19892 -n 1 -o job%4t.out -D /nfs/datanumerik/bzfsikor/julia/pkgdir/v0.5/ClusterManagers /nfs/datanumerik/bzfsikor/julia/julia-0.5.0/julia-3c9d75391c/bin/julia -e 'Base.start_worker("1E5ahz7GRYCxfGs8")'`
srun: job 1103 queued and waiting for resources
srun: job 1103 has been allocated resources

1-element Array{Int64,1}:
 2

So there is something wrong with the --worker command line option? Strange nobody else had problems...

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Maybe it has something to do with the local environment on your cluster. What is the locale on the worker nodes? Non-english language?

@vtjnash : do you have any ideas as to what could be the problem here?

The cookie is being passed as a required arg with --worker and is read here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/client.jl#L224

The comparison which is failing on @axsk 's system is here - https://github.com/JuliaLang/julia/blob/0faf8ce200103839577171d28a4d1545fa827336/base/multi.jl#L1402-L1406 - basically the cookie read from the command line is compared to the one read from the socket.

@axsk : are you open to building julia from source for the workers? I can provide a patch with appropriate debug statements to track down this issue.

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

locale returns en_US.UTF-8 everywhere.

I'm open to building julia with the debug statetemets, but probably not today anymore, since I now got enough work with actually running the code I needed on the cluster :)

from clustermanagers.jl.

axsk avatar axsk commented on July 2, 2024

Here I am with another update:

I reinstalled all the packages (now on Julia 0.5.2), hence the Base.start_worker patch is gone.
The bug returned, i.e. I get the same version read error, but it only happens in about 50% of the times I try to run addprocs.

from clustermanagers.jl.

juliohm avatar juliohm commented on July 2, 2024

Too old to reproduce. Please check the new release.

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.