if i add local workers before adding remote SGE workers, then the SGE workers

fixed here <a class="issue-link js-issue-link" data-error-text="Failed to load title"

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine about clustermanagers.jl HOT 7 CLOSED

juliaparallel commented on July 2, 2024

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine

from clustermanagers.jl.

Comments (7)

amitmurthy commented on July 2, 2024

I am not familiar with SGE, but I can take a guess at what is happening.

In Julia, all workers are connected to each other. The way this works is that after the main process (pid 1) launches a worker, the worker writes the ip:port it is listening on to its stdout. pid 1 connects to this address and then sends the new worker a list of host:port addresses (of existing workers) it in turn should connect to. The later workers always initiate a connection to the previously launched workers.

What seems to be happening is that while workers on localhost can initiate connections to workers on SGE nodes, the reverse is not true, i.e., workers on SGE nodes are not being allowed to connect outside their local network.

Is this a configurable property of SGE?

from clustermanagers.jl.

amitmurthy commented on July 2, 2024

Or, more likely, a firewall in your localhost is not allowing incoming connections from SGE nodes.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

thanks amit. my main julia process (pid 1) is on the cluster (i ssh in and run julia interactively), as are all the workers. the sysadmin tells me that there is no firewall between nodes.

i'm testing the tcp connection between workers. after starting julia and adding remote workers netstat reports one established tcp connection for each, with a port number corresponding to what's in the julia-xxx.oxxx.x files. nc -z succeeds going to the worker, but fails if i ssh into the worker and test the socket in the reverse direction.

so my question: should i expect a second tcp socket for the incoming traffic, and the problem is that it is not there? or should this sole socket be bidirectional?

it might be relevant that each node in this cluster has two NICs, one facing out to the world, the other facing towards the rest of the nodes in the cluster. julia correctly finds the latter ip addr.

here is a transcript of my test session:

julia> using ClusterManagers

julia> addprocs(1, cman=SGEManager())
job id is 6447451, waiting for job to start ......................................................
1-element Array{Any,1}:
2

julia>
[1]+ Stopped /home/arthurb/src/juliac/julia
[arthurb@h01u14 ~]$ cat julia-18256.o6447451.1
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#172.38.104.21
[arthurb@h01u14 ~]$ hostname --ip-address
172.38.101.24
[arthurb@h01u14 ~]$ netstat -an | grep 172.38.104.21
tcp 0 0 172.38.101.24:40886 172.38.104.21:9009 ESTABLISHED
[arthurb@h01u14 ~]$ nc -z 172.38.104.21 9009; echo $?
Connection to 172.38.104.21 9009 port [tcp/pichat] succeeded!
0
[arthurb@h01u14 ~]$ ssh 172.38.104.21
[email protected]'s password:
[arthurb@h04u11 ~]$ netstat -an | grep 172.38.101.24
tcp 0 0 172.38.104.21:22 172.38.101.24:33349 ESTABLISHED
tcp 0 0 172.38.104.21:9009 172.38.101.24:40886 ESTABLISHED
[arthurb@h04u11 ~]$ nc -z 172.38.101.24 40886; echo $?
1

from clustermanagers.jl.

amitmurthy commented on July 2, 2024

Thanks. The main process is storing the address on localhost addprocs as the loopback address and hence the problem. I have opened an issue here - JuliaLang/julia#5995 .

from clustermanagers.jl.

nlhepler commented on July 2, 2024

Thanks for fixing this upstream, Amit! I take it the issue is resolved, now?

from clustermanagers.jl.

amitmurthy commented on July 2, 2024

The fix upstream has not yet been merged. But this can be closed here since it is not an issue with ClusterManagers.jl per se.

from clustermanagers.jl.

bjarthur commented on July 2, 2024

fixed here JuliaLang/julia#6030

from clustermanagers.jl.

addprocs(n); addprocs(m, cman=SGEManager()) fails, while the reverse works fine about clustermanagers.jl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent