Giter Site home page Giter Site logo

Comments (7)

amitmurthy avatar amitmurthy commented on July 2, 2024

I am not familiar with SGE, but I can take a guess at what is happening.

In Julia, all workers are connected to each other. The way this works is that after the main process (pid 1) launches a worker, the worker writes the ip:port it is listening on to its stdout. pid 1 connects to this address and then sends the new worker a list of host:port addresses (of existing workers) it in turn should connect to. The later workers always initiate a connection to the previously launched workers.

What seems to be happening is that while workers on localhost can initiate connections to workers on SGE nodes, the reverse is not true, i.e., workers on SGE nodes are not being allowed to connect outside their local network.

Is this a configurable property of SGE?

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Or, more likely, a firewall in your localhost is not allowing incoming connections from SGE nodes.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

thanks amit. my main julia process (pid 1) is on the cluster (i ssh in and run julia interactively), as are all the workers. the sysadmin tells me that there is no firewall between nodes.

i'm testing the tcp connection between workers. after starting julia and adding remote workers netstat reports one established tcp connection for each, with a port number corresponding to what's in the julia-xxx.oxxx.x files. nc -z succeeds going to the worker, but fails if i ssh into the worker and test the socket in the reverse direction.

so my question: should i expect a second tcp socket for the incoming traffic, and the problem is that it is not there? or should this sole socket be bidirectional?

it might be relevant that each node in this cluster has two NICs, one facing out to the world, the other facing towards the rest of the nodes in the cluster. julia correctly finds the latter ip addr.

here is a transcript of my test session:

julia> using ClusterManagers

julia> addprocs(1, cman=SGEManager())
job id is 6447451, waiting for job to start ......................................................
1-element Array{Any,1}:
2

julia>
[1]+ Stopped /home/arthurb/src/juliac/julia
[arthurb@h01u14 ~]$ cat julia-18256.o6447451.1
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#172.38.104.21
[arthurb@h01u14 ~]$ hostname --ip-address
172.38.101.24
[arthurb@h01u14 ~]$ netstat -an | grep 172.38.104.21
tcp 0 0 172.38.101.24:40886 172.38.104.21:9009 ESTABLISHED
[arthurb@h01u14 ~]$ nc -z 172.38.104.21 9009; echo $?
Connection to 172.38.104.21 9009 port [tcp/pichat] succeeded!
0
[arthurb@h01u14 ~]$ ssh 172.38.104.21
[email protected]'s password:
[arthurb@h04u11 ~]$ netstat -an | grep 172.38.101.24
tcp 0 0 172.38.104.21:22 172.38.101.24:33349 ESTABLISHED
tcp 0 0 172.38.104.21:9009 172.38.101.24:40886 ESTABLISHED
[arthurb@h04u11 ~]$ nc -z 172.38.101.24 40886; echo $?
1

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

Thanks. The main process is storing the address on localhost addprocs as the loopback address and hence the problem. I have opened an issue here - JuliaLang/julia#5995 .

from clustermanagers.jl.

nlhepler avatar nlhepler commented on July 2, 2024

Thanks for fixing this upstream, Amit! I take it the issue is resolved, now?

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 2, 2024

The fix upstream has not yet been merged. But this can be closed here since it is not an issue with ClusterManagers.jl per se.

from clustermanagers.jl.

bjarthur avatar bjarthur commented on July 2, 2024

fixed here JuliaLang/julia#6030

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.