Comments (7)
I am not familiar with SGE, but I can take a guess at what is happening.
In Julia, all workers are connected to each other. The way this works is that after the main process (pid 1) launches a worker, the worker writes the ip:port it is listening on to its stdout. pid 1 connects to this address and then sends the new worker a list of host:port addresses (of existing workers) it in turn should connect to. The later workers always initiate a connection to the previously launched workers.
What seems to be happening is that while workers on localhost can initiate connections to workers on SGE nodes, the reverse is not true, i.e., workers on SGE nodes are not being allowed to connect outside their local network.
Is this a configurable property of SGE?
from clustermanagers.jl.
Or, more likely, a firewall in your localhost is not allowing incoming connections from SGE nodes.
from clustermanagers.jl.
thanks amit. my main julia process (pid 1) is on the cluster (i ssh in and run julia interactively), as are all the workers. the sysadmin tells me that there is no firewall between nodes.
i'm testing the tcp connection between workers. after starting julia and adding remote workers netstat reports one established tcp connection for each, with a port number corresponding to what's in the julia-xxx.oxxx.x files. nc -z succeeds going to the worker, but fails if i ssh into the worker and test the socket in the reverse direction.
so my question: should i expect a second tcp socket for the incoming traffic, and the problem is that it is not there? or should this sole socket be bidirectional?
it might be relevant that each node in this cluster has two NICs, one facing out to the world, the other facing towards the rest of the nodes in the cluster. julia correctly finds the latter ip addr.
here is a transcript of my test session:
julia> using ClusterManagers
julia> addprocs(1, cman=SGEManager())
job id is 6447451, waiting for job to start ......................................................
1-element Array{Any,1}:
2
julia>
[1]+ Stopped /home/arthurb/src/juliac/julia
[arthurb@h01u14 ~]$ cat julia-18256.o6447451.1
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
julia_worker:9009#172.38.104.21
[arthurb@h01u14 ~]$ hostname --ip-address
172.38.101.24
[arthurb@h01u14 ~]$ netstat -an | grep 172.38.104.21
tcp 0 0 172.38.101.24:40886 172.38.104.21:9009 ESTABLISHED
[arthurb@h01u14 ~]$ nc -z 172.38.104.21 9009; echo $?
Connection to 172.38.104.21 9009 port [tcp/pichat] succeeded!
0
[arthurb@h01u14 ~]$ ssh 172.38.104.21
[email protected]'s password:
[arthurb@h04u11 ~]$ netstat -an | grep 172.38.101.24
tcp 0 0 172.38.104.21:22 172.38.101.24:33349 ESTABLISHED
tcp 0 0 172.38.104.21:9009 172.38.101.24:40886 ESTABLISHED
[arthurb@h04u11 ~]$ nc -z 172.38.101.24 40886; echo $?
1
from clustermanagers.jl.
Thanks. The main process is storing the address on localhost addprocs as the loopback address and hence the problem. I have opened an issue here - JuliaLang/julia#5995 .
from clustermanagers.jl.
Thanks for fixing this upstream, Amit! I take it the issue is resolved, now?
from clustermanagers.jl.
The fix upstream has not yet been merged. But this can be closed here since it is not an issue with ClusterManagers.jl per se.
from clustermanagers.jl.
fixed here JuliaLang/julia#6030
from clustermanagers.jl.
Related Issues (20)
- ElasticManager does not export get_connect_cmd
- htcondor manager: failure when listening to a telnet commu HOT 4
- Extra options on SGE HOT 5
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.