Comments (12)
Ok, so I've written a new TorqueManager that adds some features and significantly improves the speed of launching workers to the cluster. It's currently in my own fork. I've finished coding it and am exercising it in a project I'm working on now to QA it.
https://github.com/davidparks21/ClusterManagers.jl/blob/master/src/torque.jl
Here are the salient points:
- It supports specifying nodes and processes per node. The PBSManager only supports specifying ncpus, which doesn't allow you to request a full node and all processors on it (this was the key feature I was missing).
- I've used network requests which significantly speeds up the deployment. In my tests the workers deploy in a few seconds instead of 10 or 30 seconds with the PBSManager, which uses STDOUT redirection and the file system for the worker/master handshake. The network requests from the workers depend on bash shell using bash's network redirection
julia --worker >> /dev/tcp/master_ip/master_port
. The only external dependency is that bash shell is installed and on the path, which feels quite reasonable. - I added additional support for a number of other parameters, -N (job name), -l (additional resources, so long as they work with the array_list parameter), and -p (priority). The other qsub parameters that PBSManager supports are all included, -v/-V, -q.
- It still uses qsub's array_list (just like PBSManager) to start Julia on the nodes, that limits some of the qsub options that it can support, but it's a royal pain to do without using qsub's -t array_list functionality.
- All parameters are defined as optional in the TorqueManager constructor, with suitable defaults. Parameters are not passed as part of
addprocs(...)
, I noticed that the PBSManager took parameters from addprocs, but that approach felt odd to me, so if there's disagreement I'm open to discuss. - I didn't attempt to support other qsub environments, I only have Torque to work with here and have only set it up for that environment. It should be trivial to extend to other qsub environments. The command line parameters get messy but I've kept them organized with future editing in mind.
- It deploys
ppn
(processes per node) Julia processes on each node, it uses a simple bash while loop and coprocesses to do this, there's a dependency on bash shell and nothing else (both coprocs and the network requests are processed within bash shell). - stdout writes on the workers show up on the master immediately rather than waiting on the filesystem IO redirect that has a significant delay in PBSManager
If it seems useful I'll be happy to document it after I've finished testing and offer it to this project. And let me know if there are any questions/concerns about it.
Here are a few usage examples:
#Typical usage: hyper queue, 10 nodes w/ 16 processes per node
addprocs( TorqueManager( queue="hyper", nodes=10, ppn=16, job_name="My Job" ))
#Default, launches to default queue, 1 node, 1 process, job name is "JuliaWorker"
addprocs( TorqueManager() )
#All supported parameters demoed
addprocs( TorqueManager( queue="normal", nodes=10, ppn=16, l="mem=200m", job_name="My Job", env=:ALL, priority=100, other_qsub_params="-u runasuser" ))
David
from clustermanagers.jl.
I think a generic manager_args
kw parameter to addprocs
should be added as is to the qsub (and other cluster launchers) command as is.
from clustermanagers.jl.
Yup, I hacked in that solution locally and got a few quick POC tests I need working. I did run into a minor hiccup as the parameters that are currently passed to qsub by ClusterManager conflicted conflicted with my use of ppn=16 (specifically number of processes).
For the moment I only needed to force 16 nodes onto 1 single host which wasn't possible as-is, but we should be able to define any combination of hosts/nodes in PBS style. I guess the easiest way will be to simply remove the existing #-processes argument that's in use if the user defines their own directives. We can depend on the user to define processes/nodes/etc themselves with a suitable default if it's omitted.
I'll see about getting that working in the coming weeks and post back here about it then. The hack wasn't too hard, so a clean solution shouldn't be too hard either.
from clustermanagers.jl.
Ah, I'm seeing the crux of the problem now. Using -l nodes=2:ppn=16 only spawns 2 processes from qsub, and there's no environment variable passed in to specify that 16 julia processes should be started.
The way it's coded now, using -t 1-$num_processes
, generates a unique output file per process that the manager reads to get the julia host/port information.
This approach fundamentally excludes the ability to specify processes per node. To enable this it looks like things need to be re-written from the ground up. The manager will need to specifically know the number of processes per node so it can spawn the right number of julia processes. And it will need to communicate the worker host/port information back to the host julia process in a different way than is being done currently.
from clustermanagers.jl.
@davidparks21 Your fork looks really interesting! Could you open a PR (I use Torque a lot too)? Also, are you planning to support arguments like -A
(account #)/mixed node types (stuff like "I want nodes 20, 22, and any three GPU nodes")?
from clustermanagers.jl.
I don't use -A on my local cluster, but I could add it easily, any other suggestions on parameters I should include?
I tried to get mixed node types to work, but that didn't seem to play nicely with the -t array_list
option, which is important because that's the mechanism I'm relying on to start 1 bash process per node (that's the same mechanism used by PBSManager as well, there are other options, but they increase the complexity significantly). That bash process simply loops over the number of ppn (processes per node) launching julia coprocesses (background processes). If you can find a way to get the -t parameter and mixed node types to work together (along with processes per node -l ppn=?
), I very much want to know about it, I gave it a good effort.
Technically as it stands it supports adding custom parameters using -l, such as walltime, for example, but they have to work nicely with the -t parameter or it'll break.
I'll issue the pull request in the coming week. I want to test it locally here to make sure I've shaken out any last bugs.
David
p.s. Sorry for the late reply, I was out at a workshop all week.
from clustermanagers.jl.
Keyword args should be able to override hardcoded options. Maybe the current default args are exposed as a keyword arg, default_args=
....``, which the user can completely override. Will help handle any conflicting args.
from clustermanagers.jl.
@amitmurthy are you referring to the keyword arguments of addprocs
? I'm not sure I'm clear about what overrides you're referring to.
If you just mean the default arguments being taken by TorqueManager
, then yes, all arguments are optional and have a reasonable default assigned.
If you're talking about addprocs
then I have questions about what might be expected by addprocs
in terms of arguments that need to be overridden in a cluster manager. I saw that all the other cluster managers passed arguments via addprocs rather than just taking arguments to the ClusterManager, but that approach didn't seem to have any benefit that was obvious to me.
Thanks,
Dave
from clustermanagers.jl.
@kshyatt this is an example of the problem I've been having getting torque to play nicely with more specific node assignments like you suggested:
$ echo 'echo $(hostname)' | qsub -q gpu -l nodes=gpu-5:ppn=12+gpu-8:ppn=12 -t 1-2
53580[]
$ cat *53580*
gpu-5.local
gpu-5.local
I've tried a dozen different qsub
combinations trying to specify nodes, nothing seems to work when I use the -t
(array_list) option to execute the command on each node.
The documentation isn't being particularly cooperative either:
http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/2-jobs/requestingRes.htm
If you have more experience and know of a good way to get torque to run a command per node let me know and I'll see about incorporating it.
I just noticed that if I only request 1 process per node (ppn) and 2 nodes, qsub
will execute 2 instances of the command on only 1 of the nodes. That's an issue.
$ echo 'echo $(hostname)' | qsub -q gpu -l nodes=2:ppn=1 -t 1-2
53582[]
$ cat *53582*
gpu-8.local
gpu-8.local
My other thought was to use pbsdsh
to execute the commands per-node, but the documentation there didn't seem like it was going to be much easier to work with than using the array_list
parameter from qsub
, though I haven't tried yet.
http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/pbsdsh.htm
from clustermanagers.jl.
Hello,
Was this ever opened as PR? Could it be? It would solve a problem I ran into just this morning and yelled at innocent postdocs (who then graciously helped me fix it) about.
from clustermanagers.jl.
Hi @kshyatt, I'm sorry to say that I never resolved the issues I mentioned on June 27th and ended up switching from Mocha to Tensorflow as my primary platform so my focus was drawn away from this. It was quite tricky to get the parameters of qsub to function nicely in a multiprocessing environment. qsub seems to be engineered with OpenMP in mind. I suppose that a multiprocess manager like this will not be able to implement all PBS directives as well as I had hoped.
My code still has some bugs that I found in testing which is why I never submitted it.
from clustermanagers.jl.
I am getting a slight error using your package, cant seem to figure it out. The error is
Received network input: String
got a worker:
Error launching workers, caught exception: [ErrorException("type Void has no field captures")]
ErrorException("type Void has no field captures")
0-element Array{Int64,1}
It turns out that println("got a worker: ", worker_stdout)
, is empty. ie, the variable worker_stdout
is empty, so performing regex matching on it returns nothing.
from clustermanagers.jl.
Related Issues (20)
- ElasticManager does not export get_connect_cmd
- htcondor manager: failure when listening to a telnet commu HOT 4
- Extra options on SGE HOT 5
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.