Giter Site home page Giter Site logo

Comments (12)

davidparks21 avatar davidparks21 commented on July 21, 2024 1

Ok, so I've written a new TorqueManager that adds some features and significantly improves the speed of launching workers to the cluster. It's currently in my own fork. I've finished coding it and am exercising it in a project I'm working on now to QA it.

https://github.com/davidparks21/ClusterManagers.jl/blob/master/src/torque.jl

Here are the salient points:

  • It supports specifying nodes and processes per node. The PBSManager only supports specifying ncpus, which doesn't allow you to request a full node and all processors on it (this was the key feature I was missing).
  • I've used network requests which significantly speeds up the deployment. In my tests the workers deploy in a few seconds instead of 10 or 30 seconds with the PBSManager, which uses STDOUT redirection and the file system for the worker/master handshake. The network requests from the workers depend on bash shell using bash's network redirection julia --worker >> /dev/tcp/master_ip/master_port. The only external dependency is that bash shell is installed and on the path, which feels quite reasonable.
  • I added additional support for a number of other parameters, -N (job name), -l (additional resources, so long as they work with the array_list parameter), and -p (priority). The other qsub parameters that PBSManager supports are all included, -v/-V, -q.
  • It still uses qsub's array_list (just like PBSManager) to start Julia on the nodes, that limits some of the qsub options that it can support, but it's a royal pain to do without using qsub's -t array_list functionality.
  • All parameters are defined as optional in the TorqueManager constructor, with suitable defaults. Parameters are not passed as part of addprocs(...), I noticed that the PBSManager took parameters from addprocs, but that approach felt odd to me, so if there's disagreement I'm open to discuss.
  • I didn't attempt to support other qsub environments, I only have Torque to work with here and have only set it up for that environment. It should be trivial to extend to other qsub environments. The command line parameters get messy but I've kept them organized with future editing in mind.
  • It deploys ppn (processes per node) Julia processes on each node, it uses a simple bash while loop and coprocesses to do this, there's a dependency on bash shell and nothing else (both coprocs and the network requests are processed within bash shell).
  • stdout writes on the workers show up on the master immediately rather than waiting on the filesystem IO redirect that has a significant delay in PBSManager

If it seems useful I'll be happy to document it after I've finished testing and offer it to this project. And let me know if there are any questions/concerns about it.

Here are a few usage examples:

#Typical usage: hyper queue, 10 nodes w/ 16 processes per node
addprocs( TorqueManager( queue="hyper", nodes=10, ppn=16, job_name="My Job" ))

#Default, launches to default queue, 1 node, 1 process, job name is "JuliaWorker"
addprocs( TorqueManager() )

#All supported parameters demoed
addprocs( TorqueManager( queue="normal", nodes=10, ppn=16, l="mem=200m", job_name="My Job",  env=:ALL, priority=100, other_qsub_params="-u runasuser" ))

David

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

I think a generic manager_args kw parameter to addprocs should be added as is to the qsub (and other cluster launchers) command as is.

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

Yup, I hacked in that solution locally and got a few quick POC tests I need working. I did run into a minor hiccup as the parameters that are currently passed to qsub by ClusterManager conflicted conflicted with my use of ppn=16 (specifically number of processes).

For the moment I only needed to force 16 nodes onto 1 single host which wasn't possible as-is, but we should be able to define any combination of hosts/nodes in PBS style. I guess the easiest way will be to simply remove the existing #-processes argument that's in use if the user defines their own directives. We can depend on the user to define processes/nodes/etc themselves with a suitable default if it's omitted.

I'll see about getting that working in the coming weeks and post back here about it then. The hack wasn't too hard, so a clean solution shouldn't be too hard either.

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

Ah, I'm seeing the crux of the problem now. Using -l nodes=2:ppn=16 only spawns 2 processes from qsub, and there's no environment variable passed in to specify that 16 julia processes should be started.

The way it's coded now, using -t 1-$num_processes, generates a unique output file per process that the manager reads to get the julia host/port information.

This approach fundamentally excludes the ability to specify processes per node. To enable this it looks like things need to be re-written from the ground up. The manager will need to specifically know the number of processes per node so it can spawn the right number of julia processes. And it will need to communicate the worker host/port information back to the host julia process in a different way than is being done currently.

from clustermanagers.jl.

kshyatt avatar kshyatt commented on July 21, 2024

@davidparks21 Your fork looks really interesting! Could you open a PR (I use Torque a lot too)? Also, are you planning to support arguments like -A (account #)/mixed node types (stuff like "I want nodes 20, 22, and any three GPU nodes")?

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

I don't use -A on my local cluster, but I could add it easily, any other suggestions on parameters I should include?

I tried to get mixed node types to work, but that didn't seem to play nicely with the -t array_list option, which is important because that's the mechanism I'm relying on to start 1 bash process per node (that's the same mechanism used by PBSManager as well, there are other options, but they increase the complexity significantly). That bash process simply loops over the number of ppn (processes per node) launching julia coprocesses (background processes). If you can find a way to get the -t parameter and mixed node types to work together (along with processes per node -l ppn=?), I very much want to know about it, I gave it a good effort.

Technically as it stands it supports adding custom parameters using -l, such as walltime, for example, but they have to work nicely with the -t parameter or it'll break.

I'll issue the pull request in the coming week. I want to test it locally here to make sure I've shaken out any last bugs.

David

p.s. Sorry for the late reply, I was out at a workshop all week.

from clustermanagers.jl.

amitmurthy avatar amitmurthy commented on July 21, 2024

Keyword args should be able to override hardcoded options. Maybe the current default args are exposed as a keyword arg, default_args=....``, which the user can completely override. Will help handle any conflicting args.

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

@amitmurthy are you referring to the keyword arguments of addprocs? I'm not sure I'm clear about what overrides you're referring to.

If you just mean the default arguments being taken by TorqueManager, then yes, all arguments are optional and have a reasonable default assigned.

If you're talking about addprocs then I have questions about what might be expected by addprocs in terms of arguments that need to be overridden in a cluster manager. I saw that all the other cluster managers passed arguments via addprocs rather than just taking arguments to the ClusterManager, but that approach didn't seem to have any benefit that was obvious to me.

Thanks,
Dave

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

@kshyatt this is an example of the problem I've been having getting torque to play nicely with more specific node assignments like you suggested:

$ echo 'echo $(hostname)' | qsub -q gpu -l nodes=gpu-5:ppn=12+gpu-8:ppn=12 -t 1-2
53580[]
$ cat *53580*
gpu-5.local
gpu-5.local

I've tried a dozen different qsub combinations trying to specify nodes, nothing seems to work when I use the -t (array_list) option to execute the command on each node.

The documentation isn't being particularly cooperative either:
http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/2-jobs/requestingRes.htm

If you have more experience and know of a good way to get torque to run a command per node let me know and I'll see about incorporating it.

I just noticed that if I only request 1 process per node (ppn) and 2 nodes, qsub will execute 2 instances of the command on only 1 of the nodes. That's an issue.

$ echo 'echo $(hostname)' | qsub -q gpu -l nodes=2:ppn=1 -t 1-2
53582[]
$ cat *53582*
gpu-8.local
gpu-8.local

My other thought was to use pbsdsh to execute the commands per-node, but the documentation there didn't seem like it was going to be much easier to work with than using the array_list parameter from qsub, though I haven't tried yet.

http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/pbsdsh.htm

from clustermanagers.jl.

kshyatt avatar kshyatt commented on July 21, 2024

Hello,

Was this ever opened as PR? Could it be? It would solve a problem I ran into just this morning and yelled at innocent postdocs (who then graciously helped me fix it) about.

from clustermanagers.jl.

davidparks21 avatar davidparks21 commented on July 21, 2024

Hi @kshyatt, I'm sorry to say that I never resolved the issues I mentioned on June 27th and ended up switching from Mocha to Tensorflow as my primary platform so my focus was drawn away from this. It was quite tricky to get the parameters of qsub to function nicely in a multiprocessing environment. qsub seems to be engineered with OpenMP in mind. I suppose that a multiprocess manager like this will not be able to implement all PBS directives as well as I had hoped.

My code still has some bugs that I found in testing which is why I never submitted it.

from clustermanagers.jl.

affans avatar affans commented on July 21, 2024

I am getting a slight error using your package, cant seem to figure it out. The error is

Received network input:    String
got a worker: 
Error launching workers, caught exception: [ErrorException("type Void has no field captures")]
ErrorException("type Void has no field captures")
0-element Array{Int64,1}

It turns out that println("got a worker: ", worker_stdout), is empty. ie, the variable worker_stdout is empty, so performing regex matching on it returns nothing.

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.