Giter Site home page Giter Site logo

caplett / clusterduck Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alrhub/clusterduck

0.0 0.0 0.0 77 KB

clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.

Python 100.00%

clusterduck's Introduction

clusterduck

clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.

Installation

Install clusterduck with pip install .

pip install .

Developers should note that Hydra plugins are not compatible with new PEP 660-style editable installs. In order to perform an editable install, either use compatibility mode:

pip install -e . --config-settings editable_mode=compat

or use strict editable mode.

pip install -e . --config-settings editable_mode=strict

Be aware that strict mode installs do not expose new files created in the project until the installation is performed again.

Examples

The example script requires a few additional dependencies. Install with:

pip install ".[examples]"

To run the example script locally, e.g. looping over both model types twice each, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)"

To run the example script with the submitit backend but locally without a cluster, specify the platform like this:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=slurm_debug

To run the example script on the HoreKa cluster, use:

python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=horeka

Configuration Options

This plugin is heavily inspired by the hydra-submitit-launcher plugin, and provides all parameters of that original plugin. See their documentation for details about those parameters.

Both plugins rely on submitit for the real heavy lifting. See their documentation for more information.

Additional Parameters

The following parameters are added by this plugin:

We refer to a hydra job, i.e. one execution of the hydra main function with a set of overrides, as a run, to differentiate it from both jobs and tasks as defined by SLURM.

  • parallel_runs_per_node:
    The number of parallel executions per node, i.e. the number of experiments that will run simultaneously in a single SLURM job. This will depend on the available resources in a node.
  • total_runs_per_node:
    The total number of executions per node, i.e. the number of experiments that will run in a single SLURM job. This will depend on the duration of a run, the parallel_runs_per_node setting, and the time limit you set for the job in SLURM. If not specified, all executions will be run in a single job. However only parallel_runs_per_node of these executions will be running at any given time.
  • wait_for_completion:
    If set to true, the launcher will keep running in your login node until all SLURM jobs have completed before exiting. Otherwise it will submit the SLURM jobs into the queue and then exit.
  • resources_config:
    Any resources that must be divided up among parallel runs within a SLURM job. Currently available are following options configurable resources:
    • cpu This will divide the runs over the available CPUs.
      • Optional argument cpus specifies the CPU ids available to the job. Leave blank to auto-detect.
    • cuda This will divide the runs over the available GPUs. With this you can specify the GPUs that will be used for the tasks.
      • Optional argument gpus specifies the GPU ids available to the job. Leave blank to auto-detect.
    • stagger: This will delay the start of each task by the specified amount of seconds. This can be useful if you want to avoid starting all tasks at the same time, e.g. to avoid overloading the file system.
      • Argument delay specifies the delay amount in seconds.
  • verbose If set to true, additional debug information will be printed to the SLURM job log (related to scheduling runs within a job and allocating resources), and to each hydra run log (related to setting up the resources for the run). If you are having difficulties with the plugin, setting this to true might help understand what is going on.

Here an example of a hydra/launcher config that uses all of the above options:

hydra:
  launcher:
    # launcher/cluster specific options
    timeout_min: 5
    partition: dev_accelerated
    gres: gpu:4
    
    # clusterduck specific options
    parallel_runs_per_node: 4
    total_runs_per_node: 8
    wait_for_completion: False
    resources_config:
      cpu:
      cuda:
        gpus: [0, 1, 2, 3]  # optional, will auto-detect if left blank
      stagger:
        delay: 5

Further look into the example folder for a working example with multiple example configurations.

Development

PyCUDA is a helpful tool for working with CUDA devices outside of the context of a machine learning library like pytorch. We recommend installing it with conda:

conda install pycuda

Install additional requirements for development using:

pip install ".[all]"

clusterduck's People

Contributors

balazsgyenes avatar nicolas-schreiber avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.