Giter Site home page Giter Site logo

university-of-haifa-users-pre-slurm-tutorial's Introduction

Pre-Slurm Tutorial

Introduction

Best practices for preparing your environment to run as a Slurm job. Procedure includes pulling, modifying and converting a container from NGC using NVIDIA Enroot. A multi-node Horovod example is included.

We strongly recommend to work with a container as your neutral environment and mount your code from outside the container.

A flowchart of the process:

flowchart

Episode 1 - Development

Our recommendation is to start developing based on an optimized container from NGC. Chances are the required packages are installed and no modification is needed. If this is not the case, you can modify the container to fit your requirements.

  1. Pull a relevant container from NGC using NVIDIA Enroot.

    Optimized frameworks (PyTorch, TensorFlow, etc.) containers are available and updated monthly. To find which container fits your desired environment visit our Optimized Framework Release Notes and search for the relevant container release.

    Pull command:

    enroot import 'docker://nvcr.io#nvidia/<framework>:<tag>'

    E.g., to pull a 22.03 release TensorFlow container run:

    enroot import 'docker://nvcr.io#nvidia/tensorflow:22.03-tf1-py3'

    A container will be pulled and converted to a local squash file.

  2. Export the container to Enroot's data path.

    enroot create --name <environment_name> <squash_file>

    E.g., to export the TensorFlow container run:

    enroot create --name nvidia_tf nvidia+tensorflow+22.03-tf1-py3.sqsh

    To view all exported containers run:

    enroot list
  3. Start and work on the container.

    enroot start --root --rw --mount <local_folder>:<container_folder> <environment_name>
    • --root enables root privileges.
    • --rw enables read and write permissions (any changes inside the container will be saved).
    • --mount enables mounting of a local folder (to mount your code and data).

    More configurations are available in Enroot's start command documentations.

    To exit the container run exit.

Episode 2 - Exporting your environment to a squash file

Slurm uses squash files to run jobs. Therefore, your environment should be exported to a (new) squash file, containing all the changes you performed (if any).

  1. Export your current environment to a squash file.

    enroot export --output <squash_file> <environment_name>

    A new squash file will be locally created.

    Note: move the squash file to a location accessible to Slurm.

  2. Optional: remove old squash files and clear Enroot's data path.

    The original, unmodified squash file can be deleted. Additionally, to delete the exported container under Enroot's data path run:

    enroot remove <environment_name>

Episode 3 - Submitting a Slurm job

Slurm jobs can be submitted either via a srun or a sbatch commands. To submit a job from the "login" node use sbatch and prepare a designated script.

Case A - MPI

Relevant for executing multi-GPU / multi-node runs using MPI. We'll use Horovod's example for that.

Note: also relevant for single-GPU runs, but MPI is redundant.

  1. Clone Horovod's repository.

    git clone https://github.com/horovod/horovod
  2. Create a Slurm script file.

    Create a new file, paste the following code and save:

    #!/bin/bash
    #SBATCH --job-name horovod_tf
    #SBATCH --output %x-%j.out
    #SBATCH --error %x-%j.err
    #SBATCH --ntasks 1
    #SBATCH --cpus-per-task 32
    #SBATCH --gpus-per-task 16
    
    srun --container-image $1 \
    --container-mounts $2:/code \
    --no-container-entrypoint \
    /bin/bash -c \
    "python /code/examples/tensorflow/tensorflow_synthetic_benchmark.py \
    --batch-size 256"
    • %x - Job name.
    • %j - Job ID.

    Note: this script is intended to run on 16 GPUs (E.g., 2 nodes with 8 GPUs each), modify it if needed. Notice how only a single task (--ntasks 1) is needed for running with MPI.

  3. Submit a new Slurm job.

    sbatch <script_file> <squash_file> <horovods_git_folder>

    Two files will be locally created, one for the output and one for the errors.

Case B - OpenMP

Relevant for executing single-GPU / multi-GPU runs in a single / multi-threaded manner with the framework's native support.

Create a Slurm script identical to Case A, and change the following lines:

#SBATCH --ntasks <number of GPUs>
#SBATCH --cpus-per-task 8
#SBATCH --gpus-per-task 1

This will create a separate task per GPU.

university-of-haifa-users-pre-slurm-tutorial's People

Contributors

assafna avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.