Pre-Slurm Tutorial

Introduction

Best practices for preparing your environment to run as a Slurm job. Procedure includes pulling, modifying and converting a container from NGC using NVIDIA Enroot. A multi-node Horovod example is included.

We strongly recommend to work with a container as your neutral environment and mount your code from outside the container.

A flowchart of the process:

Episode 1 - Development

Our recommendation is to start developing based on an optimized container from NGC. Chances are the required packages are installed and no modification is needed. If this is not the case, you can modify the container to fit your requirements.

Pull a relevant container from NGC using NVIDIA Enroot.

Optimized frameworks (PyTorch, TensorFlow, etc.) containers are available and updated monthly. To find which container fits your desired environment visit our Optimized Framework Release Notes and search for the relevant container release.

Pull command:
```
enroot import 'docker://nvcr.io#nvidia/<framework>:<tag>'
```
E.g., to pull a 22.03 release TensorFlow container run:
```
enroot import 'docker://nvcr.io#nvidia/tensorflow:22.03-tf1-py3'
```
A container will be pulled and converted to a local squash file.

Export the container to Enroot's data path.

enroot create --name <environment_name> <squash_file>

E.g., to export the TensorFlow container run:

enroot create --name nvidia_tf nvidia+tensorflow+22.03-tf1-py3.sqsh

To view all exported containers run:

enroot list

Start and work on the container.
```
enroot start --root --rw --mount <local_folder>:<container_folder> <environment_name>
```
- --root enables root privileges.
- --rw enables read and write permissions (any changes inside the container will be saved).
- --mount enables mounting of a local folder (to mount your code and data).
More configurations are available in Enroot's start command documentations.

To exit the container run exit.

Episode 2 - Exporting your environment to a squash file

Slurm uses squash files to run jobs. Therefore, your environment should be exported to a (new) squash file, containing all the changes you performed (if any).

Export your current environment to a squash file.
```
enroot export --output <squash_file> <environment_name>
```
A new squash file will be locally created.

Note: move the squash file to a location accessible to Slurm.
Optional: remove old squash files and clear Enroot's data path.

The original, unmodified squash file can be deleted. Additionally, to delete the exported container under Enroot's data path run:
```
enroot remove <environment_name>
```

Episode 3 - Submitting a Slurm job

Slurm jobs can be submitted either via a srun or a sbatch commands. To submit a job from the "login" node use sbatch and prepare a designated script.

Case A - MPI

Relevant for executing multi-GPU / multi-node runs using MPI. We'll use Horovod's example for that.

Note: also relevant for single-GPU runs, but MPI is redundant.

Clone Horovod's repository.

git clone https://github.com/horovod/horovod

Create a Slurm script file.

Create a new file, paste the following code and save:

#!/bin/bash
#SBATCH --job-name horovod_tf
#SBATCH --output %x-%j.out
#SBATCH --error %x-%j.err
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 32
#SBATCH --gpus-per-task 16

srun --container-image $1 \
--container-mounts $2:/code \
--no-container-entrypoint \
/bin/bash -c \
"python /code/examples/tensorflow/tensorflow_synthetic_benchmark.py \
--batch-size 256"

%x - Job name.
%j - Job ID.

Note: this script is intended to run on 16 GPUs (E.g., 2 nodes with 8 GPUs each), modify it if needed. Notice how only a single task (--ntasks 1) is needed for running with MPI.

Submit a new Slurm job.
```
sbatch <script_file> <squash_file> <horovods_git_folder>
```
Two files will be locally created, one for the output and one for the errors.

Case B - OpenMP

Relevant for executing single-GPU / multi-GPU runs in a single / multi-threaded manner with the framework's native support.

Create a Slurm script identical to Case A, and change the following lines:

#SBATCH --ntasks <number of GPUs>
#SBATCH --cpus-per-task 8
#SBATCH --gpus-per-task 1

This will create a separate task per GPU.

assafna / university-of-haifa-users-pre-slurm-tutorial Goto Github PK

university-of-haifa-users-pre-slurm-tutorial's Introduction

Pre-Slurm Tutorial

Introduction

Episode 1 - Development

Episode 2 - Exporting your environment to a squash file

Episode 3 - Submitting a Slurm job

Case A - MPI

Case B - OpenMP

university-of-haifa-users-pre-slurm-tutorial's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent