- Use the squeue command and replace
<userID>
with you userID
squeue -u <userID>
-
Default srun request small tasks only (1 hr time limit, 1 core, 1 GB, compute partition & batch QoS).
srun --pty bash
-
General format for srun [do not execute without changing values]
srun -p <partition> -q <QoS> -N <Number of Nodes> -n <Number of tasks> -c <cores> --time <time> --pty bash
-
Single-core srun interactive job on shared partition and cpu QoS with 4 cpus and 8GB RAM for 4 hrs.
srun -p shared -q cpu -N 1 -n 1 -c 1 --mem 8GB --time 4:00:00 --pty bash
-
Single-core srun interactive job on shared partition and cpu QoS with 1 cpus and 8GB RAM for 8 hrs (time limit for dev QoS).
srun -p shared -q cpu -N 1 -n 1 -c 1 --mem 8GB --time 8:00:00 --pty bash
-
Multi-core srun interactive job on shared partition and cpu QoS with 4 cpus and 8GB RAM for 8 hrs (time limit for dev QoS).
srun -p shared -q cpu -N 1 -n 1 -c 4 --mem 8GB --time 8:00:00 --pty bash
-
Multi-core srun interactive job on highmem partition (default) and cpu QoS (default) with 4 cpus and 8GB RAM for 12 hrs.
srun -p highmem -q cpu -N 1 -n 1 -c 4 --mem 8GB --time 12:00:00 --pty bash
-
Single-core sbatch script on shared partition & cpu QoS with 1 cpu and 6GB for 8 hrs (time limit for dev QoS). LINK.
-
Multi-core sbatch script on shared partition & cpu QoS with 4 cpus and 24GB for 8 hrs (time limit for dev QoS). LINK.
-
Multi-core sbatch script on highmem partition & cpu QoS with 4 cpus and 48GB RAM for 96 hrs. LINK.
-
Multi-core sbatch script on shared partition & cpu QoS with 4 cpus and 48GB RAM for 96 hrs. LINK.
- Run sbatch jobs with
sbatch <sbatch_script.sh>
where <sbatch_script.sh> is the name of your slurm script. - Set the number of cores with the
--cpus-per-task
flag and leave--nodes=1
and--ntasks=1
set to 1 (for most jobs). - The
--cpus-per-task
value is passed into the job as the${SLURM_CPUS_PER_TASK}
enviromental variable.
- dev CPU core speed 3.60GHz vs compute 2.70GHz
- dev has 8 hour time limit
- dev partition and dev QoS must be ran together
#!/bin/bash
#SBATCH --job-name=MY_JOB # Set job name
#SBATCH --partition=shared # Set the partition
#SBATCH --qos=cpu # Set the QoS
#SBATCH --nodes=1 # Do not change unless you know what your doing (it set the nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1 # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=4 # Set the number of CPUs for task (change to number of CPU/threads utilized in run script/programs)
#SBATCH --mem=24GB # Set to a value ~10-20% greater than max amount of memory the job will use
#SBATCH --time=8:00:00 # Set the max time limit
###-----load modules if needed-------###
module load singularity
###-----run script below this line-------###
- high_mem partition only for jobs requiring more than 760GB RAM per node
- max available memory on the high_mem nodes is 3022GB, just shy of 3TB (3072GB binary unit), also the '--mem' value is required to be an integer (no decimals).
- high_mem partition time limit is 72 hours so use with batch QoS.
#!/bin/bash
#SBATCH --job-name=MY_JOB # Set job name
#SBATCH --partition=highmem # Set the partition
#SBATCH --qos=cpu # Set the QoS
#SBATCH --nodes=1 # Do not change unless you know what your doing (it set the number of nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1 # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=46 # Set the number of CPUs for task (change to number of CPU/threads utilized in run script/programs)
#SBATCH --mem=1024GB # Here set to ~1TB (~22GB per core set above )[limited to 3022GB per node]
#SBATCH --time=24:00:00 # Set the max time limit
###-----load modules if needed-------###
module load singularity
###-----run script below this line-------###
- array value can be found in the variable ${SLURM_ARRAY_TASK_ID}
- limit the number of arrays running at once by adding the % to
#SBATCH --array=1-10%1
for 1 job to run at time
#!/bin/bash
#SBATCH --job-name=MY_JOB # Set job name
#SBATCH --partition=shared # Set the partition
#SBATCH --qos=cpu # Set the QoS
#SBATCH --nodes=1 # Do not change unless you know what your doing (it set the number of nodes (do not change for non-mpi jobs))
#SBATCH --ntasks=1 # Do not change unless you know what your doing (it sets the number of tasks (do not change for non-mpi jobs))
#SBATCH --cpus-per-task=4 # Set the number of CPUs per task (change to number of CPU/threads utilized in run script/programs)
#SBATCH --mem=64GB # Set to a value ~10-20% greater than max amount of memory the job will use (or ~10 GB per core)
#SBATCH --array=1-10 # runs array
###-----load modules if needed-------###
module load singularity
###-----run script below this line-------###
-
The partitions
scontrol show partition
-
The QoS settings
sacctmgr show qos
sinfo
- Usefull workshop website: https://thejacksonlaboratory.github.io/intro-to-hpc/
In general the shared
partition and cpu
QoS are a good general config. These have 128 cores and 257526 MB RAM (~257GB) per node.
High memory nodes are available but are billed at 4 credits per CPU hour.
gotcha The partion is the wholenode partition that will allocate 128 cores regardless of the number specified in the slurm header.
Partition
Partition | System Default | Default Time (hr:min:sec) | Partion Default QoS | TRESBillingWeights | AllowQos |
---|---|---|---|---|---|
wholenode | yes | 00:30:00 | part-standard | CPU=1.0 | normal,cpu,rcac,benchmarking,dmachi,array-submission |
standard | No | 00:30:00 | part-standard | CPU=1.0 | normal,cpu,rcac,benchmarking,dmachi,array-submission |
shared | No | 00:30:00 | part-shared | CPU=1.0 | normal,cpu,rcac,shared-50,benchmarking,dmachi,mojtaba |
wide | No | 00:30:00 | part-highmem | CPU=1.0 | normal,cpu,rcac,wide-80 |
highmem | No | 00:30:00 | part-shared | CPU=4.0 | normal,cpu,rcac,benchmarking |
debug | No | 00:30:00 | part-debug | CPU=1.0 | normal,cpu,rcac |
gpu | No | 00:30:00 | part-gpu | GRES/gpu=1.0 | normal,gpu,express-gpu,allgpu,rcac,benchmarking,wide-gpu |
gpu-debug | No | 00:30:00 | part-gpu-debug | GRES/gpu=1.0 | normal,gpu,express-gpu,rcac |
QoS
Name | Flags | MaxTRES | MaxTRESPerNode | MaxWall | MaxTRESPU | MaxJobsPU | MaxSubmitPU | MaxTRESPA | MinTRES |
---|---|---|---|---|---|---|---|---|---|
normal | |||||||||
part-standard | DenyOnLimit | cpu=2048,node=16 | 4-00:00:00 | 64 | 128 | ||||
part-shared | DenyOnLimit | cpu=128,node=1 | 4-00:00:00 | cpu=6400 | |||||
part-wide | DenyOnLimit | cpu=7168,node=56 | 12:00:00 | 5 | 10 | ||||
part-highmem | DenyOnLimit | cpu=128,node=1 | 2-00:00:00 | 2 | 4 | ||||
part-debug | DenyOnLimit | cpu=256,node=2 | 02:00:00 | 1 | 2 | ||||
part-gpu | DenyOnLimit | 2-00:00:00 | gres/gpu=12 | gres/gpu=32 | gres/gpu=1 | ||||
part-gpu-debug | DenyOnLimit | gres/gpu=2,node=1 | 00:30:00 | 1 | 2 | gres/gpu=1 | |||
cpu | |||||||||
gpu | |||||||||
rcac | OverPartQOS | cpu=128000,gres/gpu=128,node=1000 | cpu=128000,gres/gpu=128,node=1000 | 500 | 5000 | ||||
express-gpu | DenyOnLimit,OverPartQOS | 01:00:00 | gres/gpu=32 | gres/gpu=1 | |||||
shared-50 | DenyOnLimit,OverPartQOS | cpu=6400 | |||||||
benchmarking | DenyOnLimit,OverPartQOS | cpu=96000,node=750 | 500 | 1000 | |||||
dmachi | DenyOnLimit,OverPartQOS | cpu=512,node=4 | 5-00:00:00 | cpu=128000 | 800 | 850 | |||
wide-80 | DenyOnLimit,OverPartQOS | cpu=10240,node=80 | |||||||
allgpu | OverPartQOS | node=16 | cpu=128,gres/gpu=4 | gres/gpu=64 | gres/gpu=64 | ||||
wide-gpu | DenyOnLimit,OverPartQOS | gres/gpu=32 | |||||||
array-submission | DenyOnLimit,OverPartQOS | 64 | 2000 | ||||||
mojtaba | DenyOnLimit,OverPartQOS | cpu=128,node=1 | 18-00:00:00 | cpu=6400 |
The Partition information.
scontrol show partition
The QoS information.
sacctmgr show qos \
format="Name%-16,Flags%-23,MaxTRES%-33,MaxTRESPerNode%-18,MaxWall,MaxTRESPU%-33,MaxJobsPU,MaxSubmitPU,MaxTRESPA,MinTRES"
The amount of memory (in MB) on each node:
sinfo -N -l
View your account.
sacctmgr show associations user=$USER format=cluster,account,user,qos
View your cluster utilization ($USER can be replaced with list of comma separated users).
sreport cluster userutilizationbyaccount start=2022-02-01 users=$USER
View your cluster utilization in minutes ($USER can be replaced with list of comma separated users).
sreport user topusage start=2023-01-01 end=2023-02-19 users=$USER
View user job history (requires startdata -S
).
sacct -u $USER -S 2023-01-01 --format=jobid,jobname,submit,exitcode,elapsed,reqnodes,reqcpus,reqmem,alloccpus,AllocNodes,All
ocTRES,ReqCPUS,ReqMem,ReqNodes,ReqTRES
View information about specific job. Replace ${JOBID}
with specific jobid, see user jobs above.
sacct -j ${JOBID} --format=jobid,jobname,submit,exitcode,elapsed,reqnodes,reqcpus,reqmem,alloccpus,AllocNodes,All
ocTRES,ReqCPUS,ReqMem,ReqNodes,ReqTRES
#usefull Anvil pages