Dear RELION dev team, Hi, my name is Dimitrios Bellos and I am a mem

In our cluster above, although we use multiple tasks, <div class="snippet-clipboar

RELION behaves like there is GPU sharing in Refine3D jobs when used on a slurm cluster causing VRAM errors about relion HOT 8 OPEN

DimitriosBellos commented on August 26, 2024

RELION behaves like there is GPU sharing in Refine3D jobs when used on a slurm cluster causing VRAM errors

from relion.

Comments (8)

biochem-fan commented on August 26, 2024

I cannot directly address your question because I am not a SLURM expert but the following might give some insights.

In our cluster, we use SLURM only to control the visibility of GPUs per job, not per MPI process (i.e. task). For example, our job template for a "half node" job looks like:

#SBATCH -N 1
#SBATCH --ntasks-per-node=36 # grab half node
#SBATCH --gres=gpu:2 # with 2 (out of 4) GPUs
#SBATCH --mem=256G # and half memory
...

mpirun --oversubscribe -n XXXmpinodesXXX XXXcommandXXX

Users are free to divide 36 cores into e.g. 5 MPI processes x 9 threads, 3 MPI processes x 18 threads etc. Because the first MPI rank of a Refine3D/Class3D job does not perform real computation and does not need a GPU, this is more efficient. --oversubscribe allows this. Because SLURM limits available cores by cgroups, this does not harm other jobs that are running on the same node. Your job script wastes an expensive A100 on the first rank by using --gpus-per-task=1.

All RELION processes see two GPUs 0 and 1 (out of four). They might be physically 0 and 1, or 0 and 4, or whatever but cgroup renumbers allocated GPUs to 0 and 1 and hides the others.

from relion.

DimitriosBellos commented on August 26, 2024

Thank you for answer.

The issue does not have to do with the number of MPI procs, or the number of threads per MPI proc (aka slurm task) or what is their best combo.

The issue is regarding the GPUs. Our users want to use more than 4 GPUs per jobs (maybe 5, 6, 7, etc) to accelerate their processing pipelines.

The reason they have set --cpus-per-task (aka -c) to 36 is because they have also set (--gpus-per-task=1) since each node in Baskerville HPC has 4 GPUs and 144 cores, then because each task should have its own independent GPU this will lead each task using one quarter of the GPUs in a node and so one quarter of the CPUs is 36

Typically, we would like the tasks to be node agnostic. Aka e.g. for a job with 8 tasks, 3 may run on the same node, another 2 on another node, and the last 3 on 3 different modes (total of 5 modes). This way the job does not have to wait all 8 GPUs to be freed up necessary on 2 nodes (4+4 GPUs).

Thusly fixing the issue is important. Just because e.g. 3 MPI procs (aka slurm tasks) are on the same node and for each MPI proc its respective GPU has index 0, RELION should not deduce that they are using the same GPU. It is just the slurm reindex the GPU Nvidia index so that each MPI proc (aka slurm task) can 'see' only 1 GPU and this GPU has index 0

I understand that the first MPI job does not do anything and a GPU is not utilised, but we want to run mult-node jobs and as far as I know there is not a way in slurm to discriminate resources given to the different tasks and thusly have the first task use 0 GPUs, but all the rest use 1.

I believe this is an issue with how Refine3D MPI has been implemented, because in principal the first MPI with rank 0 should also act as parallel process that does computations, it is just that it has additional functionality to perform. This typically it is done by having multiple if rank==0 in the code, that will be performed only by the first MPI proc, and all other code is performed by all MPI procs (including the rank 0) since it has to do with computation.

from relion.

biochem-fan commented on August 26, 2024

I don't recommend running RELION GPU jobs over multiple nodes. This is inefficient and leads to fragmentation of jobs.

If you restrict RELION to a single node, you can allocate one task for RELION, 2 or 4 GPUs for the task and run multiple MPI processes within the one task. Many people use RELION in this way on SLURM.

Our users want to use more than 4 GPUs per jobs (maybe 5, 6, 7, etc) to accelerate their processing pipelines.

This is a bad idea. RELION does not scale well over 4 GPUs. Actually A100 is overkill for RELION. One RELION process does not have enough parallelism to saturate a A100. I recommend running 2 or 3 MPI processes per A100 and 2 (or 3 or 4 if the job is really big) GPUs per job. There are discussions regarding this on the CCPEM mailing list.

I believe this is an issue with how Refine3D MPI has been implemented,

You are correct but changing the current implementation requires huge refactoring and is very unlikely to happen.

from relion.

colinpalmer commented on August 26, 2024

To address the specific issue of GPU memory sharing: would it be possible for RELION to use the GPU UUIDs to determine when a GPU is shared between multiple processes, rather than just using the single-digit IDs?

from relion.

biochem-fan commented on August 26, 2024

To address the specific issue of GPU memory sharing: would it be possible for RELION to use the GPU UUIDs to determine when a GPU is shared between multiple processes, rather than just using the single-digit IDs?

If someone writes, tests and sends a pull request for this, I am happy to review it.

But in general, I don't like SLURM controlling how resources are allocated to each process ("task") on the same node. It is completely fine to hide and block access to CPU cores, memory and GPUs allocated to other jobs but resources for a job should be shared among all processes of the same job on the same node. In other words, RELION likes one task per node, regardless of the actual number of MPI processes running on the node.

It is fine to use compartmentalization, virtualization, containerization etc but problems caused by these additional layers of abstraction and complication should be dealt with by those layers, not by RELION.

from relion.

biochem-fan commented on August 26, 2024

In our cluster above, although we use multiple tasks,

#SBATCH -N 1
#SBATCH --ntasks-per-node=36 # grab half node
#SBATCH --gres=gpu:2 # with 2 (out of 4) GPUs

nvidia-smi within this script shows 2 GPUs and all MPI processes see 2 GPUs. In this case, GPUs are allocated to the entire job, not per task. This is what I want.

from relion.

DimitriosBellos commented on August 26, 2024

Our use case is multi-node since due to size of data using more than 4 GPUs at a time is necessary. Regarding your prior reply.

"But in general, I don't like SLURM controlling how resources are allocated to each process ("task") on the same node. It is completely fine to hide and block access to CPU cores, memory and GPUs allocated to other jobs but resources for a job should be shared among all processes of the same job on the same node. In other words, RELION likes one task per node, regardless of the actual number of MPI processes running on the node."

I understand your argument but unfortunately this is how slurm scheduler operates. Slurm tasks run independenly of each other and they are completely agnostic if they land on the same node and even if they do, they are independent meaning that one cannot access the GPUs of the other, even if they are tasks spawned by the same job. Slurm schedulers are being used on a large number of modern HPCs (ARCHER2, Baskerville, etc.) and thus having RELION being able to run optimally on them will accelarate scientific discoveries and of courses acknowledgements and citations of RELION.

I am just mentioning this fact, because if RELION had an option that will make it ignore in which node the tasks are operated and instead it was functioning as if all tasks are being processed on different nodes, then this issue would be resolved.

Furthermore, the issue of the first MPI proc with rank 0 not to being used to run computations, in my opinion it is a very suboptimal choice, though I totally understand the difficulty and time required to refactor the software to resolve it. The reason I believe it is suboptimal is because even in HPCs that only have CPUs, the memory - RAM allocation is expected to be proportional to how much RAM is installed on a node and how many CPUs of the node are being used per task. Because of this, there might be use cases (based on the dataset and processing options that are being used) where the number of threads per MPI proc-task should be high (to accommodate also high RAM allocation per task). In these use cases and of course in use cases where per task, 1 high performance GPU is being allocated, the fact the the first MPI proc with rank 0 is not performing any computations leads to subutilising the compute resources of the HPC(s).

As I already mentioned, I understand the difficult of refactoring RELION in order to:

To operate in a fashion that will ignore in which node each MPI proc - slurm task is being run and operate assuming all tasks are being run on different nodes.
To have the first MPI proc with rank 0 also performing computations.

However, the points above might be useful to keep in mind at least when developing the future versions of RELION, since they can lead to better and more optimised perforance of RELION on HPCs, accelarate scientific discoveries, and increase the number of acknowledgements and citations of the RELION suite.

from relion.

biochem-fan commented on August 26, 2024

We do use SLURM ourselves. The difference is that we don't use SLURM's task concept.

I understand that this is not compatible for completely arbitrary node allocation. But in practice, we use only 4 GPUs on one node or 2 GPU on half node so this never became a problem. We don't let a job to run on 2+1+1 GPUs on 3 nodes, because this is inefficient (i.e. 3x storage access and synchronization over network).

Our use case is multi-node since due to size of data using more than 4 GPUs at a time is necessary.

We did extensive tests on scaling and we concluded that going beyond 4 GPUs is a very bad idea. When we have 2 million particles, we split the dataset to smaller chunks (say 0.5 M particles each) and run four independent jobs. Such an "embarrassingly parallel" mode of operation is far more efficient than running a big job on 16 GPUs. Certainly parallelization efficiency of a single job can be improved, but somebody has to work on this.

RAM allocation is expected to be proportional to how much RAM is installed on a node and how many CPUs of the node are being used per task.

Please look at my example above. We just oversubscribe mpirun. So no CPU and RAM are wasted.

I see your points and welcome pull requests to address these issues but I'm afraid to say it is unlikely I myself can work on these issues. I just don't have time and motivation for it; most users (including SLURM users) are using RELION in the way I described above.

from relion.

RELION behaves like there is GPU sharing in Refine3D jobs when used on a slurm cluster causing VRAM errors about relion HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent