I started a session with an sbatch with: <

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Yeah it is an odd quirk of SLURM. The example s from NERSC all use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This is helpful I think: <a href="https://stackoverflow.com/a/43799481/1596965" rel="n

SLURM: `srun` doesn't see julia instance using ClusterManagers? about clustermanagers.jl HOT 11 CLOSED

juliaparallel commented on July 21, 2024

SLURM: `srun` doesn't see julia instance using ClusterManagers?

from clustermanagers.jl.

Comments (11)

vchuravy commented on July 21, 2024

I think this is how it is supposed to be working. ClusterManager uses srun to start processes on the cluster within the sbatch allocation. But the first process does not count as such since it is not itself run under srun.

When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.
https://slurm.schedmd.com/sbatch.html

So you would need to say:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1G
#SBATCH --job-name=my_first_parallel_julia
#SBATCH --time=00-00:02:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=julia_in_parallel.output  # output and error messages go to this file

srun --ntasks=1 julia  my_first_parallel_julia_on_slurm.jl

Will run julia under slurm and then use srun to start additional workers, but I am not a 100% sure that this will work as expected since I don't have a slurm cluster handy.

from clustermanagers.jl.

nickeubank commented on July 21, 2024

Thanks @vchuravy .

That does seem a little odd to me -- it's letting that first julia process run, and I can't think of a reason SLURM would allow a process to run if it weren't part of that resource allocation. Seems inherently problematic to allow users to just sneak in processes like that.

(Also, the example slurm scripts I've seen don't use that srun notation...)

from clustermanagers.jl.

vchuravy commented on July 21, 2024

Yeah it is an odd quirk of SLURM. The example scripts from NERSC all use `srun` http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ My take is to talk to your admins. If they are okay with running a lightweight master process outside srun. But generally I would use srun for the master process as well.

…

On Mon, Jan 1, 2018, 07:27 Nick Eubank ***@***.***> wrote: Thanks @vchuravy <https://github.com/vchuravy> . That does seem a little odd to me -- it's letting that first julia process run, and I can't think of a reason SLURM allow a process to run if it weren't part of that resource allocation. Seems inherently problematic to allow users to just sneak in processes like that. (Also, the example slurm scripts <http://www.accre.vanderbilt.edu/?page_id=2154#torque> I've seen don't use that srun notation...) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI3arK7Yybg6ZlvCJ20Vh4kQUdIyQkxks5tGHrIgaJpZM4ROBUJ> .

from clustermanagers.jl.

nickeubank commented on July 21, 2024

Ah, interesting. OK, bringing up with IT staff. Will get back to you with resolution.

from clustermanagers.jl.

nickeubank commented on July 21, 2024

@vchuravy Checked with IT staff, and they say:

Everything executed within the
script is ran within the allocated resources. Only parallel processes (i.e.
MPI) must be launched with srun. Any application launched without srun will
just execute one copy of itself on the first node of the job allocation where
the batch script runs.

As for the issue you reported in the GitHub repository, after consulting with
the Slurm developers, we agree that that is not a Slurm issue, but how
ClusterManagers interacts with Slurm. Namely, if it wants to spawn only a
subset of the tasks allocated for the job, it should clear the value of the
SLURM_NTASKS_PER_NODE environment variables before spawning tasks via srun.
Hence it is a Julia bug.

from clustermanagers.jl.

vchuravy commented on July 21, 2024

Thanks for getting back to me with that. The reasoning seems wird to me since Julia is starting parallel processes and is relying on srun to setup those instances...
I don't quite see how clearing SLURM_NTASKS_PER_NODE would be helping.

from clustermanagers.jl.

nickeubank commented on July 21, 2024

Not sure either -- is the exception being thrown by julia or Slurm? If the later, then maybe ClusterManagers.jl should store the value and clear it, then allocate according to that value one process at a time in a manner consistent with the initial request, but in a way that doesn't annoy slurm?

Sorry, not quite sure how this works internally....

from clustermanagers.jl.

vchuravy commented on July 21, 2024

The warning is coming from SLURM and not from Julia. I am confused by the fact that addprocs_slurm(4) works since that means it can fullfill the request properly and is ignoring the master process in accounting.

from clustermanagers.jl.

nickeubank commented on July 21, 2024

Perhaps the last srun is effectively asking for a new allocation that's small enough it happens to be getting it quickly? I was playing with this over the holidays when I assume system demand was low...

from clustermanagers.jl.

vancleve commented on July 21, 2024

This is helpful I think: https://stackoverflow.com/a/43799481/1596965

from clustermanagers.jl.

juliohm commented on July 21, 2024

Please consult https://github.com/juliohm/julia-distributed-computing for an example on how you should parallelise your program with batch scripts.

from clustermanagers.jl.

SLURM: `srun` doesn't see julia instance using ClusterManagers? about clustermanagers.jl HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent