Giter Site home page Giter Site logo

Comments (11)

vchuravy avatar vchuravy commented on July 21, 2024

I think this is how it is supposed to be working. ClusterManager uses srun to start processes on the cluster within the sbatch allocation. But the first process does not count as such since it is not itself run under srun.

When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.
https://slurm.schedmd.com/sbatch.html

So you would need to say:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1G
#SBATCH --job-name=my_first_parallel_julia
#SBATCH --time=00-00:02:00  # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=julia_in_parallel.output  # output and error messages go to this file

srun --ntasks=1 julia  my_first_parallel_julia_on_slurm.jl

Will run julia under slurm and then use srun to start additional workers, but I am not a 100% sure that this will work as expected since I don't have a slurm cluster handy.

from clustermanagers.jl.

nickeubank avatar nickeubank commented on July 21, 2024

Thanks @vchuravy .

That does seem a little odd to me -- it's letting that first julia process run, and I can't think of a reason SLURM would allow a process to run if it weren't part of that resource allocation. Seems inherently problematic to allow users to just sneak in processes like that.

(Also, the example slurm scripts I've seen don't use that srun notation...)

from clustermanagers.jl.

vchuravy avatar vchuravy commented on July 21, 2024

from clustermanagers.jl.

nickeubank avatar nickeubank commented on July 21, 2024

Ah, interesting. OK, bringing up with IT staff. Will get back to you with resolution.

from clustermanagers.jl.

nickeubank avatar nickeubank commented on July 21, 2024

@vchuravy Checked with IT staff, and they say:

Everything executed within the
script is ran within the allocated resources. Only parallel processes (i.e.
MPI) must be launched with srun. Any application launched without srun will
just execute one copy of itself on the first node of the job allocation where
the batch script runs.

As for the issue you reported in the GitHub repository, after consulting with
the Slurm developers, we agree that that is not a Slurm issue, but how
ClusterManagers interacts with Slurm. Namely, if it wants to spawn only a
subset of the tasks allocated for the job, it should clear the value of the
SLURM_NTASKS_PER_NODE environment variables before spawning tasks via srun.
Hence it is a Julia bug.

from clustermanagers.jl.

vchuravy avatar vchuravy commented on July 21, 2024

Thanks for getting back to me with that. The reasoning seems wird to me since Julia is starting parallel processes and is relying on srun to setup those instances...
I don't quite see how clearing SLURM_NTASKS_PER_NODE would be helping.

from clustermanagers.jl.

nickeubank avatar nickeubank commented on July 21, 2024

Not sure either -- is the exception being thrown by julia or Slurm? If the later, then maybe ClusterManagers.jl should store the value and clear it, then allocate according to that value one process at a time in a manner consistent with the initial request, but in a way that doesn't annoy slurm?

Sorry, not quite sure how this works internally....

from clustermanagers.jl.

vchuravy avatar vchuravy commented on July 21, 2024

The warning is coming from SLURM and not from Julia. I am confused by the fact that addprocs_slurm(4) works since that means it can fullfill the request properly and is ignoring the master process in accounting.

from clustermanagers.jl.

nickeubank avatar nickeubank commented on July 21, 2024

Perhaps the last srun is effectively asking for a new allocation that's small enough it happens to be getting it quickly? I was playing with this over the holidays when I assume system demand was low...

from clustermanagers.jl.

vancleve avatar vancleve commented on July 21, 2024

This is helpful I think: https://stackoverflow.com/a/43799481/1596965

from clustermanagers.jl.

juliohm avatar juliohm commented on July 21, 2024

Please consult https://github.com/juliohm/julia-distributed-computing for an example on how you should parallelise your program with batch scripts.

from clustermanagers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.