Comments (11)
I think this is how it is supposed to be working. ClusterManager uses srun
to start processes on the cluster within the sbatch
allocation. But the first process does not count as such since it is not itself run under srun
.
When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.
https://slurm.schedmd.com/sbatch.html
So you would need to say:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1G
#SBATCH --job-name=my_first_parallel_julia
#SBATCH --time=00-00:02:00 # Wall Clock time (dd-hh:mm:ss) [max of 14 days]
#SBATCH --output=julia_in_parallel.output # output and error messages go to this file
srun --ntasks=1 julia my_first_parallel_julia_on_slurm.jl
Will run julia
under slurm and then use srun
to start additional workers, but I am not a 100% sure that this will work as expected since I don't have a slurm cluster handy.
from clustermanagers.jl.
Thanks @vchuravy .
That does seem a little odd to me -- it's letting that first julia process run, and I can't think of a reason SLURM would allow a process to run if it weren't part of that resource allocation. Seems inherently problematic to allow users to just sneak in processes like that.
(Also, the example slurm scripts I've seen don't use that srun
notation...)
from clustermanagers.jl.
from clustermanagers.jl.
Ah, interesting. OK, bringing up with IT staff. Will get back to you with resolution.
from clustermanagers.jl.
@vchuravy Checked with IT staff, and they say:
Everything executed within the
script is ran within the allocated resources. Only parallel processes (i.e.
MPI) must be launched with srun. Any application launched without srun will
just execute one copy of itself on the first node of the job allocation where
the batch script runs.As for the issue you reported in the GitHub repository, after consulting with
the Slurm developers, we agree that that is not a Slurm issue, but how
ClusterManagers interacts with Slurm. Namely, if it wants to spawn only a
subset of the tasks allocated for the job, it should clear the value of the
SLURM_NTASKS_PER_NODE environment variables before spawning tasks via srun.
Hence it is a Julia bug.
from clustermanagers.jl.
Thanks for getting back to me with that. The reasoning seems wird to me since Julia is starting parallel processes
and is relying on srun
to setup those instances...
I don't quite see how clearing SLURM_NTASKS_PER_NODE
would be helping.
from clustermanagers.jl.
Not sure either -- is the exception being thrown by julia
or Slurm? If the later, then maybe ClusterManagers.jl should store the value and clear it, then allocate according to that value one process at a time in a manner consistent with the initial request, but in a way that doesn't annoy slurm?
Sorry, not quite sure how this works internally....
from clustermanagers.jl.
The warning is coming from SLURM and not from Julia. I am confused by the fact that addprocs_slurm(4)
works since that means it can fullfill the request properly and is ignoring the master process in accounting.
from clustermanagers.jl.
Perhaps the last srun
is effectively asking for a new allocation that's small enough it happens to be getting it quickly? I was playing with this over the holidays when I assume system demand was low...
from clustermanagers.jl.
This is helpful I think: https://stackoverflow.com/a/43799481/1596965
from clustermanagers.jl.
Please consult https://github.com/juliohm/julia-distributed-computing for an example on how you should parallelise your program with batch scripts.
from clustermanagers.jl.
Related Issues (20)
- ElasticManager does not export get_connect_cmd
- htcondor manager: failure when listening to a telnet commu HOT 4
- Extra options on SGE HOT 5
- Error in `rmprocs` SGE HOT 1
- Ship telnet via jll? HOT 2
- addprocs(SGEManager) fails HOT 5
- SGE fails in rmprocs
- Singularity images does not work with SLURM HOT 5
- Error launching workers: no such file or directory HOT 5
- TagBot trigger issue HOT 8
- lsf_bpeek makes strong assumptions on iterator state of retry_delays
- [SlurmManager] 100 % CPU usage while waiting for the job to get created HOT 6
- Better handling of SLURM job submission timing
- Handling of busy LSF deamon HOT 4
- SLURM 10 nodes good, 16 nodes error HOT 3
- pbs error HOT 4
- LSF manager broken in Julia 1.8.1 HOT 2
- -o argument in addprocs_slurm leads to an error
- ClusterManagers can be run on top of dask clusters! HOT 2
- Elastic auto IP address function HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clustermanagers.jl.