giovtorres / slurm-docker-cluster Goto Github PK
View Code? Open in Web Editor NEWA Slurm cluster using docker-compose
License: MIT License
A Slurm cluster using docker-compose
License: MIT License
My host machine is Windows. I can't use docker compose up to start dockers, the errors are:
I don't understand why this file is not exist, I think the image build process does not have any issues.
Following the setup from the README, I did:
During registration, I get the following error:
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slurmdbd:6819: Connection timed out
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out
sacctmgr: error: Problem talking to the database: Connection timed out
I have waited several minutes for all services to be initialized. Here is the tail of my docker-compose logs.
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Number of transaction pools: 1
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
mysql | 2023-05-28 22:13:12 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Completed initialization of buffer pool
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: 128 rollback segments are active.
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: log sequence number 46702; transaction id 14
mysql | 2023-05-28 22:13:12 0 [Note] Plugin 'FEEDBACK' is disabled.
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
mysql | 2023-05-28 22:13:12 0 [Warning] You need to use --log-bin to make --expire-logs-days or --binlog-expire-logs-seconds work.
mysql | 2023-05-28 22:13:12 0 [Note] InnoDB: Buffer pool(s) load completed at 230528 22:13:12
mysql | 2023-05-28 22:13:12 0 [Note] Server socket created on IP: '0.0.0.0'.
mysql | 2023-05-28 22:13:12 0 [Note] Server socket created on IP: '::'.
mysql | 2023-05-28 22:13:12 0 [Note] mariadbd: ready for connections.
mysql | Version: '10.10.4-MariaDB-1:10.10.4+maria~ubu2204' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
$ docker ps
21cef238b374 slurm-docker-cluster:19.05.2 "/usr/local/bin/dock…" 14 minutes ago Up 14 minutes 6818/tcp c1
6e9f84fcb3e9 slurm-docker-cluster:19.05.2 "/usr/local/bin/dock…" 14 minutes ago Up 14 minutes 6818/tcp c2
0b081fd724d6 slurm-docker-cluster:19.05.2 "/usr/local/bin/dock…" 14 minutes ago Up 10 minutes 6817/tcp slurmctld
9871f52a1d76 slurm-docker-cluster:19.05.2 "/usr/local/bin/dock…" 14 minutes ago Up 10 minutes 6819/tcp slurmdbd
bd5d5c751998 mariadb:10.10 "docker-entrypoint.s…" 14 minutes ago Up 14 minutes 3306/tcp mysql
Local Machine Details:
$ docker version
Client:
Version: 20.10.21
API version: 1.41
Go version: go1.18.1
Git commit: 20.10.21-0ubuntu1~20.04.2
Built: Thu Apr 27 05:56:19 2023
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.24
API version: 1.41 (minimum version 1.12)
Go version: go1.20.4
Git commit: 5d6db84
Built: Wed May 24 23:31:22 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.6.20
GitCommit: 2806fc1057397dbaeefbea0e4e17bddfbd388f38
runc:
Version: 1.1.5
GitCommit:
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Docker compose:
$ docker-compose version
Docker Compose version v2.17.2
i have tried a lot of times, but i still have a 2 nodes slurm cluster
Hi,
Firstly, thanks for creating the docker images for Slurm, it was much easier to set up this way. The main reason I wanted to set up Slurm is to read in some job archive data from my university's cluster for some metrics. I went through your instructions and everything ran well, I was able to copy the archive data to the slurmdbd
container. I then accessed it using docker exec -it slurmdbd bash
. From there, I tried to load the archive data using the command sacctmgr archive load file=/data/slurm_archive
. However, this gave me the following error:
error: slurmdbd: Error with request.
Problem loading archive file: Permission denied
I'm running as root so I didn't figure there would be any permission errors. Just wondering if there was some setup I was missing if you knew anything offhand. Thanks
Hi,
lets say I have slurm cluster like:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d1c8d0834138 slurm-docker-cluster:19.05.1 "/usr/local/bin/dock…" 2 minutes ago Up 2 minutes 6818/tcp c2
233e2046c817 slurm-docker-cluster:19.05.1 "/usr/local/bin/dock…" 2 minutes ago Up 2 minutes 6818/tcp c1
34defa70798e slurm-docker-cluster:19.05.1 "/usr/local/bin/dock…" 2 minutes ago Up 2 minutes 6817/tcp slurmctld
74c00f465f24 slurm-docker-cluster:19.05.1 "/usr/local/bin/dock…" 2 minutes ago Up 2 minutes 6819/tcp slurmdbd
a6fb4a0a4d6c mysql:5.7 "docker-entrypoint.s…" 2 minutes ago Up 2 minutes 3306/tcp, 33060/tcp mysql
And I want to run an interactive job like the example below:
$ docker exec -it slurmctld srun echo "helloworld"
helloworld
docker logs won't show anything in regards the previous job:
$ docker logs slurmctld | grep -r "helloworld"
Is there a way to show stdout from interactive jobs in the container logs?
thank you
You have
SLURM_TAG=slurm-19-05-2-1 IMAGE_TAG=19.05.2 docker-compose build
But I think you need to use --build-arg
to set those variable so they are used by Dockerfile
After placing the slurm cluster, enter the docker of slurmctld and submit and execute the job as follows.
[root@slurmctld /]# cd /data
[root@slurmctld data]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 1
[root@slurmctld data]# ls
slurm-1.out
Searching for slurm processes on host machine after job execution leaves slurmstepd zombies:
$ ps aux | grep slurm
990 43820 0.0 0.1 243368 7236 ? Ssl 20:23 0:00 /usr/sbin/slurmdbd -Dvvv
990 43981 0.2 0.2 904004 11036 ? Ssl 20:23 0:02 /usr/sbin/slurmctld -i -Dvvv
root 44128 0.0 0.1 130864 5172 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv
root 44209 0.0 0.1 131892 5348 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv
990 44563 0.0 0.0 21600 2728 ? S 20:24 0:00 slurmctld: slurmscriptd
root 44575 0.0 0.1 8920 5492 pts/0 S+ 20:24 0:00 sudo docker exec -ti slurmctld bash
root 44576 0.0 0.0 8920 884 pts/2 Ss 20:24 0:00 sudo docker exec -ti slurmctld bash
root 44577 0.0 0.7 1328360 31112 pts/2 Sl+ 20:24 0:00 docker exec -ti slurmctld bash
root 44659 0.0 0.0 0 0 ? Z 20:25 0:00 [slurmstepd]
No matter what, $SLURM_NTASKS_PER_NODE is not getting set, thus it can not support assignments per task.
Thanks for the package. I have tried to start up the service but get the error message from slurmdbd
and slurmctld
slurmctld | ---> Starting the MUNGE Authentication service (munged) ...
slurmctld | munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory
The build process didn't give me any error message. Any idea what is wrong here?
Hi!
Will be nice have Centos 8 and slurm 20.11
Hey,
Thanks for an amzing job you did dockerizing Slurm!
I would like to ask if you are aware of an issue with slurmdbd
?
slurmdbd: debug2: DBD_JOB_START: START CALL ID:4 NAME:wrap INX:0
slurmdbd: debug2: as_mysql_job_start: called
slurmdbd: error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
slurmdbd: error: It looks like the storage has gone away trying to reconnect
slurmdbd: error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
Is it the reason why sacct
output is empty?
[root@slurmctld /]# sacct
sacct sacctmgr
[root@slurmctld /]# sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
I found this guide
https://hub.docker.com/r/lvarin/slurm-docker-cluster
but it doesn't actually replicate the c1 workers
found this as well
https://stackoverflow.com/questions/49518376/docker-swarm-how-to-balance-already-running-containers-in-a-swarm-cluster
looks like I need to modify docker-compose to include something like
please advise
deploy:
restart_policy:
condition: any
mode: replicated
replicas: 5
placement:
constraints: [node.role == worker]
I built 21.08.6、21.08.4
When start slurmctld, slurmdbd, mysql, and slurm through docker compose,slurmdbd will report an error,The error is as follows:
/usr/sbin/slurmdbd: error while loading shared libraries: libslurmfull.so: cannot open shared object file: No such file or directory
But I can find it in the docker container。
Even if I set LD_ LIBRARY_ PATH in the container will still report this problem。
I don't know what to do
Is possible to run slurm jobs with more than 1 core?
docker build -t slurm-docker-cluster:19.05.1 .
docker-compose up
slurmdbd | standard_init_linux.go:228: exec user process caused: no such file or directory
slurmdbd exited with code 1
slurmctld | standard_init_linux.go:228: exec user process caused: no such file or directory
slurmctld exited with code 1
c2 | standard_init_linux.go:228: exec user process caused: no such file or directory
c1 | standard_init_linux.go:228: exec user process caused: no such file or directory
The README.md suggests submitting a test job with command uptime. procps-ng package is not installed but also uptime in docker containers does not really work since it reports host uptime. So maybe some other command is more suitable like hostname which will give you the host on which the job landed.
Hi there,
this project looks really useful! However, I'm a bit hesitant to use it, since it doesn't state any license? Could you add a license for it, please?
Thank you!
srun -N2 hostname is failing
We need to adjust MaxNodes in slurm.conf
Looks like mysql:5.7 container doesn't work with arm processors (m1 mac).
[+] Running 0/1
⠸ mysql Pulling 1.3s
no matching manifest for linux/arm64/v8 in the manifest list entries
Replacing image: mysql:5.7
with image: mariadb:10.10
seems to correct the issue. I am able to exec into the cluster and run a simple job. I have not done much testing beyond that.
I'd hate to fork this repo just for that small change. Any chance someone would be willing to take a look?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.