Giter Site home page Giter Site logo

slurm-docker-cluster's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

slurm-docker-cluster's Issues

slurm container can't start

My host machine is Windows. I can't use docker compose up to start dockers, the errors are:

  • slurmdbd | exec /usr/local/bin/docker-entrypoint.sh: no such file or directory
  • slurmctld | exec /usr/local/bin/docker-entrypoint.sh: no such file or directory
  • c2 | exec /usr/local/bin/docker-entrypoint.sh: no such file or directory
  • c1 | exec /usr/local/bin/docker-entrypoint.sh: no such file or directory

I don't understand why this file is not exist, I think the image build process does not have any issues.

Cannot open connection to slurmdbd

Following the setup from the README, I did:

  • SLURM_TAG=slurm-19-05-2-1 IMAGE_TAG=19.05.2 docker-compose build
  • IMAGE_TAG=19.05.2 docker-compose up -d
  • ./register_cluster.sh

During registration, I get the following error:

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slurmdbd:6819: Connection timed out
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out
sacctmgr: error: Problem talking to the database: Connection timed out

I have waited several minutes for all services to be initialized. Here is the tail of my docker-compose logs.

mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Number of transaction pools: 1
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
mysql      | 2023-05-28 22:13:12 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Completed initialization of buffer pool
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: 128 rollback segments are active.
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: log sequence number 46702; transaction id 14
mysql      | 2023-05-28 22:13:12 0 [Note] Plugin 'FEEDBACK' is disabled.
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
mysql      | 2023-05-28 22:13:12 0 [Warning] You need to use --log-bin to make --expire-logs-days or --binlog-expire-logs-seconds work.
mysql      | 2023-05-28 22:13:12 0 [Note] InnoDB: Buffer pool(s) load completed at 230528 22:13:12
mysql      | 2023-05-28 22:13:12 0 [Note] Server socket created on IP: '0.0.0.0'.
mysql      | 2023-05-28 22:13:12 0 [Note] Server socket created on IP: '::'.
mysql      | 2023-05-28 22:13:12 0 [Note] mariadbd: ready for connections.
mysql      | Version: '10.10.4-MariaDB-1:10.10.4+maria~ubu2204'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
$ docker ps
21cef238b374   slurm-docker-cluster:19.05.2   "/usr/local/bin/dock…"   14 minutes ago   Up 14 minutes   6818/tcp   c1
6e9f84fcb3e9   slurm-docker-cluster:19.05.2   "/usr/local/bin/dock…"   14 minutes ago   Up 14 minutes   6818/tcp   c2
0b081fd724d6   slurm-docker-cluster:19.05.2   "/usr/local/bin/dock…"   14 minutes ago   Up 10 minutes   6817/tcp   slurmctld
9871f52a1d76   slurm-docker-cluster:19.05.2   "/usr/local/bin/dock…"   14 minutes ago   Up 10 minutes   6819/tcp   slurmdbd
bd5d5c751998   mariadb:10.10                  "docker-entrypoint.s…"   14 minutes ago   Up 14 minutes   3306/tcp   mysql

Local Machine Details:

$ docker version
Client:
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.1
 Git commit:        20.10.21-0ubuntu1~20.04.2
 Built:             Thu Apr 27 05:56:19 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       5d6db84
  Built:            Wed May 24 23:31:22 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.20
  GitCommit:        2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc:
  Version:          1.1.5
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker compose:

$ docker-compose version
Docker Compose version v2.17.2

Loading archive data to slurmdb in slurmdbd container

Hi,

Firstly, thanks for creating the docker images for Slurm, it was much easier to set up this way. The main reason I wanted to set up Slurm is to read in some job archive data from my university's cluster for some metrics. I went through your instructions and everything ran well, I was able to copy the archive data to the slurmdbd container. I then accessed it using docker exec -it slurmdbd bash. From there, I tried to load the archive data using the command sacctmgr archive load file=/data/slurm_archive. However, this gave me the following error:

error: slurmdbd: Error with request.  
Problem loading archive file: Permission denied

I'm running as root so I didn't figure there would be any permission errors. Just wondering if there was some setup I was missing if you knew anything offhand. Thanks

interactive jobs output not showing in docker logs

Hi,

lets say I have slurm cluster like:

$ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS                 NAMES
d1c8d0834138        slurm-docker-cluster:19.05.1   "/usr/local/bin/dock…"   2 minutes ago       Up 2 minutes        6818/tcp              c2
233e2046c817        slurm-docker-cluster:19.05.1   "/usr/local/bin/dock…"   2 minutes ago       Up 2 minutes        6818/tcp              c1
34defa70798e        slurm-docker-cluster:19.05.1   "/usr/local/bin/dock…"   2 minutes ago       Up 2 minutes        6817/tcp              slurmctld
74c00f465f24        slurm-docker-cluster:19.05.1   "/usr/local/bin/dock…"   2 minutes ago       Up 2 minutes        6819/tcp              slurmdbd
a6fb4a0a4d6c        mysql:5.7                      "docker-entrypoint.s…"   2 minutes ago       Up 2 minutes        3306/tcp, 33060/tcp   mysql

And I want to run an interactive job like the example below:

$ docker exec -it slurmctld srun echo "helloworld"
helloworld

docker logs won't show anything in regards the previous job:

$ docker logs slurmctld | grep -r "helloworld"

Is there a way to show stdout from interactive jobs in the container logs?

thank you

slurmstepd zombie process remains after running job on slurm cluster

After placing the slurm cluster, enter the docker of slurmctld and submit and execute the job as follows.

[root@slurmctld /]# cd /data
[root@slurmctld data]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
[root@slurmctld data]# sbatch --wrap="uptime"
Submitted batch job 1
[root@slurmctld data]# ls
slurm-1.out

Searching for slurm processes on host machine after job execution leaves slurmstepd zombies:

$ ps aux | grep slurm
990 43820 0.0 0.1 243368 7236 ? Ssl 20:23 0:00 /usr/sbin/slurmdbd -Dvvv
990 43981 0.2 0.2 904004 11036 ? Ssl 20:23 0:02 /usr/sbin/slurmctld -i -Dvvv
root 44128 0.0 0.1 130864 5172 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv
root 44209 0.0 0.1 131892 5348 ? Ss 20:23 0:00 /usr/sbin/slurmd -Dvvv
990 44563 0.0 0.0 21600 2728 ? S 20:24 0:00 slurmctld: slurmscriptd
root 44575 0.0 0.1 8920 5492 pts/0 S+ 20:24 0:00 sudo docker exec -ti slurmctld bash
root 44576 0.0 0.0 8920 884 pts/2 Ss 20:24 0:00 sudo docker exec -ti slurmctld bash
root 44577 0.0 0.7 1328360 31112 pts/2 Sl+ 20:24 0:00 docker exec -ti slurmctld bash
root 44659 0.0 0.0 0 0 ? Z 20:25 0:00 [slurmstepd]

slurmctld exited with code 1

Thanks for the package. I have tried to start up the service but get the error message from slurmdbd and slurmctld

slurmctld    | ---> Starting the MUNGE Authentication service (munged) ...
slurmctld    | munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory

The build process didn't give me any error message. Any idea what is wrong here?

slurmdb: slurm_acct_db.linux_job_table' doesn't exist

Hey,

Thanks for an amzing job you did dockerizing Slurm!

I would like to ask if you are aware of an issue with slurmdbd?

slurmdbd: debug2: DBD_JOB_START: START CALL ID:4 NAME:wrap INX:0
slurmdbd: debug2: as_mysql_job_start: called
slurmdbd: error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
slurmdbd: error: It looks like the storage has gone away trying to reconnect
slurmdbd: error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist

Is it the reason why sacct output is empty?

[root@slurmctld /]# sacct
sacct     sacctmgr  
[root@slurmctld /]# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 

/usr/sbin/slurmdbd: error while loading shared libraries: libslurmfull.so

I built 21.08.6、21.08.4
When start slurmctld, slurmdbd, mysql, and slurm through docker compose,slurmdbd will report an error,The error is as follows:
/usr/sbin/slurmdbd: error while loading shared libraries: libslurmfull.so: cannot open shared object file: No such file or directory

But I can find it in the docker container。
Even if I set LD_ LIBRARY_ PATH in the container will still report this problem。
I don't know what to do

standard_init_linux.go:228: exec user process caused: no such file or directory on Windows

Context

  • Windows 10
  • Docker version 20.10.12, build e91ed57
  • Docker Compose version v2.2.3
  • Commit 3e6edfd

Steps to reproduce

  • Clone project
  • run image creation docker build -t slurm-docker-cluster:19.05.1 .
  • run docker-compose up
  • See following error
slurmdbd   | standard_init_linux.go:228: exec user process caused: no such file or directory
slurmdbd exited with code 1
slurmctld  | standard_init_linux.go:228: exec user process caused: no such file or directory
slurmctld exited with code 1
c2         | standard_init_linux.go:228: exec user process caused: no such file or directory
c1         | standard_init_linux.go:228: exec user process caused: no such file or directory

uptime command not found

The README.md suggests submitting a test job with command uptime. procps-ng package is not installed but also uptime in docker containers does not really work since it reports host uptime. So maybe some other command is more suitable like hostname which will give you the host on which the job landed.

License?

Hi there,

this project looks really useful! However, I'm a bit hesitant to use it, since it doesn't state any license? Could you add a license for it, please?

Thank you!

mysql upgrade to mariadb?

Looks like mysql:5.7 container doesn't work with arm processors (m1 mac).

[+] Running 0/1
 ⠸ mysql Pulling                                                                               1.3s
no matching manifest for linux/arm64/v8 in the manifest list entries

Replacing image: mysql:5.7 with image: mariadb:10.10 seems to correct the issue. I am able to exec into the cluster and run a simple job. I have not done much testing beyond that.

I'd hate to fork this repo just for that small change. Any chance someone would be willing to take a look?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.