Giter Site home page Giter Site logo

ubccr / hpc-toolset-tutorial Goto Github PK

View Code? Open in Web Editor NEW
109.0 9.0 63.0 322.54 MB

Tutorial for installing Open XDMoD, OnDemand, & ColdFront

License: GNU General Public License v3.0

Dockerfile 0.11% Shell 1.49% Python 0.20% Tcl 0.24% PHP 1.16% HTML 0.16% Ruby 0.01% Jupyter Notebook 96.57% CSS 0.05%

hpc-toolset-tutorial's Introduction

HPC Toolset Tutorial

Tutorial for installing and configuring ColdFront, Open OnDemand, and Open XDMoD: an HPC center management toolset.

Presented by:

OSC Logo https://osc.edu
CCR logo

This tutorial aims to demonstrate how three open source applications work in concert to provide a toolset for high performance computing (HPC) centers. ColdFront is an allocations management portal that provides users an easy way to request access to allocations for a Center's resources. HPC systems staff configure the data center’s resources with attributes that tie ColdFront’s plug-ins to systems such as job schedulers, authentication/account management systems, system monitoring, and Open XDMoD. Once the user's allocation is activated in ColdFront, they are able to access the resource using Open OnDemand, a web-based portal for accessing HPC services that removes the complexities of HPC system environments from the end-user. Through Open OnDemand, users can upload and download files, create, edit, submit and monitor jobs, create and share apps, run GUI applications and connect to a terminal, all via a web browser, with no client software to install and configure. The Open XDMoD portal provides a rich set of features, which are tailored to the role of the user. Sample metrics provided by Open XDMoD include: number of jobs, CPUs consumed, wait time, and wall time, with minimum, maximum and the average of these metrics. Performance and quality of service metrics of the HPC infrastructure are also provided, along with application specific performance metrics (flop/s, IO rates, network metrics, etc) for all user applications running on a given resource.

Tutorial Steps

Requirements
Getting Started
Accessing the Applications
ColdFront
Open OnDemand
Open XDMoD

Acknowledgments

Workshops

This tutorial will be presented at the following conferences:

PEARC23
ISC23
PEARC22
PEARC21
PEARC20
Gateways 2020

This overview of HPC Toolset Tutorial is provided as context to those finding this repo and wanting to go through the hands-on tutorial without attending the full day workshop at a conference.

Disclaimer

DO NOT run this project on production systems. This project is for educational purposes only. The container images we publish for the tutorial are configured with hard coded insecure passwords and should be run locally in development for testing and learning only.

License

This tutorial is released under the GPLv3 license. See the LICENSE file.

hpc-toolset-tutorial's People

Contributors

aebruno avatar aestoltm avatar blankenberg avatar dsajdak avatar ericfranz avatar gbyrket avatar gerald-byrket avatar johrstrom avatar jpwhite4 avatar jtpalmer avatar oglopf avatar plessbd avatar pwablito avatar rg663 avatar ryanrath avatar tomgreen66 avatar treydock avatar widyono-cets avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpc-toolset-tutorial's Issues

slurmctld and slurmdbd requires restart after ingesting ColdFront data

I just following the tutorial and find that the user cgray cannot submit a job and I get:

[cgray@frontend ~]$ sbatch --wrap "sleep 60"
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

After restarting slurmctld and slurmdbd docker containers it works:

[cgray@frontend ~]$ sbatch --wrap "sleep 60"
Submitted batch job 2

It seems slurmctld doesn't recognise the user until after the restart - I thought it might be something sssd related but restart seems to do the trick. Is this a known issue?

slurmctld log shows for failure:

[2021-11-04T13:34:21.562] error: User 1002 not found
[2021-11-04T13:34:21.562] _job_create: invalid account or partition for user 1002, account '(null)', and partition 'comp
ute'
[2021-11-04T13:34:21.562] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified
[2021-11-04T13:40:08.213] error: User 1002 not found
[2021-11-04T13:40:08.213] _job_create: invalid account or partition for user 1002, account '(null)', and partition 'comp
ute'
[2021-11-04T13:40:08.213] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified

For successful run:

[2021-11-04T13:40:16.443] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=4294901759 usec=606
[2021-11-04T13:40:17.046] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2021-11-04T13:40:20.048] sched: Allocate JobId=2 NodeList=cpn01 #CPUs=1 Partition=compute
[2021-11-04T13:41:20.242] _job_complete: JobId=2 WEXITSTATUS 0
[2021-11-04T13:41:20.242] _job_complete: JobId=2 done

Is it possible to deploy the HPCTS into an EC2 instance?

I have tried several times to deploy this repository into an AWS EC2 instance. I just changed the published IP of docker-compose.yml file to 0.0.0.0 as following:-

// --------
frontend:
    image: hpcts:slurm-${HPCTS_VERSION}
    command: ["frontend"]
    hostname: frontend
    container_name: frontend
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    ports:
      - "0.0.0.0:6222:22"
    depends_on:
      - ldap
      - slurmctld

  coldfront:
    image: hpcts:coldfront-${HPCTS_VERSION}
    build:
      context: ./coldfront
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: coldfront
    container_name: coldfront
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
      - srv_www:/srv/www
    expose:
      - "22"
    ports:
      - "0.0.0.0:2443:443"
    depends_on:
      - ldap
      - mysql
      - frontend

  ondemand:
    image: hpcts:ondemand-${HPCTS_VERSION}
    build:
      context: ./ondemand
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: ondemand
    container_name: ondemand
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    expose:
      - "22"
    ports:
      - "0.0.0.0:3443:3443"
      - "0.0.0.0:5554:5554"
    depends_on:
      - ldap
      - frontend

  xdmod:
    image: hpcts:xdmod-${HPCTS_VERSION}
    build:
      context: ./xdmod
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: xdmod
    container_name: xdmod
    networks:
      - compute
    volumes:
      - etc_xdmod:/etc/xdmod
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    expose:
      - "22"
    ports:
      - "0.0.0.0:4443:443"
    depends_on:
      - mongodb
      - ldap
      - mysql
      - frontend
 // --------

After executed ./hpcts start, all services were built successfully.

sajid@ubuntu-s-4vcpu-8gb-blr1-01:~/hpc-toolset-tutorial$ docker image list
REPOSITORY   TAG                 IMAGE ID       CREATED        SIZE
hpcts        xdmod-2022.07       8afba6346001   7 hours ago    5.2GB
hpcts        ondemand-2022.07    361496f24072   8 hours ago    4.63GB
hpcts        coldfront-2022.07   9c59638239c4   8 hours ago    4.13GB
hpcts        slurm-2022.07       982417903ebd   23 hours ago   3.83GB
hpcts        base-2022.07        e8b968f77f70   24 hours ago   393MB
hpcts        ldap-2022.07        c21746a48540   24 hours ago   257MB
mongo        5.0                 d98599fdfd65   6 days ago     696MB
mariadb      10.3                cd091d34afbb   6 days ago     387MB
sajid@ubuntu-s-4vcpu-8gb-blr1-01:~/hpc-toolset-tutorial$ docker container list
CONTAINER ID   IMAGE                     COMMAND                  CREATED       STATUS       PORTS                                                    NAMES
ef0cbf65e5ad   mongo:5.0                 "docker-entrypoint.s…"   7 hours ago   Up 7 hours   27017/tcp                                                mongodb
52b13d02c780   hpcts:xdmod-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 0.0.0.0:4443->443/tcp                            xdmod
29153626bb00   hpcts:ondemand-2022.07    "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   0.0.0.0:3443->3443/tcp, 22/tcp, 0.0.0.0:5554->5554/tcp   ondemand
543296890edb   hpcts:coldfront-2022.07   "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 0.0.0.0:2443->443/tcp                            coldfront
b356187459e9   hpcts:slurm-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 6818/tcp                                         cpn01
892a379102e0   hpcts:slurm-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   0.0.0.0:6222->22/tcp                                     frontend
740fd7111023   hpcts:slurm-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 6818/tcp                                         cpn02
8b534413f47f   hpcts:slurm-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 6817/tcp                                         slurmctld
2e431a7eaea5   hpcts:slurm-2022.07       "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   22/tcp, 6819/tcp                                         slurmdbd
aebd4868473e   mariadb:10.3              "docker-entrypoint.s…"   7 hours ago   Up 7 hours   3306/tcp                                                 mysql
9c13c02dd691   hpcts:ldap-2022.07        "/container/tool/run"    7 hours ago   Up 7 hours   389/tcp, 636/tcp                                         ldap

However, I can accesss all applications except *OnDemand.

Here is the OnDemand's log docker-compose logs -f ondemand :-

ondemand     | ---> Populating /etc/ssh/ssh_known_hosts from frontend for ondemand...
ondemand     | # frontend:22 SSH-2.0-OpenSSH_8.0
ondemand     | # frontend:22 SSH-2.0-OpenSSH_8.0
ondemand     | # frontend:22 SSH-2.0-OpenSSH_8.0
ondemand     | ---> Starting SSSD on ondemand ...
ondemand     | ---> Cleaning NGINX ...
ondemand     | (2022-08-08 21:58:44): [sssd] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
ondemand     | (2022-08-08 21:58:44): [be[implicit_files]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
ondemand     | (2022-08-08 21:58:44): [be[default]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
ondemand     | (2022-08-08 21:58:44): [nss] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
ondemand     | (2022-08-08 21:58:44): [pam] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
ondemand     | ---> Starting the MUNGE Authentication service (munged) on ondemand ...
ondemand     | ---> Starting sshd on ondemand...
ondemand     | ---> Running update ood portal...
ondemand     | cp -p /etc/pki/tls/certs/localhost.crt /etc/ood/dex/localhost.crt
ondemand     | chown ondemand-dex:ondemand-dex /etc/ood/dex/localhost.crt
ondemand     | cp -p /etc/pki/tls/private/localhost.key /etc/ood/dex/localhost.key
ondemand     | chown ondemand-dex:ondemand-dex /etc/ood/dex/localhost.key
ondemand     | No change in Apache config.
ondemand     | No change in the Dex config.
ondemand     | Completed successfully!
ondemand     | ---> Starting ondemand-dex...
ondemand     | ---> Starting ondemand httpd24...
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="config issuer: https://hpcts.t99ltd.info:5554"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="config storage: sqlite3"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="config static client: OnDemand"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="config connector: ldap"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="config skipping approval screen"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="listening (http/telemetry) on 0.0.0.0:5558"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="listening (http) on 0.0.0.0:5556"
ondemand     | time="2022-08-08T21:58:45Z" level=info msg="listening (https) on 0.0.0.0:5554"
ondemand     | AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using 172.30.0.9. Set the 'ServerName' directive globally to suppress this message
ondemand     | time="2022-08-09T03:20:15Z" level=info msg="keys expired, rotating"
ondemand     | time="2022-08-09T03:20:15Z" level=info msg="keys rotated, next rotation: 2022-08-09 09:20:15.515348558 +0000 UTC"
ondemand     | 2022-08-09 04:43:24.556688 I | http: TLS handshake error from 59.153.103.253:46942: remote error: tls: unknown certificate

I have modified the ondemand/install.sh as following:-

#!/bin/bash
set -e

trap 'ret=$?; test $ret -ne 0 && printf "failed\n\n" >&2; exit $ret' EXIT

log_info() {
  printf "\n\e[0;35m $1\e[0m\n\n"
}

log_info "Setting up Ondemand"
mkdir -p /etc/ood/config/clusters.d
mkdir -p /etc/ood/config/apps/shell
mkdir -p /etc/ood/config/apps/bc_desktop
mkdir -p /etc/ood/config/apps/dashboard
mkdir -p /etc/ood/config/apps/myjobs/templates
echo "DEFAULT_SSHHOST=frontend" > /etc/ood/config/apps/shell/env
echo "OOD_DEFAULT_SSHHOST=frontend" >> /etc/ood/config/apps/shell/env
echo "OOD_SSHHOST_ALLOWLIST=ondemand:cpn01:cpn02" >> /etc/ood/config/apps/shell/env
echo "OOD_DEV_SSH_HOST=ondemand" >> /etc/ood/config/apps/dashboard/env
echo "MOTD_PATH=/etc/motd" >> /etc/ood/config/apps/dashboard/env
echo "MOTD_FORMAT=markdown" >> /etc/ood/config/apps/dashboard/env
echo "OOD_BC_DYNAMIC_JS=1" >> /etc/ood/config/apps/dashboard/env

log_info "Configuring Ondemand ood_portal.yml .."

tee /etc/ood/config/ood_portal.yml <<EOF
---
#
# Portal configuration
#
listen_addr_port:
  - '3443'
servername: hpcts.t99ltd.info
port: 3443
ssl:
   - 'SSLCertificateFile "/etc/pki/tls/certs/localhost.crt"'
   - 'SSLCertificateKeyFile "/etc/pki/tls/private/localhost.key"'
node_uri: "/node"
rnode_uri: "/rnode"
oidc_scope: "openid profile email groups"
dex:
  client_redirect_uris:
    - "https://hpcts.t99ltd.com:4443/simplesaml/module.php/authoidcoauth2/linkback.php"
    - "https://hpcts.t99ltd.com:2443/oidc/callback/"
  client_secret: 334389048b872a533002b34d73f8c29fd09efc50
  client_id: hpcts.t99ltd.com
  connectors:
    - type: ldap
      id: ldap
      name: LDAP
      config:
        host: ldap:636
        insecureSkipVerify: true
        bindDN: cn=admin,dc=example,dc=org
        bindPW: admin
        userSearch:
          baseDN: ou=People,dc=example,dc=org
          filter: "(objectClass=posixAccount)"
          username: uid
          idAttr: uid
          emailAttr: mail
          nameAttr: gecos
          preferredUsernameAttr: uid
        groupSearch:
          baseDN: ou=Groups,dc=example,dc=org
          filter: "(objectClass=posixGroup)"
          userMatchers:
            - userAttr: DN
              groupAttr: member
          nameAttr: cn
  # This is the default, but illustrating how to change
  frontend:
    theme: ondemand
EOF

log_info "Generating new httpd24 and dex configs.."
/opt/ood/ood-portal-generator/sbin/update_ood_portal

log_info "Adding new theme to dex"
sed -i "s/theme: ondemand/theme: hpc-coop/g" /etc/ood/dex/config.yaml

dnf clean all
rm -rf /var/cache/dnf

log_info "Cloning repos to assist with app development.."
mkdir -p /var/git
git clone https://github.com/OSC/bc_example_jupyter.git --bare /var/git/bc_example_jupyter
git clone https://github.com/OSC/ood-example-ps.git --bare /var/git/ood-example-ps

log_info "Enabling app development for hpcadmin..."
mkdir -p /var/www/ood/apps/dev/hpcadmin
ln -s /home/hpcadmin/ondemand/dev /var/www/ood/apps/dev/hpcadmin/gateway
echo 'if [[ ${HOSTNAME} == ondemand ]]; then source scl_source enable ondemand; fi' >> /home/hpcadmin/.bash_profile

when I tried to access Ondemand using this URL : https://hpcts.t99ltd.com:3443, follwin error raised : -
![alt Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at root@localhost to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log](https://ibb.co/hZ5qGfK)

My question:-

  1. Is it possible to override the default localhost configuration?
  2. If possible what I am doing wrong?

Add Globus Integration

We need to add a section around Globus to the tutorial walkthrough to demonstrate the config and how it works.

No SAML response provided

when longin XDMod : https://ip:4443 :

No SAML response provided
You accessed the Assertion Consumer Service interface, but did not provide a SAML Authentication Response. Please note that this endpoint is not intended to be accessed directly.

If you report this error, please also report this tracking number which makes it possible to locate your session in the logs available to the system administrator:

Current containers won't start

Running the tutorial as per instructions results in the following:

xdmod        | You are currently using Open XDMoD 10.0.0, but a newer version
xdmod        | (10.0.2) is available.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no] 1
xdmod        | 
xdmod        | '1' is not a valid option.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no]
xdmod        | Failed to get prompt
xdmod        | ---> Open XDMoD Setup: hpc resource

I updated the Dockerfile for xdmod and did a docker-compose build, but that failed with this:

#0 72.09 Last metadata expiration check: 0:00:08 ago on Mon Apr 17 00:26:37 2023.
#0 75.57 xdmod-10.0.2-1.0.el8.noarch.rpm                 8.3 MB/s |  27 MB     00:03    
#0 77.53 xdmod-ondemand-10.0.0-1.0.beta1.el8.noarch.rpm   13 kB/s |  26 kB     00:01    
#0 78.47 xdmod-supremm-10.0.0-1.4.beta4.el8.noarch.rpm   328 kB/s | 308 kB     00:00    
#0 80.14 supremm-2.0.0-1.0_beta3.el8.x86_64.rpm          147 kB/s | 246 kB     00:01    
#0 80.20 Error: 
#0 80.20  Problem 1: package xdmod-10.0.2-1.0.el8.noarch requires nodejs(engine) >= 16.18.1, but none of the providers can be installed
#0 80.20   - conflicting requests
#0 80.20   - package nodejs-1:16.18.1-3.module+el8.7.0+1108+49363b0d.x86_64 is filtered out by modular filtering
#0 80.20   - package nodejs-1:16.19.1-1.module+el8.7.0+1178+d52dba78.x86_64 is filtered out by modular filtering
#0 80.20   - package nodejs-1:18.12.1-2.module+el8.7.0+1104+549f92a6.x86_64 is filtered out by modular filtering
#0 80.20   - package nodejs-1:18.14.2-2.module+el8.7.0+1177+510ae886.x86_64 is filtered out by modular filtering
[+] Building 89.0s (7/20)                                                                                                                              
 => [internal] load build definition from Dockerfile                                                                                              0.6s
 => => transferring dockerfile: 1.87kB                                                                                                            0.0s
 => [internal] load .dockerignore                                                                                                                 0.8s
 => => transferring context: 2B                                                                                                                   0.0s
 => [internal] load metadata for docker.io/ubccr/hpcts:slurm-2022.07                                                                              0.0s
 => [internal] load build context                                                                                                                 0.9s 
 => => transferring context: 10.15MB                                                                                                              0.1s
 => [stage-amd64 1/4] FROM docker.io/ubccr/hpcts:slurm-2022.07                                                                                    1.3s
 => [stage-amd64 2/4] RUN dnf install -y https://yum.osc.edu/ondemand/2.0/ondemand-release-web-2.0-1.noarch.rpm                                  71.6s
 => CANCELED [stage-amd64 3/4] RUN dnf install -y netcat ondemand ondemand-dex                                                                   15.1s
failed to solve: process "/bin/sh -c /build/install.sh && rm -rf /build" did not complete successfully: exit code: 1     

I'm working on this still, but something to be aware of before you run your tutorial.

Fix xdmod-supremm el8 x86_64 rpms

xdmod-supremm-10.0 rpm for el8 currently failing with:

#0 25.96 Error: 
#0 25.96   - nothing provides php-pecl-mongo needed by xdmod-supremm-10.0.0-1.0.beta1.el8.noarch

pearc21 xdmod ood integration

The instructions for integrating XDMoD and OOD are on the OOD message of the day (and in the message of the day when you ssh into OOD).

It's fairly straight forward, and seems to imply everything's already set on the XDMoD side and the only modifications we need is to OOD itself.

In any case, When I try to run this command I get these errors. Same with the API - it returns errors that the MySQL db isn't up.

[hpcadmin@xdmod ~]$ sudo -u xdmod /srv/xdmod/scripts/shred-ingest-aggregate-all.sh
2021-07-13 16:20:31 [notice] xdmod-slurm-helper start (process_start_time: 2021-07-13 16:20:31)
2021-07-13 16:20:31 [critical] Failed to create database connection: SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) (stacktrace: #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(88): PDO->__construct('mysql:host=loca...', 'xdmod', '')
#1 /usr/share/xdmod/classes/CCR/DB.php(111): CCR\DB\PDODB->connect()
#2 /usr/bin/xdmod-slurm-helper(137): CCR\DB::factory('shredder')
#3 /usr/bin/xdmod-slurm-helper(21): main()
#4 {main})
2021-07-13 16:20:32 [critical] SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) (stacktrace: #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(88): PDO->__construct('mysql:host=loca...', 'xdmod', '')
#1 /usr/share/xdmod/classes/CCR/DB.php(111): CCR\DB\PDODB->connect()
#2 /usr/bin/xdmod-ingestor(207): CCR\DB::factory('hpcdb')
#3 /usr/bin/xdmod-ingestor(21): main()
#4 {main})
2021-07-13T16:20:32.208 [INFO] archive indexer starting
2021-07-13T16:20:32.224 [ERROR] [Errno 2] No such file or directory: '/data/pcp-logs/my_cluster_name'
Traceback (most recent call last):
  File "/bin/indexarchives.py", line 11, in <module>
    load_entry_point('supremm==1.4.0', 'console_scripts', 'indexarchives.py')()
  File "/usr/lib64/python2.7/site-packages/supremm/indexarchives.py", line 473, in runindexing
    logging.debug("processed archive %s (fileio %s, dbacins %s)", archivefile, parse_end - start_time, db_end - parse_end)
  File "/usr/lib64/python2.7/site-packages/supremm/indexarchives.py", line 368, in __exit__
    dbac = XDMoDArchiveCache(self.config)
  File "/usr/lib64/python2.7/site-packages/supremm/xdmodaccount.py", line 316, in __init__
    self.con = getdbconnection(self.dbconfig)
  File "/usr/lib64/python2.7/site-packages/supremm/scripthelpers.py", line 53, in getdbconnection
    return MySQLdb.connect(**dbargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)")****

XDMoD PEARC22 issues

Just creating an issue to keep track of getting the XDMoD containers ready for PEARC22 tutorial. Here's a start to the outstanding issues:

localhost redirect ondemand

Hi,

I am trying to setup the tutorial containers so that they serve the applications on the hostname of the server and not localhost. This is so that I and others at my site can remotely connect to the tutorial and test things out. It saves us time so that each of us don't have to install and configure the tutorial containers on our workstations.

I was able to get coldfront, and xdmod to work by changing the ports item in the yaml file docker-compose.yml from localhost to the ip address of the server.

It appears there is a redirect in the install.sh file for the ondemand container that shows:

dex: client_redirect_uris: - "https://localhost:4443/simplesaml/module.php/authoidcoauth2/linkback.php" - "https://localhost:2443/oidc/callback/"

Which is I think the issue? I tried changing that to the hostname of the server, ran cleanup and then tried to start via hpcts start, but its still redirecting to localhost:3443 instead of using the servername:3443, ran cleanup, and then hpcts start again, but its still redirecting to localhost.

Do I need to change the hpcts script to --build instead of --no-build to get the containers to pull in those local changes?

Thanks!

fatal: The AccountingStorageLoc option has been removed

There's some issue in our configuration with slurm 20.11.7

[hpcadmin@ondemand ~]$ sbatch --wrap 'hello world'
sbatch: fatal: The AccountingStorageLoc option has been removed. It is safe to remove from your configuration.

The frontend seems to be in some infinite loop trying to start up correctly.

Attaching to frontend
frontend     | ---> Starting SSSD ...
frontend     | ---> Starting the MUNGE Authentication service (munged) ...
frontend     | scontrol: fatal: The AccountingStorageLoc option has been removed. It is safe to remove from your configuration.
frontend     | -- Waiting for slurmctld to become active ...
frontend     | scontrol: fatal: The AccountingStorageLoc option has been removed. It is safe to remove from your configuration.
frontend     | -- Waiting for slurmctld to become active ...
frontend     | scontrol: fatal: The AccountingStorageLoc option has been removed. It is safe to remove from your configuration.
frontend     | -- Waiting for slurmctld to become active ...
frontend     | scontrol: fatal: The AccountingStorageLoc option has been removed. It is safe to remove from your configuration.
frontend     | -- Waiting for slurmctld to become active ...

coldfront won't start on Linux

The coldfront service isn't staying alive for me. I don't know if this is just my issue or what. I just rebuilt all these containers from the current master 3792d31.

Here's my docker info with Linux Fedora 30 5.6.7-100.fc30.x86_64.

[jeff 03:24:47 hpc-toolset-tutorial(master)] 🐺 docker-compose -v
docker-compose version 1.26.2, build eefe0d31
[jeff 03:25:49 hpc-toolset-tutorial(master)] 🐷 docker -v
Docker version 19.03.8, build afacb8b7f0

It looks like it's trying to start, reaching out to the ldap service looking for the nginx user and can't find it. I added the restart: "always" just to see if it could ever come up (it doesn't).

coldfront exited with code 1
coldfront    | ---> Starting nginx on coldfront...
coldfront    | ---> Starting coldfront in gunicorn...
ldap         | 5f0e0670 conn=1007 fd=13 ACCEPT from IP=172.18.0.11:35298 (IP=0.0.0.0:636)
ldap         | 5f0e0670 conn=1007 fd=13 TLS established tls_ssf=256 ssf=256
ldap         | 5f0e0670 conn=1007 op=0 SRCH base="" scope=0 deref=0 filter="(objectClass=*)"
ldap         | 5f0e0670 conn=1007 op=0 SRCH attr=* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domainControllerFunctionality defaultNamingContext lastUSN highestCommittedUSN
ldap         | 5f0e0670 conn=1007 op=0 SEARCH RESULT tag=101 err=0 nentries=1 text=
ldap         | 5f0e0670 conn=1007 op=1 BIND dn="cn=admin,dc=example,dc=org" method=128
ldap         | 5f0e0670 slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
ldap         | 5f0e0670 conn=1007 op=1 BIND dn="cn=admin,dc=example,dc=org" mech=SIMPLE ssf=0
ldap         | 5f0e0670 conn=1007 op=1 RESULT tag=97 err=0 text=
ldap         | 5f0e0670 conn=1007 op=2 SRCH base="dc=example,dc=org" scope=2 deref=0 filter="(&(uid=nginx)(objectClass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))"
ldap         | 5f0e0670 conn=1007 op=2 SRCH attr=objectClass uid userPassword uidNumber gidNumber gecos homeDirectory loginShell krbPrincipalName cn memberOf modifyTimestamp modifyTimestamp shadowLastChange shadowMin shadowMax shadowWarning shadowInactive shadowExpire shadowFlag krbLastPwdChange krbPasswordExpiration pwdAttribute authorizedService accountExpires userAccountControl nsAccountLock host rhost loginDisabled loginExpirationTime loginAllowedTimeMap sshPublicKey userCertificate;binary mail
ldap         | 5f0e0670 conn=1007 op=2 SEARCH RESULT tag=101 err=0 nentries=0 text=
coldfront    | (Tue Jul 14 19:24:32 2020) [sssd[be[default]]] [sysdb_get_real_name] (0x0040): Cannot find user [nginx@default] in cache
coldfront    | (Tue Jul 14 19:24:32 2020) [sssd[be[default]]] [sysdb_get_real_name] (0x0040): Cannot find user [nginx@default] in cache
coldfront    | [2020-07-14 19:24:33 +0000] [44] [INFO] Starting gunicorn 20.0.4
coldfront    | [2020-07-14 19:24:33 +0000] [44] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-14 19:24:34 +0000] [44] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-14 19:24:35 +0000] [44] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-14 19:24:36 +0000] [44] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-14 19:24:37 +0000] [44] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-14 19:24:38 +0000] [44] [ERROR] Can't connect to /srv/www/coldfront/coldfront.sock
ldap         | 5f0e0676 conn=1007 fd=13 closed (connection lost)
coldfront exited with code 1
...

./hpcts start fails to run. Ubuntu focal.

Nice tutorial.
But sorry to report an issue.

OS Ubuntu Focal with the default kernel.
Docker, Version: 20.10.18
Docker Compose version v2.10.2

git clone https://github.com/ubccr/hpc-toolset-tutorial.git
cd hpc-toolset-tutorial
./hpcts start

reports the following error msg,

ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the services key, or omit the version key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

Edit docker-compose.yml, change version: "3.9"to version: "2.2", then run ./hpcts start again, reports,

Fetching latest HPC Toolset Images..
Pulling mongodb ... done
Pulling mysql ... done
Starting HPC Toolset Cluster..
ERROR: Service 'base' needs to be built, but --no-build was passed.

Anyway, I have changed ${HPCTS_VERSION} to 2022.07, otherwise, there will be these error messages,

ERROR: no such image: ubccr/hpcts:ldap-"2022.07": invalid reference format.

If I have made something wrong, please let me know.
I am not quite familiar with docker.

document/fix ipv6 errors

This ticket is to document or fix the ipv6 errors related to Coldfront starting up. I'm sure there's some nginx config to just force ipv4.

I'm guessing the user had started a Vagrant or similar VM technology that did not allow ipv6, but I don't know for sure.

---> Starting SSSD on coldfront ...
---> Starting sshd on coldfront...
---> Starting the MUNGE Authentication service (munged) on coldfront ...
---> Starting nginx on coldfront...
nginx: [emerg] socket() [::]:80 failed (97: Address family not supported by protocol)

Here's the relevant config from their startup command.

cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-1160.31.1.el7.x86_64  ipv6.disable=1 

MySQL corruption

I've had a few instances where if I do docker-compose stop then I start everything the MySQL instance will refuse to start because it needs repair. I wonder if we should not make the mysql volume persist through container restarts? Not sure how else to avoid MySQL database corruption or getting flagged that repairs are needed unless some config options we can set to auto-repair.

coldfront.project_projectattribute doesn't exist

I'm working my way through the tutorial but I am getting the following error when I try to create a new project:



Request Method: GET
Request URL: https://localhost:2443/project/1/

Django Version: 3.2.17
Python Version: 3.9.13
Installed Applications:
['django_su',
 'django.contrib.admin',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'django.contrib.humanize',
 'crispy_forms',
 'sslserver',
 'django_q',
 'simple_history',
 'fontawesome_free',
 'coldfront.core.user',
 'coldfront.core.field_of_science',
 'coldfront.core.utils',
 'coldfront.core.portal',
 'coldfront.core.project',
 'coldfront.core.resource',
 'coldfront.core.allocation',
 'coldfront.core.grant',
 'coldfront.core.publication',
 'coldfront.core.research_output',
 'coldfront.plugins.slurm',
 'mozilla_django_oidc']
Installed Middleware:
['django.middleware.security.SecurityMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.common.CommonMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware',
 'simple_history.middleware.HistoryRequestMiddleware',
 'mozilla_django_oidc.middleware.SessionRefresh']



Traceback (most recent call last):
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/mysql/base.py", line 73, in execute
    return self.cursor.execute(query, args)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)

The above exception ((1146, "Table 'coldfront.project_projectattribute' doesn't exist")) was the direct cause of the following exception:
  File "/srv/www/venv/lib64/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/contrib/auth/mixins.py", line 71, in dispatch
    return super().dispatch(request, *args, **kwargs)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/contrib/auth/mixins.py", line 128, in dispatch
    return super().dispatch(request, *args, **kwargs)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/views/generic/detail.py", line 107, in get
    context = self.get_context_data(object=self.object)
  File "/srv/www/venv/lib64/python3.9/site-packages/coldfront/core/project/views.py", line 124, in get_context_data
    attributes_with_usage = [attribute for attribute in project_obj.projectattribute_set.filter(
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/models/query.py", line 280, in __iter__
    self._fetch_all()
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/models/query.py", line 51, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
    cursor.execute(sql, params)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 98, in execute
    return super().execute(sql, params)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/srv/www/venv/lib64/python3.9/site-packages/django/db/backends/mysql/base.py", line 73, in execute
    return self.cursor.execute(query, args)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/srv/www/venv/lib64/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)

Exception Type: ProgrammingError at /project/1/
Exception Value: (1146, "Table 'coldfront.project_projectattribute' doesn't exist")``` 

xdmod container doesnt start

---> Open XDMoD Aggregate: Job Performance
xdmod | ERROR: unable to find configuration file "supremm_resources.json" in the
xdmod | XDMoD configuration directory "/etc/xdmod".
xdmod | This file should be created using the instructions in the install guide.
xdmod | 2022-07-07 12:42:10 [error] Caught exception while executing:
SQLSTATE[42S02]: Base table or view not found: 1146 Table 'modw_supremm.job'
doesn't exist (stacktrace: #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(158):
PDOStatement->execute(Array)
xdmod | #1 /usr/share/xdmod/classes/DB/EtlJournalHelper.php(34):
CCR\DB\PDODB->query('SELECT UNIX_TIM...')
xdmod | #2 /usr/lib64/xdmod/aggregate_supremm.php(105):
DB\EtlJournalHelper->getLastModified()
xdmod | #3 /usr/lib64/xdmod/aggregate_supremm.php(146): run_aggregation(Array,
Array)
xdmod | #4 {main})
xdmod | 2022-07-07 12:42:10 [critical] Filter list building failed:
SQLSTATE[42S02]: Base table or view not found: 1146 Table
'modw_aggregates.supremmfact_by_year' doesn't exist (stacktrace: #0
/usr/share/xdmod/classes/CCR/DB/PDODB.php(158): PDOStatement->execute(Array)
xdmod | #1
/usr/share/xdmod/classes/DataWarehouse/Query/TimeAggregationUnit.php(84):
CCR\DB\PDODB->query('SELECT
xdmod | ...', Array)
xdmod | #2 /usr/share/xdmod/classes/DataWarehouse/Query/Query.php(1378):
DataWarehouse\Query\TimeAggregationUnit->getDateRangeIds('0000-01-01',
'9999-12-31')
xdmod | #3 /usr/share/xdmod/classes/DataWarehouse/Query/Query.php(165):
DataWarehouse\Query\Query->setDuration(NULL, NULL)
xdmod | #4 /usr/share/xdmod/classes/DB/FilterListBuilder.php(74):
DataWarehouse\Query\Query->__construct('SUPREMM', 'year', NULL, NULL, 'none')
xdmod | #5 /usr/bin/xdmod-build-filter-lists(91):
FilterListBuilder->buildRealmLists('SUPREMM')
xdmod | #6 /usr/bin/xdmod-build-filter-lists(82): build(Array,
Object(CCR\Logger))
xdmod | #7 {main})
xdmod exited with code 1

Ondemand not working on an EC2 instance

I am trying to deploy "hpc-toolset-tutorial" into an EC2 instance. I can access all the application accept "OnDeman".
Here is my docker-compose.yml file so far:-

version: "3.9"

services:
  ldap:
    image: ubccr/hpcts:ldap-${HPCTS_VERSION}
    build:
      context: ./ldap
    hostname: ldap
    container_name: ldap
    environment:
      - CONTAINER_LOG_LEVEL=debug
      - LDAP_RFC2307BIS_SCHEMA=true
      - LDAP_REMOVE_CONFIG_AFTER_SETUP=false
      - LDAP_TLS_VERIFY_CLIENT=never
    networks:
      - compute

  base:
    image: ubccr/hpcts:base-${HPCTS_VERSION}
    build:
      context: ./base
    networks:
      - compute
    depends_on:
      - ldap

  mongodb:
    image: mongo:${MONGODB_VERSION}
    hostname: mongodb
    container_name: mongodb
    environment:
      - MONGO_INITDB_ROOT_USERNAME=admin
      - MONGO_INITDB_ROOT_PASSWORD=hBbeOfpFLfFT5ZO
    networks:
      - compute
    volumes:
      - ./mongodb:/docker-entrypoint-initdb.d 
      - data_db:/data/db
    expose:
      - "27017"
  mysql:
    image: mariadb:${MARIADB_VERSION}
    hostname: mysql
    container_name: mysql
    environment:
      MYSQL_ALLOW_EMPTY_PASSWORD: "yes"
    networks:
      - compute
    volumes:
      - ./database:/docker-entrypoint-initdb.d
      - ./database:/etc/mysql/conf.d
      - ./slurm/slurmdbd.conf:/etc/slurm/slurmdbd.conf
      - var_lib_mysql:/var/lib/mysql
    expose:
      - "3306"

  slurmdbd:
    image: ubccr/hpcts:slurm-${HPCTS_VERSION}
    build:
      context: ./slurm
      args:
        SLURM_VERSION: $SLURM_VERSION
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["slurmdbd"]
    container_name: slurmdbd
    hostname: slurmdbd
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - slurmdbd_state:/var/lib/slurmd
    expose:
      - "22"
      - "6819"
    depends_on:
      - base
      - ldap
      - mysql

  slurmctld:
    image: ubccr/hpcts:slurm-${HPCTS_VERSION}
    command: ["slurmctld"]
    container_name: slurmctld
    hostname: slurmctld
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
      - slurmctld_state:/var/lib/slurmd
    expose:
      - "22"
      - "6817"
    depends_on:
      - ldap
      - slurmdbd

  cpn01:
    init: true
    image: ubccr/hpcts:slurm-${HPCTS_VERSION}
    command: ["slurmd"]
    hostname: cpn01
    container_name: cpn01
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
      - cpn01_slurmd_state:/var/lib/slurmd
    expose:
      - "22"
      - "6818"
    depends_on:
      - ldap
      - slurmctld

  cpn02:
    init: true
    image: ubccr/hpcts:slurm-${HPCTS_VERSION}
    command: ["slurmd"]
    hostname: cpn02
    container_name: cpn02
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
      - cpn02_slurmd_state:/var/lib/slurmd
    expose:
      - "22"
      - "6818"
    depends_on:
      - ldap
      - slurmctld

  frontend:
    image: ubccr/hpcts:slurm-${HPCTS_VERSION}
    command: ["frontend"]
    hostname: frontend
    container_name: frontend
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    ports:
      - "0.0.0.0:6222:22"
    depends_on:
      - ldap
      - slurmctld

  coldfront:
    image: ubccr/hpcts:coldfront-${HPCTS_VERSION}
    build:
      context: ./coldfront
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: coldfront
    container_name: coldfront
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
      - srv_www:/srv/www
    expose:
      - "22"
    ports:
      - "0.0.0.0:2443:443"
    depends_on:
      - ldap
      - mysql
      - frontend

  ondemand:
    image: ubccr/hpcts:ondemand-${HPCTS_VERSION}
    build:
      context: ./ondemand
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: ondemand
    container_name: ondemand
    networks:
      - compute
    volumes:
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    expose:
      - "22"
    ports:
      - "0.0.0.0:3443:3443"
      - "0.0.0.0:5554:5554"
    depends_on:
      - ldap
      - frontend

  xdmod:
    image: ubccr/hpcts:xdmod-${HPCTS_VERSION}
    build:
      context: ./xdmod
      args:
        HPCTS_VERSION: $HPCTS_VERSION
    command: ["serve"]
    hostname: xdmod
    container_name: xdmod
    networks:
      - compute
    volumes:
      - etc_xdmod:/etc/xdmod
      - etc_munge:/etc/munge
      - etc_slurm:/etc/slurm
      - home:/home
    expose:
      - "22"
    ports:
      - "0.0.0.0:4443:443"
    depends_on:
      - mongodb
      - ldap
      - mysql
      - frontend

volumes:
  etc_xdmod:
  etc_munge:
  etc_slurm:
  home:
  var_lib_mysql:
  cpn01_slurmd_state:
  cpn02_slurmd_state:
  slurmctld_state:
  slurmdbd_state:
  data_db:
  srv_www:

networks:
  compute:

I already have edited ondemand/install.sh

#!/bin/bash
set -e

trap 'ret=$?; test $ret -ne 0 && printf "failed\n\n" >&2; exit $ret' EXIT

log_info() {
  printf "\n\e[0;35m $1\e[0m\n\n"
}

log_info "Setting up Ondemand"
mkdir -p /etc/ood/config/clusters.d
mkdir -p /etc/ood/config/apps/shell
mkdir -p /etc/ood/config/apps/bc_desktop
mkdir -p /etc/ood/config/apps/dashboard
mkdir -p /etc/ood/config/apps/myjobs/templates
echo "DEFAULT_SSHHOST=frontend" > /etc/ood/config/apps/shell/env
echo "OOD_DEFAULT_SSHHOST=frontend" >> /etc/ood/config/apps/shell/env
echo "OOD_SSHHOST_ALLOWLIST=ondemand:cpn01:cpn02" >> /etc/ood/config/apps/shell/env
echo "OOD_DEV_SSH_HOST=ondemand" >> /etc/ood/config/apps/dashboard/env
echo "MOTD_PATH=/etc/motd" >> /etc/ood/config/apps/dashboard/env
echo "MOTD_FORMAT=markdown" >> /etc/ood/config/apps/dashboard/env
echo "OOD_BC_DYNAMIC_JS=1" >> /etc/ood/config/apps/dashboard/env

log_info "Configuring Ondemand ood_portal.yml .."

tee /etc/ood/config/ood_portal.yml <<EOF
---
#
# Portal configuration
#
listen_addr_port:
  - '3443'
servername: null
port: 3443
ssl: null
  # - 'SSLCertificateFile "/etc/pki/tls/certs/localhost.crt"'
  # - 'SSLCertificateKeyFile "/etc/pki/tls/private/localhost.key"'
node_uri: "/node"
rnode_uri: "/rnode"
oidc_scope: "openid profile email groups"
dex:
  client_redirect_uris:
    - "https://75.101.240.220:4443/simplesaml/module.php/authoidcoauth2/linkback.php"
    - "https://75.101.240.220:2443/oidc/callback/"
  client_secret: 334389048b872a533002b34d73f8c29fd09efc50
  client_id: null
  connectors:
    - type: ldap
      id: ldap
      name: LDAP
      config:
        host: ldap:636
        insecureSkipVerify: true
        bindDN: cn=admin,dc=example,dc=org
        bindPW: admin
        userSearch:
          baseDN: ou=People,dc=example,dc=org
          filter: "(objectClass=posixAccount)"
          username: uid
          idAttr: uid
          emailAttr: mail
          nameAttr: gecos
          preferredUsernameAttr: uid
        groupSearch:
          baseDN: ou=Groups,dc=example,dc=org
          filter: "(objectClass=posixGroup)"
          userMatchers:
            - userAttr: DN
              groupAttr: member
          nameAttr: cn
  # This is the default, but illustrating how to change
  frontend:
    theme: ondemand
EOF

log_info "Generating new httpd24 and dex configs.."
/opt/ood/ood-portal-generator/sbin/update_ood_portal

log_info "Adding new theme to dex"
sed -i "s/theme: ondemand/theme: hpc-coop/g" /etc/ood/dex/config.yaml

dnf clean all
rm -rf /var/cache/dnf

log_info "Cloning repos to assist with app development.."
mkdir -p /var/git
git clone https://github.com/OSC/bc_example_jupyter.git --bare /var/git/bc_example_jupyter
git clone https://github.com/OSC/ood-example-ps.git --bare /var/git/ood-example-ps

log_info "Enabling app development for hpcadmin..."
mkdir -p /var/www/ood/apps/dev/hpcadmin
ln -s /home/hpcadmin/ondemand/dev /var/www/ood/apps/dev/hpcadmin/gateway
echo 'if [[ ${HOSTNAME} == ondemand ]]; then source scl_source enable ondemand; fi' >> /home/hpcadmin/.bash_profile

Where I am doing wrong? TIA

No XDMoD Data for PEARC 21

I'm working through the OOD + XDMoD integration. Got that to work, but I see there are no jobs in XDMoD after I run /srv/xdmod/scripts/shred-ingest-aggregate-all.sh.

Am I missing something? That's the command the instructions say to run, is that the right script?

Here are snippets from the output of that command that may be useful.

2021-07-14T17:12:34.382 [INFO] archive indexer starting
2021-07-14T17:12:34.442 [INFO] archive indexer complete
2021-07-14T17:12:35.005 [WARNING] Autoperiod library not found, TimeseriesPatterns plugins will not do period analysis
2021-07-14T17:12:35.009 [INFO] Processing resource hpc
2021-07-14T17:12:35.026 [INFO] Processing 0 jobs
...
2021-07-14T17:12:34.382 [INFO] archive indexer starting
2021-07-14T17:12:34.442 [INFO] archive indexer complete
2021-07-14T17:12:35.005 [WARNING] Autoperiod library not found, TimeseriesPatterns plugins will not do period analysis
2021-07-14T17:12:35.009 [INFO] Processing resource hpc
2021-07-14T17:12:35.026 [INFO] Processing 0 jobs

The widgets in OnDemand show everything empty
image

Same with XDMoD
image

XDMoD fails after docker-compose stop/start

This is related to #134. Would be nice to support stopping/starting the containers. Steps to reproduce:

$ ./hpcts start
$ docker-compose down
$ docker-compose up

This also doesn't work:

$ ./hpcts start
$ docker-compose stop
$ docker-compose start

XDMoD fails to come back up and just responds with application/json:

xdmod-fail

@ryanrath Any thoughts?

jupyter notebook needs additional imports

The jupyter notebook in the OnDemand demo needs this additional import as well as the default rendering setting.

import plotly.io as pio
pio.renderers.default = 'notebook'

Error after logging into Open OnDemand

So I'm going through all the tutorials on my M1 and I've run into an issue on the Login to OnDemand website https://github.com/ubccr/hpc-toolset-tutorial/blob/master/coldfront/README.md#login-to-ondemand-website . Specifically after I login successfully I'm presented with the following ( URL is: https://localhost:3443/pun/sys/dashboard ) :

Error -- nginx: [emerg] unknown directive "passenger_preload_bundler" in /var/lib/ondemand-nginx/config/puns/cgray.conf:48

Here are the logs from the ondemand container:

~/s/c/h/ryanrath ❯❯❯ docker logs -f ondemand                                                                                                                                                           ✘ 1 master ✭ ◼
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection refused.
-- Waiting for frontend ssh to become active ...
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection refused.
-- Waiting for frontend ssh to become active ...
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection refused.
-- Waiting for frontend ssh to become active ...
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.112.9:22.
Ncat: 0 bytes sent, 0 bytes received in 0.00 seconds.
---> Cleaning NGINX ...
---> Populating /etc/ssh/ssh_known_hosts from frontend for ondemand...
# frontend:22 SSH-2.0-OpenSSH_8.0
# frontend:22 SSH-2.0-OpenSSH_8.0
# frontend:22 SSH-2.0-OpenSSH_8.0
---> Starting SSSD on ondemand ...
---> Starting the MUNGE Authentication service (munged) on ondemand ...
---> Starting sshd on ondemand...
---> Running update ood portal...
(2022-07-08 18:37:49): [sssd] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2022-07-08 18:37:49): [be[implicit_files]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2022-07-08 18:37:49): [be[default]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2022-07-08 18:37:49): [nss] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2022-07-08 18:37:49): [pam] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
cp -p /etc/pki/tls/certs/localhost.crt /etc/ood/dex/localhost.crt
chown ondemand-dex:ondemand-dex /etc/ood/dex/localhost.crt
cp -p /etc/pki/tls/private/localhost.key /etc/ood/dex/localhost.key
chown ondemand-dex:ondemand-dex /etc/ood/dex/localhost.key
No change in Apache config.
mv /etc/ood/dex/config.yaml /etc/ood/dex/config.yaml.20220708T183749
mv /tmp/dex_config20220708-38-au9wbu /etc/ood/dex/config.yaml
chown ondemand-dex:ondemand-dex /etc/ood/dex/config.yaml
chmod 600 /etc/ood/dex/config.yaml
Backing up previous Dex config to: '/etc/ood/dex/config.yaml.20220708T183749'
Generating new Dex config at: /etc/ood/dex/config.yaml
Completed successfully!

Restart the ondemand-dex service now.

Suggested command:
    sudo systemctl restart ondemand-dex.service

---> Starting ondemand-dex...
---> Starting ondemand httpd24...
[Fri Jul 08 18:37:49.929088 2022] [so:warn] [pid 60:tid 281472875155472] AH01574: module ssl_module is already loaded, skipping
AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using 192.168.112.11. Set the 'ServerName' directive globally to suppress this message
time="2022-07-08T18:37:49Z" level=info msg="Dex Version: , Go Version: go1.17.10, Go OS/ARCH: linux arm64"
time="2022-07-08T18:37:49Z" level=info msg="config issuer: https://localhost:5554"
time="2022-07-08T18:37:49Z" level=info msg="config storage: sqlite3"
time="2022-07-08T18:37:49Z" level=info msg="config static client: OnDemand"
time="2022-07-08T18:37:49Z" level=info msg="config connector: ldap"
time="2022-07-08T18:37:49Z" level=info msg="config skipping approval screen"
time="2022-07-08T18:37:49Z" level=info msg="config refresh tokens rotation enabled: true"
time="2022-07-08T18:37:49Z" level=info msg="keys expired, rotating"
time="2022-07-08T18:37:50Z" level=info msg="keys rotated, next rotation: 2022-07-09 00:37:50.180941796 +0000 UTC"
time="2022-07-08T18:37:50Z" level=info msg="listening (telemetry) on 0.0.0.0:5558"
time="2022-07-08T18:37:50Z" level=info msg="listening (http) on 0.0.0.0:5556"
time="2022-07-08T18:37:50Z" level=info msg="listening (https) on 0.0.0.0:5554"
2022/07/08 19:02:38 http: TLS handshake error from 192.168.112.1:58552: remote error: tls: unknown certificate
time="2022-07-08T19:02:45Z" level=info msg="performing ldap search ou=People,dc=example,dc=org sub (&(objectClass=posixAccount)(uid=cgray))"
time="2022-07-08T19:02:45Z" level=info msg="username \"cgray\" mapped to entry cn=cgray,ou=People,dc=example,dc=org"
time="2022-07-08T19:02:45Z" level=info msg="performing ldap search ou=Groups,dc=example,dc=org sub (&(objectClass=posixGroup)(member=cn=cgray,ou=People,dc=example,dc=org))"
time="2022-07-08T19:02:45Z" level=info msg="login successful: connector \"ldap\", username=\"Carl Grey\", preferred_username=\"cgray\", email=\"[email protected]\", groups=[\"cgray\"]"

Please let me know if you need any additional information.

Thanks!

Troubles opening ondemand

I am trying to run the tutorial on a virtual machine, all containers are running, I can open coldfront and xdmod in the browser.
But I have problems when trying to open https://:3443, it redirects to localhost:3443.
I did change the ip address in docker-compose.yml but nothing change.

Do you know what should be modified when running in virtual machine? I am not running an extra httpd service.

when running docker compose logs -f I see:

AH0058: httpd could not reliably determine the servers fully qualified domain name ....

coldfront check icons bad for colorblind users

Not finding an obvious place for ColdFront bugs, but the icons are pretty bad for colorblind users like me. The check and cross in addition to color help, but they're small enough that they're also kind of hard to distinguish (both being two lines crossing right angles).

Arguably it would be a lot better to just have text saying "YES" and "NO", for example.

image

Error message seen when logging in to ondemand

The following message is printed when you ssh to the ondemand instance:

-bash: scl_source: No such file or directory

Steps to reproduce:

  • from the outside ssh to frontend as hpcadmin ssh -p 6222 hpcadmin@localhost
  • from frontent ssh to ondemand with this command ssh ondemand
  • error message is printed after the MOTD

Coldfront fails to start

Command to start was docker-compose up -d --build

From docker-compose ps:

coldfront                     /usr/local/bin/entrypoint. ...   Exit 1                                                             

Logs:

$ docker-compose logs coldfront
Attaching to coldfront
coldfront    | ---> Starting SSSD on coldfront ...
coldfront    | ---> Starting sshd on coldfront...
coldfront    | ---> Starting the MUNGE Authentication service (munged) on coldfront ...
coldfront    | ---> Starting nginx on coldfront...
coldfront    | ---> Starting coldfront in gunicorn...
coldfront    | (Thu Jul 16 15:30:56 2020) [sssd[be[default]]] [sysdb_get_real_name] (0x0040): Cannot find user [nginx@default] in cache
coldfront    | (Thu Jul 16 15:30:56 2020) [sssd[be[default]]] [sysdb_get_real_name] (0x0040): Cannot find user [nginx@default] in cache
coldfront    | [2020-07-16 15:30:57 +0000] [47] [INFO] Starting gunicorn 20.0.4
coldfront    | [2020-07-16 15:30:57 +0000] [47] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-16 15:30:58 +0000] [47] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-16 15:30:59 +0000] [47] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-16 15:31:00 +0000] [47] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-16 15:31:01 +0000] [47] [ERROR] Retrying in 1 second.
coldfront    | [2020-07-16 15:31:02 +0000] [47] [ERROR] Can't connect to /srv/www/coldfront/coldfront.sock

The SSSD error is expected, it didn't find nginx user in LDAP which happens. Only way to avoid that error is tell SSSD to never lookup nginx in LDAP, but that's likely not why the container failed since nginx is in /etc/passwd.

OnDemand offline after docker-compose stop/start

It would be nice if users could stop the containers and restart them later without losing their state. For example, user completes the first half of the tutorial, stop containers go eat lunch etc. Then come back and start containers again should allow them to pick up where they left off. This flow currently works:

$ ./hpcts start
$ docker-compose down
$ docker-compose up

OnDemand restarts just fine, however a docker-compose down stops and removes the containers (and any networks).

This flow causes OnDemand to come backup in "offline mode":

$ ./hpcts start
$ ./docker-compose stop
$ ./docker-compose start

ood-offline

@johrstrom Any thoughts? Seems like we should be able to support the stop/start of the containers.

Create supremm-1.4 el8 x86_64 rpm builds

Is this package still required? If so we'll need el8 builds. Currently failing to install with:

#0 25.96 Error: 
#0 25.96   - nothing provides pcp-libs < 5.0 needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides python-pcp >= 4.1 needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides python-pcp < 5.0 needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides python needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides numpy needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides python-pymongo needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides scipy needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides pytz needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides Cython needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides /usr/bin/python needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides MySQL-python needed by supremm-1.4.1-1.el7.x86_64
#0 25.96   - nothing provides python-tzlocal needed by supremm-1.4.1-1.el7.x86_64
#0 25.96 (try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

OnDemand aarch64 builds

We added multi-arch builds to support M1 macs. All seems to work fine except for OnDemand as only x86_64 rpm builds exist. Is it possible to get aarch64 builds? If not, what is arch specific? We now install TurboVNC directly from their yum repo which provides aarch64 builds. We also install python-websockify from source. Same for Dex, we just build the binary directly. What else is there?

document WSL/Windows issues

Looks like there's some setup for Windows folks, so getting some muscle memory and documentation around that would be nice.

Here are some things that come to my mind (but there could clearly be more):

  1. Using WLS or Windows Desktop
  2. You're likely going to need to update docker-compose (we probably need a minimum version listed somewhere)
  3. When you install docker/docker-compose, you also likely need to:
    • create the docker group
    • add your user to it
    • start the docker daemon
    • ensure the docker socket is root:docker owner:group

OnDemand PEARC22 issues

Creating an issue to keep track of getting the OnDemand containers ready for PEARC22 tutorial. Here's a start to the outstanding issues:

Single Sign On between Applications

Currently, we have the following:

  1. OOD running Dex (backed by LDAP container IdP)
  2. XDMoD running SAML? (backed by LDAP container IdP)
  3. Coldfront authenticating via LDAP directly

While all the above allows user logins with the same credentials (user/pass) it is not SSO. Further, it would seem XDMoD and OOD are actually two completely separate SSO systems. Please correct if I have anything wrong here.

Question for the team: Do we plan on creating a true SSO system to integrate all three applications for the tutorial? If so, how do we plan on doing that?

XDMoD 10.0.0 Prompt

Since the XDMoD 10 release, the xdmod-setup prompts about the new version, then fails.

$ docker-compose logs xdmod
Attaching to xdmod
xdmod        | ---> Starting SSSD on xdmod ...
xdmod        | ---> Starting sshd on xdmod...
xdmod        | ---> Starting the MUNGE Authentication service (munged) on xdmod ...
xdmod        | ---> Starting sshd on xdmod...
xdmod        | ERROR 2003 (HY000): Can't connect to MySQL server on 'mysql' (111)
xdmod        | -- Waiting for database to become active ...
xdmod        | ERROR 2003 (HY000): Can't connect to MySQL server on 'mysql' (111)
xdmod        | -- Waiting for database to become active ...
xdmod        | ERROR 2003 (HY000): Can't connect to MySQL server on 'mysql' (111)
xdmod        | -- Waiting for database to become active ...
xdmod        | ---> Open XDMoD Setup: SSO...
xdmod        | ---> Open XDMoD Setup: start
xdmod        | spawn xdmod-setup
xdmod        | You are currently using Open XDMoD 9.5.0, but a newer version
xdmod        | (10.0.0) is available.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no] 1
xdmod        | 
xdmod        | '1' is not a valid option.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no]
xdmod        | Failed to get prompt
xdmod        | ---> Open XDMoD Setup: hpc resource
xdmod        | spawn xdmod-setup
xdmod        | You are currently using Open XDMoD 9.5.0, but a newer version
xdmod        | (10.0.0) is available.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no] 4
xdmod        | 
xdmod        | '4' is not a valid option.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no] 1
xdmod        | 
xdmod        | '1' is not a valid option.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no]
xdmod        | Failed to get prompt
xdmod        | ---> Open XDMoD Setup: finish
xdmod        | spawn xdmod-setup
xdmod        | You are currently using Open XDMoD 9.5.0, but a newer version
xdmod        | (10.0.0) is available.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no] 5
xdmod        | 
xdmod        | '5' is not a valid option.
xdmod        | 
xdmod        | Do you want to continue (yes, no)? [no]
xdmod        | Failed to get prompt
xdmod        | Open XDMoD Import: Hierarchy
xdmod        | (2022-04-15 21:26:51): [be[default]] [sysdb_get_real_name] (0x0040): Cannot find user [xdmod@default] in cache
xdmod        | (2022-04-15 21:26:51): [be[default]] [sysdb_get_real_name] (0x0040): Cannot find user [xdmod@default] in cache
xdmod        | (2022-04-15 21:26:51): [be[default]] [sysdb_get_real_name] (0x0040): Cannot find user [xdmod@default] in cache
xdmod        | (2022-04-15 21:26:51): [be[default]] [sysdb_get_real_name] (0x0040): Cannot find user [xdmod@default] in cache
xdmod        | No entry for terminal type "unknown";
xdmod        | using dumb terminal settings.
xdmod        | SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
xdmod        | #0 /usr/share/xdmod/classes/CCR/DB/PDODB.php(88): PDO->__construct('mysql:host=loca...', 'xdmod', '')
xdmod        | #1 /usr/share/xdmod/classes/CCR/DB.php(111): CCR\DB\PDODB->connect()
xdmod        | #2 /usr/share/xdmod/classes/CCR/CCRDBHandler.php(55): CCR\DB::factory('logger')
xdmod        | #3 /usr/share/xdmod/classes/CCR/Log.php(288): CCR\CCRDBHandler->__construct(NULL, NULL, NULL, 200)
xdmod        | #4 [internal function]: CCR\Log::getDbHandler('xdmod-import-cs...', Array)
xdmod        | #5 /usr/share/xdmod/classes/CCR/Log.php(192): call_user_func(Array, 'xdmod-import-cs...', Array)
xdmod        | #6 /usr/share/xdmod/classes/CCR/Log.php(113): CCR\Log::getLogger('xdmod-import-cs...', Array)
xdmod        | #7 /usr/bin/xdmod-import-csv(133): CCR\Log::factory('xdmod-import-cs...', Array)
xdmod        | #8 /usr/bin/xdmod-import-csv(27): main()
xdmod        | #9 {main}

Required header yaml.h not found

I was working through this tutorial recently and ran into issues within the OnDemand section. Some ruby gems were not able to compile because a yaml.h header was missing.

I was able to work around the issue by making the following change

diff --git a/ondemand/install.sh b/ondemand/install.sh
index 546e5ac..f0371bf 100755
--- a/ondemand/install.sh
+++ b/ondemand/install.sh
@@ -79,6 +79,7 @@ log_info "Generating new httpd24 and dex configs.."
 log_info "Adding new theme to dex"
 sed -i "s/theme: ondemand/theme: hpc-coop/g" /etc/ood/dex/config.yaml

+dnf --enablerepo=powertools install -y libyaml-devel
 dnf clean all
 rm -rf /var/cache/dnf

and forcing the ondemand image be built locally instead of pulling from dockerhub. However, I'm not familiar enough with this package to know if this is the best place for the libyaml-devel of if it would be better suited in one of the base images. If this is an appropriate change, I'd be happy to submit a merge request.

BTW, this was run on an Apple M1 Pro.

slurmctld never starts

after a fresh pull and .hpcts start the slurmctd never starts.

...
frontend   | -- Waiting for slurmctld to become active ...
ondemand   | nc: connect to frontend (172.19.0.9) port 22 (tcp) failed: Connection refused
ondemand   | -- Waiting for frontend ssh to become active ...
cpn02      | -- slurmctld is not available.  Sleeping ...
cpn01      | -- slurmctld is not available.  Sleeping ...
frontend   | -- Waiting for slurmctld to become active ...
...

however, the slurmcltd container is started:

hpc-toolset-tutorial (git)-[master] # docker logs slurmctld
---> Starting SSSD ...
---> Starting the MUNGE Authentication service (munged) ...
---> Starting sshd on the slurmctld...
---> Waiting for slurmdbd to become active before starting slurmctld ...
-- slurmdbd is not available.  Sleeping ...
(2023-03-01 20:07:42): [sssd] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2023-03-01 20:07:42): [be[implicit_files]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2023-03-01 20:07:43): [be[default]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2023-03-01 20:07:43): [pam] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2023-03-01 20:07:43): [nss] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
-- slurmdbd is not available.  Sleeping ...
-- slurmdbd is not available.  Sleeping ...
-- slurmdbd is not available.  Sleeping ...
-- slurmdbd is not available.  Sleeping ...
-- slurmdbd is now active ...
---> Starting the Slurm Controller Daemon (slurmctld) ...

problem with dex build from scratch

Trying to force build images (in order to test other changes), I get the following error. I'm no go developer so I'm afraid I'm at a loss as to the possible solution. This is running on a Mac M1 running 14.4.

This bug report points to issues with some versions of Go: ent/ent#2155

 > [ondemand stage-arm64 5/8] RUN /build/install-dex-arm64.sh:
[...]
16.89  Install dex 2.32.0...
16.89
16.89 --2024-03-18 20:34:56--  https://github.com/dexidp/dex/archive/v2.32.0.tar.gz
[...]
17.28 --2024-03-18 20:34:56--  https://github.com/OSC/dex/commit/9366a1969bd656daa1df44e0bdd02f14437ed466.patch
[...]
17.58 dex-2.32.0/web/web.go
17.58 /tmp/dex-2.32.0 /tmp /
17.61 go: downloading entgo.io/ent v0.10.1
[...]
32.38 internal error: package "context" without types was imported from "entgo.io/ent"
------
failed to solve: process "/bin/sh -c /build/install-dex-arm64.sh" did not complete successfully: exit code: 2

Open OnDemand Internal Server Error

When trying to open the OnDemand URL: https://localhost:3443 in a browser I get the following error:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at root@localhost to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log.

The error appears in the ondemand logs (docker compose logs -f ondemand) as:

ondemand  | Completed successfully!
ondemand  | ---> Starting ondemand-dex...
ondemand  | ---> Starting ondemand httpd24...
ondemand  | AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using 172.19.0.12. Set the 'ServerName' directive globally to suppress this message
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="Dex Version: v2.36.0, Go Version: go1.19.2, Go OS/ARCH: linux amd64"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config issuer: https://localhost:5554"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config storage: sqlite3"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config static client: OnDemand"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config connector: ldap"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config skipping approval screen"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="config refresh tokens rotation enabled: true"
ondemand  | time="2024-07-25T14:22:47Z" level=info msg="keys expired, rotating"
ondemand  | time="2024-07-25T14:22:48Z" level=info msg="keys rotated, next rotation: 2024-07-25 20:22:47.998753652 +0000 UTC"
ondemand  | time="2024-07-25T14:22:48Z" level=info msg="listening (telemetry) on 0.0.0.0:5558"
ondemand  | time="2024-07-25T14:22:48Z" level=info msg="listening (http) on 0.0.0.0:5556"
ondemand  | time="2024-07-25T14:22:48Z" level=info msg="listening (https) on 0.0.0.0:5554"
ondemand  | 2024/07/25 14:23:58 http: TLS handshake error from [::1]:52182: local error: tls: bad record MAC
ondemand  | 2024/07/25 14:39:17 http: TLS handshake error from [::1]:51144: local error: tls: bad record MAC
ondemand  | 2024/07/25 14:42:10 http: TLS handshake error from [::1]:40056: local error: tls: bad record MAC

I'm running the containers in Pop OS 20.04. XDMoD and Coldfront interfaces work correctly. The Open OnDemand interface had previously worked as recent as 2 weeks ago. I have tried running ./hpcts destroy and ./hpcts cleanup as well as cloning a new repository and running the setup again. My colleague can also reproduce the error.

The fact that it had previously worked leads me to believe it could be an expired certificate maybe?

cannot schedule jobs after stopping/starting containers

Previously we (open ondemand) were able to submit jobs without the account information (I checked the ondemand/README.md and you can see the GIF of this happening). That seems to have changed, at least for hpcadmin.

[hpcadmin@ondemand ~]$  sacctmgr -np list account format=account,partition,user
root|||
sfoster|||
staff|||

When I'm working through the tutorials, I've always just used hpcadmin without also going through the cold front tutorial (i.e., setting up account mappings and so on).

Is there something I'm missing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.