Giter Site home page Giter Site logo

aws-parallelcluster-post-install-scripts's Introduction

AWS ParallelCluster Post-Install Scripts ๐Ÿš€

This repo contains a set of scripts that can be used to customize AWS ParallelCluster. To use multiple, take advantage of the multi-runner script like so:

 CustomActions:
    OnNodeConfigured:
      Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/multi-runner/postinstall.sh
      Args:
        - https://script1.com
        - -arg1
        - -arg2
        - https://script2.com
        - -arg1
Script URL Description
Spack Setup ๐Ÿ‘พ https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/spack/postinstall.sh Setup Spack Package Manager
Multi-Runner ๐Ÿช„ https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/multi-runner/postinstall.sh Run Multiple Post-install scripts including arguments
SLURM Rest API ๐Ÿ›ฐ๏ธ https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh Setup the Slurm REST API
Pyxis + Enroot ๐Ÿ“ฆ https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/pyxis/postinstall.sh Run containers with Slurm using Pyxis and Enroot.
Docker ๐Ÿšข https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/docker/postinstall.sh Install docker.
NCCL ๐ŸŽ๏ธ https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/nccl/postinstall.sh Install NCCL, AWS OFI NCCL and NCCL Tests.

aws-parallelcluster-post-install-scripts's People

Contributors

amazon-auto avatar jamt9000 avatar lipovsek-aws avatar mhuguesaws avatar rkilpadi avatar sean-smith avatar verdimrc avatar wdykas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

aws-parallelcluster-post-install-scripts's Issues

Pyxis postinstall script broken due to centos7 end of life.

The fuse overlay package from centos/7 is no longer available. This package is a requirement for enroot. Luckily I think we can install it with yum now.

FUSE_OVERLAYFS_URL=http://mirror.centos.org/centos/7/extras/x86_64/Packages/fuse-overlayfs-0.7.2-6.el7_8.x86_64.rpm

Screenshot 2024-07-01 at 2 28 09โ€ฏPM Screenshot 2024-07-01 at 2 27 58โ€ฏPM

REST-API fails with "slurmdbd.conf not found"

In a default Parallelcluster (3.9.1) configuration it can happen that there is no slurmdbd.conf file:

Recipe: @recipe_files::/tmp/slurm_rest_api/slurm_rest_api.rb
  * ruby_block[Create JWT key file] action run
    - execute the ruby block Create JWT key file
  * file[/var/spool/slurm.state/jwt_hs256.key] action create
    - change mode from '0644' to '0600'
    - change owner from 'root' to 'slurm'
    - change group from 'root' to 'slurm'
  * directory[/var/spool/slurm.state] action create
    - change mode from '0700' to '0755'
  * ruby_block[Add JWT configuration to slurm.conf] action run
    - execute the ruby block Add JWT configuration to slurm.conf
  * ruby_block[Add JWT configuration to slurmdbd.conf] action run
    
    ================================================================================
    Error executing action `run` on resource 'ruby_block[Add JWT configuration to slurmdbd.conf]'
    ================================================================================
    
    ArgumentError
    -------------
    File '/opt/slurm/etc/slurmdbd.conf' does not exist
    
    Resource Declaration:
    ---------------------
    # In /tmp/slurm_rest_api/slurm_rest_api.rb
    
     40: ruby_block 'Add JWT configuration to slurmdbd.conf' do
     41:   block do
     42:     file = Chef::Util::FileEdit.new("#{slurm_etc}/slurmdbd.conf")
     43:     file.insert_line_after_match(/AuthType=*/, "AuthAltParameters=jwt_key=#{key_location}")
     44:     file.insert_line_after_match(/AuthType=*/, "AuthAltTypes=auth/jwt")
     45:     file.write_file
     46:   end
     47:   not_if "grep -q auth/jwt #{slurm_etc}/slurmdbd.conf"
     48: end
     49: 
    
    Compiled Resource:
    ------------------
    # Declared in /tmp/slurm_rest_api/slurm_rest_api.rb:40:in `from_file'
    
    ruby_block("Add JWT configuration to slurmdbd.conf") do
      action [:run]
      default_guard_interpreter :default
      declared_type :ruby_block
      cookbook_name "@recipe_files"
      recipe_name "/tmp/slurm_rest_api/slurm_rest_api.rb"
      block #<Proc:0x00007fdb6d578e60 /tmp/slurm_rest_api/slurm_rest_api.rb:41>
      not_if "grep -q auth/jwt /opt/slurm/etc/slurmdbd.conf"
    end
    
    System Info:
    ------------
    chef_version=18.2.7
    platform=ubuntu
    platform_version=22.04
    ruby=ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
    program_name=/usr/bin/cinc-client
    executable=/opt/cinc/bin/cinc-client

Install nvidia-container-cli

Can we add the following to the post-install scripts to install nvidia-container-cli?

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update \
  && sudo apt-get install libnvidia-container1 \
  && sudo apt-get install libnvidia-container-tools

Pyxis runtime path cannot be on /fsx

Pyxis runtime path cannot be /fsx, otherwise error to run Docker image (directly) on multiple nodes.

# NOTE: below works fine for -N1.
$ srun -N2 --container-image=alpine grep PRETTY /etc/os-release
...
slurmstepd: error: pyxis:     Can't find a SQUASHFS superblock on /fsx/pyxis/1000/385.0.squashfs
slurmstepd: error: pyxis:     Wrong filesystem or filesystem is corrupted!
slurmstepd: error: pyxis:     Failed to read existing filesystem - will not overwrite - ABORTING!
slurmstepd: error: pyxis:     To force Mksquashfs to write to this block device or file use -noappend
...
srun: error: p4de-st-p4de-1: task 0: Exited with exit code 1
...
slurmstepd: error: pyxis:     [ERROR] No such file or directory: /fsx/pyxis/1000/385.0.squashfs
...
srun: error: p4de-st-p4de-2: task 1: Exited with exit code 1

Spack script failed on invocation of spack

Cluster creation succeeds however when I run spack I get:

[ec2-user@ip-172-31-33-190 ~]$ spack
==> Error: /shared/spack/etc/spack/config.yaml:1: Additional properties are not allowed (404 was unexpected)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.