Giter Site home page Giter Site logo

slurm-upgrade's Introduction

Update SlurmDB, Controller, and Client

This Ansible playbook automates the process of updating SlurmDB, Controller, and Client nodes. The playbook performs tasks such as creating backup directories, verifying configurations, downloading and extracting new releases from GitHub, and restarting Slurm services. https://confluence.dug.com/pages/viewpage.action?spaceKey=DUGIT&title=20240122+Update+London%27s+SLURM+to+slurm-23-02-7-1-DUG-5

✨ Requirements

  • Ansible installed on the control machine.
  • SSH access to the Slurm nodes (SlurmDB, Controller, and Client).
  • GitHub personal access token for downloading releases.
  • SQL key for database access.

❔ Variables

  • get_url_args: Dictionary containing the destination path for downloaded files.
  • asset_name: The name of the GitHub release asset.
  • github_pat: GitHub personal access token.
  • sql_key: SQL key for database access.
  • backup_dir: Directory path for backups.
  • gh_release_api: GitHub API URL for the latest release.

💻 Hosts

  • lslurmdb: Hosts related to SlurmDB.
  • lslurmcontroller: Hosts related to the Slurm Controller.
  • lslurmclient: Hosts related to the Slurm Client.

🚀 Tasks

  1. Ensure the backup directory exists
    • Creates backup directories on the SlurmDB and SlurmController nodes.
    - name: Ensure the backup directory exists
      ansible.builtin.file:
        path: "{{ backup_dir }}"
        mode: '0755'
        state: directory
  1. Run the sanity check script
    • Runs the sanity_slurm_conf.sh script to verify the Slurm configuration on the Controller.
      ansible.builtin.shell: DEBUG=1 /d/sw/slurm/etc/sanity_slurm_conf.sh
  1. Extract and compare configuration files
    • Compares the current Slurm configuration file and failed if it doesn't match (ensure all dynamic changes have been written)
      failed_when: diff_result.rc != 0 
      when: inventory_hostname in groups['lslurmcontroller']
  1. Obtain GitHub release details
    • Fetches the latest release details from the GitHub API
      ansible.builtin.set_fact:
        gh_release_asset_url: "{{ gh_release.json.assets | selectattr('name', 'contains', asset_name) | map(attribute='url') | first }}"
        asset_dir_name: "{{ gh_release.json.assets | selectattr('name', 'contains', asset_name) | map(attribute='name') | first | regex_replace('.tar.bz2', '') }}"
  1. Download and extract the release
    • Downloads and extracts the specified release asset from GitHub.
      ansible.builtin.unarchive:
        src: "{{ download_dest.dest }}"
        dest: "{{ get_url_args.dest }}"
        remote_src: yes
  1. Stop Slurm services
    • Stops the SlurmDB, SlurmController, and SlurmClient services and failed if it still on active state
      failed_when: "'restarted' in slurmdbd_status.state"
  1. Backup the configuration
    • Copies the existing configuration to a backup directory.
      ansible.builtin.copy:
        src: "/var/pool/slurm"
        dest: "{{ get_url_args.dest }}/backup_var_spool_slurm_{{ lookup('pipe', 'date +%Y%m%d') }}"
  1. Create relative symlinks
    • Creates symbolic links for the new Slurm version.
      ansible.builtin.file:
        src: "{{ asset_dir_name }}"
        dest: "{{ get_url_args.dest }}/d/sw/slurm/latest"
        state: link
  1. Restart Slurm services
    • Restarts the Slurm services on the SlurmDB, SlurmController, and SlurmClient nodes and failed if it stopped
      failed_when: "'stopped' in slurmdbd_status.state"

📣 Usage

  1. Prepare the environment

    • Ensure you have the required access tokens and keys in ~/.ssh/github_token.txt and ~/.ssh/sql_token.txt.
  2. Run the playbook

     ansible-playbook -i inventory/hosts.yaml playbooks/slurm_rollout.yaml 

🐛 Debugging

  1. Run the playbook
     ansible-playbook -i inventory/hosts.yaml playbooks/slurm_rollout.yaml -vv

📜 Notes

  • This is being done in the tested environment. Please change the FastX3 services to the relevant services, and ensure the paths are correct as per the configuration.
  • The playbook assumes that the GitHub token and SQL key files are stored in the ~/.ssh directory.
  • Adjust file paths and variables according to your specific setup.

👉 Pending

  • to move into /d/sw/slurm from its localData
  • update services to relevant services
  • update host.yaml to relevant host
  • test the key for dumping sql database into backup dir
  • sanity check the size of dumped sql database

slurm-upgrade's People

Contributors

fiqrim-dug avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.