openflighthpc / flight-profile Goto Github PK

0.0 4.0 0.0 419 KB

License: Eclipse Public License 2.0

Ruby 92.12% Elixir 0.35% Shell 1.44% Python 6.08%

flight-profile's Introduction

Flight Profile

Manage node provisioning.

Overview

Flight Profile is an interactive node provisioning tool, providing an abstracted, command-line based system for the setup of nodes via Ansible or similar provisioning tools.

Installation

Manual installation

Prerequisites

Flight Profile is developed and tested with Ruby version 2.7.1 and bundler 2.1.4. Other versions may work but currently are not officially supported.

Steps

The following will install from source using Git. The master branch is the current development version and may not be appropriate for a production installation. Instead a tagged version should be checked out.

git clone https://github.com/openflighthpc/flight-profile.git
cd flight-profile
git checkout <tag>
bundle install --path=vendor

Flight Profile requires the presence of an adjacent flight-profile-types directory. The following will install that repository using Git.

cd /path/to/flight-profile/../
git clone https://github.com/openflighthpc/flight-profile-types.git
cd flight-profile-types
git checkout <tag>

This repository contains the cluster types that are used by Flight Profile.

Configuration

To begin, run bin/profile configure. Here, you will set the cluster type to be used (present in flight-profile-types), as well as any required parameters specified in the metadata for that type.

These parameters must be set before you can run Flight Profile.

Defining Questions

The required parameters for each cluster type are different so that different questions will be asked based on the selected type when running bin/profile configure. For this reason, each type needs an independent YAML file to define the questions for that type. For instance, when configuring a Jupyter standalone cluster, a set of subsequent questions could be read from a file named 'path/to/openflight-jupyter-standalone/metadata.yaml'. In this section, the structure of the question YAML file will be discussed.

The Minimum Structure of Question Metadata YAML Files

The basic structure of the YAML file is shown below:

---
id: openflight-jupyter-standalone   # the unique id of the cluster type
name: 'Openflight Jupyter Standalone'   # the name of the cluster type
description: 'A Single Node Research Environment running Jupyter Notebook'  # the description of the cluster type
questions:  # define the list of questions for this cluster type
  - id: question_1          # the unique id of the question
    env: QUESTION_1         # the name of the environment variable to which the answer will be assigned. it should also be unique.
    text: 'question_1:'     # the text of the prompt that will be printed on the console as the label of the input field
    default: answer_1       # the default answer to the question
    validation:             # specify the validation for the answer
      type: string          # specify the type of the answer, this option is currently not actually validated but there must be at least one validation item for each question
  - id: question_2          # second question 
    env: QUESTION_2
    text: 'question_2:'
    default: answer_2
    validation:
      type: string

With the above example, two questions will be asked when configuring a Jupyter standalone cluster.

Validation: Format and Validation: Message

The format and the message come together as sub-parameters of validation. The former is used to validate whether the answer matches a specified regex pattern and the latter is used to show the corresponding error message when the answer is invalid.

validation:
  format: '^[a-zA-Z0-9_\\-]+$'
  message: 'Invalid input: %{value}. Must contain only alphanumeric characters, - and _.'

For the above example, an error message Invalid input: ab(d. Must contain only alphanumeric characters, - and _. when the input answer is "ab(d".

Child Questions

Some questions may need to be presented based on the answer to the previous question. The following example gives the approach to define such child questions.

questions:
  - id: parent_question
    env: PARENT_QUESTION
    text: 'parent_question:'
    default: child
    validation:
      type: string
    questions:                          # define the child questions
      - id: child_question_daughter
        where: daughter                     # define the condition for asking this child question
        env: CHILD_QUESTION_DAUGHTER
        text: 'child_question_daughter:'
        default: daughter
        validation:
          type: string
      - id: child_question_son
        where: son
        env: CHILD_QUESTION_SON
        text: 'child_question_son:'
        default: son
        validation:
          type: string

For this metadata, the parent_question will be asked first. Then, if the answer is "daughter", only the child_question_daughter will be asked according to the given where option, and the child_question_son will be skipped. Note that if the answer is neither "daughter" nor "son", "moose", say, then both child questions will not be asked.

Boolean Questions

Some questions may want to get a binary answer, i.e. y/yes or n/no. To define such questions, a type parameter can be used as demonstrated below:

questions:
  - id: conditional_question
    env: CONDITIONAL_QUESTION
    text: "conditional_question:"
    type: boolean
    default: TRUE
    validation:
      type: bool    # remember that what is defined under the validation does not really matter but currently at least one validation item must be included

For this kind of questions, only yes, y, no, or n a valid answers.

Boolean questions can also have child questions. Simply use true or false as the value of the where option for the child questions of a boolean question.

Conditional Dependencies

Questions can have effects on identity dependencies by using the dependencies field. For instance:

questions:
  - id: conditional_dependency_question
    env: CONDITIONAL_DEPENDENCY_QUESTION
    text: "conditional_dependency_question:"
    default: no_dependency
    validation:
      type: string
    dependencies:
      - identity: id_a
        depend_on:
          - id_x
        where: dependency_a
      - identity: id_b
        depend_on:
          - id_y
          - id_z
        where: dependency_b

Given the above question definition:

If the answer to the question is "dependency_a", it forms a dependency that id_a depends on id_x.
If the answer to the question is "dependency_b", it forms a dependency that id_b depends on both id_y and id_z.
If the answer to the question is anything else, both dependencies won't be created, or, they will be discarded in a reconfiguration scanario.

IMPORTANT NOTE: Updating the conditional dependencies through a reconfiguration can cause side effects to the queueing nodes of the profile apply process. To avoid this, please dequeue the relevant nodes before the reconfiguration and reapply them afterwards.

Operation

A brief usage guide is given here. See the help command for more in depth details and information specific to each command.

Display the available cluster types with avail. A brief description of the purpose of each type is given along with its name.

Display the available node identities with identities. These are what will be specified when setting up nodes. You can specify a type for which to list the identities with identities TYPE; if you don't specify, the type that was set in configure is used.

Set up one or more nodes with apply HOSTNAME,HOSTNAME... IDENTITY. Hostnames should be submitted as a comma separated list of valid and accessible hostnames on the network. The identity should be one that exists when running identities for the currently configured type.

List brief information for each node that has been set up with list.

View the setup status for a single node with view HOSTNAME. A truncated/stylised version of the Ansible output will be displayed, as well as the long-form command used to run it. See the raw log output by including the --raw option.

Remove on shutdown option

When applying to a set of nodes, you may use the --remove-on-shutdown option. When used, the nodes being applied to will be given a systemd unit that, when stopped (presumably, on system shutdown), attempts to communicate to the applier node that they have shut down and should be remove'd from Profile. The option requires:

The shared_secret_path config option to be set
flight-profile-api set up and running on the same system, using the same shared secret

Automatically obtaining node identities

When using apply, you may use the --detect-identity option to attempt to determine the identity of a node from its Hunter groups. The groups will be searched for a group name that matches an identity name, and if one is found, that node will be queued for an application of that identity.

When using the --detect-identity option, giving an identity is not required. However, you may still provide one, and that identity will be used for all nodes that could not automatically determine their own identity. For example, if you had a set of 50 nodes, you could modify the groups of one node to include "login", then run profile apply node[0-49] compute --detect-identity and the compute identity will be applied to the 49 other nodes which did not have modified groups, while login will be applied to the relevant node.

If you decide to apply an identity to a set of nodes while also using --detect-identity, if any of the nodes in that set determine their own identity to match the one you chose to apply, they will all be applied simultaneously.

Contributing

Fork the project. Make your feature addition or bug fix. Send a pull request. Bonus points for topic branches.

Read CONTRIBUTING.md for more details.

Copyright and License

Eclipse Public License 2.0, see LICENSE.txt for details.

This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at https://www.eclipse.org/legal/epl-2.0, or alternative license terms made available by Alces Flight Ltd - please direct inquiries about licensing to [email protected].

Flight Profile is distributed in the hope that it will be useful, but WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the Eclipse Public License 2.0 for more details.

flight-profile's People

Contributors

Watchers

flight-profile's Issues

Support genders-style syntax for multi-node apply

If I have 10 nodes and I want to apply to all I need to do:`

[root@login1 ~]# flight profile apply node01,node02,node03,node04,node05,node06,node07,node08,node09,node10 compute

It'd be really useful if I could use genders syntax such that the following would work:

[root@login1 ~]# flight profile apply node[01-10] compute

Improve Multi-Node Apply

Issue Overview

Applying to 32 compute nodes at once creates a massive load on the host server and deployment grinds to a halt due to the ansible-playbook command being run for each node to be configured.

Considerations

Ansible is usually run on multiple nodes at once and that would significantly reduce the load. Some of the problems to address with using this method is:

Separating stdout on a per-node basis (likely to be something addressed in flight-profile so the view command can still work)
Handling multiple nodes being handed through to the Ansible run script

Addressing the Considerations

Separating Stdout
- Ansible allows output to be written to a directory in separate log files, this can be achieved with the following environment variables:
  - ANSIBLE_LOG_FOLDER=/PATH/TO/OUTPUT/: Set the directory to create log files in (this directory must exist or ansible just won't log anything anywhere)
  - ANSIBLE_STDOUT_CALLBACK=log_plays: Redirect the standard output to the log file path directory
  - The above means that nothing is printed to stdout of the command, this likely means view would need to be restructured to handle this
  - Additionally it does not look possible for the filenames to be changed so we may need to approach the separation of the "stages" of a profile run (the pre, main, post bits)
    - We could create directories for each stage + date, handling the data classification via ANSIBLE_LOG_FOLDER changing between each run command
    - Alternatively, we could add some notes into the logfiles as ansible appends to the logfile for the nodename
Handling Multiple Nodes
- Currently we have $NODE passed through and the run scripts within the playbooks do --limit $NODE, for handling multiple nodes at once we'll have to think about how we can hand over all the individual nodenames
- Perhaps $NODES being a space separated list of nodenames which the scripts can then handle separating/using how they wish?)

To Do

Add ansible logging vars to env of apply & remove
Update view to work with new logging method
Update how multiple nodes are handed to the commands of a type (probably pairs well with the new expanded range options)
Update our playbooks to use the new passing of nodes from above
Significant testing to ensure playbooks work well with this new style of usage & that logging in view is as expected

Add group name field to profile

At the moment, profiles have two ambiguously defined "names": the file name of the profile (e.g. compute.yaml); and the name field in the profile definition. We should pick one of these and use it as its consistent name to be used on the command line and in informative tables. Deciding which of those to use is left as an exercise for the reader.

It would also be a good idea to decouple the profile's name from the Ansible group it will be using. At the moment, we're forced to name profiles things like head and nodes, because the name is used to fetch the appropriate Ansible group. We should add a new field to profiles, group_name, to directly specify the Ansible group that it should use.

Accommodate setup of multiple nodes at once

It would be a big time-saver/convenience to be able to submit multiple deployment jobs at the same time. For example, setting up three compute nodes at the same time. There are a few ways this could be approached; here are some of them:

New setup-group command taking any number of hostnames followed by a single profile name (the Ruby splat operator comes to mind when dealing with this)
- deploy setup-group cnode01 cnode02 cnode03 compute
New --group option for setup taking a delimited list of hostnames
- deploy setup --group cnode01,cnode02,cnode03 compute
- This approach could allow for multiple groups to be submitted at the same time

For options where multiple hostnames are listed, it may be user-friendly to accept certain range-based inputs. For example, deploy setup cnode[01-03] compute would be expanded to [cnode01, cnode02, cnode03] in the command logic. This would be a rather sophisticated addition, however, and if attempted should be left until the individual hostname specification works properly first.

The question also arises of whether we want these deployment jobs to be run in parallel or sequentially. Ideally they would be run in parallel, but some thought should be given as to whether that would create any problems (and how, if at all, we can solve/get around them).

bad error messages appearing

I encountered bad error messages: e.g.

profile: undefined method `command' for nil:NilClass

when using the flight profile view command

in particular flight profile view <name>

where the <name> i used was one that hadn't been set up yet

full steps on a fresh machine were:

sudo dnf install -y https://repo.openflighthpc.org/openflight/centos/8/x86_64/openflighthpc-release-3-1.noarch.rpm

sudo dnf config-manager --enable openflight-dev

sudo dnf install -y flight-profile

flight profile prepare openflight-slurm-standalone

flight profile configure

flight profile identities

flight profile view localhost

I hadn't applied anything to localhost yet. and it gave a nil:nilClass error

Include `use_hunter` in env merge in `apply`

Currently, the Ansible playbook relies on a configuration question in flight-profile-types to know whether or not it should be using the hunter specific configuration options. We should move the use_hunter logic out of the configuration questions, and explicitly set it to the value of the equivalent application config key when applying a profile.

Stop invalid node hostnames being saved to inventory

Attempting to setup a non-existent node fails at the first step of the ansible playbook, but this is after the false hostname gets saved to the inventory. The false node also gets displayed by the list command.

Possible solution: for nodes that failed for the reason that the node hostname was invalid (i.e. failed the first ansible step), remove failed node hostnames from the inventory, and delete the appropriate var/inventory/hostname.yaml file

Hard code the node inventory directory

There isn't really any reason for us to be setting inventory_dir in etc/config.yml; let's just enforce it being set as var/inventory.yaml.

No proper error message for flight profile clean

When running the command flight profile clean with no arguments, simply:

flight profile clean

i get the error message:

profile: undefined method `delete' for nil:NilClass

Not 100% certain, but it seems like it does actually clean 1 thing when running the command, which i'm not sure if thats a bug. E.g.

[flight@ip-172-31-19-83 ~]$ flight profile list
┌────────┬──────────┬────────┐
│ Node   │ Identity │ Status │
├────────┼──────────┼────────┤
│ login1 │ login    │ failed │
│ node01 │ compute  │ failed │
│ node02 │ compute  │ failed │
└────────┴──────────┴────────┘
[flight@ip-172-31-19-83 ~]$ flight profile clean
profile: undefined method `delete' for nil:NilClass
[flight@ip-172-31-19-83 ~]$ flight profile list
┌────────┬──────────┬───────────┐
│ Node   │ Identity │ Status    │
├────────┼──────────┼───────────┤
│ node01 │ compute  │ failed    │
│ node02 │ compute  │ failed    │
│ login1 │          │ available │
└────────┴──────────┴───────────┘
[flight@ip-172-31-19-83 ~]$ flight profile clean
profile: undefined method `delete' for nil:NilClass
[flight@ip-172-31-19-83 ~]$ flight profile list
┌────────┬──────────┬───────────┐
│ Node   │ Identity │ Status    │
├────────┼──────────┼───────────┤
│ node02 │ compute  │ failed    │
│ login1 │          │ available │
│ node01 │          │ available │
└────────┴──────────┴───────────┘
[flight@ip-172-31-19-83 ~]$ flight profile clean
profile: undefined method `delete' for nil:NilClass
[flight@ip-172-31-19-83 ~]$ flight profile list
┌────────┬──────────┬───────────┐
│ Node   │ Identity │ Status    │
├────────┼──────────┼───────────┤
│ login1 │          │ available │
│ node01 │          │ available │
│ node02 │          │ available │
└────────┴──────────┴───────────┘

But it should still say something like "specify the node to clean" rather than undefined method.

`clean` command seems to clean an `applying` node

Applying to a hunter node and then running profile clean looks to clean the applying node, resetting its status to available.

Investigate Usage of Hunter Groups with Profile Apply

A useful integration option would be that profile can determine the identity to apply to a node from its primary group in hunter

Some initial thoughts on this:

apply flag/option to autodetect the identity type
- Some verification to determine whether the group matches an available identity in the selected cluster type
Stretch/additional functionality - the ability to autoapply to all hunted nodes, selecting all their appropriate identities from the list

Consider adding more granular steps to type preparation

The flight-desktop tool has logic inplace to provide more interesting (or, in comparison to Flight Profile, any) data points about the preparation of a desktop type (see grep desktop_stage in Flight Desktop Types and Flight Desktop). It would be a welcome enhancement to Flight Profile to be able to do the same thing, along with nice colourisation and a loading spinner.

Be more thorough when checking script status

The prepare command will inform the user that it was successfully run, as long as the last command in prepare.sh was successful, even if other commands have failed. We should ensure that any failed commands are recognised by profile and deem the prepare process to be a failure if any of the subcommands fail.

Give profiles a description field

Profiles would benefit from a description field, containing a nice description of what the profile is going to do to the node it is applied to. Tied to this, we would like to remove a profile's command from the profiles table output. Replacing it with the to-be-added description field would be ideal.

Remove confirmation prompt when configuring

The configure command prompts the user to confirm their answers once they're done. This isn't really necessary, as there are currently only two questions, and if the user isn't happy with their answers then they can just run configure again.

Ensure type config values are strings before applying

Currently, the profile apply process maps the type question config values to strings and tries to interpolate them into the command spawned out. If a config value isn't a string, and cannot be implicitly converted to a string, an error is raised. We should either (1): only accept string values for config keys, or (2): try and convert all values to strings before submitting, raising a more user-friendly error if any of them cannot be converted.

Investigate possible solutions to simultaneous playbooks having issues running commands.

Currently profile would run 3x instances of the profile commands when launched with flight profile apply node01,node02,node03 compute, this is largely okay but can rarely lead to issues with similar commands running over all nodes (e.g. in post-hooks that execute for the entire playbook instead of limiting to the node it's being run for).

One possible way of circumventing these occasional errors would be to, instead, set the $NODE variable to the comma-separated list of nodes to be run on.

My immediate concern with this implementation is that it'd become tricky in how we handle view with logs.

An alternative solution would be to have a short delay between commands such that commands have less chance of executing the same commands at the same time.

For reference - the issue I've intermittently seen is when a post-hook script is running for SLURM multinode and it needs to add firewall rules to a node, due to multiple attempts happening concurrently some of the playbooks exit with

ok: [stu2-node-3.novalocal] => (item=FA:16:3E:24:AD:75)
changed: [stu2-node-5.novalocal] => (item=FA:16:3E:75:8A:6E)
changed: [stu2-node-5.novalocal] => (item=FA:16:3E:27:11:BA)
failed: [stu2-node-5.novalocal] (item=FA:16:3E:A9:C6:55) => {"ansible_loop_var": "item", "changed": false, "item": "FA:16:3E:A9:C6:55", "msg": "ERROR: Exception caught: ALREADY_ENABLED: FA:1
6:3E:A9:C6:55 Permanent and Non-Permanent(immediate) operation"}
changed: [stu2-node-5.novalocal] => (item=FA:16:3E:BE:22:61)

`clean` command for removing failed nodes from `list` output

Include a clean command, similar to flight desktop, which removes the inventory files for any failed nodes.

Then keep the inventory files for all nodes after setup, regardless of outcome. This will stop them from disappearing from the list output, but also allows a way of removing invalid nodes from the list

Continue to remove failed nodes from the ansible inventory

Improve type preparation process

The way that cluster types are prepared could do with some further sophistication. This is a meta-issue to track a few features.

Isolated `prepare` command

Currently, a cluster type can only prepared when an identity for that type is applied for the first time. It isn't immediately obvious that any preparation is going on (better UX outputs to be handled in a separate issue), and it would be nice to be able to make sure things are "ready to go" without having to commit to applying an identity. The logic for preparing a type is already encapsulated within the Type class; we just need a command to trigger it. This work should include removing the automatic preparation step from apply.

Tracking whether or not a type has been prepared

There should be some indication in the avail command as to whether or not a type has been prepared and is ready for use or not. This is as simple as a true/false flag on Type to be set when the preparation command is complete.

Better separation of type dependencies

Currently, when a cluster type from flight-profile-types is prepared, its dependencies are dumped in the root of the project directory. Ideally, a type should have its own isolated environment to store its dependencies. This could be in the type directory; it could also be in a new var directory, perhaps one called type_envs with one folder per type. Some thought would need to be given as to conveying to the type's prepare script what the isolated environment is. I think that Ruby has a way to restrict a process from interacting with files above a particular directory in the tree.

prepare command
Preparation tracking
Isolated dependencies folder

Look into usage of `Paint` for UI elements

Other Flight tools use the Paint gem to add a bit of colour to the terminal outputs. We should investigate where/how we could apply usage of that gem here.

Give types their own config file

Currently, the application config file at etc/config.yml stores both the top-level application config as well as the answers to the questions given by the user's chosen flight-profile-type. Not only should they not be sharing a file on development principles, but it produces real life issues if the user wants to use a type question that shares an ID with an application config key.

We should separate config data for Profile types into their own files. It would prevent overlap errors from occurring, and make it slightly easier to switch types if the user so decides to.

Raw log view doesn't keep newlines

#49 introduced the following change:

-        node.log_file.readlines.each do |line|
+        command.split("\n").each_with_index do |line, idx|

to https://github.com/openflighthpc/flight-profile/pull/49/files#diff-e7cd99d067e1c344e5548556de7d3ed7d7b68a39de64ad3349c2464e1076e124, which removes the \n from the lines that are printed. We ought to add the newline back.

Fix command logging in setup command

The Running section of the view command should always display the command used to run the deployment. At the moment, if a deployment gets some kind of warning message (in my case, Ansible warning me to use a module other than rpm), the line of the log containing the command is consumed by File::readlines, and the Running section displays the Ansible warning instead. Some bug-hunting is needed here to see what's going on and how to stop it.

No visual separation of different types in flight profile avail

as you can see above, the jupyter type ends and it is not obviously clear where the slurm type starts after it.

Queue Doesn't Resolve When A Previously-Failed Dependency Becomes Complete

Description

In the situation that some apply processes are queued, waiting for another identity to successfully complete, and the identity it requires fails it will remain queued (expected). However, if the identity is force applied and then succeeds (resolving dependencies for the queued processes) the processes will remain queued indefinitely.

Steps to replicate

Launch a cluster with building blocks
Setup test playbook (download and run this gist)
Put node into queue (e.g. flight profile apply node01 dep)
Apply one that will fail (flight profile apply gateway1 first)
When it fails, try forcing to gateway1 again (will fail again)
"Fix" the issue (change exit 1 to exit 0 in playbook tasks
Force apply to gateway1 again
It completes!
Node01 remains queued forever

Show In-Progress Task

The view output for a log will show the tasks that have been completed but there are some tasks in playbooks that take a while to apply. For example, initialising a Kuberenetes cluster. When this is taking place the view command will show:

<snip>
k8s
   ✅ Add Modules for Containerd
   ✅ Load Modules for Containerd
   ✅ Add Sysctl Options for Containerd
   ✅ Apply New Sysctl Options
   ✅ Start & Enable Containerd
   ✅ Start & Enable Kubelet

And the raw will show

<snip>
TASK [k8s : Add Modules for Containerd] ****************************************
changed: [kube1-login1.novalocal]

TASK [k8s : Load Modules for Containerd] ***************************************
changed: [kube1-login1.novalocal] => (item=overlay)
changed: [kube1-login1.novalocal] => (item=br_netfilter)

TASK [k8s : Add Sysctl Options for Containerd] *********************************
changed: [kube1-login1.novalocal]

TASK [k8s : Apply New Sysctl Options] ******************************************
changed: [kube1-login1.novalocal]

TASK [k8s : Start & Enable Containerd] *****************************************
changed: [kube1-login1.novalocal]

TASK [k8s : Start & Enable Kubelet] ********************************************
changed: [kube1-login1.novalocal]

TASK [k8s : Initialize the Cluster] ********************************************

I think it may be worth having some symbol/way of showing what task it is on when applying. E.g.

k8s
   ✅ Add Modules for Containerd
   ✅ Load Modules for Containerd
   ✅ Add Sysctl Options for Containerd
   ✅ Apply New Sysctl Options
   ✅ Start & Enable Containerd
   ✅ Start & Enable Kubelet
   ⏳ Initialize the Cluster

Allow users to force apply a profile

Currently, a node can be applied a single profile. If the user tries to give it another profile, an error is raised saying that it already has a profile. We would like to be able to override this with a --force option. This could cause problems with how the node is set up if it already has changes made to it that are hard to undo/modify, but that's mostly down to the playbook configuration (and, to be honest, should be expected when forcing advised-against changes).

View command doesn't work when node exists

The raise call in Profile::Commands::View#node is overriding the return value of the @node ||= Node.find(@name above it, returning nil if the node does indeed exist. The raise call should be moved up a level to the main Profile::Commands::View#run logic.

Be more consistent with YAML formatting

In, at least, the etc/config.yml file, some keys are preceded with a colon (specifically, the ones set by configure), and some keys are not (the ones set by the programmer). We should be a bit more consistent with these, ideally removing the colon, and make sure that anywhere YAML is dumped by the program that it isn't prepending a colon to the start of the keys.

No space between words in description of command (flight profile avail)

there is no space between width. and Defaulting

Fix PATH error for `use_hunter` option

The use_hunter option currently forks out to Flight Hunter to fetch some extra node information. It assumes that the file specified in the config is executable and that the PATH environment variable is set up to be able to execute the file. If it isn't, the file can't be executed, and we face an error. We should improve the Config.command_path method to take some extra steps to make sure that the PATH is set correctly. It may be worth learning from the implementation that Flight Job uses.

Pass `--force` option into queue

If a node is queued with the --force option, the presence of the option isn't saved and thus the node won't be correctly re-applied after it leaves the queue. The queue already keeps the --remove-on-shutdown option, so the logic already exists and just needs --force added to the list of options.

Better handling for malformed node files

We need a better way of handling files in the var/inventory/ directory that don't conform to the expected node YAML structure. My preferred method for this is using a schema that we can check files against. The top result on Google for a YAML schema validator was last updated in 2009, so it may be a better idea to use a JSON validator and convert the YAML to JSON before validating.

Include IP in Ansible Inventory File in Hunter Mode

Currently ansible inventories are generated with the hunter hostname (translated from the label) such that with the following hunter list:

┌──────────┬────────┬────────────────────────────────┬─────────────┬────────────┐
│ ID       │ Label  │ Hostname                       │ IP          │ Groups     │
├──────────┼────────┼────────────────────────────────┼─────────────┼────────────┤
│ 320a5701 │ login1 │ hunterspeedup-login1.novalocal │ 10.50.1.87  │            │
│ 320a1701 │ node01 │ hunterspeedup-node-1.novalocal │ 10.50.1.23  │ nodes, all │
└──────────┴────────┴────────────────────────────────┴─────────────┴────────────┘

It would create an ansible inventory that looks like

[login]
hunterspeedup-login1.novalocal

[compute]
hunterspeedup-node-1.novalocal

Which works absolutely fine in the context of the various cluster types when applying to a login node (itself) first because the playbook adds hunter nodes to /etc/hosts so DNS works for the non-self nodes. However, this hits the issue that when adding a new node we aren't able to resolve the hostname upfront and the playbook fails immediately.

A proposal to fix this is that we add the IP address known to hunter to the inventory file as well, this will allow resolution and cluster setup to continue (and means we can have a proper "add node to an existing cluster" workflow). The inventory file for the above hunter list would then look like the following:

[login]
hunterspeedup-login1.novalocal ansible_host=10.50.1.87

[compute]
hunterspeedup-node-1.novalocal ansible_host=10.50.1.23