Giter Site home page Giter Site logo

osde2e-gpu's Introduction

OpenShift-PSAP CI Artifacs

This repository contains Ansible roles and playbooks for OpenShift PSAP CI.

Performance & Latency Sensitive Application Platform


Quickstart

Requirements: (localhost)

  • Ansible >= 2.9.5
  • OpenShift Client (oc)
  • A kubeconfig config file defined at KUBECONFIG

CI testing of the GPU Operator

The main goal of this repository is to perform nightly testing of the GPU Operator. This consists in multiple pieces:

  1. a container image definition;
  2. an [entrypoint script](for the container image) that will run in the container image;
  3. a set of config files and associated jobs for PROW CI engine.

See there for the nightly CI results.

As an example, the nightly tests currently run commands such as:

run gpu-operator_test-operatorhub    # test the GPU Operator from OperatorHub installation
run gpu-operator_test-master-branch  # test the GPU Operator from its `master` branch
run gpu-operator_test-helm 1.4.0     # test the GPU Operator from Helm installation

These commands will in-turn trigger toolbox commands, in order to prepare the cluster, install the relevant operators and validate the successful usage of the GPUs.

The toolbox commands are described in the section below.

GPU Operator toolbox

See the progress and discussions about the toolbox development in this issue.

GPU Operator

toolbox/gpu-operator/deploy_from_operatorhub.sh [<version>]
toolbox/gpu-operator/undeploy_from_operatorhub.sh
- [x] List the versions available from OperatorHub (not 100%
  reliable, the connection may timeout)
toolbox/gpu-operator/list_version_from_operator_hub.sh

Usage:
  toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
  toolbox/gpu-operator/list_version_from_operator_hub.sh --help

Defaults:
  package-name: gpu-operator-certified
  catalog-name: certified-operators
  namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
  • Deploy from helm
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_from_helm.sh
  • Deploy from a custom commit.
toolbox/gpu-operator/deploy_from_commit.sh <git repository> <git reference> [gpu_operator_image_tag_uid]
Example:
toolbox/gpu-operator/deploy_from_commit.sh https://github.com/NVIDIA/gpu-operator.git master
  • Wait for the GPU Operator deployment and validate it
toolbox/gpu-operator/wait_deployment.sh
  • Run GPU-burn to validate that all the GPUs of all the nodes can run workloads
toolbox/gpu-operator/run_gpu_burn.sh [gpu-burn runtime, in seconds]
  • Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)
toolbox/entitlement/test.sh
toolbox/nfd/has_nfd_labels.sh
toolbox/nfd/has_gpu_nodes.sh
toolbox/gpu-operator/wait_deployment.sh
toolbox/gpu-operator/run_gpu_burn.sh 30
toolbox/gpu-operator/capture_deployment_state.sh

or all in one step:

toolbox/gpu-operator/diagnose.sh
  • Uninstall and cleanup stalled resources
    • helm (in particular) fails to deploy when any resource is left from a previously failed deployment, eg:
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: gpu-operator, existing_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole
  • This command ensures that the GPU Operator is fully undeployed from the cluster:
toolbox/gpu-operator/cleanup_resources.sh

NFD

  • Deploy the NFD operator from OperatorHub:
    • Control the install channel from the command-line
toolbox/nfd/deploy_from_operatorhub.sh [nfd_channel, eg: 4.7]
toolbox/nfd/undeploy_from_operatorhub.sh
  • Test the NFD deployment
    • test with the NFD if GPU nodes are available
    • wait with the NFD for GPU nodes to become available
toolbox/nfd/has_nfd_labels.sh
toolbox/nfd/has_gpu_nodes.sh
toolbox/nfd/wait_gpu_nodes.sh

Cluster

  • Add a GPU node on AWS
./toolbox/cluster/scaleup.sh
  • Specify a machine type in the command-line, and skip scale-up if a node with the given machine-type is already present
./toolbox/cluster/scaleup.sh <machine-type>
  • Entitle the cluster, by passing a PEM file, checking if they should be concatenated or not, etc. And do nothing is the cluster is already entitled
toolbox/entitlement/deploy.sh --pem /path/to/pem
toolbox/entitlement/deploy.sh --machine-configs /path/to/machineconfigs
toolbox/entitlement/undeploy.sh
toolbox/entitlement/test.sh [--no-inspect]
toolbox/entitlement/wait.sh
  • Capture all the clues required to understand entitlement issues
toolbox/entitlement/inspect.sh
  • Deployment of an entitled cluster
    • already coded, but we need to integrate this repo within the toolbox
    • deploy a cluster with 1 master node

CI

  • Build the image used for the Prow CI testing, and run a given command in the Pod
Usage:   toolbox/local-ci/deploy.sh <ci command> <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: toolbox/local-ci/deploy.sh 'run gpu-ci' https://github.com/openshift-psap/ci-artifacts.git master

toolbox/local-ci/cleanup.sh

osde2e-gpu's People

Contributors

kpouget avatar arangogutierrez avatar dagrayvid avatar xaenalt avatar dfeddema avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.