The distributed-workloads from opendatahub-io

Unable to deploy example kfdef to nonstandard namespace

With the current codeflare-stack-kfdef.yaml file, you are unable to deploy the stack into any namespace other than opendatahub. We have the namespace hardcoded in the kfdef.

This is at odds with the content in our quick start where we suggest that it is possible to override the default namespace.

IMO we should simply not specify the namespace in the kfdef.

Add make command to automatically install ODH and CodeFlare on OpenShift

Currently the installation of ODH and CodeFlare is done manually, see quickstart.

Purpose of this task is to create a make command which would setup ODH and CodeFlare on available OpenShift instance.
The command should ideally target the new ODH operator.

This automation will be used in tests to setup the environment.

Adjust Makefile to be able to install CodeFlare tools from local file system

Current Makefile allows installing CodeFlare tools using KfDef. However to be able to test current repository content we should be able to install CodeFlare tools from files in the repository directly.

Improve running e2e test on Mac

Some of the tests are problematic on Mac. Here are the examples:

Minor typo "Deleteing" instead of "Deleting" in Makefile

When running make delete-codeflare-from-filesystem or make delete-codeflare, we see:

Update Quick-Start

Based on changes made on this PR: project-codeflare/codeflare-sdk#296

File paths and contents of the quickstart will need to be updated accordingly.

Upgrade Makefile to install ODH 2.x

Right now Makefile commands install ODH 1.x.
To properly integrate with new version we should change the Makefile commands to deploy ODH 2.x and its resources.

Create PR GitHub workflow validating go files in test folder

Currently there is no validation of go code used in tests for PRs.

Purpose of this issue is to implement a validation making sure that changes in tests brought by PR compile and code is properly formatted.

Add ability to manually override CodeFlare kfdef resource requirements

Original ask:

We will want to update the deployment artifacts such that users can specify resource requirements within the kfdef itself.

Hopefully this is something that's supported in the current CRDs...

Understand the test Implementation and coverage for Ray Cluster test

Go through the code and understand the test Implementation for Ray cluster tests
Run the tests manually https://github.com/opendatahub-io/distributed-workloads/tree/main/tests

Set up OpenShift-CI for this repository

Create bug/feature/other issue templates

Create issue templates on:

Bug
Feature
Other

Similarly as done for Project-Codeflare: https://github.com/project-codeflare/.github/tree/main/.github/ISSUE_TEMPLATE

test KubeRay candidate v1.0.0-rc.0

A new version of KubeRay Operator is out. https://github.com/ray-project/kuberay/tree/v1.0.0-rc.0/
Which needs testing...

Add nvidia GPU operator ClusterPolicy info to quickstarts

In order for OpenShift cluster to discover GPU hardware a ClusterPolicy CR must be created. The defaults will do. Add this to both quickstarts V1 & V2.

E2E Step 3: Test torchx <-> mcad interaction

Use torchx command line to start, eventually SDK
Check pods coming up
Check result

Doc: CodeFlare Operator with custom MCAD and Instascale images

Currently when installing the CodeFlare Stack the latest versions of MCAD and Instascale get added to the cluster.

Issue

We want documentation on how to make use of custom CodeFlare stack images.

Suggested Resolution

Add documentation on how to install custom images of MCAD and/or InstaScale to an OpenShift or Kubernetes cluster.
Provide commands which may be used to efficiently replace the controllerImage value in a CRD to the path to the custom image.

Create a Jenkins job for running CodeFlare integration tests

Create a downstream Jenkins job for running the integration tests from [1] with the downstream CodeFlare and RHODS operators.
This time the test scope is not limited to KubeRay.

[1] https://github.com/opendatahub-io/distributed-workloads/tree/main/tests/integration

Tweak of E2E Notebook tests to improve performance

changes:
vi operator-tests/opendatahub-kubeflow/tests/resources/custom-nb-small.yaml

        resources:
          limits:
            cpu: "2"
            memory: 8Gi  --> to 3Gi
          requests:
            cpu: "1"
            memory: 8Gi  --> to 3Gi

vi operator-tests/opendatahub-kubeflow/tests/resources/mnist_mcad_mini.ipynb

gpu=0, cpu=1, memMB=8000
to
gpu=0, cpu=3, memMB=4000

vi operator-tests/opendatahub-kubeflow/tests/resources/mnist_ray_mini.ipynb

min_cpus=1, max_cpus=1, min_memory=4, max_memory=4, gpu=0
to
min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0

With these changes, the time to run the test went from (Note, this is with pre-cached images on the worker nodes)

With above changes: ./run.sh took 697 seconds

Without above changes: It's Really slow for mnist_mcad_mini.ipynb and can't complete within the 1200 second timeout,

./run.sh took 1247 seconds

and I'm pretty sure the ray job wouldn't be able to start due to memory pressure.

Add make command to automatically update odh-manifests

We should add a make command to automatically update the manifests and tests in odh-manifests. The make command should raise a PR. We implemented something similar to this for the CodeFlare repository.

One tricky thing we'll need to be careful about is that the basictest files have minor differences in odh-manifests and this repository. The minor differences are mainly around the paths.

Fix the failing KubeRay tests running with downstream RHODS operator

Investigate and fix the following test failure when creating a RayJob with the downstream operator - might be related to [1]:

=== NAME  TestRayJobSubmissionRest
    ray_test.go:185: 
        Unexpected error:
            <*errors.errorString | 0xc0004bbe70>: 
            incorrect response code: 503 for creating Ray Job, response body: <html>
...
              <body>
                <div>
                  <h1>Application is not available</h1>
                  <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>
            
                  <div class="alert alert-info">
                    <p class="info">
                      Possible reasons you are seeing this page:
                    </p>
                    <ul>
                      <li>
                        <strong>The host doesn't exist.</strong>
                        Make sure the hostname was typed correctly and that a route matching this hostname exists.
                      </li>
                      <li>
                        <strong>The host exists, but doesn't have a matching path.</strong>
                        Check if the URL path was typed correctly and that the route was created using the desired path.
                      </li>
                      <li>
                        <strong>Route and path matches, but all pods are down.</strong>
                        Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
                      </li>
...
            }
        occurred

[1] https://issues.redhat.com/browse/RHODS-11106

Add support for downstream CodeFlare operator installation into QE tooling

Add support for installing the RHODS CodeFlare operator in a similar way how the RHODS operator is installed in the downstream QE tooling. The goal is to be able to install the downstream CodeFlare operator as a prerequisite for the RHODS operator in the downstream Jenkins job.

Add support for creation of required CRs for Nvidia and NDF operators

Right now the Makefile commands to deploy Nvidia and NDF operator create just the operator itself.
For proper functionality the respective CRs need to be created too.

Goal is to implement automated creation of needed CRs for operators above so cluster can be used without any additional configuration.

E2E Step 5: Delete distributed workloads kfdef

Check that MCAD pod is gone
Check that InstaScale pod is gone
Check that KubeRay pod is gone
Should notebook imagestream go away?

Update KubeRay version to v0.5.0

We are currently installing KubeRay version v0.3.0. The latest available KubeRay is 0.5 -- we should upgrade to this version and confirm that no existing functionality is broken.

Separate InstaScale out from default KFDEF

Most users will always want MCAD available in their clusters, but they may not want InstaScale. We should separate out InstaScale into a separate overlay/manifest set so as to increase the flexibility of deployment.

CRD name changes in cleanup for quickstart docs

Small changes needed for crd cleanup in the quickstart docs...
WAS:
schedulingspecs.mcad.ibm.com appwrappers.mcad.ibm.com quotasubtrees.ibm.com

but now is:
quotasubtrees.quota.codeflare.dev
appwrappers.workload.codeflare.dev
schedulingspecs.workload.codeflare.dev

Allow using codeflare-stack kfdef to install codeflare-operator

It would nice to try install the codeflare-operator as part of the codeflare-kfdef because currently they are 2 separate steps.
Here is an example to do this.

ray test manifest not found

It looks like the manifest used for testing kuberay (see https://github.com/opendatahub-io/distributed-workloads/tree/main/ray#create-a-test-cluster ) is not available in the repo

/kind bug
/kind documentation

E2E Step 4: Test ray <-> mcad interaction

Use SDK (investigation required for creating and using notebook)
Check that pods are up
Check results

Quick-Start.md links are 404

the last two links on Quick-Start.md result in a 404

* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/batch-job)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/interactive)

/kind bug
/kind documentation

E2E Step 2: Install distributed workloads kfdef

Check that MCAD pod comes up
Check that InstaScale pod comes up
Check that KubeRay pod comes up
Check for notebook imagestream

Add e2e tutorial using KubeRay alongside ODH

This should include supporting artifacts such as notebooks, templates, etc.

E2E Step 6: Uninstall CodeFlare Operator

Confirm that the pod is gone

Investigate automation for running Distributed workloads tests with upstream/downstream operator

Investigate the options to run the Distributed workloads tests with the upstream or downstream operator while reusing as much of the automation and infrastructure as possible.

Use Tagged version of codeflare-sdk stack

We broke odh-manifests CI earlier this week when we had the latest release of the CodeFlare stack. This was due to some backwards incompatible changes in the codeflare-sdk. We should update the deployed manifests such that it's always a tagged version of the images being used.

Prepare UI test for KubeRay cluster API

Add smoke UI test for Ray dashboard

Update documentation based on the new operator design

Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.

Accommodate documentation, according to the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources and deployments.

Update DW stack manifests based on the new operator design

Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.

Accommodate manifests, based on the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources.

make all-in-one hangs at Deleting NDF Operator

Any idea on how to debug this?

==> Deleting NDF Operator

oc delete subscription nfd -n openshift-nfd
Error from server (NotFound): subscriptions.operators.coreos.com "nfd" not found
make[1]: [Makefile:113: delete-ndf-operator] Error 1 (ignored)
export CLUSTER_SERVICE_VERSION=`oc get clusterserviceversion -n openshift-nfd -l operators.coreos.com/nfd.openshift-nfd -o custom-columns=:metadata.name`; \
oc delete clusterserviceversion $CLUSTER_SERVICE_VERSION -n openshift-nfd
error: resource(s) were provided, but no name was specified
make[1]: [Makefile:114: delete-ndf-operator] Error 1 (ignored)
oc delete ns openshift-nfd

It's not something after the delete that's causing the issue b/c manually entering that last command causes the hang too.

Revise and update e2e integration tests

Verify that current tests in https://github.com/opendatahub-io/distributed-workloads/tree/main/tests cover expected use cases. In case of missing use cases implement them.
As part of the task revise current test implementation and evaluate if it is meaningful and easy to use, or if it has sense to refactor it (i.e. to match https://github.com/project-codeflare/codeflare-operator/tree/main/test/e2e).

Codeflare SDK ImageStream

https://github.com/opendatahub-io/distributed-workloads/blob/main/Quick-Start.md#add-the-codeflare-notebook-image-to-open-data-hub

I avoided cloning stuff into the codeflare sdk repository when I was working through the tutorial just as a matter of habit. A short comment stating to cd into the codeflare-sdk repo if you are not already might be helpful.

ray.sh problem during E2E testing

I didn't notice this on Friday, that the ray.sh script was failing:

./run.sh 
++++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests
Now using project "basictests-bafq" on server "https://api.jimbig412.cp.fyre.ibm.com:6443".

++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh
./run.sh: line 124: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh: Permission denied
failed: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh

@MichaelClifford do you think it's OK to remove the operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh? I can create a quick PR to remove it...

Add installation detail and Cleanup steps to the Quickstart

The existing Quick-Start.md is a little confusing on a few steps for new users, and also lacks a Clean-Up section at the end. I'd like to try to get both into the Quick-Start.
fyi: @asm582

Update tests based on the CodeFlare operator redesign

Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.

Accommodate basic and integration tests, based on the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources and deployments.

Make codeflare operator documented in the prerequisites section

Following the Quick Start Guide I was stuck on the installation and then noticed that the codeflare operator needs to be installed separately. I see it as the first step of the installation docs, but for visibility I suggest to make it as a topic of the Prerequisites section.

Add GitHub Action for automatic Tag + Release creation in Distributed Workloads Repository

Kuberay operator image is pulled from Docker Hub

I just created a KfDef with the codeflare-stack and ray-operator manifests, but the kuberay-operator pod is falling on OpenShift because of the rate limit.

Failed to pull image "kuberay/operator:v0.3.0": rpc error: code = Unknown desc = initializing source docker://kuberay/operator:v0.3.0: reading manifest v0.3.0 in docker.io/kuberay/operator: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

opendatahub-io / distributed-workloads Goto Github PK

distributed-workloads's People

Contributors

Stargazers

Watchers

Forkers

distributed-workloads's Issues

Issue

Suggested Resolution

Recommend Projects

Recommend Topics

Recommend Org