opendatahub-io / distributed-workloads Goto Github PK
View Code? Open in Web Editor NEWArtifacts for installing the Distributed Workloads stack as part of ODH
License: Apache License 2.0
Artifacts for installing the Distributed Workloads stack as part of ODH
License: Apache License 2.0
With the current codeflare-stack-kfdef.yaml
file, you are unable to deploy the stack into any namespace other than opendatahub
. We have the namespace hardcoded in the kfdef.
This is at odds with the content in our quick start where we suggest that it is possible to override the default namespace.
IMO we should simply not specify the namespace in the kfdef.
Currently the installation of ODH and CodeFlare is done manually, see quickstart.
Purpose of this task is to create a make command which would setup ODH and CodeFlare on available OpenShift instance.
The command should ideally target the new ODH operator.
This automation will be used in tests to setup the environment.
Current Makefile allows installing CodeFlare tools using KfDef. However to be able to test current repository content we should be able to install CodeFlare tools from files in the repository directly.
Some of the tests are problematic on Mac. Here are the examples:
Based on changes made on this PR: project-codeflare/codeflare-sdk#296
File paths and contents of the quickstart will need to be updated accordingly.
Right now Makefile commands install ODH 1.x.
To properly integrate with new version we should change the Makefile commands to deploy ODH 2.x and its resources.
Currently there is no validation of go
code used in tests for PRs.
Purpose of this issue is to implement a validation making sure that changes in tests brought by PR compile and code is properly formatted.
Original ask:
We will want to update the deployment artifacts such that users can specify resource requirements within the kfdef itself.
Hopefully this is something that's supported in the current CRDs...
Create issue templates on:
Similarly as done for Project-Codeflare: https://github.com/project-codeflare/.github/tree/main/.github/ISSUE_TEMPLATE
A new version of KubeRay Operator is out. https://github.com/ray-project/kuberay/tree/v1.0.0-rc.0/
Which needs testing...
In order for OpenShift cluster to discover GPU hardware a ClusterPolicy CR must be created. The defaults will do. Add this to both quickstarts V1 & V2.
Currently when installing the CodeFlare Stack the latest versions of MCAD and Instascale get added to the cluster.
We want documentation on how to make use of custom CodeFlare stack images.
controllerImage
value in a CRD to the path to the custom image.Create a downstream Jenkins job for running the integration tests from [1] with the downstream CodeFlare and RHODS operators.
This time the test scope is not limited to KubeRay.
[1] https://github.com/opendatahub-io/distributed-workloads/tree/main/tests/integration
changes:
vi operator-tests/opendatahub-kubeflow/tests/resources/custom-nb-small.yaml
resources:
limits:
cpu: "2"
memory: 8Gi --> to 3Gi
requests:
cpu: "1"
memory: 8Gi --> to 3Gi
gpu=0, cpu=1, memMB=8000
to
gpu=0, cpu=3, memMB=4000
min_cpus=1, max_cpus=1, min_memory=4, max_memory=4, gpu=0
to
min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0
With these changes, the time to run the test went from (Note, this is with pre-cached images on the worker nodes)
With above changes: ./run.sh took 697 seconds
Without above changes: It's Really slow for mnist_mcad_mini.ipynb and can't complete within the 1200 second timeout,
./run.sh took 1247 seconds
and I'm pretty sure the ray job wouldn't be able to start due to memory pressure.
We should add a make command to automatically update the manifests and tests in odh-manifests. The make command should raise a PR. We implemented something similar to this for the CodeFlare repository.
One tricky thing we'll need to be careful about is that the basictest files have minor differences in odh-manifests and this repository. The minor differences are mainly around the paths.
Investigate and fix the following test failure when creating a RayJob with the downstream operator - might be related to [1]:
=== NAME TestRayJobSubmissionRest
ray_test.go:185:
Unexpected error:
<*errors.errorString | 0xc0004bbe70>:
incorrect response code: 503 for creating Ray Job, response body: <html>
...
<body>
<div>
<h1>Application is not available</h1>
<p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>
<div class="alert alert-info">
<p class="info">
Possible reasons you are seeing this page:
</p>
<ul>
<li>
<strong>The host doesn't exist.</strong>
Make sure the hostname was typed correctly and that a route matching this hostname exists.
</li>
<li>
<strong>The host exists, but doesn't have a matching path.</strong>
Check if the URL path was typed correctly and that the route was created using the desired path.
</li>
<li>
<strong>Route and path matches, but all pods are down.</strong>
Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
</li>
...
}
occurred
Add support for installing the RHODS CodeFlare operator in a similar way how the RHODS operator is installed in the downstream QE tooling. The goal is to be able to install the downstream CodeFlare operator as a prerequisite for the RHODS operator in the downstream Jenkins job.
Right now the Makefile commands to deploy Nvidia and NDF operator create just the operator itself.
For proper functionality the respective CRs need to be created too.
Goal is to implement automated creation of needed CRs for operators above so cluster can be used without any additional configuration.
We are currently installing KubeRay version v0.3.0. The latest available KubeRay is 0.5 -- we should upgrade to this version and confirm that no existing functionality is broken.
Most users will always want MCAD available in their clusters, but they may not want InstaScale. We should separate out InstaScale into a separate overlay/manifest set so as to increase the flexibility of deployment.
Small changes needed for crd cleanup in the quickstart docs...
WAS:
schedulingspecs.mcad.ibm.com appwrappers.mcad.ibm.com quotasubtrees.ibm.com
but now is:
quotasubtrees.quota.codeflare.dev
appwrappers.workload.codeflare.dev
schedulingspecs.workload.codeflare.dev
It would nice to try install the codeflare-operator as part of the codeflare-kfdef because currently they are 2 separate steps.
Here is an example to do this.
It looks like the manifest used for testing kuberay (see https://github.com/opendatahub-io/distributed-workloads/tree/main/ray#create-a-test-cluster ) is not available in the repo
/kind bug
/kind documentation
the last two links on Quick-Start.md result in a 404
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/batch-job)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/interactive)
/kind bug
/kind documentation
This should include supporting artifacts such as notebooks, templates, etc.
Confirm that the pod is gone
Investigate the options to run the Distributed workloads tests with the upstream or downstream operator while reusing as much of the automation and infrastructure as possible.
We broke odh-manifests CI earlier this week when we had the latest release of the CodeFlare stack. This was due to some backwards incompatible changes in the codeflare-sdk. We should update the deployed manifests such that it's always a tagged version of the images being used.
Add smoke UI test for Ray dashboard
Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.
Accommodate documentation, according to the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources and deployments.
Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.
Accommodate manifests, based on the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources.
Any idea on how to debug this?
==> Deleting NDF Operator
oc delete subscription nfd -n openshift-nfd
Error from server (NotFound): subscriptions.operators.coreos.com "nfd" not found
make[1]: [Makefile:113: delete-ndf-operator] Error 1 (ignored)
export CLUSTER_SERVICE_VERSION=`oc get clusterserviceversion -n openshift-nfd -l operators.coreos.com/nfd.openshift-nfd -o custom-columns=:metadata.name`; \
oc delete clusterserviceversion $CLUSTER_SERVICE_VERSION -n openshift-nfd
error: resource(s) were provided, but no name was specified
make[1]: [Makefile:114: delete-ndf-operator] Error 1 (ignored)
oc delete ns openshift-nfd
It's not something after the delete that's causing the issue b/c manually entering that last command causes the hang too.
Verify that current tests in https://github.com/opendatahub-io/distributed-workloads/tree/main/tests cover expected use cases. In case of missing use cases implement them.
As part of the task revise current test implementation and evaluate if it is meaningful and easy to use, or if it has sense to refactor it (i.e. to match https://github.com/project-codeflare/codeflare-operator/tree/main/test/e2e).
I avoided cloning stuff into the codeflare sdk repository when I was working through the tutorial just as a matter of habit. A short comment stating to cd
into the codeflare-sdk
repo if you are not already might be helpful.
I didn't notice this on Friday, that the ray.sh script was failing:
./run.sh
++++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests
Now using project "basictests-bafq" on server "https://api.jimbig412.cp.fyre.ibm.com:6443".
++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh
./run.sh: line 124: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh: Permission denied
failed: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh
@MichaelClifford do you think it's OK to remove the operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh
? I can create a quick PR to remove it...
The existing Quick-Start.md is a little confusing on a few steps for new users, and also lacks a Clean-Up section at the end. I'd like to try to get both into the Quick-Start.
fyi: @asm582
Part of the CodeFlare operator re-design, as described in project-codeflare/adr#9.
Accommodate basic and integration tests, based on the new design of the operator, e.g., removing reference to the remove InstaScale and MCAD resources and deployments.
Following the Quick Start Guide I was stuck on the installation and then noticed that the codeflare operator needs to be installed separately. I see it as the first step of the installation docs, but for visibility I suggest to make it as a topic of the Prerequisites section.
I just created a KfDef with the codeflare-stack and ray-operator manifests, but the kuberay-operator pod is falling on OpenShift because of the rate limit.
Failed to pull image "kuberay/operator:v0.3.0": rpc error: code = Unknown desc = initializing source docker://kuberay/operator:v0.3.0: reading manifest v0.3.0 in docker.io/kuberay/operator: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
The RC 1 candidate of kuberay 0.6.0 is available https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1689895388577229
I'd like to try a test PR of v0.6.0-rc.1 to make sure that there are no issues with CodeFlare
Operator pod should come up
Instascale controller is deployed as part of the codeflare-stack but the Openshifti-ci does not create resources generated Instascale-controller. Investigate how to run a test against the Instascale controller without creating resources.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.