Giter Site home page Giter Site logo

distributed-workloads's People

Contributors

accorvin avatar anishasthana avatar astefanutti avatar christianzaccaria avatar chughshilpa avatar eranra avatar fiona-waters avatar gregsheremeta avatar jbusche avatar jiripetrlik avatar kpostoffice avatar maxusmusti avatar michaelclifford avatar sutaakar avatar tedhtchang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributed-workloads's Issues

Unable to deploy example kfdef to nonstandard namespace

With the current codeflare-stack-kfdef.yaml file, you are unable to deploy the stack into any namespace other than opendatahub. We have the namespace hardcoded in the kfdef.

This is at odds with the content in our quick start where we suggest that it is possible to override the default namespace.

IMO we should simply not specify the namespace in the kfdef.

Upgrade Makefile to install ODH 2.x

Right now Makefile commands install ODH 1.x.
To properly integrate with new version we should change the Makefile commands to deploy ODH 2.x and its resources.

Doc: CodeFlare Operator with custom MCAD and Instascale images

Currently when installing the CodeFlare Stack the latest versions of MCAD and Instascale get added to the cluster.

Issue

We want documentation on how to make use of custom CodeFlare stack images.

Suggested Resolution

  • Add documentation on how to install custom images of MCAD and/or InstaScale to an OpenShift or Kubernetes cluster.
  • Provide commands which may be used to efficiently replace the controllerImage value in a CRD to the path to the custom image.

Tweak of E2E Notebook tests to improve performance

changes:
vi operator-tests/opendatahub-kubeflow/tests/resources/custom-nb-small.yaml

        resources:
          limits:
            cpu: "2"
            memory: 8Gi  --> to 3Gi
          requests:
            cpu: "1"
            memory: 8Gi  --> to 3Gi
  1. vi operator-tests/opendatahub-kubeflow/tests/resources/mnist_mcad_mini.ipynb
gpu=0, cpu=1, memMB=8000
to
gpu=0, cpu=3, memMB=4000
  1. vi operator-tests/opendatahub-kubeflow/tests/resources/mnist_ray_mini.ipynb
min_cpus=1, max_cpus=1, min_memory=4, max_memory=4, gpu=0
to
min_cpus=2, max_cpus=2, min_memory=4, max_memory=4, gpu=0

With these changes, the time to run the test went from (Note, this is with pre-cached images on the worker nodes)

With above changes: ./run.sh took 697 seconds

Without above changes: It's Really slow for mnist_mcad_mini.ipynb and can't complete within the 1200 second timeout,

./run.sh took 1247 seconds

and I'm pretty sure the ray job wouldn't be able to start due to memory pressure.

Add make command to automatically update odh-manifests

We should add a make command to automatically update the manifests and tests in odh-manifests. The make command should raise a PR. We implemented something similar to this for the CodeFlare repository.

One tricky thing we'll need to be careful about is that the basictest files have minor differences in odh-manifests and this repository. The minor differences are mainly around the paths.

Fix the failing KubeRay tests running with downstream RHODS operator

Investigate and fix the following test failure when creating a RayJob with the downstream operator - might be related to [1]:

=== NAME  TestRayJobSubmissionRest
    ray_test.go:185: 
        Unexpected error:
            <*errors.errorString | 0xc0004bbe70>: 
            incorrect response code: 503 for creating Ray Job, response body: <html>
...
              <body>
                <div>
                  <h1>Application is not available</h1>
                  <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>
            
                  <div class="alert alert-info">
                    <p class="info">
                      Possible reasons you are seeing this page:
                    </p>
                    <ul>
                      <li>
                        <strong>The host doesn't exist.</strong>
                        Make sure the hostname was typed correctly and that a route matching this hostname exists.
                      </li>
                      <li>
                        <strong>The host exists, but doesn't have a matching path.</strong>
                        Check if the URL path was typed correctly and that the route was created using the desired path.
                      </li>
                      <li>
                        <strong>Route and path matches, but all pods are down.</strong>
                        Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
                      </li>
...
            }
        occurred

[1] https://issues.redhat.com/browse/RHODS-11106

Add support for creation of required CRs for Nvidia and NDF operators

Right now the Makefile commands to deploy Nvidia and NDF operator create just the operator itself.
For proper functionality the respective CRs need to be created too.

Goal is to implement automated creation of needed CRs for operators above so cluster can be used without any additional configuration.

Update KubeRay version to v0.5.0

We are currently installing KubeRay version v0.3.0. The latest available KubeRay is 0.5 -- we should upgrade to this version and confirm that no existing functionality is broken.

Separate InstaScale out from default KFDEF

Most users will always want MCAD available in their clusters, but they may not want InstaScale. We should separate out InstaScale into a separate overlay/manifest set so as to increase the flexibility of deployment.

CRD name changes in cleanup for quickstart docs

Small changes needed for crd cleanup in the quickstart docs...
WAS:
schedulingspecs.mcad.ibm.com appwrappers.mcad.ibm.com quotasubtrees.ibm.com

but now is:
quotasubtrees.quota.codeflare.dev
appwrappers.workload.codeflare.dev
schedulingspecs.workload.codeflare.dev

Quick-Start.md links are 404

the last two links on Quick-Start.md result in a 404

* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/batch-job)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/interactive)

/kind bug
/kind documentation

Use Tagged version of codeflare-sdk stack

We broke odh-manifests CI earlier this week when we had the latest release of the CodeFlare stack. This was due to some backwards incompatible changes in the codeflare-sdk. We should update the deployed manifests such that it's always a tagged version of the images being used.

make all-in-one hangs at Deleting NDF Operator

Any idea on how to debug this?

==> Deleting NDF Operator

oc delete subscription nfd -n openshift-nfd
Error from server (NotFound): subscriptions.operators.coreos.com "nfd" not found
make[1]: [Makefile:113: delete-ndf-operator] Error 1 (ignored)
export CLUSTER_SERVICE_VERSION=`oc get clusterserviceversion -n openshift-nfd -l operators.coreos.com/nfd.openshift-nfd -o custom-columns=:metadata.name`; \
oc delete clusterserviceversion $CLUSTER_SERVICE_VERSION -n openshift-nfd
error: resource(s) were provided, but no name was specified
make[1]: [Makefile:114: delete-ndf-operator] Error 1 (ignored)
oc delete ns openshift-nfd

It's not something after the delete that's causing the issue b/c manually entering that last command causes the hang too.

ray.sh problem during E2E testing

I didn't notice this on Friday, that the ray.sh script was failing:

./run.sh 
++++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests
Now using project "basictests-bafq" on server "https://api.jimbig412.cp.fyre.ibm.com:6443".

++++ /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh
./run.sh: line 124: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh: Permission denied
failed: /root/JIM/peak/operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh

@MichaelClifford do you think it's OK to remove the operator-tests/opendatahub-kubeflow/ray/tests/basictests/ray.sh? I can create a quick PR to remove it...

Kuberay operator image is pulled from Docker Hub

I just created a KfDef with the codeflare-stack and ray-operator manifests, but the kuberay-operator pod is falling on OpenShift because of the rate limit.

Failed to pull image "kuberay/operator:v0.3.0": rpc error: code = Unknown desc = initializing source docker://kuberay/operator:v0.3.0: reading manifest v0.3.0 in docker.io/kuberay/operator: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Investigate InstaScale testing in the E2E test

Instascale controller is deployed as part of the codeflare-stack but the Openshifti-ci does not create resources generated Instascale-controller. Investigate how to run a test against the Instascale controller without creating resources.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.