Giter Site home page Giter Site logo

seldon-core-operator's People

Contributors

agathanatasha avatar beliaev-maksim avatar ca-scribner avatar colmbhandal avatar dnplas avatar i-chvets avatar kimwnasptd avatar knkski avatar kwmonroe avatar lucabello avatar misohu avatar natalian98 avatar nohaihab avatar orfeas-k avatar phoevos avatar phvalguima avatar rbarry82 avatar renovate[bot] avatar zeeshanali avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

seldon-core-operator's Issues

Seldon Core models deployed documentation points to incorrect URL

Reproduce:

  1. Deploy any model in Kubeflow's admin namespace
  2. Go to endpoint of model (use port-forward or expose model using Istio Virtual Server)
  3. Try using the example.

The documentation is incorrect. the prefix is different and also we do not use Ambassador, we use Istio.
Screenshot from 2023-03-17 14-02-53

`test_remove_with_resources_present` fails

test_remove_with_resources_present fails on the HEAD of main

so far the issue is narrowed down that only one CRD of possible 3 is possible in order not to fail.
That means that only one of those tests is allowed before test_remove_with_resources_present:

  • test_seldon_predictor_server[sklearn.yaml]
  • test_seldon_predictor_server[sklearn2.yaml]
  • test_seldon_deployment

if at least 2 from above list are present, then test test_remove_with_resources_present fails

Finalise work done for Seldon Core Operator during Obeservability Workshop

Finalise work done for Seldon Core Operator during Obeservability Workshop

Work items are tracked in https://warthogs.atlassian.net/browse/KF-775
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-775-gh52-feat-alert-rules
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator

Design

Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. Prometheus creates scrape jobs based on configured alert rules defined by Seldon Core Operator Charm. Then it scrapes targets, retrieves defined metrics, and performs required calculations.

Testing

  • Setup MicroK8S cluster and Juju controller:
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
  • Deploy Prometheus and Seldon Core Operator and relate them.
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager
  • Navigate to Prometheus dashboard https://<Prometheus-unit-IP>:9090, select Status->Targets
    There should be Promethus scrape job that targets Seldon metrics endpoint (http://<Seldon-Controller-Manager-IP>:8080/metrics) entry with no errors:
    Screenshot from 2022-11-15 05-44-31

  • Deploy sample Seldon deployment in the same model to observe if any failure alert is reported by navigating to Alerts

microk8s.kubectl -n test apply -f examples/serve-simple-v1.yaml
  • To simulate failure delete deployment that was created by Seldon and observe alerts:
microk8s.kubectl -n test delete deploy/seldon-model-example-0-classifier

NOTE: That alerts window is 10 minutes. Scraping is done once per minute. Make sure at lease 2 minutes have passed for proper rate calculation.
Screenshot from 2022-11-15 14-34-23

Integration tests are failing in Github runners on track/1.15

Bug Description

On track/1.15 integraion tests fail in Github runners.

To Reproduce

Create a debug PR and observe integration test triggered by workflow in Github runnner.
Example real PR that failed after many attempts: #186

Environment

Github runner action On Pull Request https://github.com/canonical/seldon-core-operator/actions/workflows/on_pull_request.yaml

Relevant log output

Output of integration tests that fail in Github runner:
https://github.com/canonical/seldon-core-operator/actions/runs/5694773152/job/16153113024?pr=186

Additional context

There were a series of fixes on main branch that made Seldon integration more reliable. These should be reviewed and probably ported/cherry-picked into track/1.15 branch.
#190
#188
#187

Enable access to a Seldon deployed model from a user's Kubeflow Pipeline step

When users deploy models via Seldon into their namespace, they want to be able to access these models from a step in a Kubeflow Pipeline1. This currently does not work - if you try to access the model from a pipeline step, you get the message "403 RBAC: access denied". This may be due to pipeline steps not having istio sidecars

For this task to be complete, we need to:

  • debug the access of seldon models from Kubeflow pipelines
  • write a guide that shows how to access models from a Kubeflow pipeline
  • [if possible] write tests that assert this works <-- This might be a full bundle test. Or maybe we just need KFP deployed.

Related is canonical/bundle-kubeflow#557, which describes some challenges in access.
This solves part of #109.

Footnotes

  1. if at all possible, this should also keep traffic inside the kubernetes cluster. Going outside and authenticating back through the front door is a last resort โ†ฉ

Re-design Install event handler

Install event handler needs to be redesigned to deploy K8S resources and update Pebble layer. All other actions should be performed in Pebble Ready event handler and in main.

Design the user experience around accessing served models via Seldon

When a user serves a model using Seldon in their Kubeflow namespace, there are different use cases for accessing a model that they might want to satisfy:

  1. accessing the model from a Kubeflow Notebook
  2. accessing the model from a Kubeflow Pipeline (workaround demonstrated in #110, but a better solution may be possible)
  3. accessing the model from another user's Kubeflow Notebook (eg userA serves the model and userB' Notebook consumes it)
  4. accessing the model from another user's Kubeflow Pipeline (eg userA serves the model and userB's Pipeline consumes it)
  5. accessing the model from outside Kubeflow with authentication required
  6. accessing the model from outside Kubeflow without authentication required (eg: true public access to the model)

For each of these use cases, we should have:

  • an automated test demonstrating it works
  • a user guide showing a user how to use this feature

Closes canonical/bundle-kubeflow#557

Charm fails to parse custom images with `:` in their name

Bug Description

Images provided through the config receive special treatment as seen here:

for image_name in SPLIT_IMAGES_LIST:
(
custom_images[f"{image_name}__image"],
custom_images[f"{image_name}__version"],
) = custom_images[image_name].split(":")

More specifically, the charm attempts to separate the image name from the tag, splitting the string in two using : as a separator. However, this will break the charm installation if the image name contains that special character, which could be the case if a local container registry address is provided for instance (e.g. 172.17.0.2:5000/tensorflow/serving:2.1.0).

To Reproduce

juju deploy seldon-core --trust \
  --config custom_images='{
    "configmap__predictor__tensorflow__tensorflow": "172.17.0.2:5000/tensorflow/serving:2.1.0",
  }'

Environment

This affects the latest/edge version of the seldon-core charm.

Relevant Log Output

unit-seldon-core-0: 14:55:41 ERROR unit.seldon-core/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 629, in <module>
    main(SeldonCoreOperator)
  File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/main.py", line 439, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/framework.py", line 843, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/framework.py", line 922, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 438, in _on_install
    self._apply_k8s_resources(force_conflicts=True)
  File "./src/charm.py", line 420, in _apply_k8s_resources
    self.configmap_resource_handler.apply()
  File "./src/charm.py", line 195, in configmap_resource_handler
    context={**self._context, **self._configmap_images},
  File "./src/charm.py", line 289, in _configmap_images
    return self._get_custom_images()
  File "./src/charm.py", line 310, in _get_custom_images
    (
ValueError: too many values to unpack (expected 2)

Additional Context

The tag is the string part following the last :. As a result, we could do something like this:

diff --git a/src/charm.py b/src/charm.py
index 70575d8..9ba4950 100755
--- a/src/charm.py
+++ b/src/charm.py
@@ -310,7 +310,7 @@ class SeldonCoreOperator(CharmBase):
                 (
                     custom_images[f"{image_name}__image"],
                     custom_images[f"{image_name}__version"],
-                ) = custom_images[image_name].split(":")
+                ) = custom_images[image_name].rsplit(":", 1)
 
         except yaml.YAMLError as err:
             self.logger.error(

`README.md` is missing information about ingressgateway

This charm can be related to istio-pilot, but that information is missing in the README. We need to specify instructions on how to deploy seldon's dependencies when the ingressgateway from istio-operators is going to be used.

Deployment stuck with ReplicaSet "X" is progressing

Bug Description

During the effort of updating seldon charm to 1.17.1 at #216, we bumped across the following bug when running the integration tests.

The deployment created by applied seldondeployment, gets stuck with a condition ReplicaSet "X" is progressing (as a result, it's never ready). The issue though is that underlying ReplicaSet creates a pod successfully and its status shows readeyReplicas: 1. At the same time, the deployment has an observedGeneration of 7 (or sth like that) while the ReplicaSet has its observedGeneration set to 1.

This looks like the same issue we 've hit when trying to update Seldon ROCKs described in this issue's comments canonical/seldonio-rocks#37 (comment).

Debugging

During debugging we tried to apply manually with kubectl apply -f yaml the aforementioned seldondeployment and noticed that:

  • when we applied it to the default namespace, deployment was progressing successfully
  • when we applied it to the namespace created by tests, the deployment was stuck (just like what happens in the tests)

Here's the namespaces' yaml outputs

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    controller.juju.is/id: ec8f226d-bdc2-45de-891d-7cc8b8f501ff
    model.juju.is/id: 73fb3200-e151-46b2-8e87-f814f48f1715
  creationTimestamp: "2023-10-03T08:46:52Z"
  labels:
    app.kubernetes.io/managed-by: juju
    kubernetes.io/metadata.name: test-charm-vmxo
    model.juju.is/name: test-charm-vmxo
    serving.kubeflow.org/inferenceservice: enabled
  name: test-charm-vmxo
  resourceVersion: "439145"
  uid: 82929f14-80fc-4049-b4ab-c74c93ec0e30
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

and

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2023-10-02T08:39:37Z"
  labels:
    kubernetes.io/metadata.name: default
  name: default
  resourceVersion: "482729"
  uid: 0cf53d5e-b0f5-482e-88a2-6be71e24fe02
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Tests run in the namespace created by a test juju model. In that namespace, they try to apply a custom resource, which in turn creates a deployment. The issue is that, while the ReplicaSet creates a pod (successfully) and its status shows readeyReplicas: 1, the deployment gets stuck with a condition ReplicaSet "X" is progressing (as a result, it's never ready). Same thing happens if I try to apply manually the custom resource in the testing namespace too.
However, if I apply this to default namespace, the Deployment goes to Ready as expected.

To Reproduce

  1. Checkout the PR's branch
  2. Set up cluster with juju 2.9.45 and Microk8s 1.24 (tried also with 1.26 but issue persists)
  3. Run integration tests with tox -e charm-integration or tox -e seldon-servers-integration

Environment

  • Juju 2.9.45
  • Microk8s 1.24
  • Ubuntu 22.04

Relevant log output

Doing `microk8s inspect` and looking at `snap.microk8s.daemon-kubelite/journal.log` we noticed this error during the deployment creation

Oct 03 08:51:43 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:43.873775   39625 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="test-charm-vmxo" name="seldon-model-1-example-0-classifier"
[..]
(also a bunch of those)
Oct 03 08:51:47 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:47.585896   39625 deployment_controller.go:495] Operation cannot be fulfilled on replicasets.apps "seldon-model-1-example-0-classifier-9df54f658": the object has been modified; please apply your changes to the latest version and try again

`$ kubectl get deployments -n kubeflow seldon-model-1-example-0-classifier -o yaml`
[...]
status:
  conditions:
  - lastTransitionTime: "2023-09-29T13:15:00Z"
    lastUpdateTime: "2023-09-29T13:15:00Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2023-09-29T13:15:00Z"
    lastUpdateTime: "2023-09-29T13:15:00Z"
    message: ReplicaSet "seldon-model-1-example-0-classifier-b7dfcbcb5" is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing
  observedGeneration: 7
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

`$ kubectl get replicaset -n kubeflow seldon-model-1-example-0-classifier-b7dfcbcb5 -o yaml`
[..]
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

Additional context

No response

seldon-core 1.15 is not compatible with k8s 1.25

Bug Description

According to the seldon docs, seldon version 1.15 is not compatible with k8s 1.25.
PR #207 bumps microk8s in the CI to 1.25, it's failing as expected with seldon 1.15.

Seldon will be updated to 1.17 as part of the 1.8 release, as documented in canonical/bundle-kubeflow#643. Seldon 1.17 is compatible with k8s 1.25, so we should update the k8s version in the CI along with the manifests and version update.

Unclear if `gateway-info` is a hard requirement

_get_env_vars() calls _get_istio_gateway() which returns the name and namespace of the istio ingress gateway as a string in the namespace/name format. The code suggests that in the case of the latter returning None, _get_env_vars() will not stop its execution, but in _get_istio_gateway() it is suggested that the unit should go to WaitingStatus if the relation with istio-pilot:gateway-info does not exist.

From the code, it is not clear if the gateway-info is a hard requirement that should put the unit into WaitingStatus or even BlockedStatus, the only thing that is clear is that _get_env_vars() won't care if there is information about the gateway and it will just proceed. In the case of the relation not being necessary 100% of the times, I recommend we reflect that in the code (removing the WaitingStatus) and also in the documentation.

Update config.yaml with proper default container image

Bug Description

Currently, config.yaml refers to default Seldon Core Operator container image v1.14. As part of Kubeflow 1.8 work, new image is being used. As a result, this default configuration should be updated to use new ROCK image.

To Reproduce

Refer to corresponding config.yaml line

Environment

config.yaml for this charm.

Relevant log output

N/A

Unable to deploy seldon examples (both versions) in CKF 1.4

In working Kubeflow 1.4 on microk8s (without RBAC enabled).
To replicate clone the repo and run:
kubectl apply -f examples/serve-simple-v1.yaml

The response is:

Error from server (InternalError): error when creating "examples/serve-simple-v1.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.kubeflow.svc:4443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial TCP

Seldon deployed: seldon-controller-manager res:oci-image@82fd029 active 1 seldon-core charmhub stable 50 kubernetes 10.152.183.16
Microk8s: installed: v1.21.8 (2870) 191MB classic

Make charm's images configurable in track/<last-version> branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

TL;DR

Mark the following as done

  • Required changes (in metadata.yaml, config.yaml, src/charm.py)
  • Test on airgap environment
  • Publish to /stable

Required changes

WARNING: No breaking changes should be backported into the track/<version> branch. A breaking change can be anything that requires extra steps to refresh from the previous /stable other than just juju refresh. Please avoid at all costs these situations.

The following files have to be modified and/or verified to enable image configuration:

  • metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:
containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096
  • config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...
  • tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.

  • src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

  1. Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)

  2. Build the charm making sure that all the changes for airgap are in place.

  3. Deploy the charms manually and observe the charm go to active and idle.

  4. Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Publishing

After completing the changes and testing, this charm has to be published to its stable risk in Charmhub. For that you must wait for the charm to be published to /edge, which is the revision to be promoted to /stable. Use the workflow dispatch for this (Actions>Release charm to other tracks...>Run workflow).

Suggested changes/backports

seldon does not check for container connectivity on config changed

charm goes into error state when "config changed" is called

unit-seldon-controller-manager-0: 04:44:24 ERROR unit.seldon-controller-manager/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 252, in connect
    self.sock.connect(self.socket_path)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1484, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 266, in http_open
    return self.do_open(_UnixSocketConnection, req,  # type:ignore
  File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 528, in <module>
    main(SeldonCoreOperator)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 511, in _on_event
    self._update_layer()
  File "./src/charm.py", line 260, in _update_layer
    current_layer = self.container.get_plan()
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 1933, in get_plan
    return self._pebble.get_plan()
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1772, in get_plan
    resp = self._request('GET', '/v1/plan', {'format': 'yaml'})
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1451, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1497, in _request_raw
    raise ConnectionError(e.reason)
ops.pebble.ConnectionError: [Errno 2] No such file or directory

juju refresh doesn't update seldon-webhook-service definition

Found during testing canonical/bundle-kubeflow#462.

The webhook label fix in #14 is only applied for a green field deployment with the new revision of the charm. If it's an upgrade scenario, there is no definition update to happen so the original issue is still there.

$ juju info seldon-core
name: seldon-core
charm-id: ZGHtHpN4TqAzrUlh9aG1SWxXenopHFRH
...
channels: |
  latest/stable:     52  2022-01-25  (52)  1MB
  latest/candidate:  โ†‘
  latest/beta:       โ†‘
  latest/edge:       58  2022-06-01  (58)  7MB

[stable: revision 52]

$ juju deploy seldon-core seldon-controller-manager
Located charm "seldon-core" in charm-hub, revision 52
Deploying "seldon-controller-manager" from charm-hub charm "seldon-core", revision 52 in channel stable on focal

$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service
Name:              seldon-webhook-service
Namespace:         kubeflow
Labels:            app=seldon
                   app.juju.is/created-by=seldon-controller-manager
                   app.kubernetes.io/instance=seldon-core
                   app.kubernetes.io/version=1.9.0
Annotations:       <none>
Selector:          app.kubernetes.io/name=seldon-controller-manager,control-plane=seldon-controller-manager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.152.183.28
IPs:               10.152.183.28
Port:              <unset>  4443/TCP
TargetPort:        4443/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

-> with control-plane=seldon-controller-manager

[upgrade from 52 to 58]

$ juju refresh seldon-controller-manager --channel edge
Added charm-hub charm "seldon-core", revision 58 in channel edge, to the model
Adding endpoint "grafana-dashboard" to default space "alpha"
Adding endpoint "metrics-endpoint" to default space "alpha"
Leaving endpoints in "alpha": ambassador, istio, keda

$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service
Name:              seldon-webhook-service
Namespace:         kubeflow
Labels:            app=seldon
                   app.juju.is/created-by=seldon-controller-manager
                   app.kubernetes.io/instance=seldon-core
                   app.kubernetes.io/version=1.9.0
Annotations:       <none>
Selector:          app.kubernetes.io/name=seldon-controller-manager,control-plane=seldon-controller-manager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.152.183.28
IPs:               10.152.183.28
Port:              <unset>  4443/TCP
TargetPort:        4443/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

-> control-plane=seldon-controller-manager is still there even after the refresh.

[edge: revision 58]

$ juju deploy seldon-core seldon-controller-manager --channel edge
Located charm "seldon-core" in charm-hub, revision 58
Deploying "seldon-controller-manager" from charm-hub charm "seldon-core", revision 58 in channel edge on focal

$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service 
Name:              seldon-webhook-service
Namespace:         kubeflow
Labels:            app=seldon
                   app.juju.is/created-by=seldon-controller-manager
                   app.kubernetes.io/instance=seldon-core
                   app.kubernetes.io/version=1.9.0
Annotations:       <none>
Selector:          app.kubernetes.io/name=seldon-controller-manager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.152.183.84
IPs:               10.152.183.84
Port:              <unset>  4443/TCP
TargetPort:        4443/TCP
Endpoints:         10.1.60.16:4443
Session Affinity:  None
Events:            <none>

-> as expected, there is no control-plane=seldon-controller-manager.

workaround:

$ microk8s kubectl -n kubeflow patch service/seldon-webhook-service --type=json \
    -p='[{"op": "remove", "path": "/spec/selector/control-plane"}]'

upgrade from 1.14 to 1.15 fails due to 409 conflict during k8s resource creation

During upgrade the charm gets stuck with 409 conflict errors during k8s resource creation.

Reproduction steps:

juju deploy seldon-core seldon-controller-manager --channel 1.14/stable
juju trust seldon-controller-manager --scope=cluster
juju refresh seldon-controller-manager --channel 1.15/edge

Which yields logs of:

unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/roles/leader-election-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-sas-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-edit-seldon?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/rolebindings/leader-election-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-sas-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/seldon-validating-webhook-configuration?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/seldon-webhook-service?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Reconcile completed successfully
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:16 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:17 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube "HTTP/1.1 409 Conflict"
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Error in sys.excepthook:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.emit(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/log.py", line 41, in emit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.model_backend.juju_log(record.levelname, self.format(record))
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return fmt.format(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     record.exc_text = self.formatException(record.exc_info)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     for line in TracebackException(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     yield from self.format_exception_only()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     stype = smod + '.' + stype
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Original exception was:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     resp.raise_for_status()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise HTTPStatusError(message, request=request, response=self)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install For more information check: https://httpstatuses.com/409
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install During handling of the above exception, another exception occurred:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 331, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.crd_resource_handler.apply()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 351, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise e
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 336, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     apply_many(client=self.lightkube_client, objs=resources, force=force)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/lightkube/batch/_many.py", line 64, in apply_many
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     returns[i] = client.apply(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 457, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self.patch(type(obj), name, obj, namespace=namespace,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 325, in patch
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self.handle_response(method, resp, br)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.raise_for_status(resp)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise transform_exception(e)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install lightkube.core.exceptions.ApiError: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install The above exception was the direct cause of the following exception:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 523, in <module>
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     main(SeldonCoreOperator)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 439, in main
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     framework.reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 840, in reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self._reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     custom_handler(event)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 357, in _on_install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self._apply_k8s_resources()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 340, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise GenericCharmRuntimeError("CRD resources creation failed") from error
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install <unknown>GenericCharmRuntimeError: CRD resources creation failed
unit-seldon-controller-manager-0: 15:21:17 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-seldon-controller-manager-0: 15:21:17 INFO juju.worker.uniter awaiting error resolution for "install" hook

where we see 409 conflict errors when creating the CRDs.

(feels similar to canonical/training-operator#104, but that issue was going between sidecar charms whereas this is going from podspec to sidecar)

`seldon-core-operator` fails to build during integration tests

Seems like this charm cannot be built during integration tests, resulting in the following error message:

RuntimeError: Failed to build charm .:
Packing the charm.
Launching environment to pack for base name='ubuntu' channel='20.04' architectures=['amd64'] (may take a while the first time but it's reusable)
charmcraft internal error: LXDError(brief="failed to inject host's snap 'charmcraft' into target environment.", details=None, resolution=None)
Full execution log: '/home/runner/.local/state/charmcraft/log/charmcraft-20230726-124415.757660.log'

Logs and info here and in the latest CI run of #184

Update verions of seldon, subordinate images

SeldonDeployments using our charm to deploy models from artifacts stored via mlflow currently fail to deploy because the seldonio/mlflowserver:1.9.0 image used for the classifier pod has an unpinned package dependency problem. This causes the flask server to fail like in pallets/flask#4455. This has been fixed in SeldonIO/seldon-core#3946, which would fix new versions of the mlflowserver image.

To fix the above issue, we should update the images we specify in our config like here:

"defaultImageVersion": "1.9.0",
. It is unclear if we need to update all images at the same time, or if we can pick and choose (for example, the main charm runtime is back on version 1.6, and current versions are >1.13.x)

Document examples for Seldon Core Deployment for various scenarios

Provide documentation (Jupyter notebooks, README update, how-to guide, etc.) on Seldon Core Deployment scenarios:

  • Canary deployments.
  • Deployments with authentication.
  • Deployments without authentication.
  • Debugging deployments in Seldon (useful for field).

Some of the work was completed as part of this PR: #20

  • Create Jira task

Seldon (edge-bundle) failing in charmed kubernetes deployment

After deploying edge bundle to charmed kubernetes on AWS Seldon-core is failing:
image

Running

kubectl deploy kubeflow --channel=latest/edge --trust

Container logs:

โฏ kubectl logs -f seldon-controller-manager-54cd8dcfc-prvxz -n kubeflow
Defaulted container "seldon-core" out of: seldon-core, juju-pod-init (init)
{"level":"info","ts":1674829552.0947893,"logger":"setup","msg":"Intializing operator"}
{"level":"info","ts":1674829552.1163418,"logger":"setup","msg":"CRD not found - trying to create"}
{"level":"error","ts":1674829552.2253666,"logger":"setup","msg":"unable to initialise operator","error":"the server could not find the requested resource","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nmain.main\n\t/workspace/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

Attaching logs
dump.log

Sledon only logs:

โฏ juju debug-log --replay | grep seldon
controller-0: 14:56:45 INFO juju.worker.caasapplicationprovisioner.runner start "seldon-controller-manager"
controller-0: 15:06:24 INFO juju.worker.caasprovisioner started operator for application "seldon-controller-manager"
application-seldon-controller-manager: 15:06:26 INFO juju.cmd running jujud [2.9.34 90e2f047763059f0b8a57941ae0907346464aee8 gc go1.19]
application-seldon-controller-manager: 15:06:26 DEBUG juju.cmd   args: []string{"/var/lib/juju/tools/jujud", "caasoperator", "--application-name=seldon-controller-manager", "--debug"}
application-seldon-controller-manager: 15:06:26 DEBUG juju.agent read agent config, format "2.0"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.upgradesteps upgrade steps for 2.9.34 have already been run.
application-seldon-controller-manager: 15:06:26 INFO juju.cmd.jujud caas operator application-seldon-controller-manager start (2.9.34 [gc])
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "clock" manifold worker started at 2023-01-27 14:06:26.196880263 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "agent" manifold worker started at 2023-01-27 14:06:26.197542092 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-gate" manifold worker started at 2023-01-27 14:06:26.198022058 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.introspection introspection worker listening on "@jujud-application-seldon-controller-manager"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-01-27 14:06:26.1988468 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.introspection stats worker now serving
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.apicaller connecting with old password
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-config-watcher" manifold worker started at 2023-01-27 14:06:26.208160367 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-flag" manifold worker started at 2023-01-27 14:06:26.210204205 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-fortress" manifold worker started at 2023-01-27 14:06:26.220874831 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.api successfully dialed "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.api connection established to "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.apicaller [3ebd7f] "application-seldon-controller-manager" successfully connected to "172.31.25.210:17070"
application-seldon-controller-manager: 15:06:26 DEBUG juju.api RPC connection died
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-caller" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.apicaller connecting with old password
application-seldon-controller-manager: 15:06:26 DEBUG juju.api successfully dialed "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.api connection established to "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.apicaller [3ebd7f] "application-seldon-controller-manager" successfully connected to "172.31.25.210:17070"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-caller" manifold worker started at 2023-01-27 14:06:26.255014228 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-01-27 14:06:26.263643616 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker started at 2023-01-27 14:06:26.265520702 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrader" manifold worker started at 2023-01-27 14:06:26.267809453 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "log-sender" manifold worker started at 2023-01-27 14:06:26.267974806 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-minion" manifold worker started at 2023-01-27 14:06:26.268182889 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-inactive-flag" manifold worker started at 2023-01-27 14:06:26.27124827 +0000 UTC
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasupgrader abort check blocked until version event received
application-seldon-controller-manager: 15:06:26 INFO juju.worker.migrationminion migration phase is now: NONE
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.caasupgrader current agent binary version: 2.9.34
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasupgrader unblocking abort check
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.logger initial log config: "<root>=DEBUG"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2023-01-27 14:06:26.284435601 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "charm-dir" manifold worker started at 2023-01-27 14:06:26.284655444 +0000 UTC
application-seldon-controller-manager: 15:06:26 INFO juju.worker.logger logger worker started
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2023-01-27 14:06:26.284730685 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2023-01-27 14:06:26.284764145 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "hook-retry-strategy" manifold worker started at 2023-01-27 14:06:26.314754686 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.logger reconfiguring logging from "<root>=DEBUG" to "<root>=INFO"
application-seldon-controller-manager: 15:06:26 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasoperator.charm downloading ch:amd64/focal/seldon-core-58 from API server
application-seldon-controller-manager: 15:06:26 INFO juju.downloader downloading from ch:amd64/focal/seldon-core-58
application-seldon-controller-manager: 15:06:26 INFO juju.downloader download complete ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:26 INFO juju.downloader download verified ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator operator "seldon-controller-manager" started
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.runner start "seldon-controller-manager/0"
application-seldon-controller-manager: 15:06:32 INFO juju.worker.leadership seldon-controller-manager/0 promoted to leadership of seldon-controller-manager
application-seldon-controller-manager: 15:06:32 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-seldon-controller-manager-0
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 unit "seldon-controller-manager/0" started
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 resuming charm install
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.charm downloading ch:amd64/focal/seldon-core-58 from API server
application-seldon-controller-manager: 15:06:32 INFO juju.downloader downloading from ch:amd64/focal/seldon-core-58
application-seldon-controller-manager: 15:06:32 INFO juju.downloader download complete ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:32 INFO juju.downloader download verified ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:39 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 hooks are retried true
application-seldon-controller-manager: 15:06:39 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "install" hook
application-seldon-controller-manager: 15:06:40 INFO unit.seldon-controller-manager/0.juju-log Running legacy hooks/install.
application-seldon-controller-manager: 15:06:45 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "install" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:45 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "leader-elected" hook
application-seldon-controller-manager: 15:06:48 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:55 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "config-changed" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:55 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "start" hook
application-seldon-controller-manager: 15:06:55 INFO unit.seldon-controller-manager/0.juju-log Running legacy hooks/start.
application-seldon-controller-manager: 15:06:57 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "start" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:08:12 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "config-changed" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:09:42 INFO juju.worker.caasoperator started pod init on "seldon-controller-manager/0"
application-seldon-controller-manager: 15:11:11 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:16:25 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:22:16 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:27:56 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)

1.6/stable is working for charmed kubeflow

Finalise work done for Seldon Metrics Discovery during Obeservability Workshop

Finalise work done for Seldon Metrics Discovery during Obeservability Workshop

Work items are tracked in https://warthogs.atlassian.net/browse/KF-829
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-829-gh68-feat-metrics-discovery
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator

Design

Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. For metrics provided by models targets can change from model to model and from deployment to deployment. Metrics Endpoint Observer provided by COS is integrated. Updates to targets are handled by Mertics Endpoint Observer and relayed to Prometheus by Seldon Core Operator Charm.

Testing

  • Setup MicroK8S cluster and Juju controller:
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
  • Deploy Istio, Prometheus and Seldon Core Operator and relate them.
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager

Resulting deployment should like this:

Model  Controller  Cloud/Region        Version  SLA          Timestamp
test   uk8s        microk8s/localhost  2.9.34   unsupported  15:28:26-05:00

App                        Version  Status  Scale  Charm           Channel      Rev  Address         Exposed  Message
prometheus-k8s             2.33.5   active      1  prometheus-k8s  stable        79  10.152.183.239  no       
seldon-controller-manager           active      1  seldon-core                    0  10.152.183.182  no       

Unit                          Workload  Agent  Address     Ports  Message
prometheus-k8s/0*             active    idle   10.1.59.80         
seldon-controller-manager/0*  active    idle   10.1.59.79         

Deploy model with custom metrics:

microk8s.kubectl -n test apply -f examples/echo-metrics-v1.yaml

Get IP address of model classifier and use it for prediction request:

microk8s.kubectl -n test get svc | grep echo-metrics-default-classifier
echo-metrics-default-classifier       ClusterIP      10.152.183.34    <none>         9000/TCP,9500/TCP                       25m

Request prediction using IP address of model classifier:

for i in `seq 1 10`; do sleep 0.1 && \
   curl -v -s -H "Content-Type: application/json"    \
  -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}'    \
  http://<echo-metrics-default-classifier-IP>:9000/predict > /dev/null ; \
done

Metrics are available at pod's IP address and Prometheus port:

 microk8s.kubectl -n test describe pod  echo-metrics-default-0-classifier-5bf6cf86cd-r7c8l
 IP:           10.1.59.82
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
curl http://10.1.59.82:6000/prometheus
  • Navigate to Prometheus dashboard https://<Prometheus-unit-IP>:9090, select Status->Targets

Add test that uses istio ingress

We should add a test(s) that confirm seldon is working with our istio charms/ingress, and maybe also with istio+auth. This could be similar to the examples/ingress_canary_and_auth.ipynb example.

Error deploying the SeldonDeployment in namespace other than kubeflow

Trying to deploy a SeldonDeployment:

$ microk8s kubectl apply -f model-mlflow-local.yaml -n kubeflow
seldondeployment.machinelearning.seldon.io/mlflow created

$ microk8s kubectl apply -f model-mlflow-local.yaml -n admin
Error from server (InternalError): error when creating "model-mlflow-local.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.kubeflow.svc:4443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial tcp 10.152.183.150:4443: connect: connection refused

I have the same secret in both of the namespaces:

$ microk8s kubectl get secrets -A | grep seldon-init-container-secret
admin              seldon-init-container-secret                          Opaque                      6      114m
kubeflow           seldon-init-container-secret                          Opaque                      6      114m

yaml file (requires changing the modelUri)

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/71/8476095066fd43af8ae2a6f1511044df/artifacts/model
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: wine-super-model
    replicas: 1

Deployment was done using kubeflow-lite bundle + mlflow (stable channel). RBAC is enabled on microk8s.

`tests/integration/test_seldon_servers.py` is flaky in GitHub runners

It seems like we have some flaky test executions in GitHub runners when running tests/integration/test_seldon_servers.py:

  1. It seems like GH runners sometimes are overloaded with these workloads and it may take more time than expected to deploy and verify SeldonDeployments are working correctly, causing errors like this: AssertionError: Waited too long for seldondeployment/mlflow!. A possible fix for this is to increase the timeout. A fix for this is being worked in #190

  2. The above causes the SeldonDeployments created by the test_seldon_predictor_server test case to not always be removed because the test case fails and doesn't have a step to ensure a cleanup between test cases. Since this test case is parametrised, it will try to deploy SeldonDeployments that may have the same name, which can cause conflicts if they are not correctly removed in a previous execution of the test case, which ends up in failures with message lightkube.core.exceptions.ApiError: seldondeployments.machinelearning.seldon.io "mlflow" already exists. This error can be found here. A fix for this is being worked in #188

Seldon-istio relation does not work with charmed istio

We need Istio to provide external access to models and canary deployments.

Right now relating the istio-pilot with seldon-controller-manager is not possible. The default values are also not working for CKF, they require default Istio installation in istio-system namespace

test_prometheus_data_set unit test failed

PR #102 upgrades charm libs, which caused test_prometheus_data_set to fail with the Error:

 File "/home/runner/work/seldon-core-operator/seldon-core-operator/tests/unit/test_operator.py", line 111, in test_prometheus_data_set
    assert json.loads(
KeyError: 'scrape_jobs'

Severe memory leak caused by endpoint discovery merge

Description

The fix for this issue needs to be merged into two branches:
Merge into:

  • track/1.15 (release KF v1.7)
  • main

After this PR: #94 every Seldon Deployment causes severe memory leak and Kubeflow deployment becomes unresponsive.
This was discovered when integration test for charm removal was introduced. Remove operation never succeeded and test failed with timeout errors.

Running integration tests with removal test produced errors:

unit-seldon-controller-manager-0: 20:37:58 ERROR unit.seldon-controller-manager/0.juju-log Uncaught exception while in charm code
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2615, in _run
    result = run(args, **kwargs)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-seldon-controller-manager-0/config-get', '--format=json')' returned non-zero exit status 1.
Traceback (most recent call last):
  File "./src/charm.py", line 549, in <module>
    main(SeldonCoreOperator)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 424, in main
    charm = charm_class(framework)
  File "./src/charm.py", line 65, in __init__
    self._metrics_port = self.model.config["metrics-port"]
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 695, in __getitem__
    return self._data[key]
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 679, in _data
    data = self._lazy_data = self._load()
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 1519, in _load
    return self._backend.config_get()
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2727, in config_get
    out = self._run('config-get', return_output=True, use_json=True)
  File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2617, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: ERROR permission denied
unit-seldon-controller-manager-0: 20:37:58 ERROR juju.worker.uniter.operation hook "stop" (via hook dispatching script: dispatch) failed: exit status 1
unit-seldon-controller-manager-0: 20:37:58 INFO juju.worker.uniter unit "seldon-controller-manager/0" shutting down: unit "seldon-controller-manager/0" not found (not found)
unit-prometheus-k8s-0: 20:40:01 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
controller-0: 20:40:56 INFO juju.worker.caasapplicationprovisioner.runner stopped "seldon-controller-manager", err: attempt count exceeded: try again
controller-0: 20:40:56 ERROR juju.worker.caasapplicationprovisioner.runner exited "seldon-controller-manager": attempt count exceeded: try again
controller-0: 20:40:56 INFO juju.worker.caasapplicationprovisioner.runner restarting "seldon-controller-manager" in 3s
controller-0: 20:40:59 INFO juju.worker.caasapplicationprovisioner.runner start "seldon-controller-manager"
ERROR lost connection to pod

Remove integration test fails

Description

Remove integration test in current form fails.
Re-design/debug remove integration test to ensure it is passing.

Remove redundant configmaps and document which one we choose/how to update

Seldon allows for several ways to pass default configuration values. We use a few of them simultaneously, which results in unexpected behaviour.

On deploy, our charm creates:

  • configmap {CHARM_NAME}-operator-resources-config, which is mounted into the seldon-core manager pod as /tmp/operator-resources
  • configmap seldon-config
    Both have the same or similar content (I only checked quickly - I see some charm config values used to override some settings so maybe this is only done in one of these cm but can be confirmed)

Seldon preferentially uses:

  1. configmap seldon-config (from the same namespace as the seldon core manager is deployed
  2. /tmp/operator-resources
    So since we deploy both, the /tmp/operator-resources is ignored (or at least ignored whenever there's a value in both. Not sure if it it is merging these sources or just taking the first it sees). This means changing values in the {CHARM_NAME}-operator-resources-config will have no effect.

We should remove the redundancy (or at minimum clearly document it in the code and readme). There are pros and cons to both:

  • seldon-config: changes to this file will live update the defaults without having to restart the manager pod which is good, but since the name is hard coded deploying two seldon charms to the same will cause conflicts (I dont think there's a good use case for multiple seldon deployments in the same model, though)
  • /tmp/operator-resources: is uniquely named and used by the seldon it is deployed for, so wont have conflicts if multiple seldons are deployed. But, changes to source configmap do not automatically propagate to the pod (the pod must be restarted after the configmap is changed in order to load that new data)

seldon-controller-manager failed to upgrade 1.6 to 1.7

Description

seldon-controller-manager failed to upgrade 1.6 to 1.7
Failed to reach active/idle:

seldon-controller-manager blocked: K8S resources creation failed

Jira

Merge into:

  • track/1.15 (release KF v1.7)
  • main

Testing

Deploy charm from stable:

juju deploy seldon-core seldon-controller-manager --channel=1.14/stable --trust

Verify that deployment:

juju status
Model         Controller  Cloud/Region        Version  SLA          Timestamp
test-upgrade  uk8s        microk8s/localhost  2.9.34   unsupported  11:25:23-04:00

App                        Version                Status  Scale  Charm        Channel      Rev  Address        Exposed  Message
seldon-controller-manager  res:oci-image@eb811b6  active      1  seldon-core  1.14/stable   92  10.152.183.81  no       

Unit                          Workload  Agent  Address      Ports              Message
seldon-controller-manager/0*  active    idle   10.1.59.111  8080/TCP,4443/TCP  

Build local charm and execute refresh command:

juju refresh seldon-controller-manager --path=./seldon-core_ubuntu-20.04-amd64.charm --resource="oci-image=docker.io/seldonio/seldon-core-operator:1.15.0"

Verify that upgrade was successful:

juju status
Model         Controller  Cloud/Region        Version  SLA          Timestamp
test-upgrade  uk8s        microk8s/localhost  2.9.34   unsupported  11:27:41-04:00

App                        Version                         Status  Scale  Charm        Channel  Rev  Address         Exposed  Message
seldon-controller-manager  .../c5e3s519ko1quc9tqnysy92...  active      1  seldon-core  stable     0  10.152.183.252  no       

Unit                          Workload  Agent  Address      Ports  Message
seldon-controller-manager/0*  active    idle   10.1.59.112   

Handle ConfigMap created by workload container during application remove

Description

Handle ConfigMap created by workload container during application remove

There is a ConfigMap created by workload container seldon-core to track its leadership:

kubectl -n <namespace> get configmap a33bd623.machinelearning.seldon.io -o=yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"seldon-controller-manager-5ff5b59788-r9c5b_d5372a80-cf0b-49ce-88d6-4c33a1096f28","leaseDurationSeconds":15,"acquireTime":"2023-03-23T18:52:51Z","renewTime":"2023-03-23T18:54:15Z","leaderTransitions":1}'
  creationTimestamp: "2023-03-23T18:49:51Z"
  labels:
    app.juju.is/created-by: seldon-controller-manager
  name: a33bd623.machinelearning.seldon.io
  namespace: kf
  resourceVersion: "4439"
  uid: 7dfa723e-5286-435b-a633-84a25f340506

This ConfigMap has expiration time of 45 seconds.

Initial problem was detected when testing upgrade: deploying stable charm and then upgrading to updated one with 45 seconds failed upgrade due to container inability to acquire lock on ConfigMap above:

 error retrieving resource lock kf/a33bd623.machinelearning.seldon.io

If upgrade is performed outside of expiration window it succeeds.

On application removal this ConfigMap (a33bd623.machinelearning.seldon.io) is not removed. It should be removed.

Integration of ROCK images into Seldon Core Operator

Description

Integration of ROCK images.
https://github.com/canonical/seldonio-rocks

The following ROCK images need to be integrated into Seldon Core Operator:
Charm managed:

  • seldonio-rocks/seldon-core-operator: Rock is done, integration tests are done, integrated into CKF

Workload managed:

  • seldonio-rocks/sklearnserver: Rock is done, integration tests are done, integrated into CKF
  • seldonio/mlserver-sklearn
  • seldonio/mlserver-xgboost
  • seldonio/mlserver-mlflow
  • seldonio/mlserver-huggingface
  • seldonio/mlserver-slim
  • seldonio/mlflowserver
  • seldonio-rocks/xgboostserver v1
  • seldonio-rocks/xgboostserver v2
  • seldonio-rocks/seldon-core-s2i-python3
  • seldonio-rocks/seldon-core-s2i-python36 (could be drop candidate)
  • seldonio-rocks/seldon-core-s2i-python37 (could be drop candidate)
  • seldonio-rocks/seldon-core-s2i-python38
  • seldonio-rocks/seldon-core-s2i-python36-gpu (could be drop candidate)
  • seldonio-rocks/seldon-core-s2i-python37-gpu (could be drop candidate)
  • seldonio-rocks/seldon-core-s2i-python38-gpu
  • seldonio-rocks/tensorflow/serving
  • seldonio-rocks/seldonio/tfserving-proxy

Design

Design of how ROCKs are built, tested and integrated is captured in related specification (KF-044).
Main design points for Seldon ROCKs:

  • metadata.yaml is updated with references to new ROCK image.
  • Seldon Core Operator ConfigMap updated with new ROCK images.

Testing

Existing integration tests are to be re-used to test functionality of new ROCK images.
Follow Seldon documenation for testing:
https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/index.html
https://docs.seldon.io/projects/seldon-core/en/latest/nav/config/servers.html

Integration tests

Implementation

  • Create ROCK images:
    • seldonio-rocks/seldon-core-operator
    • SKLEARN_SERVER: seldonio-rocks/sklearnserver (seldon)
    • SKLEARN_SERVER-V2: seldonio-rocks/mlserver-sklearn (v2) PR
    • XGBOOST_SERVER: seldonio-rocks/xgboostserver (seldon)
    • XGBOOST_SERVER-V2: seldonio-rocks/mlserver-xgboost (v2)
    • MLFLOW_SERVER: seldonio-rocks/mlflowserver (seldon) PR
    • MLFLOW_SERVER-V2: seldonio-rocks/mlserver-mlflow (v2) PR
    • TEMPO_SERVER-V2: seldonio/mlserver-slim
    • HUGGINGFACE_SERVER-V2: seldonio/mlserver-huggingface PR
    • seldonio-rocks/seldon-core-s2i-python3
    • seldonio-rocks/seldon-core-s2i-python38
    • seldonio-rocks/seldon-core-s2i-python38-gpu
    • seldonio-rocks/tensorflow/serving
    • seldonio-rocks/seldonio/tfserving-proxy
  • Create integration tests to test each image (charm and workload managed).
    • seldonio-rocks/seldon-core-operator
    • SKLEARN_SERVER: seldonio-rocks/sklearnserver (seldon)
    • SKLEARN_SERVER-V2: seldonio-rocks/mlserver-sklearn (v2) PR
    • XGBOOST_SERVER: seldonio-rocks/xgboostserver (seldon)
    • XGBOOST_SERVER-V2: seldonio-rocks/mlserver-xgboost (v2)
    • MLFLOW_SERVER: seldonio-rocks/mlflowserver (seldon)
    • MLFLOW_SERVER-V2: seldonio-rocks/mlserver-mlflow (v2) PR
    • TEMPO_SERVER-V2: seldonio/mlserver-slim
    • HUGGINGFACE_SERVER-V2: seldonio/mlserver-huggingface (no test container)
    • seldonio-rocks/seldon-core-s2i-python3
    • seldonio-rocks/seldon-core-s2i-python38
    • seldonio-rocks/seldon-core-s2i-python38-gpu
    • seldonio-rocks/tensorflow/serving
    • seldonio-rocks/seldonio/tfserving-proxy
  • Update required charm's resources to include/deploy ROCK images.
    • seldonio-rocks/seldon-core-operator
    • SKLEARN_SERVER: seldonio-rocks/sklearnserver (seldon)
    • SKLEARN_SERVER-V2: seldonio-rocks/mlserver-sklearn (v2)
    • XGBOOST_SERVER: seldonio-rocks/xgboostserver (seldon)
    • XGBOOST_SERVER-V2: seldonio-rocks/mlserver-xgboost (v2)
    • MLFLOW_SERVER: seldonio-rocks/mlflowserver (seldon)
    • MLFLOW_SERVER-V2: seldonio-rocks/mlserver-mlflow (v2) (broken model in test container)
    • TEMPO_SERVER-V2: seldonio/mlserver-slim
    • HUGGINGFACE_SERVER-V2: seldonio/mlserver-huggingface (no test container)
    • seldonio-rocks/seldon-core-s2i-python3
    • seldonio-rocks/seldon-core-s2i-python38
    • seldonio-rocks/seldon-core-s2i-python38-gpu
    • seldonio-rocks/tensorflow/serving
    • seldonio-rocks/seldonio/tfserving-proxy
  • Publish ROCK images.
    • seldonio-rocks/seldon-core-operator
    • SKLEARN_SERVER: seldonio-rocks/sklearnserver (seldon)
    • SKLEARN_SERVER-V2: seldonio-rocks/mlserver-sklearn (v2)
    • XGBOOST_SERVER: seldonio-rocks/xgboostserver (seldon)
    • XGBOOST_SERVER-V2: seldonio-rocks/mlserver-xgboost (v2)
    • MLFLOW_SERVER: seldonio-rocks/mlflowserver (seldon)
    • MLFLOW_SERVER-V2: seldonio-rocks/mlserver-mlflow (v2) (broken model in test container)
    • TEMPO_SERVER-V2: seldonio/mlserver-slim
    • HUGGINGFACE_SERVER-V2: seldonio/mlserver-huggingface (no test container)
    • seldonio-rocks/seldon-core-s2i-python3
    • seldonio-rocks/seldon-core-s2i-python38
    • seldonio-rocks/seldon-core-s2i-python38-gpu
    • seldonio-rocks/tensorflow/serving
    • seldonio-rocks/seldonio/tfserving-proxy

Charm missing kubeflow role aggregation rule that gives users access to `SeldonDeployments`

As presently implemented, the combination of kubeflow-roles and Seldon that are planned for the Kubeflow 1.7 release do not grant users access to SeldonDeployments in their namespaces. This has the effect of making Seldon unusable for Kubeflow users.

In pod spec versions of the seldon charm, Kubeflow users were granted permission to create/edit/* SeldonDeployments in their namespace via this aggregated ClusterRole (for a description of how kubeflow aggregates roles to users, see this readme). This ClusterRole was implemented in the kubeflow roles charm because pod spec did not allow us to create arbitrary ClusterRoles.

Now that the charm was migrated to sidecar, this ClusterRole could be deployed by this charm. canonical/kubeflow-roles-operator#38 removed the ClusterRole from the central role deployment (possibly because it was thought that this role was now implemented in the seldon charm directly?). The result is that users are not granted the required permissions for SeldonDeployments.

To fix this, we should:

  • do one of:
    • restore the aggregation ClusterRole in kubeflow-roles
    • implement the aggregation ClusterRole here in Seldon
  • add tests that would catch this in future

Regarding adding tests, it is unclear where the best place should be for them. Because this was the transference of responsibility from one charm to another, I think it could only be caught at the bundle level? Although it feels unsatisfying that we can delete a file in kubeflow-roles and not have a check at the repo level to say that was important.

Integration tests failing with bad file descriptor message

Encountered during a PR that just updated a random text file in the repo #183

The tests were failing for Seldon with the following message:

OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.

https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183

Full logs
WARNING  juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
WARNING  juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2000' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
    res = await connector(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
    await self._connect(endpoints)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
    result = await task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
    return await self._open(endpoint, cacert)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line [402](https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183#step:5:403), in _open
    return (await websockets.connect(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
    transport, protocol = await self._creating_connection
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
    sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2001' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
    res = await connector(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
    await self._connect(endpoints)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
    result = await task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
    return await self._open(endpoint, cacert)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open
    return (await websockets.connect(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
    transport, protocol = await self._creating_connection
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
    sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.

Integration tests for Tensorflow Seldon servers

Description

Integration tests for Tensorflow Seldon severs are required to validate that Seldon Deployment will be successful.
Integration tests for these servers should be a part of Seldon integration tests suite and run on every PR.
These tests will be reused in ROCKs testing.

The following Tesnorflow prepackaged Seldon servers should be covered in test:

        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving",
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },

Design

Use pamatetirized tests to execute same test for different Seldon servers.
To test use specified test deployments from:

For seldon protocol, server container image in test response data is updated based on deployed seldon-config ConfigMap. This ensures that test response data contains properm image and version.

For one of the Tensorflow tests request and response data are relatively large. That data is stored in files and common paramterized test is modified to be able to read data from file (JSON) or use it as is (JSON object).
Eg.:

    # prepare request data:
    # - if it is string, load it from file specified in by that string
    # - otherwise use it as JSON object
    if isinstance(request_data, str):
        # response test data contains file with JSON data
        with open(f"tests/data/{request_data}") as f:
            request_data = json.load(f)

For mlflowserver test ephemeral-sorage=2G requirement/limit needs to be added to ensure it runs in GH runner.

Testing

Integration testing should pass when implemented.

Review workaround for temporary fix for bug in python-libjuju

Description

Due to juju/python-libjuju#913 there was a temporary fix introduced in requirements-integration.in:

# FIXME: This is a workaround for https://github.com/juju/python-libjuju/issues/913
# please pin to a released python-libjuju ASAP.
juju @ git+https://github.com/DnPlas/python-libjuju@dnplas-pyyaml-6

When the issue is fixed this needs to be reviewed/removed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.