canonical / seldon-core-operator Goto Github PK
View Code? Open in Web Editor NEWSeldon Core Operator
License: Apache License 2.0
Seldon Core Operator
License: Apache License 2.0
Update observability libs
Merge into
track/1.15
(release KF v1.7)main
test_remove_with_resources_present
fails on the HEAD of main
so far the issue is narrowed down that only one CRD of possible 3 is possible in order not to fail.
That means that only one of those tests is allowed before test_remove_with_resources_present
:
if at least 2 from above list are present, then test test_remove_with_resources_present
fails
Work items are tracked in https://warthogs.atlassian.net/browse/KF-775
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-775-gh52-feat-alert-rules
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator
Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. Prometheus creates scrape jobs based on configured alert rules defined by Seldon Core Operator Charm. Then it scrapes targets, retrieves defined metrics, and performs required calculations.
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager
Navigate to Prometheus dashboard https://<Prometheus-unit-IP>:9090
, select Status->Targets
There should be Promethus scrape job that targets Seldon metrics endpoint (http://<Seldon-Controller-Manager-IP>:8080/metrics
) entry with no errors:
Deploy sample Seldon deployment in the same model to observe if any failure alert is reported by navigating to Alerts
microk8s.kubectl -n test apply -f examples/serve-simple-v1.yaml
microk8s.kubectl -n test delete deploy/seldon-model-example-0-classifier
NOTE: That alerts window is 10 minutes. Scraping is done once per minute. Make sure at lease 2 minutes have passed for proper rate calculation.
On track/1.15
integraion tests fail in Github runners.
Create a debug PR and observe integration test triggered by workflow in Github runnner.
Example real PR that failed after many attempts: #186
Github runner action On Pull Request https://github.com/canonical/seldon-core-operator/actions/workflows/on_pull_request.yaml
Output of integration tests that fail in Github runner:
https://github.com/canonical/seldon-core-operator/actions/runs/5694773152/job/16153113024?pr=186
There were a series of fixes on main
branch that made Seldon integration more reliable. These should be reviewed and probably ported/cherry-picked into track/1.15
branch.
#190
#188
#187
When users deploy models via Seldon into their namespace, they want to be able to access these models from a step in a Kubeflow Pipeline1. This currently does not work - if you try to access the model from a pipeline step, you get the message "403 RBAC: access denied". This may be due to pipeline steps not having istio sidecars
For this task to be complete, we need to:
Related is canonical/bundle-kubeflow#557, which describes some challenges in access.
This solves part of #109.
if at all possible, this should also keep traffic inside the kubernetes cluster. Going outside and authenticating back through the front door is a last resort โฉ
Install event handler needs to be redesigned to deploy K8S resources and update Pebble layer. All other actions should be performed in Pebble Ready event handler and in main
.
When a user serves a model using Seldon in their Kubeflow namespace, there are different use cases for accessing a model that they might want to satisfy:
For each of these use cases, we should have:
Images provided through the config receive special treatment as seen here:
seldon-core-operator/src/charm.py
Lines 309 to 313 in 7ee415f
More specifically, the charm attempts to separate the image name from the tag, splitting the string in two using :
as a separator. However, this will break the charm installation if the image name contains that special character, which could be the case if a local container registry address is provided for instance (e.g. 172.17.0.2:5000/tensorflow/serving:2.1.0
).
juju deploy seldon-core --trust \
--config custom_images='{
"configmap__predictor__tensorflow__tensorflow": "172.17.0.2:5000/tensorflow/serving:2.1.0",
}'
This affects the latest/edge
version of the seldon-core
charm.
unit-seldon-core-0: 14:55:41 ERROR unit.seldon-core/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "./src/charm.py", line 629, in <module>
main(SeldonCoreOperator)
File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/main.py", line 439, in main
framework.reemit()
File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/framework.py", line 843, in reemit
self._reemit()
File "/var/lib/juju/agents/unit-seldon-core-0/charm/venv/ops/framework.py", line 922, in _reemit
custom_handler(event)
File "./src/charm.py", line 438, in _on_install
self._apply_k8s_resources(force_conflicts=True)
File "./src/charm.py", line 420, in _apply_k8s_resources
self.configmap_resource_handler.apply()
File "./src/charm.py", line 195, in configmap_resource_handler
context={**self._context, **self._configmap_images},
File "./src/charm.py", line 289, in _configmap_images
return self._get_custom_images()
File "./src/charm.py", line 310, in _get_custom_images
(
ValueError: too many values to unpack (expected 2)
The tag is the string part following the last :
. As a result, we could do something like this:
diff --git a/src/charm.py b/src/charm.py
index 70575d8..9ba4950 100755
--- a/src/charm.py
+++ b/src/charm.py
@@ -310,7 +310,7 @@ class SeldonCoreOperator(CharmBase):
(
custom_images[f"{image_name}__image"],
custom_images[f"{image_name}__version"],
- ) = custom_images[image_name].split(":")
+ ) = custom_images[image_name].rsplit(":", 1)
except yaml.YAMLError as err:
self.logger.error(
This charm can be related to istio-pilot, but that information is missing in the README. We need to specify instructions on how to deploy seldon's dependencies when the ingressgateway from istio-operators is going to be used.
During the effort of updating seldon charm to 1.17.1
at #216, we bumped across the following bug when running the integration tests.
The deployment created by applied seldondeployment, gets stuck with a condition ReplicaSet "X" is progressing
(as a result, it's never ready). The issue though is that underlying ReplicaSet creates a pod successfully and its status shows readeyReplicas: 1
. At the same time, the deployment has an observedGeneration
of 7 (or sth like that) while the ReplicaSet
has its observedGeneration
set to 1
.
This looks like the same issue we 've hit when trying to update Seldon ROCKs described in this issue's comments canonical/seldonio-rocks#37 (comment).
During debugging we tried to apply manually with kubectl apply -f yaml
the aforementioned seldondeployment and noticed that:
default
namespace, deployment was progressing successfullyHere's the namespaces' yaml outputs
apiVersion: v1
kind: Namespace
metadata:
annotations:
controller.juju.is/id: ec8f226d-bdc2-45de-891d-7cc8b8f501ff
model.juju.is/id: 73fb3200-e151-46b2-8e87-f814f48f1715
creationTimestamp: "2023-10-03T08:46:52Z"
labels:
app.kubernetes.io/managed-by: juju
kubernetes.io/metadata.name: test-charm-vmxo
model.juju.is/name: test-charm-vmxo
serving.kubeflow.org/inferenceservice: enabled
name: test-charm-vmxo
resourceVersion: "439145"
uid: 82929f14-80fc-4049-b4ab-c74c93ec0e30
spec:
finalizers:
- kubernetes
status:
phase: Active
and
apiVersion: v1
kind: Namespace
metadata:
creationTimestamp: "2023-10-02T08:39:37Z"
labels:
kubernetes.io/metadata.name: default
name: default
resourceVersion: "482729"
uid: 0cf53d5e-b0f5-482e-88a2-6be71e24fe02
spec:
finalizers:
- kubernetes
status:
phase: Active
Tests run in the namespace created by a test juju model. In that namespace, they try to apply a custom resource, which in turn creates a deployment. The issue is that, while the ReplicaSet creates a pod (successfully) and its status shows readeyReplicas: 1
, the deployment gets stuck with a condition ReplicaSet "X" is progressing
(as a result, it's never ready). Same thing happens if I try to apply manually the custom resource in the testing namespace too.
However, if I apply this to default
namespace, the Deployment goes to Ready
as expected.
2.9.45
and Microk8s 1.24
(tried also with 1.26 but issue persists)tox -e charm-integration
or tox -e seldon-servers-integration
Doing `microk8s inspect` and looking at `snap.microk8s.daemon-kubelite/journal.log` we noticed this error during the deployment creation
Oct 03 08:51:43 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:43.873775 39625 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="test-charm-vmxo" name="seldon-model-1-example-0-classifier"
[..]
(also a bunch of those)
Oct 03 08:51:47 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:47.585896 39625 deployment_controller.go:495] Operation cannot be fulfilled on replicasets.apps "seldon-model-1-example-0-classifier-9df54f658": the object has been modified; please apply your changes to the latest version and try again
`$ kubectl get deployments -n kubeflow seldon-model-1-example-0-classifier -o yaml`
[...]
status:
conditions:
- lastTransitionTime: "2023-09-29T13:15:00Z"
lastUpdateTime: "2023-09-29T13:15:00Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2023-09-29T13:15:00Z"
lastUpdateTime: "2023-09-29T13:15:00Z"
message: ReplicaSet "seldon-model-1-example-0-classifier-b7dfcbcb5" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
observedGeneration: 7
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
`$ kubectl get replicaset -n kubeflow seldon-model-1-example-0-classifier-b7dfcbcb5 -o yaml`
[..]
status:
availableReplicas: 1
fullyLabeledReplicas: 1
observedGeneration: 1
readyReplicas: 1
replicas: 1
No response
According to the seldon docs, seldon version 1.15 is not compatible with k8s 1.25.
PR #207 bumps microk8s in the CI to 1.25, it's failing as expected with seldon 1.15.
Seldon will be updated to 1.17 as part of the 1.8 release, as documented in canonical/bundle-kubeflow#643. Seldon 1.17 is compatible with k8s 1.25, so we should update the k8s version in the CI along with the manifests and version update.
_get_env_vars()
calls _get_istio_gateway()
which returns the name and namespace of the istio ingress gateway as a string in the namespace/name
format. The code suggests that in the case of the latter returning None
, _get_env_vars()
will not stop its execution, but in _get_istio_gateway()
it is suggested that the unit should go to WaitingStatus
if the relation with istio-pilot:gateway-info does not exist.
From the code, it is not clear if the gateway-info
is a hard requirement that should put the unit into WaitingStatus
or even BlockedStatus
, the only thing that is clear is that _get_env_vars()
won't care if there is information about the gateway and it will just proceed. In the case of the relation not being necessary 100% of the times, I recommend we reflect that in the code (removing the WaitingStatus
) and also in the documentation.
Need to update libraries and fix all required test.
Currently, config.yaml
refers to default Seldon Core Operator container image v1.14. As part of Kubeflow 1.8 work, new image is being used. As a result, this default configuration should be updated to use new ROCK image.
Refer to corresponding config.yaml line
config.yaml
for this charm.
N/A
In working Kubeflow 1.4 on microk8s (without RBAC enabled).
To replicate clone the repo and run:
kubectl apply -f examples/serve-simple-v1.yaml
The response is:
Error from server (InternalError): error when creating "examples/serve-simple-v1.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.kubeflow.svc:4443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial TCP
Seldon deployed: seldon-controller-manager res:oci-image@82fd029 active 1 seldon-core charmhub stable 50 kubernetes 10.152.183.16
Microk8s: installed: v1.21.8 (2870) 191MB classic
The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/*
Github branch.
Mark the following as done
WARNING: No breaking changes should be backported into the
track/<version>
branch. A breaking change can be anything that requires extra steps to refresh from the previous /stable other than justjuju refresh
. Please avoid at all costs these situations.
The following files have to be modified and/or verified to enable image configuration:
metadata.yaml
- the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:containers:
training-operator:
resource: training-operator-image
resources:
training-operator-image:
type: oci-image
description: OCI image for training-operator
upstream-source: kubeflow/training-operator:v1-855e096
apiVersion: v1
kind: ConfigMap
metadata:
name: seldon-config
namespace: {{ namespace }}
data:
predictor_servers: |-
{
"TENSORFLOW_SERVER": {
"protocols" : {
"tensorflow": {
"image": "tensorflow/serving", <--- this image should be configurable
"defaultImageVersion": "2.1.0"
},
"seldon": {
"image": "seldonio/tfserving-proxy",
"defaultImageVersion": "1.15.0"
}
}
},
...
tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.
src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.
Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)
Build the charm making sure that all the changes for airgap are in place.
Deploy the charms manually and observe the charm go to active and idle.
Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).
After completing the changes and testing, this charm has to be published to its stable risk in Charmhub. For that you must wait for the charm to be published to /edge, which is the revision to be promoted to /stable. Use the workflow dispatch for this (Actions>Release charm to other tracks...>Run workflow).
Suggested changes/backports
The k8s autoscaling/v2beta1 API has been migrated to v2, causing this issue in Seldon and blocking Seldon <=1.15 from using k8s 1.25+. Seldon 1.16 will fix this and support k8s 1.25+.
charm goes into error state when "config changed" is called
unit-seldon-controller-manager-0: 04:44:24 ERROR unit.seldon-controller-manager/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1011, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 951, in send
self.connect()
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 252, in connect
self.sock.connect(self.socket_path)
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1484, in _request_raw
response = self.opener.open(request, timeout=self.timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 266, in http_open
return self.do_open(_UnixSocketConnection, req, # type:ignore
File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./src/charm.py", line 528, in <module>
main(SeldonCoreOperator)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 436, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 354, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 830, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
custom_handler(event)
File "./src/charm.py", line 511, in _on_event
self._update_layer()
File "./src/charm.py", line 260, in _update_layer
current_layer = self.container.get_plan()
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 1933, in get_plan
return self._pebble.get_plan()
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1772, in get_plan
resp = self._request('GET', '/v1/plan', {'format': 'yaml'})
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1451, in _request
response = self._request_raw(method, path, query, headers, data)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/pebble.py", line 1497, in _request_raw
raise ConnectionError(e.reason)
ops.pebble.ConnectionError: [Errno 2] No such file or directory
Found during testing canonical/bundle-kubeflow#462.
The webhook label fix in #14 is only applied for a green field deployment with the new revision of the charm. If it's an upgrade scenario, there is no definition update to happen so the original issue is still there.
$ juju info seldon-core
name: seldon-core
charm-id: ZGHtHpN4TqAzrUlh9aG1SWxXenopHFRH
...
channels: |
latest/stable: 52 2022-01-25 (52) 1MB
latest/candidate: โ
latest/beta: โ
latest/edge: 58 2022-06-01 (58) 7MB
[stable: revision 52]
$ juju deploy seldon-core seldon-controller-manager
Located charm "seldon-core" in charm-hub, revision 52
Deploying "seldon-controller-manager" from charm-hub charm "seldon-core", revision 52 in channel stable on focal
$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service
Name: seldon-webhook-service
Namespace: kubeflow
Labels: app=seldon
app.juju.is/created-by=seldon-controller-manager
app.kubernetes.io/instance=seldon-core
app.kubernetes.io/version=1.9.0
Annotations: <none>
Selector: app.kubernetes.io/name=seldon-controller-manager,control-plane=seldon-controller-manager
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.152.183.28
IPs: 10.152.183.28
Port: <unset> 4443/TCP
TargetPort: 4443/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
-> with control-plane=seldon-controller-manager
[upgrade from 52 to 58]
$ juju refresh seldon-controller-manager --channel edge
Added charm-hub charm "seldon-core", revision 58 in channel edge, to the model
Adding endpoint "grafana-dashboard" to default space "alpha"
Adding endpoint "metrics-endpoint" to default space "alpha"
Leaving endpoints in "alpha": ambassador, istio, keda
$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service
Name: seldon-webhook-service
Namespace: kubeflow
Labels: app=seldon
app.juju.is/created-by=seldon-controller-manager
app.kubernetes.io/instance=seldon-core
app.kubernetes.io/version=1.9.0
Annotations: <none>
Selector: app.kubernetes.io/name=seldon-controller-manager,control-plane=seldon-controller-manager
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.152.183.28
IPs: 10.152.183.28
Port: <unset> 4443/TCP
TargetPort: 4443/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
-> control-plane=seldon-controller-manager
is still there even after the refresh.
[edge: revision 58]
$ juju deploy seldon-core seldon-controller-manager --channel edge
Located charm "seldon-core" in charm-hub, revision 58
Deploying "seldon-controller-manager" from charm-hub charm "seldon-core", revision 58 in channel edge on focal
$ microk8s kubectl -n kubeflow describe service/seldon-webhook-service
Name: seldon-webhook-service
Namespace: kubeflow
Labels: app=seldon
app.juju.is/created-by=seldon-controller-manager
app.kubernetes.io/instance=seldon-core
app.kubernetes.io/version=1.9.0
Annotations: <none>
Selector: app.kubernetes.io/name=seldon-controller-manager
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.152.183.84
IPs: 10.152.183.84
Port: <unset> 4443/TCP
TargetPort: 4443/TCP
Endpoints: 10.1.60.16:4443
Session Affinity: None
Events: <none>
-> as expected, there is no control-plane=seldon-controller-manager
.
workaround:
$ microk8s kubectl -n kubeflow patch service/seldon-webhook-service --type=json \
-p='[{"op": "remove", "path": "/spec/selector/control-plane"}]'
During upgrade the charm gets stuck with 409 conflict errors during k8s resource creation.
juju deploy seldon-core seldon-controller-manager --channel 1.14/stable
juju trust seldon-controller-manager --scope=cluster
juju refresh seldon-controller-manager --channel 1.15/edge
Which yields logs of:
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/roles/leader-election-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-sas-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-edit-seldon?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/rolebindings/leader-election-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-sas-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/seldon-validating-webhook-configuration?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/seldon-webhook-service?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Reconcile completed successfully
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:16 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:17 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube "HTTP/1.1 409 Conflict"
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Error in sys.excepthook:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self.emit(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/log.py", line 41, in emit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self.model_backend.juju_log(record.levelname, self.format(record))
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install return fmt.format(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install record.exc_text = self.formatException(record.exc_info)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install for line in TracebackException(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install yield from self.format_exception_only()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install stype = smod + '.' + stype
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Original exception was:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install resp.raise_for_status()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install raise HTTPStatusError(message, request=request, response=self)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install For more information check: https://httpstatuses.com/409
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install During handling of the above exception, another exception occurred:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "./src/charm.py", line 331, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self.crd_resource_handler.apply()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 351, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install raise e
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 336, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install apply_many(client=self.lightkube_client, objs=resources, force=force)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/lightkube/batch/_many.py", line 64, in apply_many
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install returns[i] = client.apply(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 457, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install return self.patch(type(obj), name, obj, namespace=namespace,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 325, in patch
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install return self.handle_response(method, resp, br)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self.raise_for_status(resp)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install raise transform_exception(e)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install lightkube.core.exceptions.ApiError: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install The above exception was the direct cause of the following exception:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "./src/charm.py", line 523, in <module>
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install main(SeldonCoreOperator)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 439, in main
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install framework.reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 840, in reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self._reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install custom_handler(event)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "./src/charm.py", line 357, in _on_install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install self._apply_k8s_resources()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install File "./src/charm.py", line 340, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install raise GenericCharmRuntimeError("CRD resources creation failed") from error
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install <unknown>GenericCharmRuntimeError: CRD resources creation failed
unit-seldon-controller-manager-0: 15:21:17 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-seldon-controller-manager-0: 15:21:17 INFO juju.worker.uniter awaiting error resolution for "install" hook
where we see 409 conflict errors when creating the CRDs.
(feels similar to canonical/training-operator#104, but that issue was going between sidecar charms whereas this is going from podspec to sidecar)
Seems like this charm cannot be built during integration tests, resulting in the following error message:
RuntimeError: Failed to build charm .:
Packing the charm.
Launching environment to pack for base name='ubuntu' channel='20.04' architectures=['amd64'] (may take a while the first time but it's reusable)
charmcraft internal error: LXDError(brief="failed to inject host's snap 'charmcraft' into target environment.", details=None, resolution=None)
Full execution log: '/home/runner/.local/state/charmcraft/log/charmcraft-20230726-124415.757660.log'
SeldonDeployment
s using our charm to deploy models from artifacts stored via mlflow currently fail to deploy because the seldonio/mlflowserver:1.9.0
image used for the classifier
pod has an unpinned package dependency problem. This causes the flask server to fail like in pallets/flask#4455. This has been fixed in SeldonIO/seldon-core#3946, which would fix new versions of the mlflowserver
image.
To fix the above issue, we should update the images we specify in our config like here:
. It is unclear if we need to update all images at the same time, or if we can pick and choose (for example, the main charm runtime is back on version 1.6, and current versions are >1.13.x)Add upgrade option to tox.ini update-requirements
Merge into:
track/1.15
(release KF v1.7)main
Provide documentation (Jupyter notebooks, README update, how-to guide, etc.) on Seldon Core Deployment scenarios:
Some of the work was completed as part of this PR: #20
Upgrade integration test in current form fails. Currently skipped.
Re-design/debug upgrade integration test to ensure it is passing.
Initial implementation is in https://github.com/canonical/seldon-core-operator/blob/main/tests/integration/test_upgrade.py
seldonio/engine
image was removed from the project: SeldonIO/seldon-core@6fc7f9b
It has been placed in image retrieval script, in static list, and should be removed.
After deploying edge bundle to charmed kubernetes on AWS Seldon-core is failing:
Running
kubectl deploy kubeflow --channel=latest/edge --trust
Container logs:
โฏ kubectl logs -f seldon-controller-manager-54cd8dcfc-prvxz -n kubeflow
Defaulted container "seldon-core" out of: seldon-core, juju-pod-init (init)
{"level":"info","ts":1674829552.0947893,"logger":"setup","msg":"Intializing operator"}
{"level":"info","ts":1674829552.1163418,"logger":"setup","msg":"CRD not found - trying to create"}
{"level":"error","ts":1674829552.2253666,"logger":"setup","msg":"unable to initialise operator","error":"the server could not find the requested resource","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nmain.main\n\t/workspace/main.go:149\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
Attaching logs
dump.log
Sledon only logs:
โฏ juju debug-log --replay | grep seldon
controller-0: 14:56:45 INFO juju.worker.caasapplicationprovisioner.runner start "seldon-controller-manager"
controller-0: 15:06:24 INFO juju.worker.caasprovisioner started operator for application "seldon-controller-manager"
application-seldon-controller-manager: 15:06:26 INFO juju.cmd running jujud [2.9.34 90e2f047763059f0b8a57941ae0907346464aee8 gc go1.19]
application-seldon-controller-manager: 15:06:26 DEBUG juju.cmd args: []string{"/var/lib/juju/tools/jujud", "caasoperator", "--application-name=seldon-controller-manager", "--debug"}
application-seldon-controller-manager: 15:06:26 DEBUG juju.agent read agent config, format "2.0"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.upgradesteps upgrade steps for 2.9.34 have already been run.
application-seldon-controller-manager: 15:06:26 INFO juju.cmd.jujud caas operator application-seldon-controller-manager start (2.9.34 [gc])
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "clock" manifold worker started at 2023-01-27 14:06:26.196880263 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "agent" manifold worker started at 2023-01-27 14:06:26.197542092 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-gate" manifold worker started at 2023-01-27 14:06:26.198022058 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.introspection introspection worker listening on "@jujud-application-seldon-controller-manager"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-01-27 14:06:26.1988468 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.introspection stats worker now serving
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.apicaller connecting with old password
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-config-watcher" manifold worker started at 2023-01-27 14:06:26.208160367 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-flag" manifold worker started at 2023-01-27 14:06:26.210204205 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-fortress" manifold worker started at 2023-01-27 14:06:26.220874831 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.api successfully dialed "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.api connection established to "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.apicaller [3ebd7f] "application-seldon-controller-manager" successfully connected to "172.31.25.210:17070"
application-seldon-controller-manager: 15:06:26 DEBUG juju.api RPC connection died
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-caller" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.apicaller connecting with old password
application-seldon-controller-manager: 15:06:26 DEBUG juju.api successfully dialed "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.api connection established to "wss://172.31.25.210:17070/model/3ebd7f9f-23d7-4c10-82d4-68a99d6006b4/api"
application-seldon-controller-manager: 15:06:26 INFO juju.worker.apicaller [3ebd7f] "application-seldon-controller-manager" successfully connected to "172.31.25.210:17070"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-caller" manifold worker started at 2023-01-27 14:06:26.255014228 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-01-27 14:06:26.263643616 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker started at 2023-01-27 14:06:26.265520702 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker completed successfully
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "upgrader" manifold worker started at 2023-01-27 14:06:26.267809453 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "log-sender" manifold worker started at 2023-01-27 14:06:26.267974806 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-minion" manifold worker started at 2023-01-27 14:06:26.268182889 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "migration-inactive-flag" manifold worker started at 2023-01-27 14:06:26.27124827 +0000 UTC
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasupgrader abort check blocked until version event received
application-seldon-controller-manager: 15:06:26 INFO juju.worker.migrationminion migration phase is now: NONE
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.caasupgrader current agent binary version: 2.9.34
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasupgrader unblocking abort check
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.logger initial log config: "<root>=DEBUG"
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2023-01-27 14:06:26.284435601 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "charm-dir" manifold worker started at 2023-01-27 14:06:26.284655444 +0000 UTC
application-seldon-controller-manager: 15:06:26 INFO juju.worker.logger logger worker started
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2023-01-27 14:06:26.284730685 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2023-01-27 14:06:26.284764145 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.dependency "hook-retry-strategy" manifold worker started at 2023-01-27 14:06:26.314754686 +0000 UTC
application-seldon-controller-manager: 15:06:26 DEBUG juju.worker.logger reconfiguring logging from "<root>=DEBUG" to "<root>=INFO"
application-seldon-controller-manager: 15:06:26 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
application-seldon-controller-manager: 15:06:26 INFO juju.worker.caasoperator.charm downloading ch:amd64/focal/seldon-core-58 from API server
application-seldon-controller-manager: 15:06:26 INFO juju.downloader downloading from ch:amd64/focal/seldon-core-58
application-seldon-controller-manager: 15:06:26 INFO juju.downloader download complete ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:26 INFO juju.downloader download verified ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator operator "seldon-controller-manager" started
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.runner start "seldon-controller-manager/0"
application-seldon-controller-manager: 15:06:32 INFO juju.worker.leadership seldon-controller-manager/0 promoted to leadership of seldon-controller-manager
application-seldon-controller-manager: 15:06:32 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-seldon-controller-manager-0
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 unit "seldon-controller-manager/0" started
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 resuming charm install
application-seldon-controller-manager: 15:06:32 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.charm downloading ch:amd64/focal/seldon-core-58 from API server
application-seldon-controller-manager: 15:06:32 INFO juju.downloader downloading from ch:amd64/focal/seldon-core-58
application-seldon-controller-manager: 15:06:32 INFO juju.downloader download complete ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:32 INFO juju.downloader download verified ("ch:amd64/focal/seldon-core-58")
application-seldon-controller-manager: 15:06:39 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 hooks are retried true
application-seldon-controller-manager: 15:06:39 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "install" hook
application-seldon-controller-manager: 15:06:40 INFO unit.seldon-controller-manager/0.juju-log Running legacy hooks/install.
application-seldon-controller-manager: 15:06:45 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "install" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:45 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "leader-elected" hook
application-seldon-controller-manager: 15:06:48 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:55 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "config-changed" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:06:55 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0 found queued "start" hook
application-seldon-controller-manager: 15:06:55 INFO unit.seldon-controller-manager/0.juju-log Running legacy hooks/start.
application-seldon-controller-manager: 15:06:57 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "start" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:08:12 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "config-changed" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:09:42 INFO juju.worker.caasoperator started pod init on "seldon-controller-manager/0"
application-seldon-controller-manager: 15:11:11 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:16:25 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:22:16 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-seldon-controller-manager: 15:27:56 INFO juju.worker.caasoperator.uniter.seldon-controller-manager/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
1.6/stable is working for charmed kubeflow
I think this use of ApiError might be a bug. The ApiError caught here is a lightkube ApiError, but we're catching errors that are thrown from file operation calls.
Work items are tracked in https://warthogs.atlassian.net/browse/KF-829
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-829-gh68-feat-metrics-discovery
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator
Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. For metrics provided by models targets can change from model to model and from deployment to deployment. Metrics Endpoint Observer provided by COS is integrated. Updates to targets are handled by Mertics Endpoint Observer and relayed to Prometheus by Seldon Core Operator Charm.
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager
Resulting deployment should like this:
Model Controller Cloud/Region Version SLA Timestamp
test uk8s microk8s/localhost 2.9.34 unsupported 15:28:26-05:00
App Version Status Scale Charm Channel Rev Address Exposed Message
prometheus-k8s 2.33.5 active 1 prometheus-k8s stable 79 10.152.183.239 no
seldon-controller-manager active 1 seldon-core 0 10.152.183.182 no
Unit Workload Agent Address Ports Message
prometheus-k8s/0* active idle 10.1.59.80
seldon-controller-manager/0* active idle 10.1.59.79
Deploy model with custom metrics:
microk8s.kubectl -n test apply -f examples/echo-metrics-v1.yaml
Get IP address of model classifier and use it for prediction request:
microk8s.kubectl -n test get svc | grep echo-metrics-default-classifier
echo-metrics-default-classifier ClusterIP 10.152.183.34 <none> 9000/TCP,9500/TCP 25m
Request prediction using IP address of model classifier:
for i in `seq 1 10`; do sleep 0.1 && \
curl -v -s -H "Content-Type: application/json" \
-d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' \
http://<echo-metrics-default-classifier-IP>:9000/predict > /dev/null ; \
done
Metrics are available at pod's IP address and Prometheus port:
microk8s.kubectl -n test describe pod echo-metrics-default-0-classifier-5bf6cf86cd-r7c8l
IP: 10.1.59.82
PREDICTIVE_UNIT_METRICS_SERVICE_PORT: 6000
PREDICTIVE_UNIT_METRICS_ENDPOINT: /prometheus
curl http://10.1.59.82:6000/prometheus
https://<Prometheus-unit-IP>:9090
, select Status->TargetsWe should add a test(s) that confirm seldon is working with our istio charms/ingress, and maybe also with istio+auth. This could be similar to the examples/ingress_canary_and_auth.ipynb
example.
When deploying the seldon-core operator with its default application name seldon-core
the container gets into an error state. This also happens if it is given any other application-name besides seldon-controller-manager
.
Steps to reproduce:
juju deploy seldon-core
Trying to deploy a SeldonDeployment:
$ microk8s kubectl apply -f model-mlflow-local.yaml -n kubeflow
seldondeployment.machinelearning.seldon.io/mlflow created
$ microk8s kubectl apply -f model-mlflow-local.yaml -n admin
Error from server (InternalError): error when creating "model-mlflow-local.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.kubeflow.svc:4443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial tcp 10.152.183.150:4443: connect: connection refused
I have the same secret in both of the namespaces:
$ microk8s kubectl get secrets -A | grep seldon-init-container-secret
admin seldon-init-container-secret Opaque 6 114m
kubeflow seldon-init-container-secret Opaque 6 114m
yaml file (requires changing the modelUri)
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: mlflow
spec:
name: wines
predictors:
- componentSpecs:
- spec:
containers:
- name: classifier
livenessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
readinessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
graph:
children: []
implementation: MLFLOW_SERVER
modelUri: s3://mlflow/71/8476095066fd43af8ae2a6f1511044df/artifacts/model
envSecretRefName: seldon-init-container-secret
name: classifier
name: wine-super-model
replicas: 1
Deployment was done using kubeflow-lite bundle + mlflow (stable channel). RBAC is enabled on microk8s.
It seems like we have some flaky test executions in GitHub runners when running tests/integration/test_seldon_servers.py
:
It seems like GH runners sometimes are overloaded with these workloads and it may take more time than expected to deploy and verify SeldonDeployments are working correctly, causing errors like this: AssertionError: Waited too long for seldondeployment/mlflow!
. A possible fix for this is to increase the timeout. A fix for this is being worked in #190
The above causes the SeldonDeployments created by the test_seldon_predictor_server
test case to not always be removed because the test case fails and doesn't have a step to ensure a cleanup between test cases. Since this test case is parametrised, it will try to deploy SeldonDeployments that may have the same name, which can cause conflicts if they are not correctly removed in a previous execution of the test case, which ends up in failures with message lightkube.core.exceptions.ApiError: seldondeployments.machinelearning.seldon.io "mlflow" already exists
. This error can be found here. A fix for this is being worked in #188
We need Istio to provide external access to models and canary deployments.
Right now relating the istio-pilot with seldon-controller-manager is not possible. The default values are also not working for CKF, they require default Istio installation in istio-system
namespace
https://github.com/canonical/seldon-core-operator/blob/master/LICENSE leads to a 404 error. Please include a LICENSE file in this repo.
PR #102 upgrades charm libs, which caused test_prometheus_data_set
to fail with the Error:
File "/home/runner/work/seldon-core-operator/seldon-core-operator/tests/unit/test_operator.py", line 111, in test_prometheus_data_set
assert json.loads(
KeyError: 'scrape_jobs'
The fix for this issue needs to be merged into two branches:
Merge into:
track/1.15
(release KF v1.7)main
After this PR: #94 every Seldon Deployment causes severe memory leak and Kubeflow deployment becomes unresponsive.
This was discovered when integration test for charm removal was introduced. Remove operation never succeeded and test failed with timeout errors.
Running integration tests with removal test produced errors:
unit-seldon-controller-manager-0: 20:37:58 ERROR unit.seldon-controller-manager/0.juju-log Uncaught exception while in charm code
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2615, in _run
result = run(args, **kwargs)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-seldon-controller-manager-0/config-get', '--format=json')' returned non-zero exit status 1.
Traceback (most recent call last):
File "./src/charm.py", line 549, in <module>
main(SeldonCoreOperator)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 424, in main
charm = charm_class(framework)
File "./src/charm.py", line 65, in __init__
self._metrics_port = self.model.config["metrics-port"]
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 695, in __getitem__
return self._data[key]
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 679, in _data
data = self._lazy_data = self._load()
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 1519, in _load
return self._backend.config_get()
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2727, in config_get
out = self._run('config-get', return_output=True, use_json=True)
File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/model.py", line 2617, in _run
raise ModelError(e.stderr)
ops.model.ModelError: ERROR permission denied
unit-seldon-controller-manager-0: 20:37:58 ERROR juju.worker.uniter.operation hook "stop" (via hook dispatching script: dispatch) failed: exit status 1
unit-seldon-controller-manager-0: 20:37:58 INFO juju.worker.uniter unit "seldon-controller-manager/0" shutting down: unit "seldon-controller-manager/0" not found (not found)
unit-prometheus-k8s-0: 20:40:01 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
controller-0: 20:40:56 INFO juju.worker.caasapplicationprovisioner.runner stopped "seldon-controller-manager", err: attempt count exceeded: try again
controller-0: 20:40:56 ERROR juju.worker.caasapplicationprovisioner.runner exited "seldon-controller-manager": attempt count exceeded: try again
controller-0: 20:40:56 INFO juju.worker.caasapplicationprovisioner.runner restarting "seldon-controller-manager" in 3s
controller-0: 20:40:59 INFO juju.worker.caasapplicationprovisioner.runner start "seldon-controller-manager"
ERROR lost connection to pod
Charm is using version 1.14.0
instead of 1.15
for no apparent reason. We 've probably missed this during upgrading from 1.14 to 1.15.
Remove integration test in current form fails.
Re-design/debug remove integration test to ensure it is passing.
Seldon allows for several ways to pass default configuration values. We use a few of them simultaneously, which results in unexpected behaviour.
On deploy, our charm creates:
configmap
{CHARM_NAME}-operator-resources-config
, which is mounted into the seldon-core manager pod as /tmp/operator-resources
configmap
seldon-config
cm
but can be confirmed)Seldon preferentially uses:
configmap
seldon-config
(from the same namespace as the seldon core manager is deployed/tmp/operator-resources
/tmp/operator-resources
is ignored (or at least ignored whenever there's a value in both. Not sure if it it is merging these sources or just taking the first it sees). This means changing values in the {CHARM_NAME}-operator-resources-config
will have no effect.We should remove the redundancy (or at minimum clearly document it in the code and readme). There are pros and cons to both:
seldon-config
: changes to this file will live update the defaults without having to restart the manager pod which is good, but since the name is hard coded deploying two seldon charms to the same will cause conflicts (I dont think there's a good use case for multiple seldon deployments in the same model, though)/tmp/operator-resources
: is uniquely named and used by the seldon it is deployed for, so wont have conflicts if multiple seldons are deployed. But, changes to source configmap
do not automatically propagate to the pod (the pod must be restarted after the configmap
is changed in order to load that new data)seldon-controller-manager failed to upgrade 1.6 to 1.7
Failed to reach active/idle:
seldon-controller-manager blocked: K8S resources creation failed
Merge into:
track/1.15
(release KF v1.7)main
Deploy charm from stable:
juju deploy seldon-core seldon-controller-manager --channel=1.14/stable --trust
Verify that deployment:
juju status
Model Controller Cloud/Region Version SLA Timestamp
test-upgrade uk8s microk8s/localhost 2.9.34 unsupported 11:25:23-04:00
App Version Status Scale Charm Channel Rev Address Exposed Message
seldon-controller-manager res:oci-image@eb811b6 active 1 seldon-core 1.14/stable 92 10.152.183.81 no
Unit Workload Agent Address Ports Message
seldon-controller-manager/0* active idle 10.1.59.111 8080/TCP,4443/TCP
Build local charm and execute refresh command:
juju refresh seldon-controller-manager --path=./seldon-core_ubuntu-20.04-amd64.charm --resource="oci-image=docker.io/seldonio/seldon-core-operator:1.15.0"
Verify that upgrade was successful:
juju status
Model Controller Cloud/Region Version SLA Timestamp
test-upgrade uk8s microk8s/localhost 2.9.34 unsupported 11:27:41-04:00
App Version Status Scale Charm Channel Rev Address Exposed Message
seldon-controller-manager .../c5e3s519ko1quc9tqnysy92... active 1 seldon-core stable 0 10.152.183.252 no
Unit Workload Agent Address Ports Message
seldon-controller-manager/0* active idle 10.1.59.112
Handle ConfigMap created by workload container during application remove
There is a ConfigMap created by workload container seldon-core
to track its leadership:
kubectl -n <namespace> get configmap a33bd623.machinelearning.seldon.io -o=yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"seldon-controller-manager-5ff5b59788-r9c5b_d5372a80-cf0b-49ce-88d6-4c33a1096f28","leaseDurationSeconds":15,"acquireTime":"2023-03-23T18:52:51Z","renewTime":"2023-03-23T18:54:15Z","leaderTransitions":1}'
creationTimestamp: "2023-03-23T18:49:51Z"
labels:
app.juju.is/created-by: seldon-controller-manager
name: a33bd623.machinelearning.seldon.io
namespace: kf
resourceVersion: "4439"
uid: 7dfa723e-5286-435b-a633-84a25f340506
This ConfigMap has expiration time of 45 seconds.
Initial problem was detected when testing upgrade: deploying stable charm and then upgrading to updated one with 45 seconds failed upgrade due to container inability to acquire lock on ConfigMap above:
error retrieving resource lock kf/a33bd623.machinelearning.seldon.io
If upgrade is performed outside of expiration window it succeeds.
On application removal this ConfigMap (a33bd623.machinelearning.seldon.io
) is not removed. It should be removed.
Integration of ROCK images.
https://github.com/canonical/seldonio-rocks
The following ROCK images need to be integrated into Seldon Core Operator:
Charm managed:
Workload managed:
Design of how ROCKs are built, tested and integrated is captured in related specification (KF-044).
Main design points for Seldon ROCKs:
Existing integration tests are to be re-used to test functionality of new ROCK images.
Follow Seldon documenation for testing:
https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/index.html
https://docs.seldon.io/projects/seldon-core/en/latest/nav/config/servers.html
The current seldon charm is written in the podspec charming pattern. We need to migrate it to the new sidecar pattern.
Jira
As presently implemented, the combination of kubeflow-roles and Seldon that are planned for the Kubeflow 1.7 release do not grant users access to SeldonDeployments
in their namespaces. This has the effect of making Seldon unusable for Kubeflow users.
In pod spec versions of the seldon charm, Kubeflow users were granted permission to create/edit/* SeldonDeployments
in their namespace via this aggregated ClusterRole (for a description of how kubeflow aggregates roles to users, see this readme). This ClusterRole
was implemented in the kubeflow roles charm because pod spec did not allow us to create arbitrary ClusterRoles
.
Now that the charm was migrated to sidecar, this ClusterRole
could be deployed by this charm. canonical/kubeflow-roles-operator#38 removed the ClusterRole
from the central role deployment (possibly because it was thought that this role was now implemented in the seldon charm directly?). The result is that users are not granted the required permissions for SeldonDeployments
.
To fix this, we should:
ClusterRole
in kubeflow-rolesClusterRole
here in SeldonRegarding adding tests, it is unclear where the best place should be for them. Because this was the transference of responsibility from one charm to another, I think it could only be caught at the bundle level? Although it feels unsatisfying that we can delete a file in kubeflow-roles
and not have a check at the repo level to say that was important.
Encountered during a PR that just updated a random text file in the repo #183
The tests were failing for Seldon with the following message:
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.
https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183
WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
ERROR asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2000' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
res = await connector(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
await self._connect(endpoints)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
result = await task
File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
return await self._open(endpoint, cacert)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line [402](https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183#step:5:403), in _open
return (await websockets.connect(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
transport, protocol = await self._creating_connection
File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
ERROR asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2001' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
res = await connector(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
await self._connect(endpoints)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
result = await task
File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
return await self._open(endpoint, cacert)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open
return (await websockets.connect(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
transport, protocol = await self._creating_connection
File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.
Integration tests for Tensorflow Seldon severs are required to validate that Seldon Deployment will be successful.
Integration tests for these servers should be a part of Seldon integration tests suite and run on every PR.
These tests will be reused in ROCKs testing.
The following Tesnorflow prepackaged Seldon servers should be covered in test:
"TENSORFLOW_SERVER": {
"protocols" : {
"tensorflow": {
"image": "tensorflow/serving",
"defaultImageVersion": "2.1.0"
},
"seldon": {
"image": "seldonio/tfserving-proxy",
"defaultImageVersion": "1.15.0"
}
}
},
Use pamatetirized tests to execute same test for different Seldon servers.
To test use specified test deployments from:
For seldon
protocol, server container image in test response data is updated based on deployed seldon-config
ConfigMap
. This ensures that test response data contains properm image and version.
For one of the Tensorflow tests request and response data are relatively large. That data is stored in files and common paramterized test is modified to be able to read data from file (JSON) or use it as is (JSON object).
Eg.:
# prepare request data:
# - if it is string, load it from file specified in by that string
# - otherwise use it as JSON object
if isinstance(request_data, str):
# response test data contains file with JSON data
with open(f"tests/data/{request_data}") as f:
request_data = json.load(f)
For mlflowserver
test ephemeral-sorage=2G
requirement/limit needs to be added to ensure it runs in GH runner.
Integration testing should pass when implemented.
Please add to the documentation that Istio is required when deploying. The istio relation is not necessary for the seldon-core to be deployed.
When no istio is deployed the error is strange and hard to troubleshoot - SeldonIO/seldon-core#3646
Due to juju/python-libjuju#913 there was a temporary fix introduced in requirements-integration.in
:
# FIXME: This is a workaround for https://github.com/juju/python-libjuju/issues/913
# please pin to a released python-libjuju ASAP.
juju @ git+https://github.com/DnPlas/python-libjuju@dnplas-pyyaml-6
When the issue is fixed this needs to be reviewed/removed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.