psfried / roperator Goto Github PK
View Code? Open in Web Editor NEWExperimental Kubernetes Operator kit written in Rust
License: Apache License 2.0
Experimental Kubernetes Operator kit written in Rust
License: Apache License 2.0
Operators may perform additional validation beyond what's possible with the structural validation in the CRD. Additionally, they may call out to external services. Both of those things significantly increase the chances of there being some sort of error when syncing a parent.
We want to make it as easy as possible for operators to provide visibility to errors. Operators are typically used by people who don't (and shouldn't have to) understand all of the details of how everything works behind the scenes. It's a good practice to set a special status
on the parent when an error occurs, because it allows users to easily see when an error occurs. Roperator currently supports this, but it does nothing to help make it easier.
Another good way to provide visibility to errors is emitting Event
s. Unlike parent statuses, it's easy to have a generic process that watches only for events that indicate an error has occurred. Events also form a historical record of errors, which can be especially useful when there's transient issues. The parent status typically only shows the current status, but it might not show a transient error that occurred in the past.
The idea is to add support for events and additionally to also allow operators to opt in to some reasonable automatic defaults for handling errors. This can be done by adding an additional function to the Handler
trait that looks something like the following:
fn on_error(&self, description: &ErrorDescription) -> ErrorResponse;
The ErrorDescription
and ErrorResponse
types would be something like:
pub struct ErrorDescription<'a> {
/// True if the error was returned from a `finalize` function, false if it was from `sync`
pub is_finalize_error: bool,
/// The sync request that was being handled when the error occurred
pub parent: &'a roperator::handler::SyncRequest,
pub error: roperator::error::Error, // a Box<dyn RoperatorError>, which supports downcasting
/// The number of _consecutive_ errors
pub error_count: u64,
}
pub struct ErrorResponse {
/// how long to delay before retrying this particular parent
pub retry_in: std::time::Duration,
/// If event is Some, then roperator will emit an event that includes these details
pub event: Option<EventDetails>,
/// see below for an explanation of how this fields would be used
pub status_error: Option<(String, serde_json::Value)>,
}
pub struct EventDetails {
/// A short, machine-readable string describing the reason for the error event
pub reason: String,
/// A human-readable description of what happened
pub message: String,
}
The on_error
function must not fail, or else we could easily end up in "turtles all the way down" scenario. There's just no reasonable way to try to handle errors in the error handler. The function signature does not allow for returning a Result
in order to make this clear, and to discourage authors from making fallible function calls, e.g. HTTP calls to an external service.
Errors may be cause either by the Handler
itself returning a Result::Err
variant, or by an error that roperator encounters when communicating with the api server. Depending on the cause of an error, it may or may not be possible to emit events or set a parent status. For example, if there error is caused by a network error that prevents communication with the api server, or an RBAC misconfiguration that prevents roperator from creating events. We'll need to make it clear that this error handling is done on a best-effort basis. We should also support testing this error handling behavior in the TestKit
.
Given that errors may originate from outside the Handler
itself (or even after the parent status has already been updated), it makes the semantics of the status_error
field potentially confusing. The following sequence is one example of a potentially tricky situation:
spec
of the parent, which triggers a sync.The reason this is potentially ambiguous is that the status of the parent should continue to reflect the actual state of the child resources, and those resources may still be running just fine. But we want the status to also reflect the fact that we were not able to sync the latest changes to the spec
.
These type of scenarios imply that we should treat the status returned from the error handling function as additive. In other words, it should be added to the existing status instead of replacing it entirely. Thus, the status_error
should not represent the entire status, but rather only the portion that describes the error. If status_error
is Some, then it will contain a tuple that represents both the key and the value to add to the status. For example, returning status_error: Some(("error".to_string(), Value::String("my error message".to_string()) )),
would cause the json "error": "my error message"
to be added to the existing status instead of replacing it.
We want to make it as easy as possible to write operators that follow best practices, but in this case the events behavior should probably be opted into explicitly by the operator authors, since emitting events requires special permissions that may not be obvious.
For the status, though, it may make sense to add a basic error message automatically. We already add the observedGeneration
field to the status, so there's some precedent for that. And the risk or running into authorization errors is minimal, since we already update parent statuses as a matter of course.
Currently, roperator will spin up an HTTP server to serve prometheus metrics and to expose a health check endpoint. This is all done over plain HTTP only, and only the port is configurable, unless you want to disable the endpoints entirely.
We want to add optional TLS support for these endpoints, which will require some additional configuration options. In particular, we'll need
TLS may seem unnecessary for just metrics and a health check, but the ultimate goal is to make it easy for an operator to also support acting as a Validating Admission Webhook, though we'll have a separate issue for that particular feature.
I am working on a CRD operator that is running afoul of the namespace constraint. The stated aim of the constraint is to prevent the complexity of having namespaced parents with non-namespaced children. That's not what I'm doing; my CRD (not namespaced) creates a namespace and then a resource within that namespace. The second child resource runs afoul of the constraint.
This error also contradicts the docs, which say:
If you need resources that span multiple namespaces, then you should use a non-namespaced (a.k.a. Cluster Scoped) parent CRD.
The API for accessing child resources has a few issues that should be ironed out before the next release:
&K8sType
. This will be more consistent with other apis, and also more convenient and readable. For example, RequestChildren::of_type(api_version: &str, kind: &str)
should be changed to of_type(k8s_type: &K8sType)
request.children().of_type_raw(my_type).first()
. Currently, this fails to compile because of temporary references that get dropped.It's fairly common for operators to perform some additional validation of their "parent" resources on top of what can be done in the schema attached to the CRD. Given that they're already doing this validation anyway, it can be desirable to validate the parent resources before allowing them to be modified. Kubernetes has supported this for a while now, in the form of Validating Admission Webhooks. The goal is to add optional configuration that would allow roperator to setup a new http route for validating admission webhooks.
If enabled, users would be expected to pass some sort of validation function that would accept a &ValidationRequest
struct (which would include the fields described here) and would return a ValidationResponse
(which would include these fields).
This feature would not change the current behavior of the operator at all, and would be totally optional. We also could not guarantee that the webhook would be called, or even respected, so we still wouldn't want to make any assumptions about the parent resources in a SyncRequest
.
The CAData::File
variant should really represent paths as a PathBuf
instead of a String
, since paths might contain characters that are not valid utf-8. We should clean that up the next time there's a breaking api change.
I recently came across an rust operator example at https://github.com/Pscheidl/rust-kubernetes-operator-example which uses kube-rs
Does roperator implement similar approach. Can it make use of kube-rs?
Writing correct RBAC resources is tedious and error prone, and there's basically no reason to do it by hand. What I've been thinking of is adding an API for generating RBAC resources from an OperatorConfig
(and possibly some other inputs). This could be used in a few different ways. One option would be something like the following quick sketch:
fn main() {
let operator_config = create_operator_config();
if std::env::args().any(|arg| arg == "--print-rbac-role") {
let rbac_roles = generate_rbac_cluster_role(&operator_config);
println!("{}", rbac_roles);
} else {
// do the usual to run the operator
}
}
I think the bare minimum functionality that would be useful would be to only generate a ClusterRole as a yaml or json string from the OperatorConfig. A more complete feature set might allow the option to generate either a Role or ClusterRole (or maybe base that on the presence of a namespace in the config), allow for choice of format, or have additional options to generate Users and (Cluster)RoleBindings based on some additional input.
The current behavior uses fixed delays when an error is encountered while syncing a parent. We should implement a backoff strategy so that re-tries can happen more quickly at first, and then move to longer delays.
Motivated by this comment and moved from that issue.
One thought did cross my mind about this. When you want the operator to create a namespace and then resources within that namespace, you need to of course create the namespace before the resources within it. Your handler would have to first return a syncResponse with only then namespace resource, and then create the rest of the child resources later, after the namespace shows up in the syncRequest children. This type of thing would probably make an excellent example. I'll try to create a basic example that shows how to do that, since it's likely that others would benefit from it as well.
So, that explains why it was failing on the first request and then succeeding on the second. It makes sense, but I had been thinking of it as equivalent to applying a YAML manifest, which obviously is higher level than this.
I have implemented the logic you suggest, and it seems to work pretty well:
/// Encapsulates the creation of the child objects (e.g., `Namespace` and `Rolebinding`) and
/// returns a list of opaque JSON objects for syncing. Note that it returns only the namespace if a
/// namespace is not included in the `SyncRequest`, and then creates a `RoleBinding` in a
/// subsequent request.
pub fn create_child_objects(request: &SyncRequest) -> anyhow::Result<Vec<Value>> {
let parent: DevSandbox = request.deserialize_parent()?;
let mut children: Vec<Value> = Vec::new();
let names = Names::new(&parent)?;
let ns = create_namespace(&names).map(serde_json::to_value)??;
children.push(ns);
if request.has_child("v1", "Namespace", "", &names.namespace) {
info!("Namespace exists; creating rolebinding");
let rb = create_rolebinding(&names).map(serde_json::to_value)??;
children.push(rb);
} else {
info!("Namespace does not exist; creating it");
}
Ok(children)
}
It strikes me as a little odd that there's no way to check whether something exists using one of the JSON objects. What if you could use a method like request.has_child_value(ns)
to check for a child with a given serializable JSON object?
I have a minikube server in my ~/.kube/config
file, specified with paths:
- cluster:
certificate-authority: /Users/myuser/.minikube/ca.crt
server: https://192.168.99.100:8443
name: minikube
Upon trying to bootstrap a ClientConfig
with this cluster definition in place, it fails with an error:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Format(Message("missing field `certificate-authority-data`", Some(Pos { marker: Marker { index: 4755, line: 16, col: 25 }, path: "clusters[3].cluster" })))', src/libcore/result.rs:1165:5
last commit was 3 years ago ...
CronJob in k8s_types.rs
is defined as under the v1
batch api, but it does not exist in this version. It exists under v1beta1
and is not in v1
.
This has the effect that if you specify .with_child(k8s_types::batch::v1::CronJob, ...)
on your operator config then you get no events at all being processed, the operator appears to hang and do nothing. (which may perhaps be a separate bug?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.