argoproj-labs / argo-rollouts-manager Goto Github PK
View Code? Open in Web Editor NEWKubernetes Operator for Argo Rollouts controller.
Home Page: https://argo-rollouts-manager.readthedocs.io/en/latest/
License: Apache License 2.0
Kubernetes Operator for Argo Rollouts controller.
Home Page: https://argo-rollouts-manager.readthedocs.io/en/latest/
License: Apache License 2.0
At present, the operator only supports namespace-scoped installs via the --namespaced
parameter, which is currently hardcoded to be enabled by default.
Controller should default to cluster-scoped install by default
Add an optional namespaceScoped
bool field to the RolloutsManager
CR
The cluster-scoped install should ensure a ClusterRole/ClusterRoleBinding exist that allow the rollouts manager to watch the whole cluster
Likewise, the cluster-scoped install should NOT run with the --namespaced
parameter
In the case where there exist multiple RolloutsManager
CRs on the cluster, we should detect and report an invalid configuration. The .status
field should indicate why these are not being reconciled.
Unit/E2E tests (using Ginkgo) to verify each of the above
I've contributed initial E2E test fixtures, and E2E tests, as part of #19.
However, all I did was port the existing kuttl tests to Ginkgo.
As part of this current PR, we should add additional E2E tests that more thoroughly test the functionality that is already implemented in this repository.
NOTE: As of this writing, we only support Namespace-scoped Argo Rollouts installs. Thus the tests only need to verify that rollouts can be installed into a single Namespace.
The Rollouts CR also supports env, extraCommand args, custom image, version. We should make sure these are set as expected.
apiVersion: argoproj.io/v1alpha1
kind: RolloutManager
metadata:
name: argo-rollout
labels:
example: with-properties
spec:
env:
- name: "foo"
value: "bar"
extraCommandArgs:
- --foo
- bar
image: "quay.io/random/my-rollout-image"
version: "sha256:...."
The ev, extraCommand args, custom image, version can all be modified. We should ensure that when they are modified on an existing RollutManager, that the change is made correctly to the resources.
Likewise, verify that when a RolloutsManager is deleted that all the expected resources are deleted.
Issue Criteria:
All argo-rollout-manager PRs should have the upstream Argo Rollouts E2E tests run against them.
Not all of the Argo Rollouts E2E tests will work when tested in this configuration, so we can maintain a 'deny list' of tests to ignore (tests that are expected to fail).
The version of Argo Rollouts that we run the E2E tests from should correspond to the same upstream Argo Rollouts version that argo-rollouts-manager is installing.
Within upstream Argo Rollouts, one can enable traffic management plugins and metric plugins via the standard Argo Rollouts configuration ConfigMap
.
However, at present, within Argo Rollouts operator there is no mechanism to specify these via the RolloutManager
CR.
As of this writing (once PR #40 has merged), the current operator behaviour is:
However, requiring users to modify the ConfigMap to add plugins is not as desirable as our supporting it 'natively', within the RolloutManager CR.
Native support within the RolloutManager CustomResource would look like this:
For example, something like:
kind: RolloutManager
metadata:
name: my-rollout-install
namespace: rollouts
spec:
# (...)
plugins:
trafficManagement:
- name: "argoproj-labs/gatewayAPI"
location: "https://github.com/argoproj-labs/rollouts-plugin-trafficrouter-gatewayapi/releases/download/<version>/gateway-api-plugin-<arch>" # file url also supported
metric:
- name: "argoproj-labs/sample-prometheus"
location: "https://github.com/argoproj-labs/rollouts-plugin-metric-sample-prometheus/releases/download/v0.0.4/metric-plugin-linux-amd64" # file url also supported
sha256: "dac10cbf57633c9832a17f8c27d2ca34aa97dd3d" #optional sha256 checksum of the plugin executable
This will cause the specified values to be inserted into the ConfigMap that is generated by the RolloutManager controller.
Open Questions to be discussed and/or investigated by implementor:
Issue Criteria:
Additional/removal/modification of the fields
For testing traffic management plugin, we can use the Gateway plugin as an example. An E2E test might do this:
For testing metric plugin, there don't seem to be many plugins that exist. I've found these:
Depends on fix to #22
At present, if the value of the OpenShiftRolloutPlugin URL changes, for example, because we are shipping a new version of the plugin, it appears the ArgoRollouts ConfigMap will not be updated.
reconcileConfigMap
contains this block of code:
for _, plugin := range actualTrafficRouterPlugins {
if plugin.Name == OpenShiftRolloutPluginName {
// Openshift Route Plugin already present, nothing to do
return nil
}
}
This block of code will exit the function if the plugin already exists, regardless of whether the URL matches the expected value.
Fix Criteria:
The operator currently does not create the argo-rollouts-config
config map at the time of installing the controller and creating required resources. The argo-rollouts-config
cm is included in the Argo rollouts install.yaml manifest
We should also make sure to include a controller to reconcile this config map moving forwards
At present, it appears this operator would only support cluster-scoped installations of the rollouts-controller. There could be a case made for having the ability to restrict a rollouts-controller to only target rollouts in a specific namespace/namespaces when operating within multi-tenant clusters. This would be especially useful if the rollouts-controller CR supports features like automated analysisRuns etc, which can be enabled/disabled per controller based on the requirements of the tenants of a given namespace
Argo rollouts does have a namespace-install manifest : https://github.com/argoproj/argo-rollouts/blob/master/manifests/namespace-install.yaml
Suggested approach:
RolloutManager controller needs to create a ServiceMonitor so that metrics from Rollouts will automatically included in Prometheus gathered metrics, including in OpenShift monitoring (for OpenShift users)
Why is this important?
Scenarios
(Originally a feature request from https://issues.redhat.com/browse/GITOPS-3271)
RolloutsManager
CR is reconciled, it should create the corresponding ServiceMonitor CR
Routes
)A dependency on upstream Argo Rollouts was added to this repository, as part of https://github.com/argoproj-labs/argo-rollouts-manager/pull/15/files#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6.
However, I think this dependency will likely cause us problem in downstream consumers of this repository: depending on Argo Rollouts within go.mod
may require us to pull in some of the argo rollouts dependencies, which may conflict with other dependencies in our sister projects which vendor in argo-cd: argocd-operator and gitops-operator.
I'm mostly concerned about differing kubernetes versions, go-client versions, and controller-runtime versions between the projects.
Fortunately, we depend on Rollouts very little, and it should be straightforward to remove the reference from our code.
While we're at it, let's also upgrade to controller-runtime v0.16.3 (which is the version used by our sister controllers).
Criteria:
Whenever a new Rollouts release is released at argoproj/argo-rollouts, a script should run automatically that opens a PR on the 'argoproj-labs/argo-rollouts-manager' repo to update to that new version.
Once the PR is opened, the argo-rollouts-manager automated unit/E2E tests will verify whether there are any issues with installation of the release by the operator.
This way, we don't miss security/features updates on rollouts releases, and the amount of toil required to adopt a new version is reduced.
Example PRs generated by the tool:
Hi,
It would be nice to be able to provide labels and annotations to the rollout controller generated from the Rollout Manager
At present, we are running the upstream Argo Rollouts E2E automated tests against argo-rollouts-manager PRs. With each PR, we:
When the Argo Rollouts tests run again our operator, most tests pass! But some fail: for example, here is a list of failures from a recent E2E test run:
--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.48s)
--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.68s)
--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.69s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.74s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.85s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.89s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.90s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.92s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (3.25s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.13s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.17s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.18s)
(source)
TestControllerMetrics we expect to fail: the test expects that the Argo Rollouts controller is running locally (via make start-e2e
), whereas in this case it's running in a Pod on the cluster.
However, it's not clear why TestBlueGreenPromoteFull is failing: I've glanced over the test and everything it's doing seems like it should work (and often does work, on first run).
So, this issue is to investigate why it's failing. This is also a good opportunity to dig in to Rollouts code, both the controller code and the test code.
hack/run-upstream-argo-rollouts-e2e-tests.sh
:make test-e2e
to E2E_TEST_OPTIONS="-run 'TestFunctionalSuite' -testify.m 'TestBlueGreenPromoteFull'" "until-fail.sh" make test-e2e
TestBlueGreenPromoteFull
test over and over, until it fails.until-fail.sh
script:hack/run-upstream-argo-rollouts-e2e-tests.sh
Strangely, what I have seen is this TestBlueGreenPromoteFull
will initially pass a few times, but after a few runs it will switch to always failing, 100% of the time.
When you attempt to start the Rollouts controller on v1.6.4, it exits with this message:
Failed to init config: validation of config due to (plugin repository (openshift-route-plugin) must be in the format of <namespace>/<name>
This is due to the logic that was added as part of #15.
At present, the resource reconciliation logic is fairly straight forward in terms of its error handling. If we encounter any error whatsoever we return it immediately. However, this also has the effect of breaking the reconciliation cycle and blocking reconciliation of other resources that come later in the order. Not all resources created have equal importance and/or are worth blocking the reconciliation cycle for. We should consider which resources are critical for the functioning of the controller, and reserve returning errors only for situations where these resources produce some kind of failure
For instance, errors reconciling the metrics service should not block reconciliation of other resources
There could be further logic developed to fine tune how specific errors for specific resources should be handled (if not returned or ignored)
The new operator creates a Custom resource with kind "argorollout" which is the representation for the rollouts-controller created on the cluster in response to this resource. However, the resource watched by this controller is also called a "rollout". As a result, users of this operator would see 2 resources, an "argorollout" which is actually for the rollout-controller, and then a "rollout" which is the actual rollout itself, and this can be very confusing for users
Suggested solution:
I would also suggest extending this renaming to parts of the operator code, such as function names, their associated comments, log messages etc. This would make it more clear for people working on this project and make it harder to get confused about whether this operator is acting on the rollouts-controller resources or resources for the rollouts themselves
For example:
reconccileRolloutsControllerService
vs reconcileRolloutsService
(and associated comments and log messages)reconcileRolloutsControllerRole
vs reconcileRolloutsRole
We have logic in deployment.go
which attempts to compare the expected (target) contents of the Argo Rollouts Deployment
resource with the live on-cluster contents of that resource, and iff they differ it will update the live copy.
However, at present, that logic will always cause an Update, because reflect.DeepEqual()
is matching fields with default values that are set in the live object, but not set in the expected object.
We should either decide it's needed, and call it from another function, or remove it.
the file https://github.com/iam-veeramalla/argo-rollouts-manager/blob/main/controllers/default.go has some nasty bad practices such as hardcoding SHA sums and images. This is not ok. These should come as part of the config, and not built-in the go build
code.
Also the comments are off.
The comparison !reflect.DeepEqual(actualPodSpec.Containers[0], desiredPodSpec.Containers)
in the code snippet is incorrect. This is comparing actualPodSpec.Containers[0] (first index)and desiredPodSpec.Containers(slice) which is incorrect. As a result the comparison may not behave as expected and could lead to unintended consequences.
Right now the controller is doing the minimum things and the testing coverage is:
drpaneas@linux:~/go/src/github.com/iam-veeramalla/argo-rollouts-manager$ go test -cover ./...
? github.com/iam-veeramalla/argo-rollouts-manager [no test files]
? github.com/iam-veeramalla/argo-rollouts-manager/api/v1alpha1 [no test files]
ok github.com/iam-veeramalla/argo-rollouts-manager/controllers 0.018s coverage: 57.4% of statements
it would be a good idea to secure this basic functionality by increasing the controllers up to 80% minimum.
The Argo Rollouts controller supports HA mode as is described in this FAQ section. By supplying the --leader-elect
flag and adjusting the controller deployment replicas. While this maybe something that could achieved from the existing extraCommandArgs
field in the operand spec, I think it would be worth considering the introduction of a first class field in the CR to support HA mode, along the lines of:
spec:
ha:
enabled: true
replicas: 5
which should automatically update the controller deployment container with the required flag and update the replica count in the deployment itself
I think it would be beneficial to consider a segmented package structure for the existing controllers package, in a way that would make the code more organized and abstracted by generally defined verticals. This could also help us define meaningful interfaces for subpackages and allow for greater maintainability of the code base going forwards
An initial idea for what the new package structure could look like:
└── Controllers/
├── Rollouts/
│ ├── argorollouts_controller.go
│ └── status.go
├── Workloads/
│ ├── deployments.go
│ ├── secrets.go
│ ├── config.go
│ ├── ...
│ └── utils.go
├── Networking/
│ ├── service.go
│ ├── ingress.go
│ ├── route.go
│ ├── ...
│ └── utils.go
├── RBAC/
│ ├── role.go
│ ├── rolebinding.go
│ ├── serviceaccount.go
│ ├── ...
│ └── utils.go
├── Common/
│ └── default.go
├── Testing/
│ └── testing.go
└── Utility/
└── utils.go
As a part of this I would propose to include the following changes:
utils.go
for package specific utility functions. Utility/utils.go
could be reserved only for truly generic utility functions that can be used project-widewatchers.go
and reconcile.go
contents be consolidated in argorollouts_controller.go
since their contents are closely related (and performed by) the main argorollouts_controller.gogetPolicyRules
should be moved out of utils.go
and into the RBAC
subpackage since it is central to the rbac functionality more than just being a utility functionThis PR was contributed, which adds code that reconciles the Argo Rollouts ConfigMap settings resource, in order to contribute the OpenShift Routes traffic.
However, there were not Reconcile unit tests contributed as part of that PR. We should add those as part of this issue.
Issue Criteria:
At present, there exist some non-Ginkgo-based unit tests in the 'controllers' package of this repository. We should convert them to Ginkgo.
kubectl apply -f examples/basic_argorollout.yaml
is wrong as this file doesn't exist. The correct would be kubectl apply -f examples/basic_rolloutmanager.yaml
Doing that you end up with the following error:
drpaneas@linux:~/go/src/github.com/iam-veeramalla/argo-rollouts-manager$ kubectl apply -f examples/basic_rolloutmanager.yaml
rolloutmanager.argoproj.io/rollout-manager created
2023-05-03T12:23:51+02:00 INFO Reconciling Rollout Managers {"controller": "rolloutmanager", "controllerGroup": "argoproj.io", "controllerKind": "RolloutManager", "RolloutManager": {"name":"rollout-manager","namespace":"default"}, "namespace": "default", "name": "rollout-manager", "reconcileID": "8129326f-b483-4472-8c02-d014c76175e2", "Request.Namespace": "default", "Request.Name": "rollout-manager"}
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts serviceaccount
drpaneas@linux:~/go/src/github.com/iam-veeramalla/argo-rollouts-manager$ 2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating serviceaccount argo-rollouts
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts roles
2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating role argo-rollouts
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts role bindings
2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating rolebinding argo-rollouts
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts secret
2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating secret argo-rollouts-notification-secret
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts deployment
2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating deployment argo-rollouts
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts service
2023-05-03T12:23:51+02:00 INFO rollouts-controller Creating service argo-rollouts
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling status of workloads
2023-05-03T12:23:51+02:00 INFO Reconciling Rollout Managers {"controller": "rolloutmanager", "controllerGroup": "argoproj.io", "controllerKind": "RolloutManager", "RolloutManager": {"name":"rollout-manager","namespace":"default"}, "namespace": "default", "name": "rollout-manager", "reconcileID": "7d5c2dc2-6228-4031-a1cd-f539e5c4f0c9", "Request.Namespace": "default", "Request.Name": "rollout-manager"}
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts serviceaccount
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts roles
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts role bindings
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts secret
2023-05-03T12:23:51+02:00 INFO rollouts-controller reconciling rollouts deployment
2023-05-03T12:23:51+02:00 ERROR Reconciler error {"controller": "rolloutmanager", "controllerGroup": "argoproj.io", "controllerKind": "RolloutManager", "RolloutManager": {"name":"rollout-manager","namespace":"default"}, "namespace": "default", "name": "rollout-manager", "reconcileID": "7d5c2dc2-6228-4031-a1cd-f539e5c4f0c9", "error": "Operation cannot be fulfilled on deployments.apps \"argo-rollouts\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/drpaneas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/drpaneas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/drpaneas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
2023-05-03T12:23:51+02:00 INFO Reconciling Rollout Managers {"controller": "rolloutmanager", "controllerGroup": "argoproj.io", "controllerKind": "RolloutManager", "RolloutManager": {"name":"rollout-manager","namespace":"default"}, "namespace": "default", "name": "rollout-manager", "reconcileID": "5e74903b-7821-4ca0-a173-1ccaccd822ce", "Request.Namespace": "default", "Request.Name": "rollout-manager"}
This code is generating a list of policy rules for an Argo Rollouts role. It's a common practice to use generated or pre-defined policy rules, especially for RBAC roles and role bindings. This approach allows you to easily define a set of permissions for a given role, without having to manually add or remove permissions every time the manifests change.
However, this code could become outdated if the underlying manifests change. One solution to this problem is to generate the policy rules during the build process or as part of a CI/CD pipeline, using a tool like kubebuilder or kustomize. This way, the policy rules are always up-to-date with the manifests.
Another solution is to include a mechanism for versioning the policy rules. For example, you could store the policy rules in a ConfigMap or Secret and include a version field. When the manifests change, you can update the policy rules and increment the version field. Then, in your RBAC role and role binding definitions, you can reference the specific version of the policy rules that you want to use.
Overall, the use of generated or pre-defined policy rules is a good practice for RBAC roles and role bindings, but you should also consider how to keep the policy rules up-to-date with changes in the underlying manifests.
Personally I would use a tool that reads the manifests and generate this code as part of the building process, to avoid future issues.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.