Update Operator

Feature (What you would like to be added):
Run multi-node ETCD during maintenance operations, so that it can quickly fail-over.

Motivation (Why is this needed?):
Shorter ETCD (=API server=cluster) downtimes during maintenance operations that effect ETCD like rolling the seed node it runs on or updating the ETCD spec.

Approach/Hint to the implement solution (optional):
Operator that scales out (with node anti-affinity) and later in again. The main question will be how to orchestrate that with Gardener as there are hooks and means missing for that at present.

[Feature] Make Etcd CRD's `spec.backup.store` section immutable

Feature (What you would like to be added):
Make Etcd CRD's spec.backup.store section immutable.

Motivation (Why is this needed?):
Make Etcd CRD's spec.backup.store section immutable so that the storage container location isn't allowed to change mid-usage of an etcd, due to potential mismatch of snapshotting and restoration locations, allowing restorations to happen from a different etcd's backup and rendering the shoot cluster unusable. Refer gardener/gardener#4454 for a fix already made on Gardener, although we still want druid to be resilient to potential undesirable changes to the Etcd resource.

Approach/Hint to the implement solution (optional):
Since CRD immutability is yet to be support (refer kubernetes/kubernetes#65973), it might make more sense to use something like a validating webhook on the Etcd resource updates.

/cc @amshuman-kr

[Feature] Enhance reconciliation to handle multi-node scenario in etcd-druid

Feature (What you would like to be added):

Enhance reconciliation to handle multi-node scenario in etcd-druid. This should include the following topics.

Maintain the Ready and AllMembersReady conditions based on the contents of the members section of the Etcd resource status.
Create member Lease objects for every member pod.
Create separate Services for client and etcd peer (ref) -> TBD (#147)

Motivation (Why is this needed?):

Pick individually executable pieces of the multi-node proposal.

Approach/Hint to the implement solution (optional):

[Feature] Unit tests for etcd-druid reconciliation cycle

Feature (What you would like to be added):
Unit tests for etcd-druid reconciliation cycle.

Motivation (Why is this needed?):
We should have both positive and negative scenarios covered in the unit tests to improve our own productivity and to avoid regression.

Approach/Hint to the implement solution (optional):
Replace the kubebuilder way of tests (running kube-apiserver and etcd) with mock APIs.

[Feature] Control etcd + backup-restore sidecar versions

Feature (What you would like to be added):
The etcd-druid should control the both the versions for etcd + backup-restore sidecar.

Motivation (Why is this needed?):
It controls the manifests + configuration for the statefulset and the used versions must fit to it. Hence, it makes sense to control them.

Approach/Hint to the implement solution (optional):
Please use the image vector approach (https://github.com/gardener/gardener/blob/master/charts/images.yaml) with the use of https://github.com/gardener/gardener/tree/master/pkg/utils/imagevector.
It must be possible to overwrite the image vector during deployment time.

[Feature] Move to etcd v3.3.23 or higher

Feature (What you would like to be added):
Move to etcd v3.3.23 or the latest v.3.3.x patch release .

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):

[BUG] Validation that selector matches labels in etcd.Spec

Describe the bug:
A clear and concise description of what the bug is.
The statefulset prevents creation of statefulsets where the labels in the template does not match the selector. Etcd however does not have a similar validation.
Expected behavior:
A clear and concise description of what you expected to happen.
Etcd resource creation should through an error when the labels in the template does not match the selector.
How To Reproduce (as minimally and precisely as possible):
Have the selector field set so that it does not match the template in the statefulset.
Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

[Feature] Integrate backup compression feature from etcd-backup-restore with etcd-druid and gardener

Feature (What you would like to be added):

Integrate backup compression feature from etcd-backup-restore with etcd-druid (by enabling configuration via Etcd resource spec) and then integrate with gardener.

Motivation (Why is this needed?):

The backup compression feature will be used primarily in the etcd-druid and gardener context.

Approach/Hint to the implement solution (optional):

Keep the default configuration in etcd-druid to be uncompressed backups (for backward compatibility) and the default configuration in gardener integration to be compressed backups.

[Feature] Move status update to a separate controller in druid

Feature (What you would like to be added):
The main reconciliation loop in etcd-druid takes care of everything from updating the owned resources and updating the status in the Etcd resource. We should create a separate controller (still part of the etcd-druid controller manager) which reconciles only the status section of the Etcd resource.

Credit: @rfranzke ❤️

Motivation (Why is this needed?):
The main reconcilation loop is triggered only if the watch events pass some predicates. If the status update during the main reconcilation fails for any reason, the status in the Etcd resource might not be updated until the next gardener reconcilation event that matches the predicates.

Approach/Hint to the implement solution (optional):

[BUG] HVPA not able to scale etcd.

Describe the bug:
VPA recommender misses the permission to get scale subresource because of which VPA on etcd is not happening

Expected behavior:
As load increases based on VPA recommendations , etcd should be scaled.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: All

Anything else we need to know?:

[Feature] Enhance multi-node ETCD proposal to handle backup health more explicitly

Feature (What you would like to be added):

The current multi-node ETCD proposal to handle backup health more explicitly. Especially, consider the impact of not cutting off requests when backup upload fails.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):

[Feature] Expose the monitoring configuration as per gardener extensions contract

Feature (What you would like to be added):
Expose the associated monitoring and logging configuration as per the https://github.com/gardener/gardener/blob/master/docs/extensions/logging-and-monitoring.md

Motivation (Why is this needed?):
Thought Druid is standalone component, it is designed adhere to gardener extension contract as well. As a result it might have to take responsibility of exposing it monitoring configuration to gardener like projects.

Approach/Hint to the implement solution (optional):

Compacting Incremental Snapshot Files

Question: How are you guys dealing with the incremental backup files when restoring a cluster? I am asking, because in Kubify we expect one full snapshot file to trigger a restore operation. What is the best way to compact all the incremental backup files into one? Do you have already something handy?

[BUG] Automatic recovery from informer errors

Feature (What you would like to be added):
If there is any issue with watch connections used by the informers, etcd-druid should detect this and try to automatically recover from it.

Motivation (Why is this needed?):
Manual intervention is required without such detection and automatic recovery.

Approach/Hint to the implement solution (optional):
We need to revendor client-go and possibly controller-runtime to include the fix (kubernetes/kubernetes#87329) that propagates informer errors to the caller and then possibly react to it.

See also:

[Feature] Auto-recover if database directory is locked

Feature (What you would like to be added):
In some infrastructures (Azure), abnormal termination of etcd container/pod leads to the database directory lock not being released and prevents the backup-restore to hang while opening the database for verification on etcd container restart.

We should try to detect this scenario and try to recover from it automatically.

Motivation (Why is this needed?):
This happens rarely (so far only a couple of times in Azure) but requires manual intervention. Typically, a pod restart resolves the issue. But we should try and automate this.

Approach/Hint to the implement solution (optional):
Typically, a pod restart resolves the issue.

Multi-Node/Clustered ETCD

Stories

Please provide stories that we plan to tackle:

As operator ...
As provider ...

Motivation

We should support provisioning and management of multi-node etcd clusters via etcd-druid to serve the following goals:

Generally, better robustness, resilience, and high availability (HA)
Zero downtime maintenance (ZDM)
Non-disruptive scaling
Single-zone outage/fault tolerance
Ephemeral (or in-memory) persistence for better performance

Acceptance Criteria

Enhancement/Implementation Proposal (optional)

https://github.com/gardener/etcd-druid/tree/master/docs/proposals/multi-node

[Feature] Prepare Druid for Server-Side Apply

Feature (What you would like to be added):
Druid should use and support Server-Side Apply where applicable once Gardener dropped the support for seed clusters with K8s <= 1.17 (gardener/gardener#4083).

Motivation (Why is this needed?):
Server-Side Apply makes working with the etcd resource more efficiently when there will be more then one actor (motivated here).

Tasks to be done:

Add markers for merge strategy to etcd resource
Use SSA for etcd status updates

[Feature] Make it possible to have smaller auto-compaction-retention period for etcd

Feature (What you would like to be added):

Make it possible to have smaller auto-compaction-retention period for etcd (both main etcd and the embedded ETCD during restoration).

Motivation (Why is this needed?):

High update rate can overflow memory and storage if auto-compaction-retention period is long. The current value is 24h.

etcd-druid/charts/etcd/templates/etcd-bootstrap-configmap.yaml

Lines 129 to 130 in 8307a62

    
               auto-compaction-mode: periodic 
        
               auto-compaction-retention: "24"

Approach/Hint to the implement solution (optional):

We can either change the value to be smaller by default (5m?) and/or we can make it configurable via the Etcd resource spec.

[Feature] Modularize and enhance status management according to multi-node ETCD proposal

Feature (What you would like to be added):

Enhance the Etcd resource status structure according to the changes proposed here
Enhance etcd-backup-restore and etcd-druid to update the corresponding member's health status in the Etcd resource status. This need not include the task of cutting off traffic in case of backup failure yet as the evaluation/decision on that is pending for the scenario of multi-node ETCD with ephemeral persistence. The etcd-backup-restore needs to consider the following scenarios in the implementation.
- etcd-druid can maintain the AllMembersReady and Ready conditions as well as the following transitions for the member status where the member's etcd-backup-restore is unable to update its own status. Probably this has to be done in etcd-druid by enhancing the etcd status controller (custodian).

Motivation (Why is this needed?):

Pick individually executable pieces of the multi-node proposal.

Approach/Hint to the implement solution (optional):

Also, it would be preferable to use StatusWriter.Patch() to avoid race-conditions.

[Feature] Move out bootstrap script to etcd image

Feature (What you would like to be added):
Move out etcd bootstrap script to etcd custom image.

Motivation (Why is this needed?):
To avoid the issue of out-of-sync configmap and statefulset spec (and hence etcd image version) during etcd version updates on Gardener landscapes.

Approach/Hint to the implement solution (optional):

[Feature] Enhance the Etcd resource status structure as proposed in the multi-node proposal

Feature (What you would like to be added):

Enhance the Etcd resource status structure according to the changes proposed here while maintaining backward compatibility for the consumers of Etcd resource status (such as the gardenlet).

Motivation (Why is this needed?):

Pick individually executable pieces of the multi-node proposal.

Approach/Hint to the implement solution (optional):

For backward compatibility, the existing status fields and the values in them need to maintained as they are. In both the main etcd-druid controller (especially, here and here) as well as the newly separated custodian controller.

[Feature] Leader election settings should be increased and made configurable in chart manifests

Feature (What you would like to be added):
Leader election settings should be increased and made configurable in chart manifests.

Motivation (Why is this needed?):
The default leader election settings in controller-runtime seem to create too much load on the apiserver. It should be possible to configure them to reduce load on the apiserver without having to make any changes to etcd-druid.

Approach/Hint to the implement solution (optional):
We can introduce command-line flags and chart manifest flags along the lines of gardener/gardener#2667.

Also, it would be desirable to switch to Lease for leader election rather than ConfigMap. But controller-runtime still uses ConfigMap. So, for this, we either have to wait till controller-runtime move to Lease or we override with a custom newResourceLock factory function in the options.

[BUG] Validations for etcd spec

Describe the bug:
Validations for the below scenarions:

backup-secretRet nil-check
volumeTemplateName not provided
StorageClass not provided

Expected behavior:
A clear and concise description of what you expected to happen.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

[Feature] Automatic Update-PR script

Feature (What you would like to be added):
This repository should also benefit from automatic update PRs of dependent components. etcd-druid deploys etcd-backup-restore, hence, when a new version of it is released then automatic update PRs should be opened by CI, similar to gardener/gardener#2260.

Motivation (Why is this needed?):
Less manual actions.

Approach/Hint to the implement solution (optional):
You need such a script: https://github.com/gardener/gardener/blob/master/hack/.ci/set_dependency_version
You don't need to copy it but can also call it like the extensions as you already vendor gardener/gardener: https://github.com/gardener/gardener-extension-provider-aws/blob/master/.ci/set_dependency_version, https://github.com/gardener/gardener-extension-provider-aws/blob/master/hack/tools.go#L23

:BUG: Etcd-sts should block owner etcd resource deletion

Currently, the druid deploy etcd statefulset with owenerReference pointing to etcd resources. But the blockOwnerDeletion field is set to false as you can see here. But deletion of etcd resource should be blocked until all the resources deployed by it like statefulset, service, configmap are deleted.
Ideally, deletion of etcd resource should guarantee that etcd server is completely down.

[Feature] Schedule regular backup compaction

Feature (What you would like to be added):

With the newly introduced compaction command in etcd-backup-restore to asynchronously compact backups (latest full snapshot and its following incremental snapshots into a single full snapshot) in gardener/etcd-backup-restore#301, we should enhance etcd-druid to schedule the backup compaction at regular intervals to limit the number of incremental snapshots at any point in time and hence enhance backup restoration performance.

Motivation (Why is this needed?):

Complete the functionality for the issue #88.

Approach/Hint to the implement solution (optional):

etcd-druid's main controller may create a CronJob as part of it's reconciliation cycle. There is no need include the logic for selecting existing cronjobs based on spec.selector (of the Etcd resources) because of #186.

[Feature] Improve Lease informer

Feature (What you would like to be added):
Use an optimized informer for Lease resources, concretely only for objects which contain a gardener.cloud/owned-by label.

Motivation (Why is this needed?):
#214 fetches Lease objects for performing health checks on etcd members. It uses the standard Controller-Runtime client which is backed by a cache, so that all Lease objects will be considered in the informer's ListWatch function. Since Controller-Runtime v0.9.0 it is possible to setup this cache more fine granular (see here)

Approach/Hint to the implement solution (optional):
Controller Runtime is updated to v0.10.2. So the optimization of lease informer based on label is supported with the current version

[BUG] Fix failing test after

Describe the bug:
The following test case is failing after the commit 72ec7a0.

Expected behavior:
No test cases should fail.

How To Reproduce (as minimally and precisely as possible):

Run make test on the commit 72ec7a0.

Logs:

• Failure [6.027 seconds]
Druid when etcd resource is created [It] if fields are set in etcd.Spec and TLS enabled, the resources should reflect the spec changes 
/tmp/build/a94a8fe5/pull-request-gardener.etcd-druid-pr.master/tmp/src/github.com/gardener/etcd-druid/controllers/etcd_controller_test.go:482

  Expected
      <string>: ConfigMap
  to match fields: {
  .Data."bootstrap.sh":
  	Expected
  	    <string>: "...tus = '143'..."
  	to equal               |
  	    <string>: "...tus == '143..."
  }
  

  /tmp/build/a94a8fe5/pull-request-gardener.etcd-druid-pr.master/tmp/src/github.com/gardener/etcd-druid/controllers/etcd_controller_test.go:882

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

[BUG] Remove operation annotation before reconciling etcd, in accordance to the gardener extension contract

Describe the bug:
The etcd controller removes the operation annotation from the Etcd resource after reconciling it, which goes against the gardener extension contract.

Expected behavior:
The etcd controller should remove the operation annotation from the Etcd resource before reconciling it here.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

[Feature] Avoid fetching resource from external resources on internet during bootstrap

Feature (What you would like to be added):
In Alicloud China regions, sometimes it is not possible to run apk get. We need to avoid such statement.
Motivation (Why is this needed?):
Shoot cluster can't be created in China regions sometimes.
Approach/Hint to the implement solution (optional):

[Feature] Performance/load test for etcd instances

Feature (What you would like to be added):
We should have performance/load test for etcd instances integrated with etcd-druid CI/CD pipelines which should test at least the following aspects.

Create an etcd database close to 8Gi.
Create a high rate of updates (>500/s) into etcd and high rate of delta snapshots (>4/m of >100Mi snapshots).
The changes should include large sized individual changes that are close to the configured max request bytes size.
Restart etcd under the such active load.
Restore a large DB.

Motivation (Why is this needed?):
This will help us understand the limits and help us configure the alert thresholds.

Approach/Hint to the implement solution (optional):

[BUG] panic: "invalid memory address or nil pointer dereference"

Describe the bug:
etcd-druid panics with nil pointer.

Expected behavior:

How To Reproduce (as minimally and precisely as possible):

Logs:

E0416 09:08:10.455443       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 317 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x14e9740, 0x23fb600)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x14e9740, 0x23fb600)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).getMapFromEtcd(0xc0000d4050, 0xc00086c500, 0x3fb999999999999a, 0x4, 0x0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:881 +0x1276
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).reconcileEtcd(0xc0000d4050, 0xc00086c500, 0xc00086c500, 0x0, 0x0, 0x0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:724 +0x4d
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).reconcile(0xc0000d4050, 0x18da080, 0xc000048248, 0xc00086c500, 0xc000845c40, 0x2, 0x2, 0x18a9ec0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:227 +0x27c
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).Reconcile(0xc0000d4050, 0xc000187300, 0x16, 0xc0005c4d00, 0xb, 0xc000758c00, 0x1, 0xc000758cc8, 0x478588)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:189 +0x30b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000184840, 0x15400e0, 0xc00061a400, 0x43eb00)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000184840, 0x0)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000184840)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00069c530)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00069c530, 0x3b9aca00, 0x0, 0x1, 0xc00015a0c0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00069c530, 0x3b9aca00, 0xc00015a0c0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1347ff6]

goroutine 317 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x14e9740, 0x23fb600)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).getMapFromEtcd(0xc0000d4050, 0xc00086c500, 0x3fb999999999999a, 0x4, 0x0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:881 +0x1276
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).reconcileEtcd(0xc0000d4050, 0xc00086c500, 0xc00086c500, 0x0, 0x0, 0x0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:724 +0x4d
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).reconcile(0xc0000d4050, 0x18da080, 0xc000048248, 0xc00086c500, 0xc000845c40, 0x2, 0x2, 0x18a9ec0)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:227 +0x27c
github.com/gardener/etcd-druid/controllers.(*EtcdReconciler).Reconcile(0xc0000d4050, 0xc000187300, 0x16, 0xc0005c4d00, 0xb, 0xc000758c00, 0x1, 0xc000758cc8, 0x478588)
	/go/src/github.com/gardener/etcd-druid/controllers/etcd_controller.go:189 +0x30b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000184840, 0x15400e0, 0xc00061a400, 0x43eb00)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000184840, 0x0)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000184840)
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00069c530)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00069c530, 0x3b9aca00, 0x0, 0x1, 0xc00015a0c0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00069c530, 0x3b9aca00, 0xc00015a0c0)
	/go/src/github.com/gardener/etcd-druid/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/github.com/gardener/etcd-druid/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
etcd-druid: v0.1.14

[Feature] Move ETCD monitoring configuration into `etcd-druid`

Feature (What you would like to be added):

Move ETCD monitoring configuration into etcd-druid according to gardener's extensions monitoring integration.

Motivation (Why is this needed?):

Keep the monitoring configuration close to the component.

Approach/Hint to the implement solution (optional):

[Feature] Standalone Helm Chart for Etcd Druid

Feature (What you would like to be added):
Would be great if we would have a standalone helm chart for druid which can be used outside Gardener.

Motivation (Why is this needed?):
Using druid instead of the etcd-operator and utilizing the backup and restore capabilities.

Approach/Hint to the implement solution (optional):
Cool thing would be to have a chart in the charts repository and released inside this github project via https://github.com/helm/chart-releaser-action in the repositories gh-pages branch.

[Feature] The latest snapshot info in etcd status

Feature (What you would like to be added):
The status in etcd resource does not reflect the current snapshot [full/delta]. Update the etcd resource status to reflect the latest snapshot information.

Motivation (Why is this needed?):
It would help in control plane migration to fetch the latest snapshot for update.
Approach/Hint to the implement solution (optional):

[Feature] Deploy/maintain the correct `PodDisruptionBudget` configuration according to the Etcd resource status conditions

Feature (What you would like to be added):

Deploy/maintain the correct PodDisruptionBudget configuration according to the Etcd resource status conditions.

Motivation (Why is this needed?):

Pick individually executable pieces of the multi-node proposal.

Approach/Hint to the implement solution (optional):

The deployment of PodDisruptionBudget resource is probably best done in the main controller and the dynamic modification of the resource based on the Etcd resource status is best done in the custodian controller.

It is probably better to deploy the PodDisruptionBudget only for the multi-node case (spec.replicas > 1) because deploying it for the single-node case might block node drain.

[Enhancement] let's not hardcode "cluster-autoscaler.kubernetes.io/safe-to-evict=false" for all etcd

Feature (What you would like to be added):
etcd-druid should not hardcode the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict=false" on etcd pods and let user configure it using annotation field in CRD.

Motivation (Why is this needed?):
Not every etcd is critical to system. The above mentioned annotation is specific to cluster-autoscaler and not etcd. Depending on the use of etcd CRD creator should have choice to add this annotation.
From gardener's POV, this etcd-main is critical but etcd-events is not that critical. So, the annotation should be set for etcd-main but not etcd-events. In future, we thought of deploying etcd for cilium networking extension, there also probably this annotation might not be required.

Approach/Hint to the implement solution (optional):
Remove the annotation from https://github.com/gardener/etcd-druid/blob/master/charts/etcd/templates/etcd-statefulset.yaml#L30.

☂️ [Feature] Add EtcdMember resource

Feature (What you would like to be added):
A new resource EtcdMemeber should be added to the druid.gardener.cloud/v1alpha1 API group.

Example:

apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdMember
metadata:
  labels:
    gardener.cloud/owned-by: etcd-test
  name: etcd-test-0 # pod name
  namespace: default
  ownerReferences:
  - apiVersion: druid.gardener.cloud/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: etcd
    name: etcd-test
    uid: <UID>
status:
  id: "1"
  lastTransitionTime: "2021-07-20T10:34:04Z"
  lastUpdateTime: "2021-07-20T10:34:04Z"
  name: member1
  reason: up and running
  role: Member
  status: Ready

Every etcd member in a cluster should have a corresponding EtcdMember resource which contains the shown status information. The EtcdMember resource ought to be created and maintained by the backup-restore side car. Etcd-Druid may set status: Unkown after heartbeatGracePeriod (ref).

Motivation (Why is this needed?):
The original proposal intended the status information of each etcd member to be part of a []members list in the etcd.status resource. However, this will lead to update conflicts as multiple clients try to update the same resource at nearly the same time and we cannot use any adequate patch technique (SSA failed for K8s versions <= 1.21, strategic-merge not supported for CRDs) to prevent that.

Subtasks

#595

/cc @shreyas-s-rao @amshuman-kr

[Feature] Avoid locks during reconciliation

Feature (What you would like to be added):

#163 introduced locking the main and custodian controller for every update to the Etcd resource and its status.

This should be avoided and the race conditions in the tests should be solved in a different way.

Motivation (Why is this needed?):

Such synchronisation will lead to performance bottlenecks.

Approach/Hint to the implement solution (optional):

Use StatusWriter.Patch()?

Credit: @timuthy

[Feature] Start a job which will copy ETCD backups between backup buckets via etcd-backup-restore

Feature (What you would like to be added):
Functionality to start a job which will use etcd-backup-restore to copy ETCD backups between backup buckets during the restore phase of Control Plane Migration (you can check the revised GEP here

The ETCD druid can find out whether it should start such a job via an additional field in the etcd resource, providing information about the source backup bucket. All necessary secrets will be handled by the BackupEntry controller and an additional "source" Backupentry resource.

Motivation (Why is this needed?):
This is needed to start an etcd-backup-restore copy operation which will be used to copy etcd backups between backup buckets. You can check issue 356 on the etcd-backup-restore repo

Approach/Hint to the implement solution (optional):
A POC was already developed for this, however we did not start the etcd-backup-restore copy operation as a job. Still the main functionality and idea is present in the POC. It is outlined here: gardener/gardener#3875

[BUG] Etcd failed to flush changes from WAL to the DB before shutting down/crashing.

Describe the bug:
After a single-node etcd instance provisioned via etcd-druid terminated abnormally (non-zero exit code) the etcd container restarted and the the backup-restore sidecar container (on data directory verification) had the following logs.

current etcd revision (2314180238) is less than latest snapshot revision (2314180239): possible data loss

On circumventing the backup restoration triggered because of this, it was found that the WAL directory (not checked by the `backup-restore sidecar) contained more recent revisions which were applied after the restart (without the backup restoration).

Expected behavior:
etcd-druid should try and configure etcd instances to shut down safely (and flush the WAL changes to the database) or often that.

How To Reproduce (as minimally and precisely as possible):
Not known yet.

Logs:

current etcd revision (2314180238) is less than latest snapshot revision (2314180239): possible data loss

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

Documentation for etcd-druid deployment

Document information about how to deploy etcd-druid and what all resources are needed by etcd-druid while reconciling an etcd resource.

[Feature] Add validation for etcd resources

Feature (What you would like to be added):

Please add validation code for etcd resources, similarly to the validation code that already exists for other Gardener extension resources, even though this is technically still dead code.

We are currently working on a new validating webhook in seed-admission-controller for such extension resources, see gardener/gardener#4293, I think we could include the validation of etcd resources there as well. Alternatively, etcd-druid could introduce its own validating webhook if for whatever reason the above option is not good enough.

Motivation (Why is this needed?):

We recently had a rather severe issue that could have been prevented if we had such validation in place, see gardener/gardener-extension-provider-azure#328 (comment). In this particular case, gardenlet was generating an etcd resource with a spec.backup.store.prefix set to -- due to a data race. With validation in place, we could have detected -- as an invalid spec.backup.store.prefix and prevented the reconciliation from continuing. This particular issue is already fixed in gardenlet (see gardener/gardener#4459 and gardener/gardener#4454), but similar issues may occur in the future.

Approach/Hint to the implement solution (optional):

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead

Feature (What you would like to be added):
Currently, the health check of the etcd pods is linked to the backup health (last backup upload succeeded) in addition to just etcd health. But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails as long as high priority alerts are triggered when backup upload fails and follow up is done to resolve the issue.

Motivation (Why is this needed?):
Avoid bringing down the whole shoot cluster control-plane when backup upload fails as that basically brings the cluster to a grinding halt. This might be affordable if etcd data is backed by persistent volumes because for data loss to occur a further data corruption in the persistent volumes is required (while backup upload is failing) to cause a data loss.

Approach/Hint to the implement solution (optional):
The following tasks might have to checked/evaluated.

Trigger high priority alert when backup upload fails (for full or incremental backups) gardener/gardener#4094
Change the health check criteria for the readinessProbe in the etcd pods to not take backup health (last backup upload) into consideration gardener/etcd-backup-restore#411
~~Enhance the multi-node etcd proposal to address this new requirement.~~

☂️ Gardener ETCD Operator a.k.a. ETCD Druid

Feature (What you would like to be added):
Summarise the roadmap for etcd-druid with links to the corresponding issues.

Motivation (Why is this needed?):
A central place to collect the roadmap as well as the progress.

Approach/Hint to the implement solution (optional):

[BUG] Fix reconciliation predicates to fully comply with gardener extension contract

Describe the bug:
The main controller reconciles changes to Etcd resource spec even if gardener.cloud/reconcile annotation is not added. This is against the gardener extension contract.

Expected behavior:
The main controller should use the predicates in such a way that changes to the Etcd resource spec are reconciled only when the resource is also annotated appropriately. For example, see here.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID : v0.5.2
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

[Feature] Move to etcd v3.4.10 or higher

Feature (What you would like to be added):
Move to etcd v3.4.10 or the latest v3.4.x patch release.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
We might have to build our own custom image for etcd to package the dependencies of the bootstrap script (wget).

[Feature] Automated gardener integration tests

Feature (What you would like to be added):

We should created a suite of automated tests that test gardener integration.

Motivation (Why is this needed?):

We should detect as many regression and backward compatibility issues before merging a PR and to keep the master release-ready at any point in time.

Approach/Hint to the implement solution (optional):

[Feature] Simplify reconciliation by simplifying claim logic for StatefulSet, Service and ConfigMap based on Etcd resource name

Feature (What you would like to be added):

The reconciliation flow of etcd-druid includes claiming from potentially multiple pre-existing StatefulSet, Service and ConfigMap objects if they exist. This is done by selecting the objects based on spec.selector in the Etcd resource, claiming one of the matching objects (if any) and deleting the rest of the objects (if any). If no matching objects are found then a new object is created.

The logic of claiming from multiple pre-existing objects objects based onspec.selector was done because of the following reasons.

Migration from the time before etcd-druid was introduced. I.e. adopting objects created from the time before etcd-druid was introduced minimised and simplified clean up.
Keeping options open before multi-node ETCD design was finalised to use a single StatefulSet for all the members of an ETCD cluster. Another alternative of using one StatefulSet for each member of an ETCD cluster was still open at that time.

Now that the migration scenario as well as the multi-node design don't need the functionality of claiming from multiple pre-existing objects, we can simplify the claim logic to just pick the object to be claimed by the same name as the Etcd resource . We will still need the claim functionality to mark it as claimed, of course.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):

[Feature] Provision for multiple instances of ETCD through ETCD CRD

Feature (What you would like to be added): ETCD druid should create multiple ETCD instances(along with ETCDBR instances) as specified in ETCD CRD.

Motivation (Why is this needed?): To allow bootstrapping of multinode ETCD cluster for shoot cluster

Approach/Hint to the implement solution (optional):

Refer: #107

	auto-compaction-mode: periodic
	auto-compaction-retention: "24"

gardener / etcd-druid Goto Github PK

etcd-druid's Issues

Stories

Motivation

Acceptance Criteria

Enhancement/Implementation Proposal (optional)

Recommend Projects

Recommend Topics

Recommend Org