gardener / etcd-wrapper Goto Github PK

Configures and starts an embedded ETCD

License: Apache License 2.0

Go 57.65% Shell 40.13% Makefile 1.36% Dockerfile 0.86%

etcd-wrapper's Introduction

Gardener

Gardener implements the automated management and operation of Kubernetes clusters as a service and provides a fully validated extensibility framework that can be adjusted to any programmatic cloud or infrastructure provider.

Gardener is 100% Kubernetes-native and exposes its own Cluster API to create homogeneous clusters on all supported infrastructures. This API differs from SIG Cluster Lifecycle's Cluster API that only harmonizes how to get to clusters, while Gardener's Cluster API goes one step further and also harmonizes the make-up of the clusters themselves. That means, Gardener gives you homogeneous clusters with exactly the same bill of material, configuration and behavior on all supported infrastructures, which you can see further down below in the section on our K8s Conformance Test Coverage.

In 2020, SIG Cluster Lifecycle's Cluster API made a huge step forward with v1alpha3 and the newly added support for declarative control plane management. This made it possible to integrate managed services like GKE or Gardener. We would be more than happy, if the community would be interested, to contribute a Gardener control plane provider. For more information on the relation between Gardener API and SIG Cluster Lifecycle's Cluster API, please see here.

Gardener's main principle is to leverage Kubernetes concepts for all of its tasks.

In essence, Gardener is an extension API server that comes along with a bundle of custom controllers. It introduces new API objects in an existing Kubernetes cluster (which is called garden cluster) in order to use them for the management of end-user Kubernetes clusters (which are called shoot clusters). These shoot clusters are described via declarative cluster specifications which are observed by the controllers. They will bring up the clusters, reconcile their state, perform automated updates and make sure they are always up and running.

To accomplish these tasks reliably and to offer a high quality of service, Gardener controls the main components of a Kubernetes cluster (etcd, API server, controller manager, scheduler). These so-called control plane components are hosted in Kubernetes clusters themselves (which are called seed clusters). This is the main difference compared to many other OSS cluster provisioning tools: The shoot clusters do not have dedicated master VMs. Instead, the control plane is deployed as a native Kubernetes workload into the seeds (the architecture is commonly referred to as kubeception or inception design). This does not only effectively reduce the total cost of ownership but also allows easier implementations for "day-2 operations" (like cluster updates or robustness) by relying on all the mature Kubernetes features and capabilities.

Gardener reuses the identical Kubernetes design to span a scalable multi-cloud and multi-cluster landscape. Such familiarity with known concepts has proven to quickly ease the initial learning curve and accelerate developer productivity:

Kubernetes API Server = Gardener API Server
Kubernetes Controller Manager = Gardener Controller Manager
Kubernetes Scheduler = Gardener Scheduler
Kubelet = Gardenlet
Node = Seed cluster
Pod = Shoot cluster

Please find more information regarding the concepts and a detailed description of the architecture in our Gardener Wiki and our blog posts on kubernetes.io: Gardener - the Kubernetes Botanist (17.5.2018) and Gardener Project Update (2.12.2019).

K8s Conformance Test Coverage

Gardener takes part in the Certified Kubernetes Conformance Program to attest its compatibility with the K8s conformance testsuite. Currently Gardener is certified for K8s versions up to v1.30, see the conformance spreadsheet.

Continuous conformance test results of the latest stable Gardener release are uploaded regularly to the CNCF test grid:

Provider/K8s	v1.30	v1.29	v1.28	v1.27	v1.26	v1.25
AWS
Azure
GCP
OpenStack
Alicloud
Equinix Metal	N/A	N/A	N/A	N/A	N/A	N/A
vSphere	N/A	N/A	N/A	N/A	N/A	N/A

Get an overview of the test results at testgrid.

Start using or developing the Gardener locally

See our documentation in the /docs repository, please find the index here.

Setting up your own Gardener landscape in the Cloud

The quickest way to test drive Gardener is to install it virtually onto an existing Kubernetes cluster, just like you would install any other Kubernetes-ready application. You can do this with our Gardener Helm Chart.

Alternatively you can use our garden setup project to create a fully configured Gardener landscape which also includes our Gardener Dashboard.

Feedback and Support

Feedback and contributions are always welcome!

All channels for getting in touch or learning about our project are listed under the community section. We are cordially inviting interested parties to join our bi-weekly meetings.

Please report bugs or suggestions about our Kubernetes clusters as such or the Gardener itself as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn More!

Please find further resources about our project here:

etcd-wrapper's People

Contributors

Stargazers

Watchers

Forkers

doytsujin shreyas-s-rao unmarshall ishan16696 luis-sousa-pinto aaronfern aleksandarsavchev raphaelvogel anveshreddy18 vogelhome seshachalam-yv acumino shafeeqes

etcd-wrapper's Issues

☂️ Allow external agents to start embedded etcd and set readiness via HTTP API

How to categorize this issue?

/area quality robustness usability
/kind enhancement task

What would you like to be added:
I would like etcd-wrapper to allow external agents to start embedded etcd and set readiness via HTTP API. This will help simplify the etcd bootstrapping process for the backup sidecar, by allowing it to re-use the embedded etcd provided by etcd-wrapper for restoration, but with readiness set to false so that the etcd does not serve external traffic. Once restoration is successful and the etcd is deemed ready to serve traffic, then readiness can be set to true by the backup sidecar to allow external traffic.

Additionally, I would like etcd-wrapper to provide a mechanism for leadership change notifications, which can be consumed by the backup sidecar or any other external agent via a push or pull mechanism.

Task List:

Leadership notification mechanism
Expose http endpoint to start embedded etcd with config passed in POST call
Expose http endpoint to set app readiness to ready/unready

Create a `release job` for etcd-wrapper

What would you like to be added:
We need a new job dedicated to cutting releases for etcd-wrapper just like already existing jobs that run tests whenever a PR is filed.
This will reduce manual effort needed when we need a new release of etcd-wrapper

Why is this needed:
This is needed to streamline etcd-wrapper development process, and to help with quicker releases

[BUG] Issues in startEtcd w.r.t select clause

Describe the bug:

In startEtcd method the current select clause can exit pretty soon as the default value of waitReadyTimeout is 0. It was initially assumed that a value of 0 would signify wait forever but that is now how time.After with a 0 duration would behave. This would result in an uninitialised etcd being assigned here.

In addition in cases where there is a timeout defined or there is a msg on etcd.Server.StopNotify() channel, it currently only logs the error and continues which could result in a stopped etcd being used. This is incorrect. One should return an error in these cases.

Expected behavior:
The assignment of etcd should be done only after a msg has been received on etcd.Server.ReadyNotify() channel. Also an error should be returned back to the caller when either a timeout has happened (if a non-negative and non-zero timeout has been defined) or etcd.Server.StopNotify() has received a msg.

A better way to wait for forever (if no timeout has been defined) is to do the following:

var timeoutCh <-chan time.Time
if a.waitReadyTimeout > 0 {
   ticker := time.NewTicker(a.waitReadyTimeout)
   defer ticker.Close()
   timeoutCh = ticker.C
}

select {
	case <-etcd.Server.ReadyNotify():
		a.logger.Info("etcd server is now ready to serve client requests")
	case <-etcd.Server.StopNotify():
		return fmt.Errorf("etcd server has been aborted, received notification on StopNotify channel")
	case <-timeoutCh:
		return fmt.Errorf("timeout after %s seconds waiting for ReadyNotify signal, aborting start of etcd", a.waitReadyTimeout.String())
	}
}

Trick: Read from a nil channel blocks forever.

Please also write unit tests for app.go which are currently missing. This should have been caught as part of unit tests.

Integration/e2e tests for etcd-wrapper component

What would you like to be added:
Integration and/or e2e tests for etcd-wrapper, so that PRs and releases are thoroughly tested before integrating into etcd-druid.

Decide whether to have integration or e2e tests for etcd-wrapper
Write integration/e2e tests
Provide easy method for local testing (possibly using KinD cluster)
Integrate with CICD pipeline

Why is this needed:
Without integration tests running on etcd-wrapper, it is possible that a PR that works fine within etcd-wrapper may fail to play well with etcd-backup-restore when integrated into etcd-druid. It is better to catch it early, within etcd-wrapper itself, with the help of integration/e2e tests running on every PR and release.

/area quality

Fix pipeline definitions, Makefile, and hack scripts

We need to enhance the Makefile, and ci scripts in general so that best practices are followed.
This is needed as pipeline dependencies depend on them

[Enhancement] Update ops script to include all etcd commands and relevant paths

Feature (What you would like to be added):
Current script under ops/print-etcd-cert-paths.sh only prints the cert paths. It would be nice to have a list of all paths (data directory, etcd config path, all popular etcdctl commands used by operators) that can just be copied and executed.

Motivation (Why is this needed?):
Ease the life of an operator by providing fully formed etcd commands and fully-formed paths to important paths.

Enable codecov for wrapper

How to categorize this issue?

/kind enhancement

What would you like to be added

codecov should be enabled for wrapper.

Motivation (Why is this needed?)

Enabling this plugin will be give an updated code coverage for main branch as well as for all in-process PRs. This will help in quickly identifying if the code coverage has dropped due to the recent PR.

[Feature] Introduce HTTP endpoint to restart embedded etcd

Feature (What you would like to be added):
Introduce a HTTP endpoint to allow external agents to restart the embedded etcd.

Motivation (Why is this needed?):
Use case:
To update advertise-peer-urls it is mandated by etcd to restart the member post making the member update call.
Refer: https://etcd.io/docs/v3.3/op-guide/runtime-configuration/#update-advertise-peer-urls

Today etcd-druid works around this missing feature by doing the following (Refer code):

etcd-druid updates the StatefulSet to ensure that any pending secret volume(s) are mounted and the config map changes are seen by the etcd-backup-restore container.
etcd-backup-restore makes the member update call as part of the starting the server. Refer code.
To ensure that the update to peer URL is reflected in the embedded etcd etcd-druid also triggers a deletion of all existing etcd pods forcing a restart.

The current implementation in etcd-druid is synchronous with waits embedded between steps. It is not crash friendly. If etcd-druid crashed in the middle of handling the peer URL TLS changes then it could result in a non-functioning etcd cluster. In addition etcd-backup-restore currently reports the status of peer URL TLS enablement by only looking at the mounted etcd configuration. This does not accurately indicate what the embedded etcd sees.

Therefore we need to follow the recommendations and ensure that the update is completed by first making the member update call immediately followed by restart of the member. The endpoint that is proposed to be exposed out of etcd-wrapper will be invoked by the etcd-backup-restore container just after the member-update call.

Approach/Hint to the implement solution (optional):

[Feature] Readiness Probe in multi-node etcd

Feature (What you would like to be added):
Currently, readinessProbe of etcd is set to an endpoint /healthz of HTTP server running in a backup sidecar.
This behaviour needed to be updated or improved as readinessProbe of clustered-etcd should depend on whether there is etcd-leader present or not then only it should serve the incoming write requests.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Approaches :

ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs
etcdctl endpoint health command performs a GET on the "health" key(source)
- fails when there is no etcd leader or when Quorum is lost as GET request will fail if there is no etcd leader present.
Advantages of this Method (etcdctl endpoint health).
- We don't have to worry about such scenarios of causing outage as now snapshotter failure won't fails the readinessProbe of etcd.
- If there is no Quorum present, kubelet will mark the etcd-members as NotReady and they won't able to serve the write as well as read requests.
Disadvantages of this Method (etcdctl endpoint health).
- Owner check feature depends on endpoint /healthz of HTTP server because when Owner check fails it fails the readinessProbe of etcd by setting the HTTP status to 503 but this Owner check in multi-node scenario is already being discussed here.
- It completely decouples the snapshotter of backup sidecar and readinessProbe of etcd, backup sidecar won't able to control when to let the traffic come in.
/health endpoint of etcd.
/health endpoint returns false if one of the following conditions is met (source):
- there is no etcd leader or leader-election is currently going on.
- the latency of a QGET request exceeds 1sec
Advantages and Disadvantage of Method 2 (/health endpoint).
- similar to method 1.
Use endpoint /healthz of HTTP server running in backup sidecar with modifications in such a way that whenever backup-restore leader is elected it should set HTTP server status to 200 for itself as well for all backup-restore followers and set the HTTP server status to 503 when there is no etcd-leader present.
Advantages of this Method (/healthz).
- We still have some coupling between snapshotter of backup sidecar and readinessProbe of etcd, backup sidecar will able to control when to let the traffic come in for etcd.
Disadvantages of this Method (/healthz).
- It will takes time to implement as well as to handle edge cases.
Future Scope:
- Go with method 2 as it give us flexibility to set the readinessProbe from backup-sidecar and switch to gRPC instead of sending REST requests.

Recategorization of CVE for `Network Exposure`

What would you like to be added:
Recategorization of CVE for Network Exposure.

Why is this needed:
Currently CVE network_exposure is set to private since the etcd-wrapper container does not interact with any endpoints outside of the cluster, and does not expose any external services as well. It is only contacted by etcd-backup-restore, kube-apiserver and prometheus. There is ongoing discussion to move etcd initialization from backup-restore container to etcd container, since initialization is is a DB-specific operation, and finds a better place within etcd container. Once this is done, the CVE label network_exposure needs to be re-looked at, since DB validation also checks the backup bucket for revision sanity check against the DB revision. Since this involves the etcd container contacting the object storage on a public hyperscaler, the value for label network_exposure will have to be changed to protected.