Giter Site home page Giter Site logo

gardener / etcd-backup-restore Goto Github PK

View Code? Open in Web Editor NEW
281.0 21.0 97.0 56.53 MB

Collection of components to backup and restore the ETCD of a Kubernetes cluster.

License: Apache License 2.0

Makefile 0.20% Go 96.90% Shell 2.42% Dockerfile 0.05% Smarty 0.43%
etcd backup aws gcp restore kubernetes azure openstack-swift

etcd-backup-restore's Introduction

Gardener Logo

REUSE status CI Build status Slack channel #gardener Go Report Card GoDoc CII Best Practices

Gardener implements the automated management and operation of Kubernetes clusters as a service and provides a fully validated extensibility framework that can be adjusted to any programmatic cloud or infrastructure provider.

Gardener is 100% Kubernetes-native and exposes its own Cluster API to create homogeneous clusters on all supported infrastructures. This API differs from SIG Cluster Lifecycle's Cluster API that only harmonizes how to get to clusters, while Gardener's Cluster API goes one step further and also harmonizes the make-up of the clusters themselves. That means, Gardener gives you homogeneous clusters with exactly the same bill of material, configuration and behavior on all supported infrastructures, which you can see further down below in the section on our K8s Conformance Test Coverage.

In 2020, SIG Cluster Lifecycle's Cluster API made a huge step forward with v1alpha3 and the newly added support for declarative control plane management. This made it possible to integrate managed services like GKE or Gardener. We would be more than happy, if the community would be interested, to contribute a Gardener control plane provider. For more information on the relation between Gardener API and SIG Cluster Lifecycle's Cluster API, please see here.

Gardener's main principle is to leverage Kubernetes concepts for all of its tasks.

In essence, Gardener is an extension API server that comes along with a bundle of custom controllers. It introduces new API objects in an existing Kubernetes cluster (which is called garden cluster) in order to use them for the management of end-user Kubernetes clusters (which are called shoot clusters). These shoot clusters are described via declarative cluster specifications which are observed by the controllers. They will bring up the clusters, reconcile their state, perform automated updates and make sure they are always up and running.

To accomplish these tasks reliably and to offer a high quality of service, Gardener controls the main components of a Kubernetes cluster (etcd, API server, controller manager, scheduler). These so-called control plane components are hosted in Kubernetes clusters themselves (which are called seed clusters). This is the main difference compared to many other OSS cluster provisioning tools: The shoot clusters do not have dedicated master VMs. Instead, the control plane is deployed as a native Kubernetes workload into the seeds (the architecture is commonly referred to as kubeception or inception design). This does not only effectively reduce the total cost of ownership but also allows easier implementations for "day-2 operations" (like cluster updates or robustness) by relying on all the mature Kubernetes features and capabilities.

Gardener reuses the identical Kubernetes design to span a scalable multi-cloud and multi-cluster landscape. Such familiarity with known concepts has proven to quickly ease the initial learning curve and accelerate developer productivity:

  • Kubernetes API Server = Gardener API Server
  • Kubernetes Controller Manager = Gardener Controller Manager
  • Kubernetes Scheduler = Gardener Scheduler
  • Kubelet = Gardenlet
  • Node = Seed cluster
  • Pod = Shoot cluster

Please find more information regarding the concepts and a detailed description of the architecture in our Gardener Wiki and our blog posts on kubernetes.io: Gardener - the Kubernetes Botanist (17.5.2018) and Gardener Project Update (2.12.2019).


K8s Conformance Test Coverage certified kubernetes logo

Gardener takes part in the Certified Kubernetes Conformance Program to attest its compatibility with the K8s conformance testsuite. Currently Gardener is certified for K8s versions up to v1.30, see the conformance spreadsheet.

Continuous conformance test results of the latest stable Gardener release are uploaded regularly to the CNCF test grid:

Provider/K8s v1.30 v1.29 v1.28 v1.27 v1.26 v1.25
AWS Gardener v1.30 Conformance Tests Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Azure Gardener v1.30 Conformance Tests Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
GCP Gardener v1.30 Conformance Tests Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
OpenStack Gardener v1.30 Conformance Tests Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Alicloud Gardener v1.30 Conformance Tests Gardener v1.29 Conformance Tests Gardener v1.28 Conformance Tests Gardener v1.27 Conformance Tests Gardener v1.26 Conformance Tests Gardener v1.25 Conformance Tests
Equinix Metal N/A N/A N/A N/A N/A N/A
vSphere N/A N/A N/A N/A N/A N/A

Get an overview of the test results at testgrid.

Start using or developing the Gardener locally

See our documentation in the /docs repository, please find the index here.

Setting up your own Gardener landscape in the Cloud

The quickest way to test drive Gardener is to install it virtually onto an existing Kubernetes cluster, just like you would install any other Kubernetes-ready application. You can do this with our Gardener Helm Chart.

Alternatively you can use our garden setup project to create a fully configured Gardener landscape which also includes our Gardener Dashboard.

Feedback and Support

Feedback and contributions are always welcome!

All channels for getting in touch or learning about our project are listed under the community section. We are cordially inviting interested parties to join our bi-weekly meetings.

Please report bugs or suggestions about our Kubernetes clusters as such or the Gardener itself as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn More!

Please find further resources about our project here:

etcd-backup-restore's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etcd-backup-restore's Issues

Backup sidecar stops working when facing connection issues

It looks like the backup-side car which we are using in the kubify setup stops working when it faces a connection issue against the blob storage.

ime="2018-11-16T17:18:15Z" level=info msg="GC: Executing garbage collection..."
time="2018-11-16T17:18:15Z" level=info msg="GC: Switching to Hour mode for snapshot 2018-11-16 16:57:54 +0000 UTC"
time="2018-11-16T17:18:15Z" level=info msg="GC: Switching to Day mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:18:15Z" level=info msg="GC: Switching to Week mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:18:18Z" level=info msg="Successfully saved delta snapshot at: Backup-1542388200/Incr-161068692-161068779-1542388698"
time="2018-11-16T17:18:28Z" level=info msg="Successfully saved delta snapshot at: Backup-1542388200/Incr-161068780-161068912-1542388708"
time="2018-11-16T17:19:09Z" level=error msg="failed to take new delta snapshot: failed to save snapshot: RequestError: send request failed\ncaused by: Put https://s3.eu-central-1.amazonaws.com/etcd-garden.dev.k8s.ondemand.com/etcd-operator/v1/Backup-1542388200/Incr-161068913-161069046-1542388718: dial tcp: lookup s3.eu-central-1.amazonaws.com on 10.241.0.10:53: read udp 10.241.132.102:59781->10.241.0.10:53: i/o timeout"
time="2018-11-16T17:19:15Z" level=info msg="GC: Executing garbage collection..."
time="2018-11-16T17:19:40Z" level=info msg="GC: Switching to Hour mode for snapshot 2018-11-16 16:57:54 +0000 UTC"
time="2018-11-16T17:19:40Z" level=info msg="GC: Switching to Day mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:19:40Z" level=info msg="GC: Switching to Week mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:20:40Z" level=info msg="GC: Executing garbage collection..."
time="2018-11-16T17:20:40Z" level=info msg="GC: Switching to Hour mode for snapshot 2018-11-16 16:57:54 +0000 UTC"
time="2018-11-16T17:20:40Z" level=info msg="GC: Switching to Day mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:20:40Z" level=info msg="GC: Switching to Week mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:21:40Z" level=info msg="GC: Executing garbage collection..."
time="2018-11-16T17:21:45Z" level=info msg="GC: Switching to Hour mode for snapshot 2018-11-16 16:57:54 +0000 UTC"
time="2018-11-16T17:21:45Z" level=info msg="GC: Switching to Day mode for snapshot 2018-10-31 16:00:00 +0000 UTC"
time="2018-11-16T17:21:45Z" level=info msg="GC: Switching to Week mode for snapshot 2018-10-31 16:00:00 +0000 UTC"

To make things worse, it keeps on going with the garbage collection. So eventually it will remove all old backups but won't create any new ones.

Integration test for the HTTP endpoints.

The packages have been tested from an integration perspective. However, tests are required for the HTTP endpoints of backup-restore server as well. This is how, etcd container interfaces with the etcd-backup-restore container in Gardener etcd deployment.

Scheduled full snapshots are not happening after an ETCD container crash

While we are doing performance tests in etcd, we noticed after a crash observed in etcd container, it comes up on its own in some time, further we could see the scheduled full snapshots are not being happening or taken.
Full snapshot was scheduled for every 10 mins. Even after 30-40 mins, We could observe there was not even a single full snapshot.
I am attaching the backup-restore log file for the
reference: fullsnapshoterror.log

Steps to reproduce :

  1. Delete the etcd container or crash the container
  2. Check the backup-restore logs, ETCD will come up but scheduled full snapshot is not happening.

Static code analysis

Gardener informs its stakeholders in its CNCF CII Badge, that static code checks are applied by using Checkmarx. This repository has findings, which have to be assessed by the component owner(s). As required all prio high findings were already been immediately assessed. Please find the timeline until when to assess the remaining prio medium findings in the Wiki (restricted access). At the time being you can ignore the prio low findings. Please find background information and a link to the Checkmarx project for your repository in the Wiki (restricted access). In the Wiki (restricted access) you will as well find information how to get a Checkmarx user which is required to be able to do your assessment in the Checkmarx Web UI.

Static code analysis

Gardener informs its stakeholders in its CNCF CII Badge, that static code checks are applied by using Checkmarx. This repository has findings, which have to be assessed by the component owner(s). As required all prio high findings were already been immediately assessed. Please find the timeline until when to assess the remaining prio medium findings in the Wiki (restricted access). At the time being you can ignore the prio low findings. Please find background information and a link to the Checkmarx project for your repository in the Wiki (restricted access). In the Wiki (restricted access) you will as well find information how to get a Checkmarx user which is required to be able to do your assessment in the Checkmarx Web UI.

Static Code Analysis

Gardener informs its stakeholders in its CNCF CII Badge, that static code checks are applied by using Checkmarx. This repository has findings, which have to be assessed by the component owner(s). As required all prio high findings were already been immediately assessed. Please find the timeline until when to assess the remaining prio medium findings in the Wiki (restricted access). At the time being you can ignore the prio low findings. Please find background information and a link to the Checkmarx project for your repository in the Wiki (restricted access). In the Wiki (restricted access) you will as well find information how to get a Checkmarx user which is required to be able to do your assessment in the Checkmarx Web UI.

Expose an API to trigger delta snapshot

Story

As a developer, I want to be able to trigger a delta snapshot of etcd, on demand. Triggering via the API should be controlled via a flag

Motivation

The feature will be handy to run certain tests in the CI/CD pipeline, not having to wait for the scheduled time. Controlling the schedule for triggering the delta snapshot is not always an option.

Acceptance Criteria

  • Should be able to trigger an on-demand delta snapshot

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Etcd snapshot corruption bug fix

Backup-restore utilizes a reader object from etcd and writer to the cloud-provider. Therefore, the backup is directly pushed from etcd to the cloud bucket with out taking a local copy or ensuring its valid. In case, snapshot fails partially due to snapshot timeout, corrupt etcd snapshot gets stored in the cloud bucket.

Change full snapshot procedure to first take the snapshot locally and then push the local snapshot to cloud bucket. This will optimize restoration too as if the latest full snapshot is available locally, we do not need to fetch it from the cloud.

Expose different metrics to monitor backup-restore utility

Story

  • As a user, I want to monitor the backup-restore utility is working as expected.
  • As a user, I want to know the time of last full and delta backups taken, Number of snapshots garbage collected on last call, last restoration triggered, etcd revision on latest snapshot and similar metrics.

Acceptance Criteria

  • /metrics API is exposed .
  • timestamp of last full snapshot
  • timestamp of last delta snapshot
  • Etcd revision on latest snapshot
  • network IO throughput
  • count of current snapshots in snap store
  • time take for last snapshot
  • time taken for last restore
  • Count of garbage collected snapshots

Simplify etcd sidecar setup

At the moment the etcdbrctl is deployed as a side-car container:

- name: backup
command:
- etcdbrctl
- server
- --schedule=*/5 * * * *
- --max-backups=5
- --data-dir=/var/etcd/data
- --insecure-transport=true
- --storage-provider=S3
- --delta-snapshot-period-seconds=10
- --garbage-collection-period-seconds=60
image: eu.gcr.io/gardener-project/gardener/etcdbrctl:0.2.3

and forcing the client (the main container) to execute several rest calls to the running server:

#!/bin/sh
while true;
do
wget http://localhost:8080/initialization/status -S -O status;
STATUS=`cat status`;
case $STATUS in
"New")
wget http://localhost:8080/initialization/start -S -O - ;;
"Progress")
sleep 1;
continue;;
"Failed")
continue;;
"Successful")
exec etcd --data-dir=/var/etcd/data --name=etcd --advertise-client-urls=http://0.0.0.0:2379 --listen-client-urls=http://0.0.0.0:2379 --initial-cluster-state=new --initial-cluster-token=new
;;
esac;
done
until finally, starting the etcd. This entire setup can be simplified greatly by having etcdbrctl be the one starting etcd as it knows exactly when to start etcd or even stop it on demand if needed. As an example:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: etcd-aws
spec:
  selector:
    matchLabels:
      app: etcd-aws
  serviceName: "etcd-aws"
  replicas: 1
  template:
    metadata:
      labels:
        app: etcd-aws 
    spec:
      containers:
      - name: backup
        command:
        - etcdbrctl
        - run # or some other name
        - --schedule=*/5 * * * *
        - --max-backups=5
        - --data-dir=/var/etcd/data
        - --insecure-transport=true
        - --storage-provider=S3
        - --delta-snapshot-period-seconds=10
        - --garbage-collection-period-seconds=60
        - -- # end of parameter list
        - etcd
        - --data-dir=/var/etcd/data
        - --name=etcd 
        - --advertise-client-urls=http://0.0.0.0:2379 
        - --listen-client-urls=http://0.0.0.0:2379 
        - --initial-cluster-state=new 
        - --initial-cluster-token=new
        image: eu.gcr.io/gardener-project/gardener/etcdbrctl:0.2.3

looks better and it in this case etcdbrctl does not need to start a rest api server and can directly work with etcd, putting it in charge of it, and any misbehaving clients don't break the workflow.

Automated Backup & Restore

Story

  • As user I want my Kubernetes cluster always to show my latest state and not "forget" resources.
  • As provider I want etcd backups to be taken frequently and the last one to be restored automatically should the PV be definitely lost or the current persistence be corrupt.

Motivation

Business continuity.

Acceptance Criteria

  • ETCD backups are taken frequently, e.g. every 5 minutes
  • ETCD is restored automatically should the PV be definitely lost or the current persistence be corrupt
  • Backup & restore is tested and validated (cluster health after restore) continuously (actually part of the DoD, but since its so important, its also listed here)
    • ETCD PV is deleted and then recreated from ETCD backup
    • Cluster is deleted and then recreated from the ETCD backup

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?
  • User documentation: Have you informed Sylvia Pick about user relevant changes?

Negative tests for single-node backup restoration

The tests should cover the following scenarios.

  • Parallel backups (full and delta)
    • If backups are already being taken when restoration starts, then the backups should be stopped, discarded and any future scheduling of the backups should be suspended until after restoration.
    • If backups start while restoration is in progress then such backups should fail and should not be retried until after the restoration is completed.
  • Parallel Defragmentation/compaction
    • If defragmentation/compaction is already being done when restoration starts, then such operations should be stopped and discarded and any future scheduling of such operations should be suspended until after restoration.
    • If defragmentation/compaction starts while restoration is in progress then such operations should fail and should not be retried until after the restoration is completed.
  • Download interruption
    The interruption could be anything from broken connections to the process being restarted.
    • If full snapshot download is interrupted then it should be retried, preferably without a container restart or at least on the container restart.
    • If any of the delta snapshot download is interrupted then it should be retried, preferably without a container restart or at least on the container restart.
      • In either case, sequence of applying the delta backups should not be affected.
  • Restoration interruption
    The interruption could be anything from broken connections to the process being restarted.
    • If full snapshot restoration is interrupted, then it should be retried, preferably without a container restart or at least on the container restart.
      For example, what happens if the process is restarted here or here?
    • If a delta snapshot restoration is interrupted, then it should be retried, preferably without a container restart or at least on the container restart.
      For example, what happens if restoration is interrupted here, anywhere here or anywhere here?
      • In either case, sequence of applying the delta backups should not be affected.
  • Backup corruption
    • If any of the backup files (full or delta) are corrupted then the restoration should fail and ETCD container should never be available/ready.
  • ETCD availability
    • Regardless of how much time or how many iterations are needed (based on the above scenarios) for a restoration to complete, ETCD should not be available/ready until restoration is successfully completed.
  • Data Sanity
    • Regardless of how much time or how many iterations are needed (based on the above scenarios) for a restoration to complete, ETCD should never be restored to any state except the latest backup. If any part of the latest backup is corrupted then the ETCD should never be restored successfully.

Add options to disable delta snpashots

For performance testing, we want to do the tests without any possible effects from delta snapshots or watch mechanism we use. So, there is a need for a configurable parameter to disable delta snapshots. Currently we correct delta snapshot time interval less than 1 to 1 seconds in code with warning log. We should change this behaviour and use that check to disable delta snapshot and watch.

etcd-backup-restore restored the oldest backup rather then the most recent

Recently we deleted the PV of a misbehaving etcd-main of a shoot.

Here are the list of backups of the shoot:

~ $ az storage blob list --account-name bkp... -c backup-store | tee flow-backup-blob-list
~ $ wc -l flow-backup-blob-list
504438 flow-backup-blob-list
~ $ head flow-backup-blob-list
Name                                                              Blob Type    Blob Tier    Length     Content Type              Last Modified              Snapshot
----------------------------------------------------------------  -----------  -----------  ---------  ------------------------  -------------------------  ----------
etcd-main/v1/Backup-1535155202/Full-00000000-09349626-1535155202  BlockBlob    Hot          15740960   application/octet-stream  2018-08-25T00:00:03+00:00
etcd-main/v1/Backup-1535155202/Incr-09349627-09349655-1535155214  BlockBlob    Hot          55162      application/octet-stream  2018-08-25T00:00:14+00:00
etcd-main/v1/Backup-1535155202/Incr-09349656-09349672-1535155224  BlockBlob    Hot          39413      application/octet-stream  2018-08-25T00:00:24+00:00
...
~ $ cat flow-backup-blob-list | grep / | cut -d'/' -f 3 | sort -u | cut -d- -f2 | xargs -I'{}' date -d '@{}' +%F | sort -u
2018-08-25
2018-08-26
2018-08-27
2018-08-28
... cropped (there is one for every day)...
2018-12-27
2018-12-28
2018-12-29
2019-01-02
2019-01-03
~ $

When etcd-main pod is started next time, it restored the Backup-1535155202 which is from 2018-08-25-01:00:02 and actually is the oldest backup:

2019-01-03T11:05:49.480057847Z time="2019-01-03T11:05:49Z" level=info msg="Checking for data directory structure validity..."
2019-01-03T11:05:49.480073547Z time="2019-01-03T11:05:49Z" level=info msg="Data directory structure invalid."
...
2019-01-03T11:05:49.805864268Z time="2019-01-03T11:05:49Z" level=info msg="Finding latest set of snapshot to recover from..."
...
2019-01-03T11:05:51.388486735Z time="2019-01-03T11:05:51Z" level=info msg="Removing data directory(/var/etcd/data/new.etcd.part) for snapshot restoration."
2019-01-03T11:05:51.388536536Z time="2019-01-03T11:05:51Z" level=info msg="Restoring from base snapshot: Backup-1535155202/Full-00000000-09349626-1535155202"
...
...
2019-01-03T11:20:42.477236616Z time="2019-01-03T11:20:42Z" level=info msg="Applying delta snapshot Backup-1535155202/Incr-09436347-09436364-1535205381"
2019-01-03T11:20:42.478725059Z time="2019-01-03T11:20:42Z" level=info msg="Responding to status request with: Progress"

I expect etcd-backup-restore to restore the newest backup, rather than the oldest.

disabling TLS support in the chart is tricky

We previously discussed about using the tls field as a toggle to turn on/off the tls support in #59 (comment)

Now while we're moving from our internal chart to this one, we realised disabling TLS is quite tricky. Given that helm will merge dictionaries, that actually makes it hard to simply remove a field inside an override file:
805b0ab#diff-f18d9ef27be2f36a73a018900706456eR42

Deploying the chart with a values file with tls: {} content wont overwrite the field in the default values file since helm uses merge strategy for dicts.

So I can see several options here:

  1. add an explicit flag to enable/disable the TLS feature
  2. set the default value of the tls key in the chart's default values.yaml to {} and comment out anything under the key.
  3. update the comment in charts default values.yaml file about how to disable tls by assigning en empty string.

I personally prefer to go with the 2nd option since it's a well known pattern in most charts, and also chart will work as is without any explicit parameters, for now people have to generate certs and create secrets before being able to use the chart. But i don't have a strong opinion for either way.

Reproduce issue

Here is an example about how it can impact the non-tls scenario:

$ cat etcd-values.yaml
tls: {}

$ helm install --values etcd-values.yaml ./chart
NAME:   vocal-tortoise

...cropped...

$ helm status vocal-tortoise
LAST DEPLOYED: Thu Dec 20 13:39:43 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/ConfigMap
NAME                     AGE
etcd-bootstrap-for-test  34s

==> v1/Service
etcd-for-test-client  34s

==> v1/StatefulSet
etcd-for-test  34s

==> v1/Pod(related)

NAME             READY  STATUS             RESTARTS  AGE
etcd-for-test-0  0/2    ContainerCreating  0         34s

$ kubectl describe pod/etcd-for-test-0
...cropped...
Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    101s                default-scheduler  Successfully assigned default/etcd-for-test-0 to minikube
  Warning  FailedMount  37s (x8 over 101s)  kubelet, minikube  MountVolume.SetUp failed for volume "ca-etcd" : secrets "ca-etcd" not found
  Warning  FailedMount  37s (x8 over 101s)  kubelet, minikube  MountVolume.SetUp failed for volume "etcd-server-tls" : secrets "etcd-server-tls" not found
  Warning  FailedMount  37s (x8 over 101s)  kubelet, minikube  MountVolume.SetUp failed for volume "etcd-client-tls" : secrets "etcd-client-tls" not found

$ helm get values  vocal-tortoise
tls: {}

$ helm get values -a vocal-tortoise
backup:
  backupSecret: etcd-backup
  env: []
  garbageCollectionPolicy: Exponential
  maxBackups: 7
  schedule: 0 */24 * * *
  storageContainer: ""
  storageProvider: Local
  volumeMounts: []
images:
  etcd: quay.io/coreos/etcd:v3.3.10
  etcd-backup-restore: eu.gcr.io/gardener-project/gardener/etcdbrctl:0.4.0
podAnnotations: {}
replicas: 1
role: for-test
tls:
  caSecret: ca-etcd
  clientSecret: etcd-client-tls
  serverSecret: etcd-server-tls

Workaround

For now we applied the workaround to assign an empty string to the tls key in our override (so we use tls: "" in our values file). Here is our workaround:

$ cat etcd-values.yaml
tls: ""

$ helm install --values etcd-values.yaml ./chart
NAME:   ugly-marsupial

...cropped...

$ helm status ugly-marsupial
LAST DEPLOYED: Thu Dec 20 13:29:08 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/ConfigMap
NAME                     AGE
etcd-bootstrap-for-test  32s

==> v1/Service
etcd-for-test-client  32s

==> v1/StatefulSet
etcd-for-test  32s

==> v1/Pod(related)

NAME             READY  STATUS   RESTARTS  AGE
etcd-for-test-0  2/2    Running  0         32s

$ helm get values  ugly-marsupial
tls: ""

$ helm get values -a ugly-marsupial
backup:
  backupSecret: etcd-backup
  env: []
  garbageCollectionPolicy: Exponential
  maxBackups: 7
  schedule: 0 */24 * * *
  storageContainer: ""
  storageProvider: Local
  volumeMounts: []
images:
  etcd: quay.io/coreos/etcd:v3.3.10
  etcd-backup-restore: eu.gcr.io/gardener-project/gardener/etcdbrctl:0.4.0
podAnnotations: {}
replicas: 1
role: for-test
tls: ""

Etcd Snapshot upload timeout

While doing performance test on etcd, db crossed 1 gb of data. While taking snapshot of the db, we faced issue. Etcd went in to crashloopbackoff state.
on RCA, @georgekuruvillak found the issue with the timeout. Snapshot of the larger db size takes lot of time.
Hoping for a code fix.

/kind/bug

[Feature] Automated tests for performance regression

Feature (What you would like to be added):
We should add automated tests to check for performance regression at least before every release if not during every build.

Motivation (Why is this needed?):
Catch issues like #102 and #115 as early as possibe.

Approach/Hint to the implement solution (optional):
With the memory leak fixed in #116, the memory usage is predictable again.

A simple way to test regression could be as follows.

  1. Deploy etcd and the backup-restore sidecar with predefined resource limits based on the expected database size during the test.
  2. Runetcd benchmark (multiple times if necessary) to simulate load to create desired database size.
  3. Verify that the etcd and backup-restore sidecars are still healthy at the end of the test and are never restarted due to the memory limits.

High memory consumption by etcd backup during full snapshots

During load test on etcd, when the full snapshot happens, where full snapshot interval was set to 10mins and delta snapshot interval was set to 5mins, memory consumption by etcd backup sidecar was too high and shoots up after subsequent backup.

Our Observation - Just for taking backup of 250mb of db size, etcd backup sidecar is consuming 2GB of memory whereas etcd consumes 475mb of memory.

From the heap profile, we could see that the "Unmarshal" function is consuming high memory which can be seen from below,

image

Possible memory leak by Backup sidecar during delta snapshot

We did few performance based test in the cluster, we noticed memory consumption by backup side car during Delta snapshot is more.
Even after delta snapshot, memory usage is high and it is not coming down.

image

Here in this case,
We have loaded the db using benchmark tool. The DB size was 512 mb and delta snapshot period is 5 mins. we can see that, after Delta snapshot (13:14) the memory used by backup raised to more then 3GB where as ETCD memory consumption is around 350mb.

Here is other metrics during the same load.

image

We have trigged the load around 13.09 and it has taken full snapshot at 13.20. During full snapshot , it has a small rise in memory for both etcd and backup sidecar.

etcd-backup-restore version : 0.5.0-dev-pref-test-3-memory and perf-test-2-with-metrics

Db is getting restored but not getting merged with original db.

Incase of ETCD, When the db is crashed, the restoration happens from previous backup but into different folder /var/etcd/data/new.etcd/member/snap instead of the main db folder- /var/etcd/data/member/snap.

Later we observed that this folder is not being merged into db folder. During the next backup cycle, it is taken only from /var/etcd/data/member/snap . So, the restored data (/var/etcd/data/new.etcd) is not backed up in the next cycle.

Etcd liveness and readiness probe failed before Delta snapshot

While doing load test on Etcd, we observed that , ETCD is going for restart after a work load and before the scheduled delta snapshot.

In Pod details, it shows "Liveness Probe failed" and "Readiness Probe Failed" as shown below.

image

Following is the pod status :
image

from the log, we didnt get any details of the restart. We are able to reproduce the same.

Raise appropriate alerts based on the backup-restore metrics

Story

As a provider I want timely alerts raised based on the backup&restore metrics to take informed decisions

Motivation

  • A number of metrics are exposed to improve the monitoring of backup-restore functionality #66
  • Defining some alerts based on the metrics will help the Ops to react in a timely manner, in case of any action required - for example missing backups or corrupt/restored etcd

Acceptance Criteria

Define alerts for

  • a backup is missed
  • corrupt/restored etcd

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Scaling recommendations for single-node ETCD for larger Kubernetes clusters

Story

  • As a Gardener user, I want to create larger Kubernetes shoot clusters with ~500 to 1000 nodes and the ETCD back-end should support the corresponding load.
  • As a Gardener administrator, I want recommendations about scaling parameters for ETCD that can support larger Kubernetes clusters.

Motivation

  • An ETCD instance with 2.5G RAM (as is currently provisioned by Gardener) can barely support a Kubernetes cluster with ~130 nodes served by 6 kuber-apiserver replicas.
  • Memory and CPU are not the only parameters to be considered during the scaling.
    Above 6 replicas of kube-apiserver, disk or network IO also could be a bottleneck. This can be seen in the metrics recorded for the above scenario.

Acceptance Criteria

  • Evaluate and recommend CPU, memory, disk/network IO, ETCD configurations to be able to support larger Kubernetes clusters. At least 500 nodes if not 1000.

  • Recommendations should include the recommendations for the backup-restore sidecar as well to handle larger backups (1.5-2G)

  • Integrate the above recommendation with Gardener.

Full snapshot fails for large DB size

We are doing performance test on Etcd.
DB size is of 2GB.
Connection Timeout : 300 secs
We have configured the full snapshot for every 10 mins.

The schedule full snapshot was at 10:40:00.
From the log , we can see that at 10.38 ,level=warning msg="watch channel responded with err: %v etcdserver: mvcc: required revision has been compacted.

image

Still its the same, the full snapshot has not been taken. Even its not taking delta snapshot.
ETCD and API server both are in healthy state.

When the DB size was 1.5GB, it took the full snapshot properly.

Incremental/Continuous ETCD Backups

Story

  • As user I want my Kubernetes cluster always to show my latest state and not "forget" resources.
  • As provider I want etcd backups to be written incrementally/continuously (with full backups now and then) so that I run a near-zero risk of infrastructure inconsistencies after backup.

Motivation

In order to make sure our ETCD backups are as fresh as possible and reflect the actual infrastructure state as good as possible, we should continue the begun work with the incremental/continuous backups.

Acceptance Criteria

  • ETCD full backups are taken regularly, e.g. every hour
  • Logs of ETCD watches are written continuously
  • Logs taken before a new ETCD full backup are deleted
  • When a restore is necessary the last ETCD full backup plus the additional ETCD logs (from the last entry in the full backup) are restored automatically

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Add mock tests for different cloud providers

Currently, Swift, GCS, ABS snapstore doesn't have any unit test. So, add mock client to perform unit test for

  • swift

  • GCS

  • ABS

  • Use these mock clients to adopt the existing unit tests from s3 snapstore.

  • Add erroneous case as well in mock client.

Strong Backup Encryption and Key Management

etcd Backup Strong Encryption
Please encrypt sensitive data when stored persistently. When sensitive data is stored persistently, for example in files, it needs to be protected from unauthorized access. (If access control cannot be fully enforced where the storage takes place and) as a second line of defense against access control failures and loss of data confidentiality, the data shall be strongly encrypted. The following encryption requirements shall be met: Enable strong encryption at the storage level.

Encryption Key Management
Encryption key management shall ensure that the principles of least privilege and segregation of duties can be followed by the operations team.

Etcd restarts with error under heavy load

Scenario is, Etcd is deployed as statefulset along with backup-sidecar. We hit etcd with benchmark tool to generate heavy load and checked for performance. The etcd goes for restart with following error

 2018-11-23 05:52:19.439293 C | etcdserver: failed to purge snap file open /var/etcd/data/new.etcd/member/snap: no such file or directory

Benchmark tool command

benchmark --endpoints=etcd-main-client:2379 --sample=true --conns=100 --clients=1000 put --key-size=1024 --sequential-keys --total=2000000 --val-size=2048 --key /certs/etcd-key --cert /certs/etcd-cert  --cacert /certs/etcd-ca

Observations till now

  • etcd restarted only when heavy load is applied.
  • removing etcd data directory which lead to restoration, and then triggering benchmark reproduced the issue.
  • second run of benchmark on same etcd data directory didn't crash the etcd.

Environment
Etcd version: 3.3.10
Etcd-backup-restore: eb2230a

cc: @nikhilsap

Wrtie unit/integration test for race conditions

We are yet missing on unit tests and integration tests considering wide scenarios that can happen on deployment. We need to check add automate tests to simulate and test the code against different race conditions.

Integrate etcd tests with TestMachinery

Story

As etcd backup-restore developer I want to integrate the existing tests with the TestMachinery to run the tests in an integrated environment and also be able automate the same

Motivation

TestMachinery provides a robust test environment, automates cluster creation and such operations making it is easy for developers to run their tests in an integrated environment, publishing the results.

Acceptance Criteria

  • Integrate the existing tests into TestMachinery
  • Ensure the end-to-end automation
  • Validate the results

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

[Critical] Etcd crash post data directory restoration.

I was seen in the staging cluster on GCP for e2e that etcd was failing even after pod restarts. The logs state that data directory is successfully getting restored and etcd is started post restoration. However, etcd crashes immediately on startup.

Possiblities:

  1. Snapshot corruption.
  2. Disk malfunction.

Identify the root cause and fix it.

The backup-restore container logs:

time="2018-07-20T08:27:50Z" level=info msg="Regsitering the http request handlers..."
time="2018-07-20T08:27:50Z" level=info msg="Starting the http server..."
time="2018-07-20T08:27:50Z" level=info msg="Created snapstore from provider: GCS"
time="2018-07-20T08:27:50Z" level=info msg="Validating schedule..."
time="2018-07-20T08:27:50Z" level=info msg="Probing etcd..."
time="2018-07-20T08:27:50Z" level=info msg="Responding to status request with: New"
time="2018-07-20T08:27:50Z" level=info msg="Received start initialization request."
time="2018-07-20T08:27:50Z" level=info msg="Updating status from New to Progress"
time="2018-07-20T08:27:50Z" level=info msg="Checking for data directory structure validity..."
time="2018-07-20T08:27:50Z" level=info msg="Data directory structure invalid."
time="2018-07-20T08:27:50Z" level=info msg="Removing data directory(/var/etcd/data) for snapshot restoration."
time="2018-07-20T08:27:50Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:50Z" level=info msg="Finding latest set of snapshot to recover from..."
time="2018-07-20T08:27:51Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:52Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:53Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:54Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:55Z" level=info msg="Restoring from base snapshot: Backup-1531824004/Full-00000000-00085834-1531824004"
time="2018-07-20T08:27:56Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:57Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:58Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:27:59Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:00Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:01Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:02Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:03Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:04Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:05Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:06Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:07Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:08Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:09Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:10Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:10Z" level=info msg="Successfully initialized data directory \"/var/etcd/data\" for etcd."
time="2018-07-20T08:28:11Z" level=info msg="Responding to status request with: Successful"
time="2018-07-20T08:28:11Z" level=info msg="Updating status from Successful to New"
time="2018-07-20T08:28:12Z" level=info msg="Responding to status request with: New"
time="2018-07-20T08:28:12Z" level=info msg="Received start initialization request."
time="2018-07-20T08:28:12Z" level=info msg="Updating status from New to Progress"
time="2018-07-20T08:28:12Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:13Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:14Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:15Z" level=info msg="Responding to status request with: Progress"
^Xtime="2018-07-20T08:28:16Z" level=info msg="Responding to status request with: Progress"
time="2018-07-20T08:28:17Z" level=info msg="Responding to status request with: Progress"

Etcd logs:

2018-07-20 08:28:11.623015 I | etcdmain: etcd Version: 3.3.8
2018-07-20 08:28:11.623029 I | etcdmain: Git SHA: 33245c6b5
2018-07-20 08:28:11.623035 I | etcdmain: Go Version: go1.9.7
2018-07-20 08:28:11.623040 I | etcdmain: Go OS/Arch: linux/amd64
2018-07-20 08:28:11.623045 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2018-07-20 08:28:11.623161 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-07-20 08:28:11.623575 I | embed: listening for peers on http://localhost:2380
2018-07-20 08:28:11.623669 I | embed: listening for client requests on 0.0.0.0:2379
unexpected fault address 0x7fb1f790b000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fb1f790b000 pc=0x8b54b4]

goroutine 46 [running]:
runtime.throw(0xff2dd4, 0x5)
	/usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc42006b398 sp=0xc42006b378 pc=0x42c145
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:364 +0x29d fp=0xc42006b3e8 sp=0xc42006b398 pc=0x442e2d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).checkBucket.func1(0x7fb1f790b000, 0x3)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:445 +0x64 fp=0xc42006b530 sp=0xc42006b3e8 pc=0x8b54b4
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).forEachPage(0xc4202a4000, 0x41d87, 0x3, 0xc42006b670)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:607 +0x81 fp=0xc42006b578 sp=0xc42006b530 pc=0x8b3d31
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).forEachPage(0xc4202a4000, 0x5097, 0x2, 0xc42006b670)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:613 +0xd7 fp=0xc42006b5c0 sp=0xc42006b578 pc=0x8b3d87
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).forEachPage(0xc4202a4000, 0x5098, 0x1, 0xc42006b670)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:613 +0xd7 fp=0xc42006b608 sp=0xc42006b5c0 pc=0x8b3d87
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).forEachPage(0xc4202a4000, 0xbb, 0x0, 0xc42006b670)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:613 +0xd7 fp=0xc42006b650 sp=0xc42006b608 pc=0x8b3d87
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).checkBucket(0xc4202a4000, 0xc42021b940, 0xc42006b950, 0xc42006b920, 0xc42028a5a0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:444 +0xe1 fp=0xc42006b6e0 sp=0xc42006b650 pc=0x8b3371
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).checkBucket.func2(0x7fb1b5c0a1b9, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:469 +0xcd fp=0xc42006b738 sp=0xc42006b6e0 pc=0x8b5add
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).ForEach(0xc4202a4018, 0xc42006b7f8, 0x0, 0xc42003c7c8)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:388 +0x124 fp=0xc42006b7a8 sp=0xc42006b738 pc=0x8a21f4
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).checkBucket(0xc4202a4000, 0xc4202a4018, 0xc42003c950, 0xc42003c920, 0xc42028a5a0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:467 +0x160 fp=0xc42006b838 sp=0xc42006b7a8 pc=0x8b33f0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages(0xc4202a0000, 0x0, 0x0, 0x0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:979 +0x257 fp=0xc42006ba20 sp=0xc42006b838 pc=0x8a98a7
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).loadFreelist.func1()
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:287 +0x1e4 fp=0xc42006ba88 sp=0xc42006ba20 pc=0x8b4be4
sync.(*Once).Do(0xc4202a0150, 0xc42003cad0)
	/usr/local/go/src/sync/once.go:44 +0xbe fp=0xc42006bac0 sp=0xc42006ba88 pc=0x472c2e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).loadFreelist(0xc4202a0000)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc42006baf0 sp=0xc42006bac0 pc=0x8a6a3e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.Open(0xc42028e8a0, 0x1d, 0x180, 0xc42003cbf8, 0x0, 0x0, 0x1d)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:260 +0x3b8 fp=0xc42006bbc0 sp=0xc42006baf0 pc=0x8a65a8
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc42028e8a0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x2)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:129 +0xaf fp=0xc42006bcb8 sp=0xc42006bbc0 pc=0x906e1f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.New(0xc42028e8a0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x472c2e, 0x10)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:113 +0x48 fp=0xc42006bcf8 sp=0xc42006bcb8 pc=0x906d48
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.newBackend(0x7fff5ca3322c, 0x9, 0x0, 0x0, 0x0, 0x0, 0xc42021e900, 0x1, 0x1, 0xc42021e700, ...)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:36 +0x1ad fp=0xc42006bdb8 sp=0xc42006bcf8 pc=0xbaa54d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend.func1(0xc42028a540, 0xc420268800)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:56 +0x62 fp=0xc42006bfd0 sp=0xc42006bdb8 pc=0xbcdd32
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc42006bfd8 sp=0xc42006bfd0 pc=0x45d2c1
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:55 +0xe6

goroutine 1 [select]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend(0x7fff5ca3322c, 0x9, 0x0, 0x0, 0x0, 0x0, 0xc42021e900, 0x1, 0x1, 0xc42021e700, ...)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:58 +0x1c1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7fff5ca3322c, 0x9, 0x0, 0x0, 0x0, 0x0, 0xc42021e900, 0x1, 0x1, 0xc42021e700, ...)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:286 +0x3ee
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420276000, 0xc420276480, 0x0, 0x0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:179 +0x870
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420276000, 0x6, 0xff3ea7, 0x6, 0x1)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:181 +0x40
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:102 +0x151e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:46 +0x3f
main.main()
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

goroutine 51 [chan receive]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc42018db20)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x47a
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0xb8

goroutine 62 [chan receive]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc4201f6660)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x47a
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0xb8

goroutine 63 [chan receive]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.(*MergeLogger).outputLoop(0xc4201f66c0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:174 +0x47a
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil.NewMergeLogger
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/logutil/merge_logger.go:92 +0xb8

goroutine 47 [chan receive]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages.func2(0xc42028a5a0)
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:975 +0x4b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages
	/tmp/etcd-release-3.3.8/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:974 +0x214

goroutine 78 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:131 +0xa6
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x41

Parameter to limit maximum number of delta changes before a delta snapshot is triggered

Problem

The frequency of delta snapshots is only controlled by the parameter delta-snapshot-period-seconds.
In case of very frequent changes to ETCD, a large number of changes might be accumulated before the next delta snapshot is triggered according to the configured delta-snapshot-period-seconds.

Since, this is related to the load, the behaviour of the backup/restore sidecar could be unpredictable including a potential crash of the backup/restore sidecar if sufficient memory is not available to the sidecar container.

Solution

To make things more robust, we must introduce a parameter (max-changes-per-delta-snapshot?) to limit the number of changes that are accumulated before a delta snapshot is triggered.

Delta snapshot could be triggered based on the new parameter (max-changes-per-delta-snapshot) or the older parameter (delta-snapshot-period-seconds`) whichever happens earlier.

Alternative

Use maximum limit to the size of the delta change as a parameter. Could this be harder to compute than a simple number of changes?

Condensed ETCD Backups

Story

  • As provider I want to reduce costs, but still have some old etcd backups to pick from (manually) should the need arise.

Motivation

Right now we take only backups for the past 7 days (once a day), but it would be better to have a much more frequent backups and "condense older backups".

Acceptance Criteria

  • Take ETCD backups frequently (to increase the likelihood of being up-to-date with the infrastructure), e.g. every 5 minutes
  • Keep only the last 24 hourly backups and of all other backups only the last backup in a day
  • Keep only the last 7 daily backups and of all other backups only the last backup in a week
  • Keep only the last 4 weekly backups

Remark

These backups would not be used for customer-initiated restore or PiTR (because most likely the infrastructure resources would no longer fit/match and the backup would not be usable anymore without manual processing), but as a safeguard against unnoticed backup corruption. In these cases these backups would be the last hope to extract valuable data and process the data manually to restore as much as possible of a lost cluster workload.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Expose API to trigger full snapshot

Story

As a developer, I want to be able to trigger a full snapshot of etcd, on demand. Triggering via the API should be controlled via a flag

Motivation

The feature will be handy to run certain tests in the CI/CD pipeline, not having to wait for the scheduled time. Controlling the schedule for triggering the full snapshot is not always an option.

Acceptance Criteria

  • Should be able to trigger an on-demand full snapshot

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit tests are provided: Have you written automated unit tests?
  • Integration tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide about ops-relevant changes?
  • User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

[BUG] etcd container in etcd-main-0 pod stucks with OOMKilled status on Azure

Describe the bug:
We had etcd OOMKilled in Azure until manually fixed. During the issue the Shoot's apiserver is not accessible due to unhealthy etcd.
Etcd can't recover itselfs unless an operator steps in and manually, our current solution is to manually delete all nodes from Azure Portal.

Expected behaviour:
Etcd should auto recover itself from such issues.

How To Reproduce (as minimally and precisely as possible):
We can't reliably reproduce the issue, but it sporadically happened on several Shoot's Etcds.

Logs:
Unfortunately logs are rotated and not available anymore, we have logs in ElasticSearch and logs are not easy to extract from there without losing the order (ES keeps logs with timestamps in ms resolution, so any logs omitted in same ms wont be in correct ordered). I tried to go over the logs below in the timeline.

Screenshots (if applicable):
The status of the etcd-main-0 Pod shows that the etcd container is OOMKilled (not the backup-restore container)
image

Screenshot of same issue happened twice on a Shoot within a day.
image

Environment (please complete the following information):

  • Etcd version/commit ID : v3.3.10
  • Etcd-backup-restore version/commit ID: 0.4.1
  • Cloud Provider [All/AWS/GCS/ABS/Swift]: Azure
  • Seed is running on AKS service with Kubernetes version: 1.11.5
  • Etcd PV in use: Azure 256 GiB (Premium SSD) (IOPS limit=1100, Throughput limit (MB/s)=125)
  • Some scaling info about the shoot cluster: nodes:10, pods:217, deploy:83, rs:286, sts:13, ds:6 (no dramatic change on the workload for a while other then regular ongoing deployment rollouts)

Regular apiserver load before apiserver
image

Amount of resources in use:
image

Anything else we need to know?:

Although the pod's status field reports the etcd container is OOMKilled, we can't see an increase in the etcd container's memory in any prometheus metrics. According to the memory usage reported from the etcd container, even the maximum reported level is safe to be not OOM Killed. This tells us the memory usage of the container jumps to a level that more than the container's limit all of a sudden and the container get killed immediately, metrics system never have a chance to catch the elevated memory usage of the container due to 15s scrape interval.

We reported issue in Slack previously:

Used Kibana queries during investigation:

event.involvedObject.namespace.keyword:"shoot--flow--flow" AND event.involvedObject.name.keyword:"etcd-main-0"
kubernetes.namespace_name.keyword:"shoot--flow--flow" AND kubernetes.pod_name.keyword:"etcd-main-0"

Timeline:
Timeline of the last occurance (all times are UTC):

2019-02-08

recent events on etcd and backup-restore containers
17:17:41  etcd container started and finished compaction
17:19:11  backup-restore container started delta snapshot, msg="Taking delta snapshot for time: 2019-02-08 17:19:11.936458854 +0000 UTC"
17:19:12  backup-restore container completed an delta snapshot (next log from backup-restore is when etcd container is restarted): msg="Successfully saved delta snapshot at: Backup-1549584000/Incr-18756018-18757064-1549646351"
17:21:13  etcd container: mvcc: finished scheduled compaction at 18756449 (took 2.546498ms)

seemingly related signs, "took too long" messages can also be considered usual
17:22:26  etcd "took too long" event spike starts (from this time there are 66 logs in 3 seconds)
17:22:30  last healthy etcd memory datapoint values reported in Prometheus (req: 500MB, usage: 1.23GB, limit: 2.5GB, etcd process_resident_memory_bytes: 1.020GB)
17:22:34  last log line from etcd container: "W | etcdserver: read-only range request "key:"/registry/configmaps/kube-system/" range_end:"/registry/configmaps/kube-system0" " with result "range_response_count:736 size:32098389" took too long (124.058866ms) to execute"

problem starts here
17:22:35  etcd container "Created container" event (pod.status field indicates etcd container is terminated as a result of OOMKilled), first reported event from this pod for a long time
17:22:36  backup-restore: level=info msg="Received start initialization request."
17:22:51  etcd container "Unhealthy", Readiness probe failed: HTTP probe failed with statuscode: 503
17:22:53  backup-restore container "Started container" event
17:22:56  etcd container "Unhealthy", Readiness probe failed: HTTP probe failed with statuscode: 503
17:23:01  etcd container "Unhealthy", Readiness probe failed: HTTP probe failed with statuscode: 503
17:34:48  etcd and backup-restore container "Created container"
17:44:57  etcd container "BackOff", Back-off restarting failed container

... problem persists (etcd container is unhealthy and in BackOff) until manual fix is applied

Investigation and mitigation attempts:

Here are the things we tried, but none helped so far:

  • delete PV along with etcd-main-0 pod and let whole data to be restored by backup-restore
  • disable GCM reconciliation and modify resources and limits of etcd pod/containers, we even tried with 16GB of mem and still got OOM kill for etcd container
  • replace the PV with faster SSD disk to eliminate the disk iops issue (current PV in use is Azure 256 GiB (Premium SSD) (IOPS limit=1100, Throughput limit (MB/s)=125))
  • tried to distribute etcd-main-0 pods among nodes manually, if it could be a network bottleneck for remotely attached disks (edited)

Workaround:
Manually delete all VMs from Azure Portal, and next time etcd starts if doesn't fail.

Optimise the restoration process

Currently, while restoring etcd from available incremental/delta snapshots, it fetch the snapshot and applies the snapshot sequentially one by one. For reducing the snapshot fetch time from object store, we can parallelise the fetch call. This can be done at two level:
Lets we have delta snapshot in order as A, B, C ... then

  • Once A is fetched, one go routine will start unmarsheling and applying events from it to embedded etcd and simultaneously another go routine will start fetching next delta snapshot i.e. B. We can think of it as kind of two step pipeline parallelism implementation.
  • Another way is trigger fetch call for all (i.e. A,B as well as C) the snapshots in parallel initially itself. And apply the sequentially to embedded etcd.

cc: @georgekuruvillak @amshuman-kr

Improve DB file verification time

As the etcd DB size grows to around ~1GB, DB file validation takes time in 3-5 minutes. Need to analyse this; and improve the time for verification of DB file. Also, check the effect of IOPS on time as well.

Environment:
Etcd version: 3.3.10
Etcd-backup-restore version: 0.3.1

Preserve Backups after Cluster Deletion

Story

  • As provider I want to preserve backups of deleted clusters so that I can help the operator in case the cluster deletion happened accidentally.

Motivation

At present, all backups are deleted when the cluster is deleted, which may happen accidentally and then we would lose everything at once. It would be better to not delete them right away and instead keep them for some time and let a garbage collector of sorts (separate controller instead of a flow step in the Gardener) delete only those older than a certain grace period (e.g. one month).

Acceptance Criteria

  • Backups of deleted clusters are not deleted right away, but only after a configurable time interval (30 days)

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.