Giter Site home page Giter Site logo

coreos / etcd-operator Goto Github PK

View Code? Open in Web Editor NEW
1.7K 73.0 742.0 3.24 MB

etcd operator creates/configures/manages etcd clusters atop Kubernetes

Home Page: https://coreos.com/blog/introducing-the-etcd-operator.html

License: Apache License 2.0

Go 94.91% Shell 4.46% Dockerfile 0.63%
etcd kubernetes operator

etcd-operator's Introduction

etcd operator

Project status: archived

This project is no longer actively developed or maintained. The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact [email protected].

Overview

The etcd operator manages etcd clusters deployed to Kubernetes and automates tasks related to operating an etcd cluster.

There are more spec examples on setting up clusters with different configurations

Read Best Practices for more information on how to better use etcd operator.

Read RBAC docs for how to setup RBAC rules for etcd operator if RBAC is in place.

Read Developer Guide for setting up a development environment if you want to contribute.

See the Resources and Labels doc for an overview of the resources created by the etcd-operator.

Requirements

  • Kubernetes 1.8+
  • etcd 3.2.13+

Demo

Getting started

etcd Operator demo

Deploy etcd operator

See instructions on how to install/uninstall etcd operator .

Create and destroy an etcd cluster

$ kubectl create -f example/example-etcd-cluster.yaml

A 3 member etcd cluster will be created.

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          1m
example-etcd-cluster-m6g62x6mwc   1/1       Running   0          1m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          1m

See client service for how to access etcd clusters created by the operator.

If you are working with minikube locally, create a nodePort service and test that etcd is responding:

$ kubectl create -f example/example-etcd-cluster-nodeport-service.json
$ export ETCDCTL_API=3
$ export ETCDCTL_ENDPOINTS=$(minikube service example-etcd-cluster-client-service --url)
$ etcdctl put foo bar

Destroy the etcd cluster:

$ kubectl delete -f example/example-etcd-cluster.yaml

Resize an etcd cluster

Create an etcd cluster:

$ kubectl apply -f example/example-etcd-cluster.yaml

In example/example-etcd-cluster.yaml the initial cluster size is 3. Modify the file and change size from 3 to 5.

$ cat example/example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 5
  version: "3.2.13"

Apply the size change to the cluster CR:

$ kubectl apply -f example/example-etcd-cluster.yaml

The etcd cluster will scale to 5 members (5 pods):

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-cl2gpqsmsw   1/1       Running   0          5m
example-etcd-cluster-cx2t6v8w78   1/1       Running   0          5m
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          7m
example-etcd-cluster-m6g62x6mwc   1/1       Running   0          7m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          7m

Similarly we can decrease the size of the cluster from 5 back to 3 by changing the size field again and reapplying the change.

$ cat example/example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.2.13"
$ kubectl apply -f example/example-etcd-cluster.yaml

We should see that etcd cluster will eventually reduce to 3 pods:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-cl2gpqsmsw   1/1       Running   0          6m
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          8m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          9mp

Failover

If the minority of etcd members crash, the etcd operator will automatically recover the failure. Let's walk through this in the following steps.

Create an etcd cluster:

$ kubectl create -f example/example-etcd-cluster.yaml

Wait until all three members are up. Simulate a member failure by deleting a pod:

$ kubectl delete pod example-etcd-cluster-cl2gpqsmsw --now

The etcd operator will recover the failure by creating a new pod example-etcd-cluster-n4h66wtjrg:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          10m
example-etcd-cluster-n4h66wtjrg   1/1       Running   0          26s
example-etcd-cluster-rqk62l46kw   1/1       Running   0          10m

Destroy etcd cluster:

$ kubectl delete -f example/example-etcd-cluster.yaml

etcd operator recovery

Let's walk through operator recovery in the following steps.

Create an etcd cluster:

$ kubectl create -f example/example-etcd-cluster.yaml

Wait until all three members are up. Then stop the etcd operator and delete one of the etcd pods:

$ kubectl delete -f example/deployment.yaml
deployment "etcd-operator" deleted

$ kubectl delete pod example-etcd-cluster-8gttjl679c --now
pod "example-etcd-cluster-8gttjl679c" deleted

Next restart the etcd operator. It should recover itself and the etcd clusters it manages.

$ kubectl create -f example/deployment.yaml
deployment "etcd-operator" created

$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
example-etcd-cluster-m8gk76l4ns   1/1       Running   0          3m
example-etcd-cluster-q6mff85hml   1/1       Running   0          3m
example-etcd-cluster-xnfvm7lg66   1/1       Running   0          11s

Upgrade an etcd cluster

Create and have the following yaml file ready:

$ cat upgrade-example.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.1.10"
  repository: "quay.io/coreos/etcd"

Create an etcd cluster with the version specified (3.1.10) in the yaml file:

$ kubectl apply -f upgrade-example.yaml
$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
example-etcd-cluster-795649v9kq   1/1       Running   1          3m
example-etcd-cluster-jtp447ggnq   1/1       Running   1          4m
example-etcd-cluster-psw7sf2hhr   1/1       Running   1          4m

The container image version should be 3.1.10:

$ kubectl get pod example-etcd-cluster-795649v9kq -o yaml | grep "image:" | uniq
    image: quay.io/coreos/etcd:v3.1.10

Now modify the file upgrade-example and change the version from 3.1.10 to 3.2.13:

$ cat upgrade-example
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.2.13"

Apply the version change to the cluster CR:

$ kubectl apply -f upgrade-example

Wait ~30 seconds. The container image version should be updated to v3.2.13:

$ kubectl get pod example-etcd-cluster-795649v9kq -o yaml | grep "image:" | uniq
    image: gcr.io/etcd-development/etcd:v3.2.13

Check the other two pods and you should see the same result.

Backup and Restore an etcd cluster

Note: The provided etcd backup/restore operators are example implementations.

Follow the etcd backup operator walkthrough to backup an etcd cluster.

Follow the etcd restore operator walkthrough to restore an etcd cluster on Kubernetes from backup.

Manage etcd clusters in all namespaces

See instructions on clusterwide feature.

etcd-operator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etcd-operator's Issues

support node selector

only schedule etcd pods to pre define nodes.

use case: nodes with SSD. nodes with better bandwidth.

Problem with etcd in hostNetwork

Ran into this today while trying to test the changes from #112

NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          1m
2016-09-15 00:49:12.196100 I | etcdmain: etcd Version: 3.0.8
2016-09-15 00:49:12.196134 I | etcdmain: Git SHA: d40982f
2016-09-15 00:49:12.196174 I | etcdmain: Go Version: go1.6.3
2016-09-15 00:49:12.196204 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 00:49:12.196232 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 00:49:12.196591 I | etcdmain: listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.196725 I | etcdmain: listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.213419 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 00:49:12.236475 I | etcdmain: stopping listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.236515 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.236541 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380

Looks like it's failing to resolve etcd-cluster-0000. I'll investigate.

release workflow

use official etcd container

publish kube-etcd-controller as container.

member health checking

Now we rely on kubernetes pod management to detect member failure. It is not reliable since there might be application level failure. We should use probe features to periodically ping etcd's health endpoint to determine the healthiness of the member.

Allow etcd pods to run in host network namespace

If a user wants to run a Kubernetes hosted etcd for Canal / Calico (i.e part of the infrastructure), it needs to be run with host networking.

Running in the host network namespace is currently necessary to avoid a bootstrap issue where the default network policy prevents Calico / Canal from accessing etcd, but the network policy cannot be changed without Calico / Canal accessing etcd.

Not sure if something similar has been discussed before, but is there a plan to have a mechanism for exposing more general Pod configuration parameters like hostNetwork: true (or even things like annotations, labels, etc)?

doc: run etcd on dedicate node

people might want to run etcd on some dedicate nodes (nodes that only run etcd), for reliability reasons. The use case is for the self-hosted etcd cluster that powers the k8s itself. we do want to extra safety for that cluster.

we should use taint feature to make this happen.

Add Spec.RequireDedicatedNode field.

/cc @colhom opinion?

support resize

  • check cluster state until all existing members are healthy
  • if growing, add member one by one
  • if shrinking, remove member one by one

existing UUID is broken

After creating an EtcdCluster object...

2016-08-05 08:08:52.451942 I | new cluster size: 3
panic: Service "etcd0-24b570cf-f642-4bb2-a8f5-4c8033565f1e" is invalid: metadata.name: Invalid value: "etcd0-24b570cf-f642-4bb2-a8f5-4c8033565f1e": must be no more than 24 characters

This little hack fixed it

diff --git a/main.go b/main.go
index 43a14cf..1c69bdb 100644
--- a/main.go
+++ b/main.go
@@ -129,7 +129,7 @@ func mustCreateClient(host string) *unversioned.Client {
 }

 func generateUUID() string {
-   return uuid.New()
+   return uuid.New()[:8]
 }

 func makeEtcdService(etcdName, uuid string) *api.Service {

#3 would be the ideal fix

panic on pod deletion through API

With kube-etcd-controller running, deleting one of the etcd-cluster-xxxx pods will cause the following panic:

panic: unexpected EOF

goroutine 1 [running]:
panic(0x2023900, 0xc82000e180)
    /usr/lib/go/src/runtime/panic.go:481 +0x3e6
main.(*etcdClusterController).Run(0xc820549f00)
    /home/chom/gopaths/kube-etcd-controller/src/github.com/coreos/kube-etcd-controller/main.go:61 +0x39a
main.main()
    /home/chom/gopaths/kube-etcd-controller/src/github.com/coreos/kube-etcd-controller/main.go:101 +0x166

@xiang90 I'm going to fix this, then push commits for #5

Allow management of existing clusters

A common requirement for using kube-etcd-controller is the ability to migrate an already existing etcd cluster to the management of the controller. This functionality will be necessary in order to enable bootkube (#128) to bootstrap using the controller.

Init container not sharing volumes correctly

Problem

From the design, init container is supposed to share and initialize volumes in sequence. However, it's not working so far and volume initialized by init containers is not possible to be used by other containers.

What's affected

In disaster recovery workflow,the seed etcd member will:

  1. fetch snapshot from backup, and use "etcdctl snapshot restore" to prepare restored datadir.
  2. start etcd server and have "--data-dir" pointed to the same datadir.

This restored datadir doesn't show up again in step 2 unfortunately. We are moving forward #94 though and going to fix it later.

Future Plan

Upstream issue: kubernetes/kubernetes#32094
Fix: (TODO)

failed to get event from apiserver: unexpected EOF

time="2016-10-04T18:54:15Z" level=info msg="finding existing clusters..." 
time="2016-10-04T18:54:15Z" level=info msg="etcd cluster controller starts running..." 
time="2016-10-04T18:54:15Z" level=info msg="watching at 2766988" 
time="2016-10-04T18:54:20Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:20Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:20Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:20Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:25Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:25Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:25Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:25Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:30Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:30Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:30Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:30Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:35Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:35Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:35Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:35Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:40Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:40Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:40Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:40Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:45Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:45Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:45Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:45Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:50Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:50Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:50Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:50Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:55Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:55Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:55Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:55Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:55:00Z" level=info msg="Reconciling:" 
time="2016-10-04T18:55:00Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:55:00Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:55:00Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:55:05Z" level=error msg="failed to get event from apiserver: unexpected EOF" 
panic: unexpected EOF

goroutine 1 [running]:
panic(0x18b28e0, 0xc42000a110)
    /home/mischief/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/kube-etcd-controller/pkg/controller.(*Controller).Run(0xc4202ceb70)
    /home/mischief/code/go/src/github.com/coreos/kube-etcd-controller/pkg/controller/controller.go:96 +0x703
main.main()
    /home/mischief/code/go/src/github.com/coreos/kube-etcd-controller/cmd/controller/main.go:30 +0x5e

setup CI for e2e testing

At initial dev state, we probably want to start with e2e testing that requires a running k8s cluster.

@colhom volunteered to work on this :) Thanks!

dns name for etcd endpoints?

is it currently possible to use a dns address or some other mechanism for a client to discover the etcd client addresses?

if not, this would be a great feature to make it simpler for clients to connect.

cluster membership reconciling

For each etcd cluster, we actually three membership state:

  1. the desired size defined in tpr resource. this is provided by user.
  2. the running pods in the k8s cluster. this is provided by API server.
  3. the membership in the etcd cluster. this is fully controlled by our controller.

From the controller perspective, 1 and 2 are dynamic. They are not controlled directly by the controller itself. The controller's responsibility is to keep 2, 3 in sync with 1.

There are a few cases that 1, 2, 3 can be out of sync. We need to define rules to make the in sync.

A general rule here is probably always first make 2, 3 in sync (which is recover) one member each time. Then start to take care about 1 (which is resize)

take over existing cluster

There is an existing cluster deployed somewhere, and we have an accessible client addr and peer addr.

We should be able to take over the cluster of that cluster by moving the cluster members into k8s. We always add one member into k8s, and remove one member outside k8s.

sign off by polvi

Polvi needs to sign off this work before we can open source it (or we might not).

He has some ideas around how shall we do this.

resync state when restarting the controller

now we assume that the controller knows all the state changes from the beginning. Once it can fails, the assumption does not hold.

the controller should get the existing clusters by getting pod with app=etcd selectors. then we can find all existing clusters and do necessary modification according to the spce in the third party resources.

e2e test: Missing running pods shown previously

This is not really our controller issue. I just want to keep a track here.

What I have observed:

  • a pod has been created
  • after 5 seconds in reconcile the pod doesn't show up.

Eventually it does show up. This long delay affects how reconcile works and isn't reasonable to workaround.

reorganize code base

kube-etcd-controller/
  cmd/
    controller/
      main.go
    backup/
      main.go
  cluster/
  k8sutil/
  ...

Then we can start to work on the backup cmd.

watch error makes kube-etcd-controller get stuck in a loop

somehow the watcher gets stuck in an error loop, causing the controller to hit the api server constantly and make the cpu load very high.

the size of my cluster is 1.

does it make sense to add a rate limit to the watch loop to prevent accidental high cpu use on the api server?

here are logs with an additional print statement to dump the watch error.

time="2016-10-03T23:01:17Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:17Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:19Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:19Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:22Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:22Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:22Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:22Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:24Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:24Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:27Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:27Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:27Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:27Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:29Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:29Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:32Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:32Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:32Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:32Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:34Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:34Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:37Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:37Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:37Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:37Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:39Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:39Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:42Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:42Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:42Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:42Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:44Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:44Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:47Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:47Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:47Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:47Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:49Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:49Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:52Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:52Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:52Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:52Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:54Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:54Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:57Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:57Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:57Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:57Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:59Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:59Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:02Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:02Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:02Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:02Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:04Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:04Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:07Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:07Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:07Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:07Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:09Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:09Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:12Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:12Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:12Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:12Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:14Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:14Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:17Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:17Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:17Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:17Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:19Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:19Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:22Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:22Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:22Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:22Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:24Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:24Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:27Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:27Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:27Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:27Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:29Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:29Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:32Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:32Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:32Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:32Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:34Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:34Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:37Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:37Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:37Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:37Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:39Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:39Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 

support pod anti affinity

by default, we should spread different etcd pods that belongs to one cluster onto different nodes. then etcd cluster can achieve better availability.

but we should also support non-spreading case for quick deployment setup when we only have one kube node.

this should be achieve by adding a policy in the tpr cluster creating spec.

update etcd cluster version

add etcd version into cluster Spec

If the version is updated, we should update the pod spec of the etcd members one by one; and one version by one version (2.3.1 -> 2.3.2 -> 2.3.3, never 2.3.1 -> 2.3.3 directly)

We only start update version if and only if the reconciling matches.

heavy load on test infra leads to unexpected timeout

From last failure:

=== RUN   TestDisasterRecovery
created etcd cluster: test-etcd-e3giv
reached to 3 members cluster
deleting etcd cluster: test-etcd-e3giv
ready: [], unready: [test-etcd-e3giv-0003 test-etcd-e3giv-0004]
kube-etcd-controller logs ===
time="2016-10-05T17:29:01Z" level=info msg="Expected membership: test-etcd-e3giv-0002,test-etcd-e3giv-0000,test-etcd-e3giv-0001" 
time="2016-10-05T17:29:01Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:06Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:06Z" level=info msg="Running pods: test-etcd-e3giv-0002,test-etcd-e3giv-0000" 
time="2016-10-05T17:29:06Z" level=info msg="Expected membership: test-etcd-e3giv-0000,test-etcd-e3giv-0001,test-etcd-e3giv-0002" 
time="2016-10-05T17:29:06Z" level=info msg="Recovering one member" 
time="2016-10-05T17:29:06Z" level=info msg="removed member (test-etcd-e3giv-0001) with ID (5161869728712750051)" 
time="2016-10-05T17:29:26Z" level=error msg="fail to add new member (test-etcd-e3giv-0003): timeout to add etcd member" 
time="2016-10-05T17:29:26Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:26Z" level=error msg="fail to reconcile: timeout to add etcd member" 
time="2016-10-05T17:29:31Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:31Z" level=info msg="Running pods: test-etcd-e3giv-0002" 
time="2016-10-05T17:29:31Z" level=info msg="Expected membership: test-etcd-e3giv-0000,test-etcd-e3giv-0002" 
time="2016-10-05T17:29:31Z" level=info msg="Disaster recovery" 
time="2016-10-05T17:29:31Z" level=info msg="Made a latest backup successfully" 
time="2016-10-05T17:29:41Z" level=info msg="created cluster (test-etcd-e3giv) with seed member (test-etcd-e3giv-0003)" 
time="2016-10-05T17:29:41Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:46Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:46Z" level=info msg="Running pods: test-etcd-e3giv-0003" 
time="2016-10-05T17:29:46Z" level=info msg="Expected membership: test-etcd-e3giv-0003" 

kube-etcd-controller logs END ===
--- FAIL: TestDisasterRecovery (165.06s)

etcd controller seems to be running correctly and recovering cluster. But it took very long for disaster recovery, taking into account that we backoff on transient failures.

My suggestion is that we should detect if controller is freezed:

  • if controller is dead.
  • if controller has made any positive progress: recovering the cluster to desired state.

etcd Clients are not closed

From the etcd clientv3 godocs:

Make sure to close the client after using it. If the client is not closed, the connection will have leaky goroutines.

The way we handle clients now is pretty ad-hoc. I figure that we'll probably end up doing one of these:

  • Put clients in a managed pool
  • Deal with client creation/cleanup as part of event loop

In either case, this issue exists to remind us to do it once we've got client lifecycle sorted.

e2e TestDisasterRecovery failed

controller panic with

panic: TODO: handle failure of backupnow request
goroutine 682 [running]:
panic(0x13a0180, 0xc4203e7180)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).disasterRecovery(0xc4203a3e00, 0xc42071fd70, 0x1, 0x13a0180)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/reconcile.go:172 +0x347
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).reconcile(0xc4203a3e00, 0xc42071fbc0, 0x0, 0x0)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/reconcile.go:73 +0x574
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).run(0xc4203a3e00)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:124 +0x3a7
created by github.com/coreos/kube-etcd-controller/pkg/cluster.new
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:83 +0x19d

panic: failed to create seed member (etcd-cluster-0000): services "etcd-cluster-0000" already exists

my kube-etcd-controller is in a crash loop:

NAME                                READY     STATUS             RESTARTS   AGE
po/kube-etcd-controller             0/1       CrashLoopBackOff   287        1d

2016-09-29T11:31:54.215836829Z time="2016-09-29T11:31:54Z" level=info msg="finding existing clusters..." 
2016-09-29T11:31:54.228707354Z time="2016-09-29T11:31:54Z" level=info msg="etcd cluster controller starts running..." 
2016-09-29T11:31:54.230494618Z time="2016-09-29T11:31:54Z" level=info msg="watching at " 
2016-09-29T11:31:54.232252069Z time="2016-09-29T11:31:54Z" level=info msg="etcd cluster event: ADDED cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"EtcdCluster\", APIVersion:\"coreos.com/v1\"}, ObjectMeta:api.ObjectMeta{Name:\"etcd-cluster\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"/apis/coreos.com/v1/namespaces/default/etcdclusters/etcd-cluster\", UID:\"20957ab4-8543-11e6-8a25-5600000827f5\", ResourceVersion:\"2096521\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:63610640203, nsec:0, loc:(*time.Location)(0x268e280)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:1, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(0xc42019efa0), HostNetwork:false}}" 
2016-09-29T11:31:54.266466557Z panic: failed to create seed member (etcd-cluster-0000): services "etcd-cluster-0000" already exists
2016-09-29T11:31:54.266511504Z 
2016-09-29T11:31:54.266521782Z goroutine 1 [running]:
2016-09-29T11:31:54.266528296Z panic(0x18ac1a0, 0xc420355990)
2016-09-29T11:31:54.266534740Z  /usr/local/go/src/runtime/panic.go:500 +0x1a1
2016-09-29T11:31:54.266542431Z github.com/coreos/kube-etcd-controller/pkg/cluster.new(0xc420128240, 0xc420286250, 0xc, 0x1aadd93, 0x7, 0xc4201b6750, 0x465001, 0x18)
2016-09-29T11:31:54.266548891Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:72 +0x1dd
2016-09-29T11:31:54.266555616Z github.com/coreos/kube-etcd-controller/pkg/cluster.New(0xc420128240, 0xc420286250, 0xc, 0x1aadd93, 0x7, 0xc4201b6750, 0x0)
2016-09-29T11:31:54.266562057Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:54 +0x62
2016-09-29T11:31:54.266571033Z github.com/coreos/kube-etcd-controller/pkg/controller.(*Controller).Run(0xc420419770)
2016-09-29T11:31:54.266576696Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/controller/controller.go:78 +0x57d
2016-09-29T11:31:54.266582835Z main.main()
2016-09-29T11:31:54.266588681Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cmd/controller/main.go:30 +0x5e

Support parameterized namespace

Currently we only use "default" namespace, we should have parameter for it.
One use case is that in testing environment, we would have our test run in "kube-etcd-${RAND}" ns.

controller creates TPR

Brendan Burns had this idea that the controller would just automatically create the TPR when it starts.

Only single member clusters can be used as Seeds

Currently, when an EtcdCluster is created with a seed cluster specified that has more than one members it panics with:

panic: seed cluster contains more than one member

In order to get bootkube (#128) working, I have explored moving from single member etcd cluster to a multi-member cluster and have had difficulty performing the migration without losing consensus in the process. This is the closest issue I could find to this problem is etcd-io/etcd#6420, but asking around seems like a known issue.

What would be required to migrate from a multi-member cluster? Are there strategies of preventing quorum loss when moving from single to multi-member clusters?
@xiang90 @hongchaodeng

Backoff on transient error of removeOneMemeber()

Controller encountered an error during removeOneMemeber():

panic: etcdcli failed to remove one member: etcdserver: unhealthy cluster

From etcd logs:

2016-10-05 04:19:45.063696 I | membership: added member d794859df5e93fae [http://test-etcd-at8eb-0001:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.063827 I | membership: added member 869e996aa2b12196 [http://test-etcd-at8eb-0002:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.063956 I | membership: added member 29e1e8156d620351 [http://test-etcd-at8eb-0003:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.064188 I | membership: added member 9c83fedd19ba5431 [http://test-etcd-at8eb-0004:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.070298 I | etcdserver: published {Name:test-etcd-at8eb-0004 ClientURLs:[http://test-etcd-at8eb-0004:2379]} to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.070353 I | embed: ready to serve client requests
2016-10-05 04:19:45.070641 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2016-10-05 04:19:49.154008 W | etcdserver: reconfigure breaks active quorum, rejecting remove member 9c83fedd19ba5431

The problem is that the time only elapsed ~4s and etcd quorum needs 5s for strict check. We can safely retry this error.

Parameterize image option for build script

Currently, we are making it static to use

gcr.io/coreos-k8s-scale-testing/kubeetcdctrl:latest

If someone changes the code and wants to test/play with the changes, they need to hack into the build scripts. For better collaboration, we should support custom image option.

controller crashes if you don't delete service

I delete etcd-cluster-0000 from the readme then the controller crashed:

2016-08-17 17:00:16.126909 I | etcd cluster controller starts running...
2016-08-17 17:00:16.156275 I | start watching...
2016-08-17 17:00:16.157034 I | etcd cluster event: ADDED main.EtcdCluster{Kind:"EtcdCluster", ApiVersion:"coreos.com/v1", Metadata:map[string]string{"name":"etcd-cluster", "namespace":"default", "selfLink":"/apis/coreos.com/v1/namespaces/default/etcdclusters/etcd-cluster", "uid":"f5625360-646b-11e6-a0d3-ba2a3bb04016", "resourceVersion":"1686", "creationTimestamp":"2016-08-17T11:15:52Z"}, Size:3}
panic: services "etcd-cluster-0000" already exists

goroutine 27 [running]:
panic(0x18e2f80, 0xc4200ce800)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
main.(*Cluster).create(0xc420400ed0, 0x3)
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:91 +0x473
main.(*Cluster).run(0xc420400ed0)
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:71 +0x158
created by main.newCluster
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:42 +0x12b

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.