coreos / etcd-operator Goto Github PK

View Code? Open in Web Editor NEW

1.8K 72.0 741.0 3.24 MB

etcd operator creates/configures/manages etcd clusters atop Kubernetes

Home Page: https://coreos.com/blog/introducing-the-etcd-operator.html

License: Apache License 2.0

Go 94.91% Shell 4.46% Dockerfile 0.63%

etcd kubernetes operator

etcd-operator's Introduction

etcd operator

Project status: archived

This project is no longer actively developed or maintained. The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact [email protected].

Overview

The etcd operator manages etcd clusters deployed to Kubernetes and automates tasks related to operating an etcd cluster.

Create and Destroy
Resize
Failover
Rolling upgrade
Backup and Restore

There are more spec examples on setting up clusters with different configurations

Read Best Practices for more information on how to better use etcd operator.

Read RBAC docs for how to setup RBAC rules for etcd operator if RBAC is in place.

Read Developer Guide for setting up a development environment if you want to contribute.

See the Resources and Labels doc for an overview of the resources created by the etcd-operator.

Requirements

Kubernetes 1.8+
etcd 3.2.13+

Demo

Getting started

Deploy etcd operator

See instructions on how to install/uninstall etcd operator .

Create and destroy an etcd cluster

$ kubectl create -f example/example-etcd-cluster.yaml

A 3 member etcd cluster will be created.

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          1m
example-etcd-cluster-m6g62x6mwc   1/1       Running   0          1m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          1m

See client service for how to access etcd clusters created by the operator.

If you are working with minikube locally, create a nodePort service and test that etcd is responding:

$ kubectl create -f example/example-etcd-cluster-nodeport-service.json
$ export ETCDCTL_API=3
$ export ETCDCTL_ENDPOINTS=$(minikube service example-etcd-cluster-client-service --url)
$ etcdctl put foo bar

Destroy the etcd cluster:

$ kubectl delete -f example/example-etcd-cluster.yaml

Resize an etcd cluster

Create an etcd cluster:

$ kubectl apply -f example/example-etcd-cluster.yaml

In example/example-etcd-cluster.yaml the initial cluster size is 3. Modify the file and change size from 3 to 5.

$ cat example/example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 5
  version: "3.2.13"

Apply the size change to the cluster CR:

$ kubectl apply -f example/example-etcd-cluster.yaml

The etcd cluster will scale to 5 members (5 pods):

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-cl2gpqsmsw   1/1       Running   0          5m
example-etcd-cluster-cx2t6v8w78   1/1       Running   0          5m
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          7m
example-etcd-cluster-m6g62x6mwc   1/1       Running   0          7m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          7m

Similarly we can decrease the size of the cluster from 5 back to 3 by changing the size field again and reapplying the change.

$ cat example/example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.2.13"

$ kubectl apply -f example/example-etcd-cluster.yaml

We should see that etcd cluster will eventually reduce to 3 pods:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-cl2gpqsmsw   1/1       Running   0          6m
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          8m
example-etcd-cluster-rqk62l46kw   1/1       Running   0          9mp

Failover

If the minority of etcd members crash, the etcd operator will automatically recover the failure. Let's walk through this in the following steps.

Create an etcd cluster:

$ kubectl create -f example/example-etcd-cluster.yaml

Wait until all three members are up. Simulate a member failure by deleting a pod:

$ kubectl delete pod example-etcd-cluster-cl2gpqsmsw --now

The etcd operator will recover the failure by creating a new pod example-etcd-cluster-n4h66wtjrg:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
example-etcd-cluster-gxkmr9ql7z   1/1       Running   0          10m
example-etcd-cluster-n4h66wtjrg   1/1       Running   0          26s
example-etcd-cluster-rqk62l46kw   1/1       Running   0          10m

Destroy etcd cluster:

$ kubectl delete -f example/example-etcd-cluster.yaml

etcd operator recovery

Let's walk through operator recovery in the following steps.

Create an etcd cluster:

$ kubectl create -f example/example-etcd-cluster.yaml

Wait until all three members are up. Then stop the etcd operator and delete one of the etcd pods:

$ kubectl delete -f example/deployment.yaml
deployment "etcd-operator" deleted

$ kubectl delete pod example-etcd-cluster-8gttjl679c --now
pod "example-etcd-cluster-8gttjl679c" deleted

Next restart the etcd operator. It should recover itself and the etcd clusters it manages.

$ kubectl create -f example/deployment.yaml
deployment "etcd-operator" created

$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
example-etcd-cluster-m8gk76l4ns   1/1       Running   0          3m
example-etcd-cluster-q6mff85hml   1/1       Running   0          3m
example-etcd-cluster-xnfvm7lg66   1/1       Running   0          11s

Upgrade an etcd cluster

Create and have the following yaml file ready:

$ cat upgrade-example.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.1.10"
  repository: "quay.io/coreos/etcd"

Create an etcd cluster with the version specified (3.1.10) in the yaml file:

$ kubectl apply -f upgrade-example.yaml
$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
example-etcd-cluster-795649v9kq   1/1       Running   1          3m
example-etcd-cluster-jtp447ggnq   1/1       Running   1          4m
example-etcd-cluster-psw7sf2hhr   1/1       Running   1          4m

The container image version should be 3.1.10:

$ kubectl get pod example-etcd-cluster-795649v9kq -o yaml | grep "image:" | uniq
    image: quay.io/coreos/etcd:v3.1.10

Now modify the file upgrade-example and change the version from 3.1.10 to 3.2.13:

$ cat upgrade-example
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
spec:
  size: 3
  version: "3.2.13"

Apply the version change to the cluster CR:

$ kubectl apply -f upgrade-example

Wait ~30 seconds. The container image version should be updated to v3.2.13:

$ kubectl get pod example-etcd-cluster-795649v9kq -o yaml | grep "image:" | uniq
    image: gcr.io/etcd-development/etcd:v3.2.13

Check the other two pods and you should see the same result.

Backup and Restore an etcd cluster

Note: The provided etcd backup/restore operators are example implementations.

Follow the etcd backup operator walkthrough to backup an etcd cluster.

Follow the etcd restore operator walkthrough to restore an etcd cluster on Kubernetes from backup.

Manage etcd clusters in all namespaces

See instructions on clusterwide feature.

etcd-operator's People

Contributors

Stargazers

Watchers

Forkers

philips hongchaodeng barakmich ngaut lachie83 spoonben ryanj sdake seeekr dhilipkumars vaikas fatduo giantswarm jackzzj icyxieex yolkov liangxiaoping lclarkmichalek m1093782566 softputer grapebaba linearregression colhom djmax hasbro17 snarlysodboxer paultiplady zoutaiqi neujie uruddarraju lamdor theothermike joshix rchicoli soggiest onitsoft llparse diegs acloudiator aia ahmetb jchanam snooc radhikapc github rancher christianvozar targetprocess clusc jamiehannaford prateekpandey14 ultimateboy ayush268 honghzzhang mtau11 vdice sstarcher athai christiankis deep6821 dqminh karthequian hexfusion aoxn adityadani sandeeppissay yan234280533 grovecai dexus-forks azure kraman immesys wilhelmguo sgotti jswoods pilsy johnwalker247 georgekuruvillak raoofm gdmello fanminshi vitalybu bmcustodio caesarxuchao hexiay swizzlr jduhamel aslanbakirov jascott1 gabrielsvinha phdloli zagraves orainxiong oprietop jimmy-chuang x1957 jpbetz marklap swapnilgm ellerbrock

etcd-operator's Issues

release workflow

use official etcd container

publish kube-etcd-controller as container.

Problem with etcd in hostNetwork

Ran into this today while trying to test the changes from #112

NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          1m

2016-09-15 00:49:12.196100 I | etcdmain: etcd Version: 3.0.8
2016-09-15 00:49:12.196134 I | etcdmain: Git SHA: d40982f
2016-09-15 00:49:12.196174 I | etcdmain: Go Version: go1.6.3
2016-09-15 00:49:12.196204 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 00:49:12.196232 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 00:49:12.196591 I | etcdmain: listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.196725 I | etcdmain: listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.213419 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 00:49:12.236475 I | etcdmain: stopping listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.236515 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.236541 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380

Looks like it's failing to resolve etcd-cluster-0000. I'll investigate.

Init container not sharing volumes correctly

Problem

From the design, init container is supposed to share and initialize volumes in sequence. However, it's not working so far and volume initialized by init containers is not possible to be used by other containers.

What's affected

In disaster recovery workflow,the seed etcd member will:

fetch snapshot from backup, and use "etcdctl snapshot restore" to prepare restored datadir.
start etcd server and have "--data-dir" pointed to the same datadir.

This restored datadir doesn't show up again in step 2 unfortunately. We are moving forward #94 though and going to fix it later.

Future Plan

Upstream issue: kubernetes/kubernetes#32094
Fix: (TODO)

expose metrics to prom

prototype with bootkube

I think we need to get the integration with bootkube prototyped ASAP. The last remaining difficult thing for a fully self-hosted Kubernetes cluster is etcd. The Kubelet, API server, etc all have a path forward and we can prove that it works.

xref kubernetes-retired/bootkube#31 (comment)

watch error makes kube-etcd-controller get stuck in a loop

somehow the watcher gets stuck in an error loop, causing the controller to hit the api server constantly and make the cpu load very high.

the size of my cluster is 1.

does it make sense to add a rate limit to the watch loop to prevent accidental high cpu use on the api server?

here are logs with an additional print statement to dump the watch error.

time="2016-10-03T23:01:17Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:17Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:19Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:19Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:22Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:22Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:22Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:22Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:24Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:24Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:27Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:27Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:27Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:27Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:29Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:29Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:32Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:32Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:32Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:32Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:34Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:34Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:37Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:37Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:37Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:37Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:39Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:39Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:42Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:42Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:42Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:42Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:44Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:44Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:47Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:47Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:47Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:47Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:49Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:49Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:52Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:52Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:52Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:52Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:54Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:54Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:01:57Z" level=info msg="Reconciling:" 
time="2016-10-03T23:01:57Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:01:57Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:01:57Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:01:59Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:01:59Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:02Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:02Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:02Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:02Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:04Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:04Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:07Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:07Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:07Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:07Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:09Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:09Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:12Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:12Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:12Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:12Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:14Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:14Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:17Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:17Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:17Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:17Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:19Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:19Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:22Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:22Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:22Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:22Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:24Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:24Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:27Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:27Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:27Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:27Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:29Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:29Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:32Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:32Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:32Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:32Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:34Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:34Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}" 
time="2016-10-03T23:02:37Z" level=info msg="Reconciling:" 
time="2016-10-03T23:02:37Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-03T23:02:37Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-03T23:02:37Z" level=info msg="Finish Reconciling" 
time="2016-10-03T23:02:39Z" level=info msg="watching at 2639990" 
time="2016-10-03T23:02:39Z" level=info msg="etcd cluster event error: &controller.Event{Type:\"ERROR\", Object:cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"Status\", APIVersion:\"v1\"}, ObjectMeta:api.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:0, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(nil), HostNetwork:false}}}"

cluster membership reconciling

For each etcd cluster, we actually three membership state:

the desired size defined in tpr resource. this is provided by user.
the running pods in the k8s cluster. this is provided by API server.
the membership in the etcd cluster. this is fully controlled by our controller.

From the controller perspective, 1 and 2 are dynamic. They are not controlled directly by the controller itself. The controller's responsibility is to keep 2, 3 in sync with 1.

There are a few cases that 1, 2, 3 can be out of sync. We need to define rules to make the in sync.

A general rule here is probably always first make 2, 3 in sync (which is recover) one member each time. Then start to take care about 1 (which is resize)

back up to persistent volume if required

Add backup into cluster spec.

Backup: int
- 0 means no backup, n means keep N backups.
BackupIntervalInMinutes: int
- 0 means no backup, n means creating a backup every N minutes

Before creating the cluster, we should request a persistent volume with N*4GB size. (etcd support 2GB snapshot by default. over request is not a big deal for now)

Use https://godoc.org/github.com/coreos/etcd/clientv3#Maintenance to write snapshot every backupInterval.

Allow etcd pods to run in host network namespace

If a user wants to run a Kubernetes hosted etcd for Canal / Calico (i.e part of the infrastructure), it needs to be run with host networking.

Running in the host network namespace is currently necessary to avoid a bootstrap issue where the default network policy prevents Calico / Canal from accessing etcd, but the network policy cannot be changed without Calico / Canal accessing etcd.

Not sure if something similar has been discussed before, but is there a plan to have a mechanism for exposing more general Pod configuration parameters like hostNetwork: true (or even things like annotations, labels, etc)?

e2e test: Missing running pods shown previously

This is not really our controller issue. I just want to keep a track here.

What I have observed:

a pod has been created
after 5 seconds in reconcile the pod doesn't show up.

Eventually it does show up. This long delay affects how reconcile works and isn't reasonable to workaround.

reorganize code base

kube-etcd-controller/
  cmd/
    controller/
      main.go
    backup/
      main.go
  cluster/
  k8sutil/
  ...

Then we can start to work on the backup cmd.

Support parameterized namespace

Currently we only use "default" namespace, we should have parameter for it.
One use case is that in testing environment, we would have our test run in "kube-etcd-${RAND}" ns.

expose etcd client port as service

so etcdctl works...

[jenkins] test issue

support TLS enabled cluster

backup to object store (s3, gs, swift, etc)

This might make for a nice "disaster recovery demo". No idea how to prioritize yet though.

dns name for etcd endpoints?

is it currently possible to use a dns address or some other mechanism for a client to discover the etcd client addresses?

if not, this would be a great feature to make it simpler for clients to connect.

update etcd cluster version

add etcd version into cluster Spec

If the version is updated, we should update the pod spec of the etcd members one by one; and one version by one version (2.3.1 -> 2.3.2 -> 2.3.3, never 2.3.1 -> 2.3.3 directly)

We only start update version if and only if the reconciling matches.

vendor dependencies

Allow management of existing clusters

A common requirement for using kube-etcd-controller is the ability to migrate an already existing etcd cluster to the management of the controller. This functionality will be necessary in order to enable bootkube (#128) to bootstrap using the controller.

Parameterize image option for build script

Currently, we are making it static to use

gcr.io/coreos-k8s-scale-testing/kubeetcdctrl:latest

If someone changes the code and wants to test/play with the changes, they need to hack into the build scripts. For better collaboration, we should support custom image option.

copy missing features from exhibitor

@hongchaodeng can you check what features https://github.com/Netflix/exhibitor provide?

We should create issues for the missing features for kube-etcd-controller.

support resize

check cluster state until all existing members are healthy
if growing, add member one by one
if shrinking, remove member one by one

panic on pod deletion through API

With kube-etcd-controller running, deleting one of the etcd-cluster-xxxx pods will cause the following panic:

panic: unexpected EOF

goroutine 1 [running]:
panic(0x2023900, 0xc82000e180)
    /usr/lib/go/src/runtime/panic.go:481 +0x3e6
main.(*etcdClusterController).Run(0xc820549f00)
    /home/chom/gopaths/kube-etcd-controller/src/github.com/coreos/kube-etcd-controller/main.go:61 +0x39a
main.main()
    /home/chom/gopaths/kube-etcd-controller/src/github.com/coreos/kube-etcd-controller/main.go:101 +0x166

@xiang90 I'm going to fix this, then push commits for #5

User documentation

Support TLS auth for remote API server hosts

Very useful for development against a remote TLS-secured development cluster.

sign off by polvi

Polvi needs to sign off this work before we can open source it (or we might not).

He has some ideas around how shall we do this.

controller creates TPR

Brendan Burns had this idea that the controller would just automatically create the TPR when it starts.

support pod anti affinity

by default, we should spread different etcd pods that belongs to one cluster onto different nodes. then etcd cluster can achieve better availability.

but we should also support non-spreading case for quick deployment setup when we only have one kube node.

this should be achieve by adding a policy in the tpr cluster creating spec.

existing UUID is broken

After creating an EtcdCluster object...

2016-08-05 08:08:52.451942 I | new cluster size: 3
panic: Service "etcd0-24b570cf-f642-4bb2-a8f5-4c8033565f1e" is invalid: metadata.name: Invalid value: "etcd0-24b570cf-f642-4bb2-a8f5-4c8033565f1e": must be no more than 24 characters

This little hack fixed it

diff --git a/main.go b/main.go
index 43a14cf..1c69bdb 100644
--- a/main.go
+++ b/main.go
@@ -129,7 +129,7 @@ func mustCreateClient(host string) *unversioned.Client {
 }

 func generateUUID() string {
-   return uuid.New()
+   return uuid.New()[:8]
 }

 func makeEtcdService(etcdName, uuid string) *api.Service {

#3 would be the ideal fix

heavy load on test infra leads to unexpected timeout

From last failure:

=== RUN   TestDisasterRecovery
created etcd cluster: test-etcd-e3giv
reached to 3 members cluster
deleting etcd cluster: test-etcd-e3giv
ready: [], unready: [test-etcd-e3giv-0003 test-etcd-e3giv-0004]
kube-etcd-controller logs ===
time="2016-10-05T17:29:01Z" level=info msg="Expected membership: test-etcd-e3giv-0002,test-etcd-e3giv-0000,test-etcd-e3giv-0001" 
time="2016-10-05T17:29:01Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:06Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:06Z" level=info msg="Running pods: test-etcd-e3giv-0002,test-etcd-e3giv-0000" 
time="2016-10-05T17:29:06Z" level=info msg="Expected membership: test-etcd-e3giv-0000,test-etcd-e3giv-0001,test-etcd-e3giv-0002" 
time="2016-10-05T17:29:06Z" level=info msg="Recovering one member" 
time="2016-10-05T17:29:06Z" level=info msg="removed member (test-etcd-e3giv-0001) with ID (5161869728712750051)" 
time="2016-10-05T17:29:26Z" level=error msg="fail to add new member (test-etcd-e3giv-0003): timeout to add etcd member" 
time="2016-10-05T17:29:26Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:26Z" level=error msg="fail to reconcile: timeout to add etcd member" 
time="2016-10-05T17:29:31Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:31Z" level=info msg="Running pods: test-etcd-e3giv-0002" 
time="2016-10-05T17:29:31Z" level=info msg="Expected membership: test-etcd-e3giv-0000,test-etcd-e3giv-0002" 
time="2016-10-05T17:29:31Z" level=info msg="Disaster recovery" 
time="2016-10-05T17:29:31Z" level=info msg="Made a latest backup successfully" 
time="2016-10-05T17:29:41Z" level=info msg="created cluster (test-etcd-e3giv) with seed member (test-etcd-e3giv-0003)" 
time="2016-10-05T17:29:41Z" level=info msg="Finish Reconciling" 
time="2016-10-05T17:29:46Z" level=info msg="Reconciling:" 
time="2016-10-05T17:29:46Z" level=info msg="Running pods: test-etcd-e3giv-0003" 
time="2016-10-05T17:29:46Z" level=info msg="Expected membership: test-etcd-e3giv-0003" 

kube-etcd-controller logs END ===
--- FAIL: TestDisasterRecovery (165.06s)

etcd controller seems to be running correctly and recovering cluster. But it took very long for disaster recovery, taking into account that we backoff on transient failures.

My suggestion is that we should detect if controller is freezed:

if controller is dead.
if controller has made any positive progress: recovering the cluster to desired state.

support node selector

only schedule etcd pods to pre define nodes.

use case: nodes with SSD. nodes with better bandwidth.

delete an existing cluster

$ kubectl delete etcdclusters example-etcd-cluster

should delete the existing example-etcd-cluster

e2e TestDisasterRecovery failed

controller panic with

panic: TODO: handle failure of backupnow request
goroutine 682 [running]:
panic(0x13a0180, 0xc4203e7180)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).disasterRecovery(0xc4203a3e00, 0xc42071fd70, 0x1, 0x13a0180)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/reconcile.go:172 +0x347
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).reconcile(0xc4203a3e00, 0xc42071fbc0, 0x0, 0x0)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/reconcile.go:73 +0x574
github.com/coreos/kube-etcd-controller/pkg/cluster.(*Cluster).run(0xc4203a3e00)
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:124 +0x3a7
created by github.com/coreos/kube-etcd-controller/pkg/cluster.new
        /home/jenkins/workspace/build-152/gopath/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:83 +0x19d

support recovery from partial failure

remove dead pods
go to resize path

add google analytic tracking

Only single member clusters can be used as Seeds

Currently, when an EtcdCluster is created with a seed cluster specified that has more than one members it panics with:

panic: seed cluster contains more than one member

In order to get bootkube (#128) working, I have explored moving from single member etcd cluster to a multi-member cluster and have had difficulty performing the migration without losing consensus in the process. This is the closest issue I could find to this problem is etcd-io/etcd#6420, but asking around seems like a known issue.

What would be required to migrate from a multi-member cluster? Are there strategies of preventing quorum loss when moving from single to multi-member clusters?
@xiang90 @hongchaodeng

Backoff on transient error of removeOneMemeber()

Controller encountered an error during removeOneMemeber():

panic: etcdcli failed to remove one member: etcdserver: unhealthy cluster

From etcd logs:

2016-10-05 04:19:45.063696 I | membership: added member d794859df5e93fae [http://test-etcd-at8eb-0001:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.063827 I | membership: added member 869e996aa2b12196 [http://test-etcd-at8eb-0002:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.063956 I | membership: added member 29e1e8156d620351 [http://test-etcd-at8eb-0003:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.064188 I | membership: added member 9c83fedd19ba5431 [http://test-etcd-at8eb-0004:2380] to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.070298 I | etcdserver: published {Name:test-etcd-at8eb-0004 ClientURLs:[http://test-etcd-at8eb-0004:2379]} to cluster fc2cfeb0a8515d85
2016-10-05 04:19:45.070353 I | embed: ready to serve client requests
2016-10-05 04:19:45.070641 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2016-10-05 04:19:49.154008 W | etcdserver: reconfigure breaks active quorum, rejecting remove member 9c83fedd19ba5431

The problem is that the time only elapsed ~4s and etcd quorum needs 5s for strict check. We can safely retry this error.

take over existing cluster

There is an existing cluster deployed somewhere, and we have an accessible client addr and peer addr.

We should be able to take over the cluster of that cluster by moving the cluster members into k8s. We always add one member into k8s, and remove one member outside k8s.

member health checking

Now we rely on kubernetes pod management to detect member failure. It is not reliable since there might be application level failure. We should use probe features to periodically ping etcd's health endpoint to determine the healthiness of the member.

failed to get event from apiserver: unexpected EOF

time="2016-10-04T18:54:15Z" level=info msg="finding existing clusters..." 
time="2016-10-04T18:54:15Z" level=info msg="etcd cluster controller starts running..." 
time="2016-10-04T18:54:15Z" level=info msg="watching at 2766988" 
time="2016-10-04T18:54:20Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:20Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:20Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:20Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:25Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:25Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:25Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:25Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:30Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:30Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:30Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:30Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:35Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:35Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:35Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:35Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:40Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:40Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:40Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:40Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:45Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:45Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:45Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:45Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:50Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:50Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:50Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:50Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:54:55Z" level=info msg="Reconciling:" 
time="2016-10-04T18:54:55Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:54:55Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:54:55Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:55:00Z" level=info msg="Reconciling:" 
time="2016-10-04T18:55:00Z" level=info msg="Running pods: etcd-cluster-0000" 
time="2016-10-04T18:55:00Z" level=info msg="Expected membership: etcd-cluster-0000" 
time="2016-10-04T18:55:00Z" level=info msg="Finish Reconciling" 
time="2016-10-04T18:55:05Z" level=error msg="failed to get event from apiserver: unexpected EOF" 
panic: unexpected EOF

goroutine 1 [running]:
panic(0x18b28e0, 0xc42000a110)
    /home/mischief/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/kube-etcd-controller/pkg/controller.(*Controller).Run(0xc4202ceb70)
    /home/mischief/code/go/src/github.com/coreos/kube-etcd-controller/pkg/controller/controller.go:96 +0x703
main.main()
    /home/mischief/code/go/src/github.com/coreos/kube-etcd-controller/cmd/controller/main.go:30 +0x5e

build e2e test as image and test it in k8s cluster

resync state when restarting the controller

now we assume that the controller knows all the state changes from the beginning. Once it can fails, the assumption does not hold.

the controller should get the existing clusters by getting pod with app=etcd selectors. then we can find all existing clusters and do necessary modification according to the spce in the third party resources.

controller crashes if you don't delete service

I delete etcd-cluster-0000 from the readme then the controller crashed:

2016-08-17 17:00:16.126909 I | etcd cluster controller starts running...
2016-08-17 17:00:16.156275 I | start watching...
2016-08-17 17:00:16.157034 I | etcd cluster event: ADDED main.EtcdCluster{Kind:"EtcdCluster", ApiVersion:"coreos.com/v1", Metadata:map[string]string{"name":"etcd-cluster", "namespace":"default", "selfLink":"/apis/coreos.com/v1/namespaces/default/etcdclusters/etcd-cluster", "uid":"f5625360-646b-11e6-a0d3-ba2a3bb04016", "resourceVersion":"1686", "creationTimestamp":"2016-08-17T11:15:52Z"}, Size:3}
panic: services "etcd-cluster-0000" already exists

goroutine 27 [running]:
panic(0x18e2f80, 0xc4200ce800)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
main.(*Cluster).create(0xc420400ed0, 0x3)
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:91 +0x473
main.(*Cluster).run(0xc420400ed0)
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:71 +0x158
created by main.newCluster
        /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cluster.go:42 +0x12b

etcd Clients are not closed

From the etcd clientv3 godocs:

Make sure to close the client after using it. If the client is not closed, the connection will have leaky goroutines.

The way we handle clients now is pretty ad-hoc. I figure that we'll probably end up doing one of these:

Put clients in a managed pool
Deal with client creation/cleanup as part of event loop

In either case, this issue exists to remind us to do it once we've got client lifecycle sorted.

Flaky e2e test: TestResizeCluster5to3

--- FAIL: TestResizeCluster5to3 (90.02s)
    e2e_test.go:82: failed to create 5 members etcd cluster: timed out waiting for the condition

doc: run etcd on dedicate node

people might want to run etcd on some dedicate nodes (nodes that only run etcd), for reliability reasons. The use case is for the self-hosted etcd cluster that powers the k8s itself. we do want to extra safety for that cluster.

we should use taint feature to make this happen.

Add Spec.RequireDedicatedNode field.

/cc @colhom opinion?

use k8s object uid instead of creating one

we should reuse the existing uid system.

panic: failed to create seed member (etcd-cluster-0000): services "etcd-cluster-0000" already exists

my kube-etcd-controller is in a crash loop:

NAME                                READY     STATUS             RESTARTS   AGE
po/kube-etcd-controller             0/1       CrashLoopBackOff   287        1d


2016-09-29T11:31:54.215836829Z time="2016-09-29T11:31:54Z" level=info msg="finding existing clusters..." 
2016-09-29T11:31:54.228707354Z time="2016-09-29T11:31:54Z" level=info msg="etcd cluster controller starts running..." 
2016-09-29T11:31:54.230494618Z time="2016-09-29T11:31:54Z" level=info msg="watching at " 
2016-09-29T11:31:54.232252069Z time="2016-09-29T11:31:54Z" level=info msg="etcd cluster event: ADDED cluster.EtcdCluster{TypeMeta:unversioned.TypeMeta{Kind:\"EtcdCluster\", APIVersion:\"coreos.com/v1\"}, ObjectMeta:api.ObjectMeta{Name:\"etcd-cluster\", GenerateName:\"\", Namespace:\"default\", SelfLink:\"/apis/coreos.com/v1/namespaces/default/etcdclusters/etcd-cluster\", UID:\"20957ab4-8543-11e6-8a25-5600000827f5\", ResourceVersion:\"2096521\", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:63610640203, nsec:0, loc:(*time.Location)(0x268e280)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Spec:cluster.Spec{Size:1, AntiAffinity:false, Version:\"\", Backup:(*backup.Policy)(0xc42019efa0), HostNetwork:false}}" 
2016-09-29T11:31:54.266466557Z panic: failed to create seed member (etcd-cluster-0000): services "etcd-cluster-0000" already exists
2016-09-29T11:31:54.266511504Z 
2016-09-29T11:31:54.266521782Z goroutine 1 [running]:
2016-09-29T11:31:54.266528296Z panic(0x18ac1a0, 0xc420355990)
2016-09-29T11:31:54.266534740Z  /usr/local/go/src/runtime/panic.go:500 +0x1a1
2016-09-29T11:31:54.266542431Z github.com/coreos/kube-etcd-controller/pkg/cluster.new(0xc420128240, 0xc420286250, 0xc, 0x1aadd93, 0x7, 0xc4201b6750, 0x465001, 0x18)
2016-09-29T11:31:54.266548891Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:72 +0x1dd
2016-09-29T11:31:54.266555616Z github.com/coreos/kube-etcd-controller/pkg/cluster.New(0xc420128240, 0xc420286250, 0xc, 0x1aadd93, 0x7, 0xc4201b6750, 0x0)
2016-09-29T11:31:54.266562057Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/cluster/cluster.go:54 +0x62
2016-09-29T11:31:54.266571033Z github.com/coreos/kube-etcd-controller/pkg/controller.(*Controller).Run(0xc420419770)
2016-09-29T11:31:54.266576696Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/pkg/controller/controller.go:78 +0x57d
2016-09-29T11:31:54.266582835Z main.main()
2016-09-29T11:31:54.266588681Z  /home/ubuntu/code/golang/src/github.com/coreos/kube-etcd-controller/cmd/controller/main.go:30 +0x5e

setup CI for e2e testing

At initial dev state, we probably want to start with e2e testing that requires a running k8s cluster.

@colhom volunteered to work on this :) Thanks!

coreos / etcd-operator Goto Github PK

etcd-operator's Introduction

etcd operator

Project status: archived

Overview

Requirements

Demo

Getting started

Deploy etcd operator

Create and destroy an etcd cluster

Resize an etcd cluster

Failover

etcd operator recovery

Upgrade an etcd cluster

Backup and Restore an etcd cluster

Manage etcd clusters in all namespaces

etcd-operator's People

Contributors

Stargazers

Watchers

Forkers

etcd-operator's Issues

Problem

What's affected

Future Plan

Recommend Projects

Recommend Topics

Recommend Org