Comments (9)
Thanks @timuthy for summarizing the issue.
from etcd-druid.
/status in-progress
from etcd-druid.
@amshuman-kr, @shreyas-s-rao and I iterated again over the approach and agreed that we find the EtcdMember
design confusing because the resource itself only exists of a status
and is not meant to be reconciled by any controller.
Instead, the backup-restore side car can contribute to the corresponding pod.status.conditions
with a custom condition type
. Extending pod conditions is a supported use-case and enables us to maintain extra status information about each etcd member. Druid can then consult these custom conditions to aggregate the status its AllMemberReady
and Ready
conditions on etcd.status.conditions
.
from etcd-druid.
#207 describes an alternative of using leases
instead of a new EtcdMember
resource.
from etcd-druid.
After working in the problem space we realized that the Lease
approach is not ideal as an etcd-member
has different status information that should be exposed to etcd-druid in order to take appropriate action, e.g.:
- Is
peer-tls
enabled for the member -> lets Druid trigger a appropriate rollout if changed fromdisabled -> enabled
(cc @unmarshall) - Is the disk corrupted -> lets Druid know to take adequate action, e.g. by triggering quorum restoration (cc @abdasgupta)
After a discussion with @unmarshall, we think it's best to revive the idea of the EtcdMember
resource instead of continuing maintaining this information in Lease
s via annotations and other hacks. The design of the EtcdMember
API can of course be revisited and revised, e.g. not having a status
but another field.
However, there can also be arguments to have .spec
and .status
fields. For instance, via a .spec
it'd be possible for Druid to trigger a data archiving/movement before quorum restoration (cc @vlerenc).
from etcd-druid.
For etcd druid to do deterministic recovery/reconciliation its important that individual etcd members publish their up-to-date information. Outside gardener if etcd-druid is not used then any other external actor can be the consumer of this information providing it holistic state of all members in the etcd cluster.
from etcd-druid.
/assign
from etcd-druid.
Is the disk corrupted -> lets Druid know to take adequate action, e.g. by triggering quorum restoration (cc @abdasgupta)
Just mentioning few points:
When disk get corrupted for 1 cluster member in multi-node etcd cluster, then druid is not getting involved. Please refer to this doc.
When there is a disk failure for majority of cluster member in multi-node etcd cluster, then we have to do the manual intervention as this is the case of permanent quorum loss which is unlikely to occur.
from etcd-druid.
The state of each etcd member is beneficial not only for deciding the order of rolling the members during an update, but also to decide the order of rolling volumes in the case of rolling volumes (due to changes in volume size, storage class, etc). Exact details of how the statefulset can be prepared and then volumes are rolled without downtime for a multi-node etcd cluster are captured in this comment from @unmarshall -> gardener/gardener-extension-provider-aws#646
from etcd-druid.
Related Issues (20)
- [Feature] Add E2E test for `EtcdCopyBackupsTask` using Localstack
- [Feature] Allow downscaling multi-node etcd cluster HOT 1
- ☂️ [Enhancement] Create v1beta1 for etcd-druid API
- [BUG] Etcd-druid removes the scale-up annotation even if scale-up didn't succeed. HOT 1
- [BUG] Wrong `.status.replicas` is set in etcd resource when cluster is marked for scale-up HOT 1
- ☂️ Improvements in etcd-backup-restore and in etcd-druid for Scale-up feature HOT 2
- [Feature] Druid-controlled updates to the pods in the etcd cluster HOT 1
- [Feature] Harmonize scaling operations of the etcd cluster
- [Feature] Introduce `Task`/`Operation` concept for out-of-band operations HOT 6
- [Feature] Enhanced snapshot compaction based on events size HOT 1
- Rework druid documentation HOT 4
- [Enhancement] New condition to ensure all etcd's join a single cluster HOT 2
- [BUG] If peerUrl TLS not enabled for non-HA migrate to HA then druid is recreates the statefulset as well as adds a scale-up annotation
- [Test] Add e2e tests while scaling a non-HA (peerUrl TLS is not enabled) to a HA etcd cluster (peerUrl TLS will get enabled) HOT 1
- [Feature] Alerts for the compaction job metrics HOT 4
- [Feature] Load some data to ETCD instances in every e2e tests
- ☂️ Replace etcd-custom-image with etcd-wrapper HOT 3
- [Feature] ☂️ Monitor compaction jobs running on shoot control planes HOT 1
- ☂️ [Epic]: Switch to Distroless images for etcd-wrapper and etcd-backup-restore HOT 1
- [Feature] Support setting imagePullSecrets and imagePullPolicy for etcd and backup images HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etcd-druid.