Comments (11)
Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts.
I think in this case specifically, the ErrClusterLocked
, we would want to leave the Nodes in whatever state they were previously in, and requeue the request to try again in say 20s. Do we track when we last observed the Node currently? We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown
from cluster-api.
I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks!
from cluster-api.
/triage needs-discussion
I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one.
Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me.
The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case).
If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time.
However, what we can do is
- Eventually, consider if and how to make MS recover more quickly in case of ErrClusterLocked, but this comes with the caveat that the time of the next reconciliation is not something deterministic, because it depends on the content of the controller queue, number of workers, etc.
- Eventually, consider if we need a way to be more explicit about what's going on (e.g. unknownReplicas)
from cluster-api.
Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.
--node-monitor-grace-period duration Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number
of retries allowed for kubelet to post node status.
from cluster-api.
BTW this could also impacted by #9810 discussed in #10165 (comment)
from cluster-api.
/area machineset
from cluster-api.
/priority important-longterm
from cluster-api.
Note: CAPI v1.7 has fixes for the ErrClusterLocked
error:
Did we already test if this is still an issue with v1.7.0-rc.1 or so? (v1.7.0 will be released today)
from cluster-api.
Thanks @chrischdi. That PR solves the reconcile latence when hitting ErrClusterLocked errors. This issue describes the issue "
MD.Status.ReadyReplicas changes from 3 to 0" when hitting ErrClusterLocked error for the first time.
from cluster-api.
Yeah just noted when reading code:
The fix it may not directly change the above behavior, but it would lead to retrying every minute.
from cluster-api.
This issue is currently awaiting triage.
CAPI contributors will take a look as soon as possible, apply one of the triage/*
labels and provide further guidance.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from cluster-api.
Related Issues (20)
- Provider E2E Tests Using capie2e.ClusterClassRolloutSpec Fail Due to Extra Provider-Specific Labels HOT 1
- Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange HOT 6
- clusterctl: Support autocompletion for the fish shell HOT 5
- E2E Test fails because ConfigMap metadata.annotations is too big HOT 3
- Consider returning warnings in validation for usage of deprecated fields HOT 2
- Bump cert-manager to v1.15.0 HOT 1
- Option to enable data encryption at-rest HOT 2
- ClusterResourceSet reconciler should use server side apply to apply resources on workload clusters HOT 4
- Handle kube-proxy v1alpha2 migration HOT 5
- ClusterClass observed generation is not set when the provider is not installed first HOT 6
- Github bot created bad release 1.7.3 HOT 19
- Consider if IntOrString fields should really be pointers HOT 5
- Consistently propagate down timeouts from MD => MS => Machines HOT 8
- clusterclass control-plane namingStrategy is not applied for Azure cluster HOT 8
- ClusterTopology feature flag not set but it is, cluster stuck pending HOT 4
- ClusterClass MachinePool reference naming validation with "Template" suffix HOT 4
- Variable metadata for top-level & nested variables HOT 2
- NodeDeletionTimeout should align behaviour with NodeVolumeDetachTimeout and NodeDrainTimeout for 0 HOT 3
- Deprecate ancient errors/ package
- Adding Patch Option to the patch helper to ignore patching the status HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cluster-api.