What steps did you take and what happened? When calling clusterctl

/triage needs-discussion I would like to get <a class="user-mention

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

BTW this could also impacted by <a class="issue-link js-issue-link" data-error-text="F

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error about cluster-api HOT 11 OPEN

jessehu commented on September 21, 2024

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

from cluster-api.

Comments (11)

JoelSpeed commented on September 21, 2024 2

Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts.

I think in this case specifically, the ErrClusterLocked, we would want to leave the Nodes in whatever state they were previously in, and requeue the request to try again in say 20s. Do we track when we last observed the Node currently? We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown

from cluster-api.

jessehu commented on September 21, 2024 1

I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks!

from cluster-api.

fabriziopandini commented on September 21, 2024

/triage needs-discussion

I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one.

Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me.

The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case).

If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time.

However, what we can do is

Eventually, consider if and how to make MS recover more quickly in case of ErrClusterLocked, but this comes with the caveat that the time of the next reconciliation is not something deterministic, because it depends on the content of the controller queue, number of workers, etc.
Eventually, consider if we need a way to be more explicit about what's going on (e.g. unknownReplicas)

from cluster-api.

jessehu commented on September 21, 2024

Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.

--node-monitor-grace-period duration     Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number
of retries allowed for kubelet to post node status.

from cluster-api.

jessehu commented on September 21, 2024

BTW this could also impacted by #9810 discussed in #10165 (comment)

from cluster-api.

jessehu commented on September 21, 2024

/area machineset

from cluster-api.

fabriziopandini commented on September 21, 2024

/priority important-longterm

from cluster-api.

chrischdi commented on September 21, 2024

Note: CAPI v1.7 has fixes for the ErrClusterLocked error:

#9810

Did we already test if this is still an issue with v1.7.0-rc.1 or so? (v1.7.0 will be released today)

from cluster-api.

jessehu commented on September 21, 2024

Thanks @chrischdi. That PR solves the reconcile latence when hitting ErrClusterLocked errors. This issue describes the issue "
MD.Status.ReadyReplicas changes from 3 to 0" when hitting ErrClusterLocked error for the first time.

from cluster-api.

chrischdi commented on September 21, 2024

Yeah just noted when reading code:

The fix it may not directly change the above behavior, but it would lead to retrying every minute.

from cluster-api.

k8s-ci-robot commented on September 21, 2024

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from cluster-api.

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error about cluster-api HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent