Giter Site home page Giter Site logo

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error about cluster-api HOT 11 OPEN

jessehu avatar jessehu commented on September 21, 2024
MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

from cluster-api.

Comments (11)

JoelSpeed avatar JoelSpeed commented on September 21, 2024 2

Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts.

I think in this case specifically, the ErrClusterLocked, we would want to leave the Nodes in whatever state they were previously in, and requeue the request to try again in say 20s. Do we track when we last observed the Node currently? We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown

from cluster-api.

jessehu avatar jessehu commented on September 21, 2024 1

I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks!

from cluster-api.

fabriziopandini avatar fabriziopandini commented on September 21, 2024

/triage needs-discussion

I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one.

Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me.

The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case).

If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time.

However, what we can do is

  • Eventually, consider if and how to make MS recover more quickly in case of ErrClusterLocked, but this comes with the caveat that the time of the next reconciliation is not something deterministic, because it depends on the content of the controller queue, number of workers, etc.
  • Eventually, consider if we need a way to be more explicit about what's going on (e.g. unknownReplicas)

from cluster-api.

jessehu avatar jessehu commented on September 21, 2024

Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.

--node-monitor-grace-period duration     Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number
of retries allowed for kubelet to post node status.

from cluster-api.

jessehu avatar jessehu commented on September 21, 2024

BTW this could also impacted by #9810 discussed in #10165 (comment)

from cluster-api.

jessehu avatar jessehu commented on September 21, 2024

/area machineset

from cluster-api.

fabriziopandini avatar fabriziopandini commented on September 21, 2024

/priority important-longterm

from cluster-api.

chrischdi avatar chrischdi commented on September 21, 2024

Note: CAPI v1.7 has fixes for the ErrClusterLocked error:

Did we already test if this is still an issue with v1.7.0-rc.1 or so? (v1.7.0 will be released today)

from cluster-api.

jessehu avatar jessehu commented on September 21, 2024

Thanks @chrischdi. That PR solves the reconcile latence when hitting ErrClusterLocked errors. This issue describes the issue "
MD.Status.ReadyReplicas changes from 3 to 0" when hitting ErrClusterLocked error for the first time.

from cluster-api.

chrischdi avatar chrischdi commented on September 21, 2024

Yeah just noted when reading code:

The fix it may not directly change the above behavior, but it would lead to retrying every minute.

from cluster-api.

k8s-ci-robot avatar k8s-ci-robot commented on September 21, 2024

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from cluster-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.