Expected Behavior We are managing a pipeline to provide on-dema

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Typha autoscaler should exclude tainted nodes in node count about operator HOT 18 CLOSED

tigera commented on August 22, 2024 9

Typha autoscaler should exclude tainted nodes in node count

from operator.

Comments (18)

amitkatyal commented on August 22, 2024

@tmjd Is it feasible to incorporate the suggested solution - that is to exclude the tainted nodes from the node count?

from operator.

tmjd commented on August 22, 2024

That might possibly work. I think we'd want to ensure that typha always deploys at least one instance in that case though. If all the nodes received the eks.amazonaws.com/nodegroup=unschedulable:NoSchedule taint, things would break if typha was scaled to zero.

from operator.

tmjd commented on August 22, 2024

Thinking a bit more about this. It might be a bit dangerous to do what was suggested because if all the node groups were being upgraded and it was a large cluster then that would mean all nodes are ignored for the purposes of auto-scaling so you could be reduced to one typha for hundreds of nodes which is not recommended for multiple reasons.

from operator.

tmjd commented on August 22, 2024

Perhaps what should be done instead of ignoring all nodes with that taint would be that if the auto scaling finds any nodes have that taint then ensure that there are fewer typhas than nodes? That could allow nodes to be updated because then there would be an available node that typha could be moved to.
Then with large clusters where there are lots of nodes then typha wouldn't get scaled down at all but for small clusters, room would be made for typha to be evicted.

from operator.

amitkatyal commented on August 22, 2024

@tmjd I tested with the latest calico operator (v1.21.0) which is under review - aws/amazon-vpc-cni-k8s#1578 and observed that upgrade to new k8s release, node deletion & cluster auto-scaling use-cases are working fine. Just wanted to double confirm if the issue has been fixed in the latest release?

I tested the k8s upgrade (two-node cluster) from 1.19 to 1.20 to 1.21 and the upgrade worked fine both times. Also tested that once the "scheduling disabled" taint was added to the node, the calico-typha POD was evicted.

Same observation with the cluster auto-scaling & manually adding/deleting the nodes from the Managed node group. Please note that I had tested with a minimum two number of nodes and two typha were always available.

If the above observation is expected as per the changes, then can wait for the latest changes to get merged. Kindly confirm.

from operator.

tmjd commented on August 22, 2024

We didn't explicitly fix this issue (AFAIK) but this newer version does include ignoring Unschedulable nodes for typha, so perhaps that has addressed the issue.

from operator.

amitkatyal commented on August 22, 2024

Thanks for the confirmation!

from operator.

tmjd commented on August 22, 2024

Even thought it sounds like this might be resolved on EKS I thought I'd mention this here.
We're expecting to discuss issues around scaling and Typha next week (Sept 8th, 2021) in the Calico community meeting if anyone here wants to join to be part of that discussion feel free to join. You can find the details to join here https://github.com/projectcalico/community

from operator.

tscn commented on August 22, 2024

@amitkatyal I can confirm your observation that it is working now with the latest version of the operator. I tested both upgrade from EKS 1.20 to 1.21 and also cluster-autoscaler in 1.21. Thank you for testing and sharing!

from operator.

amitkatyal commented on August 22, 2024

@tscn Thanks for the confirmation!

from operator.

tscn commented on August 22, 2024

@amitkatyal @tmjd While running continuous tests for release preparation, we came to the conclusion that the issue still occurs sometimes. We are seeing node group updates where the typha pod is continuously evicted and immediately rescheduled on the same node, i.e. the desired replica count is still higher than the number of untainted nodes in the cluster. The upgrade finally fails because of PodEvictionFailure. We have not found any other issues so I assume some kind of race condition between rescheduling the pod and cordoning the node. We are now trying to switch to operator version 1.22.0 which allows the controlPlaneTolerations to be applied to typha pods (as suggested in projectcalico/calico#4695).
I can re-open this issue if it makes sense to look into excluding tainted nodes in the node count for typha, but the solution proposed in https://github.com/tigera/operator/pull/1506/files will also work for us (as long as one single typha instance is sufficient to ensure reliable network policy enforcement also is small clusters).

from operator.

amitkatyal commented on August 22, 2024

@tscn Thanks for the update!
Could you please explain how adding controlPlaneTolerations will fix the issue?. As we are deploying the calico using the manifest, how to apply the "controlPlaneTolerations"?. Does it require modifying the operator manifest?

Regarding the https://github.com/tigera/operator/pull/1506/files solution, I could see the PR is in the open state. Is there a plan to merge this PR as this would indeed fix the issue?

Could you also confirm that updating the existing cluster with the new release will not have any issues right?

from operator.

amitkatyal commented on August 22, 2024

@tscn Could you please clarify? Thanks!!

from operator.

tscn commented on August 22, 2024

@amitkatyal I'm just the author of the issue and not a Tigera member or contributor - so questions regarding merging PR's and version compatibility have to be answered by @tmjd or someone else.

The controlPlaneTolerations can be set in the Installation.spec - https://docs.projectcalico.org/reference/installation/api#operator.tigera.io/v1.InstallationSpec. We changed it for NoSchedule to operator Equal and custom key/value. After some test iterations, it seems to work (but error is anyway random). Now we ran into the problem that actually the operator pod itself caused a node upgrade to fail because it also has these generous tolerations and it was again and again rescheduled on the node that was being evicted by the AWS EKS node upgrade procedure. So we have to patch the operator manifest to remove or change the tolerations.

from operator.

amitkatyal commented on August 22, 2024

@tscn Thanks for the update!

@tmjd Could you please help on how do we modify the controlPlaneTolerations (w/o manual intervention) if we are deploying the calico using the operator manifest recommended by the AWS?

As per my understanding typha pod is created by the tigera-operator, If so, how to control the typha pod toleration (controlPlaneTolerations) from the tigera-operator manifest.

from operator.

tscn commented on August 22, 2024

@amitkatyal controlPlaneTolerations can be applied like this:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  controlPlaneTolerations:
  - effect: NoSchedule
    operator: Equal
    key: your-custom-key
    value: your-custom-value

From what I saw, it is not possible to override the tolerations with an empty array, then the all-tolerations are applied again. This means you have to provide a key-value combination you can choose, but that should be different than the one AWS uses.

Update from our testing: it works and if you also override the tolerations for the tigera-operator deployments, node group updates for 2-node clusters (one node group per AZ on 2-AZ deployment) are successful. Tested for upgrade scenario where we switch AMI from 1.20 to 1.21.

The bad news is:, Cluster-Autoscaler still does not work. I am not sure if I did something wrong in my first test or if it is a random issue but with the test on EKS 1.21 with tigera-operator 1.22.0, it does not work. CA reports the same message described here: projectcalico/calico#4695 (comment)

from operator.

amitkatyal commented on August 22, 2024

@tscn Thanks for your inputs! Will try as per your suggestion.

from operator.

L1ghtman2k commented on August 22, 2024

Note: tigera versions < v1.22.0 did not have controlPlaneTolerations apply to Typha. (Was fixed here: ef4c15f)

from operator.

Typha autoscaler should exclude tainted nodes in node count about operator HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent