Comments (18)
@tmjd Is it feasible to incorporate the suggested solution - that is to exclude the tainted nodes from the node count?
from operator.
That might possibly work. I think we'd want to ensure that typha always deploys at least one instance in that case though. If all the nodes received the eks.amazonaws.com/nodegroup=unschedulable:NoSchedule
taint, things would break if typha was scaled to zero.
from operator.
Thinking a bit more about this. It might be a bit dangerous to do what was suggested because if all the node groups were being upgraded and it was a large cluster then that would mean all nodes are ignored for the purposes of auto-scaling so you could be reduced to one typha for hundreds of nodes which is not recommended for multiple reasons.
from operator.
Perhaps what should be done instead of ignoring all nodes with that taint would be that if the auto scaling finds any nodes have that taint then ensure that there are fewer typhas than nodes? That could allow nodes to be updated because then there would be an available node that typha could be moved to.
Then with large clusters where there are lots of nodes then typha wouldn't get scaled down at all but for small clusters, room would be made for typha to be evicted.
from operator.
@tmjd I tested with the latest calico operator (v1.21.0) which is under review - aws/amazon-vpc-cni-k8s#1578 and observed that upgrade to new k8s release, node deletion & cluster auto-scaling use-cases are working fine. Just wanted to double confirm if the issue has been fixed in the latest release?
I tested the k8s upgrade (two-node cluster) from 1.19 to 1.20 to 1.21 and the upgrade worked fine both times. Also tested that once the "scheduling disabled" taint was added to the node, the calico-typha POD was evicted.
Same observation with the cluster auto-scaling & manually adding/deleting the nodes from the Managed node group. Please note that I had tested with a minimum two number of nodes and two typha were always available.
If the above observation is expected as per the changes, then can wait for the latest changes to get merged. Kindly confirm.
from operator.
We didn't explicitly fix this issue (AFAIK) but this newer version does include ignoring Unschedulable nodes for typha, so perhaps that has addressed the issue.
from operator.
Thanks for the confirmation!
from operator.
Even thought it sounds like this might be resolved on EKS I thought I'd mention this here.
We're expecting to discuss issues around scaling and Typha next week (Sept 8th, 2021) in the Calico community meeting if anyone here wants to join to be part of that discussion feel free to join. You can find the details to join here https://github.com/projectcalico/community
from operator.
@amitkatyal I can confirm your observation that it is working now with the latest version of the operator. I tested both upgrade from EKS 1.20 to 1.21 and also cluster-autoscaler in 1.21. Thank you for testing and sharing!
from operator.
@tscn Thanks for the confirmation!
from operator.
@amitkatyal @tmjd While running continuous tests for release preparation, we came to the conclusion that the issue still occurs sometimes. We are seeing node group updates where the typha pod is continuously evicted and immediately rescheduled on the same node, i.e. the desired replica count is still higher than the number of untainted nodes in the cluster. The upgrade finally fails because of PodEvictionFailure. We have not found any other issues so I assume some kind of race condition between rescheduling the pod and cordoning the node. We are now trying to switch to operator version 1.22.0 which allows the controlPlaneTolerations
to be applied to typha pods (as suggested in projectcalico/calico#4695).
I can re-open this issue if it makes sense to look into excluding tainted nodes in the node count for typha, but the solution proposed in https://github.com/tigera/operator/pull/1506/files will also work for us (as long as one single typha instance is sufficient to ensure reliable network policy enforcement also is small clusters).
from operator.
@tscn Thanks for the update!
Could you please explain how adding controlPlaneTolerations will fix the issue?. As we are deploying the calico using the manifest, how to apply the "controlPlaneTolerations"?. Does it require modifying the operator manifest?
Regarding the https://github.com/tigera/operator/pull/1506/files solution, I could see the PR is in the open state. Is there a plan to merge this PR as this would indeed fix the issue?
Could you also confirm that updating the existing cluster with the new release will not have any issues right?
from operator.
@tscn Could you please clarify? Thanks!!
from operator.
@amitkatyal I'm just the author of the issue and not a Tigera member or contributor - so questions regarding merging PR's and version compatibility have to be answered by @tmjd or someone else.
The controlPlaneTolerations
can be set in the Installation.spec - https://docs.projectcalico.org/reference/installation/api#operator.tigera.io/v1.InstallationSpec. We changed it for NoSchedule to operator Equal and custom key/value. After some test iterations, it seems to work (but error is anyway random). Now we ran into the problem that actually the operator pod itself caused a node upgrade to fail because it also has these generous tolerations and it was again and again rescheduled on the node that was being evicted by the AWS EKS node upgrade procedure. So we have to patch the operator manifest to remove or change the tolerations.
from operator.
@tscn Thanks for the update!
@tmjd Could you please help on how do we modify the controlPlaneTolerations (w/o manual intervention) if we are deploying the calico using the operator manifest recommended by the AWS?
As per my understanding typha pod is created by the tigera-operator, If so, how to control the typha pod toleration (controlPlaneTolerations) from the tigera-operator manifest.
from operator.
@amitkatyal controlPlaneTolerations
can be applied like this:
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
controlPlaneTolerations:
- effect: NoSchedule
operator: Equal
key: your-custom-key
value: your-custom-value
From what I saw, it is not possible to override the tolerations with an empty array, then the all-tolerations are applied again. This means you have to provide a key-value combination you can choose, but that should be different than the one AWS uses.
Update from our testing: it works and if you also override the tolerations for the tigera-operator deployments, node group updates for 2-node clusters (one node group per AZ on 2-AZ deployment) are successful. Tested for upgrade scenario where we switch AMI from 1.20 to 1.21.
The bad news is:, Cluster-Autoscaler still does not work. I am not sure if I did something wrong in my first test or if it is a random issue but with the test on EKS 1.21 with tigera-operator 1.22.0, it does not work. CA reports the same message described here: projectcalico/calico#4695 (comment)
from operator.
@tscn Thanks for your inputs! Will try as per your suggestion.
from operator.
Note: tigera versions < v1.22.0 did not have controlPlaneTolerations apply to Typha. (Was fixed here: ef4c15f)
from operator.
Related Issues (20)
- Typha autoscaler's autoscaling profile to be configurable
- Propose Windows operator updates HOT 7
- Calico v3.27.0 not working with Tigera v1.32.3 HOT 5
- Uninstallation Failure: Calico Module Leaves Remaining Jobs Blocking Deletion HOT 1
- Can't use calico on windows on EKS due to forced network mode HOT 1
- Calico APIServer does not find certs secret HOT 2
- With Tigera operator, applicative pod lost network after windows nodes reboot HOT 2
- Calico or Tigera operator should create CRDs automatically HOT 1
- Calico v3.27.2 is not working with TigeraOperator v1.32.5 HOT 2
- is there anyway to config labels for calico-system and calico-apiserver using tigera operator
- Expose CNI path for configuration
- [SOLVED] Issue migrating to Tigera Operator, IPAMCONFIGURATION not found HOT 8
- Tigera Operator installation causing significant growth in kube-apiserver-audit and operator workload logs HOT 1
- strict decoding error: unknown field "spec.FailsafeInboundHostPorts" HOT 5
- operator: error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory HOT 4
- Tigera-operator helm chart unable to set csiNodeDriverDaemonSet resource memory/cpu requests & limits HOT 5
- bug: Calico Uninstallation Fails Due to Finalizers on Service Accounts HOT 13
- tigera operator throws error every 5 minutes for ippool not created and managed by operator HOT 2
- Request to upgrade Go packages to fix a vulnerability HOT 2
- Support for traffic shaping using the calico operator? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from operator.