clowdhaus / eksup Goto Github PK
View Code? Open in Web Editor NEWEKS cluster upgrade guidance
Home Page: https://clowdhaus.github.io/eksup/
License: Apache License 2.0
EKS cluster upgrade guidance
Home Page: https://clowdhaus.github.io/eksup/
License: Apache License 2.0
Ensure either .spec.affinity.podAntiAffinity
or .spec.topologySpreadConstraint
is set to avoid multiple pods from being scheduled on the same node. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
Prefer topology hints over affinity
Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.
Report on any workload constructs that do not have either .spec.affinity.podAntiAffinity
or .spec.topologySpreadConstraint
specified
None
podSecurityPolicy
was deprecated in v1.21
, and removed in v1.25
- uses need to ensure they have removed their use of podSecurityPolicy
and migrated to a suitable replacement prior to upgrading to v1.25
Report on the use of podSecurityPolicy
and advise to switch to pod security admission https://kubernetes.io/docs/concepts/security/pod-security-admission/
None
As a user, I want to understand if I am at risk of encountering EBS volume service limits that could affect a cluster upgrade (i.e. - unable to launch new instances during the surge, rolling-update process do to an EBS volume service limit breach)
# GP2
aws support describe-trusted-advisor-check-result --check-id dH7RR0l6J9
# GP3
aws support describe-trusted-advisor-check-result --check-id dH7RR0l6J3
Report on EBS volume service limit and provide feedback on whether changes are recommended or required prior to starting the upgrade process
None
Given a manifest, convert the manifest to the next, stable API version. Some resources only need the API version changed, others will require the schema to be modified to match the new API version
Users should be able to provide a command, either on a per-file basis, or across a directory of files (recursive), searching for the deprecated API version and updating to the next stable version including any schema changes required (where applicable if a mapping is possible)
Possible command(s):
eksup migrate apiextensions.k8s.io/v1beta1 --dir . --recursive
eksup migrate apiextensions.k8s.io/v1beta1 --file manifest.yaml
eksup migrate apiextensions.k8s.io/v1beta1 --dir manifests --recursive --dry-run
None
When running numerous clusters, it is challenging to run eksup
from a CLI on each cluster to track and report on upgrade worthiness. Instead, I would like eksup
to run on the cluster periodically and have the results sent to a central local for tracking and reporting.
eksup
periodicallyRunning CLI per cluster - not really scalable for more than 30+ clusters
I want to know the number of available IPs in my data plane subnets both as a whole (the entire data plane) as well as individually (per nodegroup/Fargate profile) to better understand if I may face any restrictions or issues when upgrading data plane components
None
How to use aws config credentials ;multi aws account how to switch with --profile
eksup analyze -r us-west-2 -c cluster-name
ERROR eksup::eks::resources: Cluster k8s-devops not found
No response
How to use aws config credentials ;multi aws account how to switch with --profile
latest
macOS x86_64
No response
There are many applications that use leader election (e.g. controllers) to achieve redundency where there's not much practical benefit to running 3 pods over 2. This is of course different to qorum based HA where minimum 3 pods are appropriate
when running eksup, pods with 2 replicas are erroneously reported as violating K8S002 and therefore are not highly available
some kind of configurable way to override or edit rules for specific workloads would be great, but I appreciate this means introducing a config file and that adds a lot of complication
An ideal solution would allow me to say "workload X should have minimum Y replicas" so that we don't accidentally green light a workload with a single replica
The override could also be a cli flag, but I can see this getting very large for large clusters
Detecting and reporting on deprecated/removed Kubernetes API versions is one of the largest concerns of upgrading Kubernetes clusters. While users may be aware of what APIs are deprecated or removed, identifying if any of those APIs are in use in the cluster is a much more challenging task.
Use the apiserve_requested_deprecated_apis
metric to detect usage of deprecated APIs
- https://kubernetes.io/blog/2020/09/03/warnings/
- https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1693-warnings
- kube-rs/kube#492 for implementation
pluto
or kubent
are recommended to check for deprecated APIs
Currently eksup takes in cluster name and region information when performing the analysis, it should also consider the profile (from aws config) to perform analysis as well.
eksup analyze --cluster clustername --region regionname --profile awsprofilename
When performing eksup analyze if no profile information is provided, it should take in information from KUBECONFIG variable or ~/.kube/config file.
The practice of setting a pod.Spec.TerminationGracePeriodSeconds
of 0 seconds is unsafe and strongly discouraged for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod shuts down gracefully before the kubelet deletes the name from the apiserver.
Report on StatefulSets
where pod.Spec.TerminationGracePeriodSeconds
== 0
None
checking to see if there is workaround for it, as analyze stops going further.
Please skip the ASG check
No response
I'm testing the tool in a non-production cluster, and I'm experiencing some timeouts.
I wonder if the solution would be to add some timeout configurations, or if the tool is not intended to run on a cluster with a certain number of resources.
k get replicasets | wc -l
+ exec kubectl get replicasets --context xxx --namespace yyy
4944
k get pods | wc -l
+ exec kubectl get pods --context xxx --namespace yyy
972
Listing the replica sets via kubectl take around 17s.
The tool fails with a timeout error.
N/A
No response
eksup analyze --cluster xxx --region us-east-1
latest
macOS x86_64
DEBUG hyper::proto::h1::conn: incoming body decode error: timed out
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:321
TRACE hyper::proto::h1::conn: State::close()
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:948
TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:731
TRACE hyper::proto::h1::conn: shut down IO complete
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:738
TRACE tower::buffer::worker: worker polling for next message
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:108
TRACE tower::buffer::worker: buffer already closed
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:62
...
TRACE hyper::client::pool: pool closed, canceling idle interval
at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/client/pool.rs:759
Error: Failed to list ReplicaSets
Caused by:
0: HyperError: error reading a body from connection: error reading a body from connection: timed out
1: error reading a body from connection: error reading a body from connection: timed out
2: error reading a body from connection: timed out
3: timed out
something
something
No response
Users may be interested in different levels of output. For example, some users may want to see only the required checks that failed and ignore the recommended checks. Others may want to see all reported results, even if there are no required/recommended changes.
Configure output levels
--quiet - suppress all output
(default, no flags) - show failed checks on hard requirements
--warn - in addition to failed, show warnings (low number of IPs available for nodes/pods, addon version older than current default, etc.)
--info - in addition to failed and warnings, show informational notices (number of IPs available for nodes/pods, addon version relative to current default and latest, etc.)
No response
eksup should analyze the cluster
Unable to connect to cluster. Ensure kubeconfig file is present and updated to connect to the cluster.
➜ foo AWS_REGION=eu-central-1 aws eks update-kubeconfig --name foo-bar-baz
Updated context arn:aws:eks:eu-central-1:1234567890:cluster/foo-bar-baz in /Users/lorem/.kube/config
➜ foo eksup analyze --cluster foo-bar-baz --region eu-central-1 -v
Error: Unable to connect to cluster. Ensure kubeconfig file is present and updated to connect to the cluster.
Try: aws eks update-kubeconfig --name foo-bar-baz
read AWS creds from environment variables if provided
aws eks update-kubeconfig --name <cluster-name>
eksup analyze --cluster <cluster name> --region <aws region>
0.2.0-alpha3
macOS x86_64
Error: Unable to connect to cluster. Ensure kubeconfig file is present and updated to connect to the cluster.
Try: aws eks update-kubeconfig --name foo-bar-baz
Supports SSO profiles using sso-session for authentication.
Credential chain fails
[sso-session SESSION_NAME]
sso_start_url = https://REDACTED.awsapps.com/start
sso_region = eu-west-1
sso_registration_scopes = sso:account:access
[profile PROFILE_NAME]
sso_account_id = 123456789000
sso_role_name = Administrator
region = eu-west-1
sso_session = SESSION_NAME
Update aws-sdk-rust 😄
eksup analyze -c cluster-v2 -r eu-west-1
latest
Linux x86_64
eksup analyze -c cluster-v2 -r eu-west-1 -v
WARN aws_config::profile::parser::normalize: profile `sso-session SESSION_NAME` ignored because `sso-session SESSION_NAME` was not a valid identifier
at /cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-config-1.1.3/src/profile/parser/normalize.rs:87
WARN aws_config::meta::credentials::chain: provider failed to provide credentials, provider: Profile, error: the credentials provider was not properly configured: ProfileFile provider could not be built: profile `PROFILE_NAME` was not defined: `sso_region` was missing (InvalidConfiguration(InvalidConfiguration { source: "ProfileFile provider could not be built: profile `PROFILE_NAME` was not defined: `sso_region` was missing" }))
at /cargo/registry/src/index.crates.io-6f17d22bba15001f/aws-config-1.1.3/src/meta/credentials/chain.rs:90
eksup supposed to display the analysis results for the eks cluster
Its showing below error although i have setup kubecontext correclty.
Error: ApiError: the server could not find the requested resource: NotFound (ErrorResponse { status: "Failure", message: "the server could not find the requested resource", reason: "NotFound", code: 404 })
Caused by:
the server could not find the requested resource: NotFound
Command being used:
eksup analyze -c eks-cluster-test -r us-east-1
No response
Run command below for any eks cluster
eksup analyze -c eks-cluster-test -r us-east-1
latest
macOS x86_64
Error: ApiError: the server could not find the requested resource: NotFound (ErrorResponse { status: "Failure", message: "the server could not find the requested resource", reason: "NotFound", code: 404 })
Caused by:
the server could not find the requested resource: NotFound
EKS requires nodes created by managed nodegroups and Fargate profiles to align with the control plane version (minor versions to be the same) before it will allow the control plane to upgrade
Currently, eksup
reports only on results as they related to the Kubernetes version skew support. This means that nodes created by a managed nodegroup or Fargate profile that are 1 minor version behind the control plane version are shown in the results as a recommended remediation (upgrade to align on versions), and not a required. Per EKS requirements, this should be shown as a required remediation since users will not be able to upgrade until they align the node versions to the control plane
N/A
Any nodes created by managed nodegroups or Fargate profiles should report as required remediation if they do not match the control plane version
N/A
latest
macOS x86_64
No response
Add support for ReplicaSet
resource that was not created by a Deployment
(does not have a ownerReferences
)
Report on standalone ReplicaSet
resources that are not created by a higher-order resource
None
K8S001 uses the indicative mood to specify a degraded state:
The version skew between the control plane (API Server) and the data plane (kubelet) violates the Kubernetes version skew policy [...]
There is a version skew between the control plane (API Server) and the data plane (kubelet).
But K8S002 uses it to specify a desired state:
There are at least 3 replicas specified for the resource.
This led to some confusion when I first read about the checks.
https://clowdhaus.github.io/eksup/info/checks/
Use the keywords specified in Best Current Practice 14 and incorporate the phrase specified by RFC 8174 near the top of the page.
I would rewrite K8S001 as follows:
The version skew between the control plane (API Server) and the data plane (kubelet) MUST NOT violate the Kubernetes version skew policy, either currently or after the control plane is upgraded. [Suggestions welcome on the "will violate after upgrade" part, which was difficult to recast.]
And K8S002 as follows:
There MUST be at least 3 replicas specified for the resource.
And version-dependent checks like K8S008 as follows:
With target version < v1.24, Pod volumes SHOULD NOT mount the
docker.sock
file.
With target version >= v1.24, Pod volumes MUST NOT mount thedocker.sock
file.
And all other checks accordingly.
A user may want to provide a custom kubeconfig
path as opposed to strictly targeting: ~/.kube/config
This could subsequently be used by any end user:
eksup analyze [OPTIONS] --cluster <CLUSTER> --kubeconfig /tmp/123abc-generated-config
The default workflow of:
eksup analyze [OPTIONS] --cluster <CLUSTER>
would result in automatic use of: --kubeconfig ~/.kube/config
(or rather, default to ~/.kube/config
)
Additionally, the user could set the KUBECONFIG
environment variable, resulting in the equivalent of --kubeconfig
:
KUBECONFIG=/tmp/123abc-generated-config eksup analyze [OPTIONS] --cluster <CLUSTER>
No response
Successfully generate output for below analyze command.
eksup analyze --cluster $CLUSTER_NAME --region $AWS_REGION --output analysis.txt
Failed to list PodSecurityPolicies
$ eksup analyze --cluster $CLUSTER_NAME --region $AWS_REGION --output analysis.txt
Error: Failed to list PodSecurityPolicies
Caused by:
0: ApiError: the server could not find the requested resource: NotFound (ErrorResponse { status: "Failure", message: "the server could not find the requested resource", reason: "NotFound", code: 404 })
1: the server could not find the requested resource: NotFound
eksup analyze --cluster $CLUSTER_NAME --region $AWS_REGION --output analysis.txt
No response
Prepare an EKS v1.26 cluster;
Run below command:
eksup analyze --cluster $CLUSTER_NAME --region $AWS_REGION --output analysis.txt
latest
Linux x86_64
Error: Failed to list PodSecurityPolicies
Caused by:
0: ApiError: the server could not find the requested resource: NotFound (ErrorResponse { status: "Failure", message: "the server could not find the requested resource", reason: "NotFound", code: 404 })
1: the server could not find the requested resource: NotFound
Currently, its required to have at least 5 free IPs for the control plane cross account ENIs to facilitate an upgrade. However, this isn't the full story since the cross account ENIs will be created in at least 2 different availability zones.
Change EKS001
check to ensure there are at least two subnets in different AZs with at least 4 available IPs each for the control plane cross account ENIs - awsdocs/amazon-eks-user-guide#688
Use current guidance of 5 free IPs but this is mis-leading (if you only have 5 free IPs, but all in one subnet, the upgrade would fail)
Add support to analyze EKS clusters that use a MixedInstancePolicy.
eksup analyze --cluster foo --region us-west-2
Error: Launch template not found, launch configuration is not supported
Note: The error message provided about is misleading or limiting. The cluster being scanned does not use launch configurations.
Enable eksup
to autodetect a mixed instance policy.
Add support to disable ASG checks.
With eksctl
being the official CLI for EKS, and terraform-aws-eks
being a popular method for deploying clusters - it would be helpful to show relevant code snippets for the commands/changes required to make with these tools to facilitate an upgrade
Provide relevant code snippets for eksctl
and terraform-aws-eks
for performing the upgrade
No response
When running the CLI against clusters of various configurations, the CLI should not panic nor show panic output to users
Using unwrap()
is throwing panic errors back to the users instead of handling gracefully within the flow of execution, or a more useful feedback message to the user
N/A - remove this from the template
No response
N/A
latest
macOS x86_64
No response
When running eksup ...
and the CLI session is unable to successfully connect to the cluster, typically to expired credentials or missing kubeconfig, the error message returned to notify the user what the specific error was and how to remediate (aws eks update-kubeconfig ...
, get AWS credentials, etc.)
eksup
fails and returns a vague error that exposes the lower level details of the internals of eksup
that is not helpful to users
eksup analyze -c <cluster> -r <region> (without a kubeconfig or AWS credentials)
No response
See above
latest
all
Error: HyperError: error trying to connect: dns error: failed to lookup address information: Name or service not known
Caused by:
0: error trying to connect: dns error: failed to lookup address information: Name or service not known
1: dns error: failed to lookup address information: Name or service not known
2: failed to lookup address information: Name or service not known
The Dockershim has been removed starting with Kubernetes v1.24
and users who are mounting the docker.sock
in their pods will be impacted if they upgrade to v1.24
Report on the use of workloads that mount the docker.sock
which requires users to remediate prior to upgrading to v1.24
- https://github.com/aws-containers/kubectl-detector-for-docker-socket
None
When interrogating resources for upgrade readiness, the Job definitions created by CronJobs should be excluded from the results since their definition is covered already by the CronJob spec
Currently, the Job specs created by CronJobs are reported in the findings in addition to the CronJob that defines them
N/A
Filter out Jobs that have an ownerReferences
apiVersion: batch/v1
kind: Job
metadata:
creationTimestamp: "2023-02-24T17:15:00Z"
generation: 1
labels:
controller-uid: 079d519d-72ba-4a77-a0d7-bb134ea425b0
job-name: bad-cron-27954315
name: bad-cron-27954315
namespace: cronjob
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: CronJob
name: bad-cron
uid: 1f1e67a0-198b-4cc1-961b-e2baff5f21b0
N/A
latest
macOS x86_64
N/A
From the EKS documentation for kube-proxy
addon:
Ensure eksup
is validating against these requirements and reporting the necessary information back to users
None
As a user, I want to understand if I am at risk of encountering EC2 instance service limits that could affect a cluster upgrade (i.e. - unable to launch new instances during the surge, rolling-update process do to an EC2 instance service limit breach)
aws support describe-trusted-advisor-check-result --check-id 0Xc6LMYG8P
Report on EC2 instance service limit and provide feedback on whether changes are recommended or required prior to starting the upgrade process
None
As a user who uses the CLI to analyze their cluster for upgrade readiness, I want to have the results formatted so that I can quickly and easily understand what checks have passed or failed
A tabular format is commonly used in this scenario, provide the number of columns are kept to a minimum to fit within a 120 character width windwo
JSON format - this will also be supported but is not as readable as a table and is intended more for machines rather than users
Expected to generate a report for cluster upgrade using
eksup analyze --cluster <cluster-name> --region <region>
getting error
Error: Launch template not found, launch configuration is not supported
aws eks update-kubeconfig --name <cluster> --region <region>
eksup analyze --cluster <cluster-name> --region <region>
No response
aws eks update-kubeconfig --name --region
eksup analyze --cluster --region
latest
macOS arm64
Error: Launch template not found, launch configuration is not supported
Ensure that .spec.containers[*].readinessProbe
is set to provide the appropriate feedback data to the control plane when performing rolling upgrades to minimize the potential for service disruption
Ensure that .spec.containers[*].readinessProbe
is set
.spec.containers[*].livenessProbe
, if set, is NOT the same as .spec.containers[*].readinessProbe
.spec.containers[*].startupProbe
is set if .spec.containers[*].livenessProbe
is setNone
Starting in Kubernetes v1.17
, the in-tree storage plugin was marked as deprecated and it will be removed in EKS v1.23
. Users need to install to the EBS CSI driver prior to upgrading to EKS v1.23
The in-tree Amazon EBS storage provisioner is deprecated. If you are upgrading your cluster to version 1.23, then you must first install the Amazon EBS driver before updating your cluster. For more information, see Amazon EBS CSI migration frequently asked questions. If you have pods running on a version 1.22 or earlier cluster, then you must install the Amazon EBS driver before updating your cluster to version 1.23 to avoid service interruption. https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi-migration-faq.html
kubernetes.io/aws-ebsStorageClass
and recommend users update to the new API resource groups provided by the CSINone
To ensure services are configured for high-availability to reduce the chance of disruption or downtime during an upgrade, users should have podDisruptionBudget
set and at least one of minAvailable
or maxUnavailable
is provided for each workload construct (Deployment
, ReplicaSet
, ReplicationController
, StatefulSet
)
Report on any workload constructs that do not have an associated podDisruptionBudget
or if the associated podDisruptionBudget
does not have minAvailable
or maxUnavailable
configured
None
When running eksup
from the CLI, users lack context as to what is happening or how much is left until the results are returned. Its common practice to provide some sort of indication as to the progress of the execution
Add progress indicator https://github.com/console-rs/indicatif for a quality of life improvement
No progress indicator
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.