kubernetes-sigs / karpenter Goto Github PK

View Code? Open in Web Editor NEW

472.0 24.0 161.0 5.31 MB

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.

License: Apache License 2.0

Makefile 0.23% Shell 1.13% Go 98.39% Smarty 0.25%

karpenter's Introduction

Karpenter

Karpenter improves the efficiency and cost of running workloads on Kubernetes clusters by:

Watching for pods that the Kubernetes scheduler has marked as unschedulable
Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods
Provisioning nodes that meet the requirements of the pods
Removing the nodes when the nodes are no longer needed

Karpenter Implementations

Karpenter is a multi-cloud project with implementations by the following cloud providers:

AWS
Azure

Community, discussion, contribution, and support

If you have any questions or want to get the latest project news, you can connect with us in the following ways:

Using and Deploying Karpenter? Reach out in the #karpenter channel in the Kubernetes slack to ask questions about configuring or troubleshooting Karpenter.
Contributing to or Developing with Karpenter? Join the #karpenter-dev channel in the Kubernetes slack to ask in-depth questions about contribution or to get involved in design discussions.
Join our alternating working group meetings where we share the latest project updates, answer questions, and triage issues:
- Bi-weekly meetings alternating between Mondays @ 9:00 PT (convert to your timezone) and Thursdays @ 14:00 PT (convert to your timezone) on Zoom - the password is 77777.
- Invites are managed through our Calendar
- Add future questions or read past discussions in our Working Group Log

Pull Requests and feedback on issues are very welcome! See the issue tracker if you're unsure where to start, especially the Good first issue and Help wanted tags, and also feel free to reach out to discuss.

See also our contributor guide and the Kubernetes community page for more details on how to get involved.

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

Talks

09/08/2022 Workload Consolidation with Karpenter
05/19/2022 Scaling K8s Nodes Without Breaking the Bank or Your Sanity
03/25/2022 Karpenter @ AWS Community Day 2022
12/20/2021 How To Auto-Scale Kubernetes Clusters With Karpenter
11/30/2021 Karpenter vs Kubernetes Cluster Autoscaler
11/19/2021 Karpenter @ Container Day
05/14/2021 Groupless Autoscaling with Karpenter @ Kubecon
05/04/2021 Karpenter @ Container Day

karpenter's People

Contributors

Stargazers

Watchers

Forkers

ellistarn bwagner5 pranav-saraswat njtran jonathan-innis tzneal dewjam grosser wkaczynski billrayburn ra-grover engedaam iankouls-aws yuyuvn 0x32c2cac3info sergkondr navvis-dev noksa digitalis-io lucass4 raykrueger willthames danieljuravski deanfoley jaypipes jashandeep-sohi ku524 kahirokunn hagen93 cebernardi masturd jackfrancis tallaxes rafiramadhana khareyash05 joshuapare cjerad guikcd fedosin tasdikrahman ravihari james-quigley bryce-soghigian wojtekdmyszewicz gegdcindy fffinkel abeer-stripe mi-kobaaa onlinehead jmdeal chu-yik acorn-io seanpm2001 caesarxuchao jacobvaldemar garvinp-stripe shikai93 alec-rabold sadath-12 gfcroft imdevin567 nathangeology o11n eddieceausu rwipfelexo lexicoder dims achal1304 rkirk-r7 nomdot mallow111 gjr9999 splichy steveshidev pelotech wmgroot nantiferov nklya sreeram-venkitesh coder12git yashpimple ricardoapl gjtempleton nikmohan123 changhyuni sanskarzz smcavallo zillow mattsre horiodino chenfeilee jigisha620 acrlabs jignyasamishra smartnews scubadam cybernetics rounoff domgoodwin avineshtripathi

karpenter's Issues

CNCF donation?

Are there any plans to donate this project to the CNCF or host it in any outside organization?
Right now karpenter only supports aws, however I anticipate the community will add support for other clouds soon. Having the project under a well-known umbrella could help drive adoption.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Discussion: Extending Cloud Providers

Tell us about your request

There's a growing number of requests to extend the cloud providers that Karpenter supports from AWS to other cloud providers (Azure, GCP, Orcale, etc.). This issue is intended to cover the discussion of extending the number of cloud providers broadly.

Additional Context

Tracking Tasks

#754

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Karpenter publish webhook metrics on TCP/9090 by default

Version

Karpenter Version: v0.22.0

Kubernetes Version: v1.23.14-eks-ffeb93d

Expected Behavior

Port 9090 is closed by default.

Actual Behavior

There are some metrics published. And the port could be configured via the env variable METRICS_PROMETHEUS_PORT and it looks like default Knative metrics exporter

Some more context in Karpenter Slack

Steps to Reproduce the Problem

kubectl -n karpenter port-forward karpenter-796b44965b-j56gn 9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090

curl localhost:9090/metrics
# HELP webhook_go_alloc The number of bytes of allocated heap objects.
# TYPE webhook_go_alloc gauge
webhook_go_alloc{name=""} 2.4627608e+07
...
# HELP webhook_work_queue_depth Depth of the work queue
# TYPE webhook_work_queue_depth gauge
webhook_work_queue_depth{reconciler="ConfigMapWebhook"} 0
webhook_work_queue_depth{reconciler="DefaultingWebhook"} 0
webhook_work_queue_depth{reconciler="ValidationWebhook"} 0
webhook_work_queue_depth{reconciler="WebhookCertificates"} 0

Resource Specs and Logs

We run Karpenter with the following options:

    - args:
      - --webhook-port=10269
      - --metrics-port=10270
      - --health-probe-port=10271

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Typed Controller should remove auto-patching

Tell us about your request

Karpenter typed controller patches the reconciled object. We should investigate updating instead of patching and rely on optimistic locking.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Currently, we just patch, meaning two concurrently occuring patches could succeed, when only one should.

Are you currently working around this issue?

Not

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Consolidation ttl: `spec.disruption.consolidateAfter`

Tell us about your request

We have a cluster where there are a lot of cron jobs which run every 5 minutes...

This means we have 5 nodes for our base workloads and every 5 minutes we get additional nodes for 2-3 minutes which are scaled down or consolidated with existing nodes.

This leads to a constant flow of nodes joining and leaving the cluster. It looks like the docker image pull and node initialization creates more network traffic fees than the cost reduction of not having running the instances all the time.

It would be great if we could configure some time consolidation period maybe together with ttlSecondsAfterEmpty which would only cleanup or consolidate nodes if the capacity was idling for x amount of time.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Creating a special provisioner is quite time consuming because all app deployments have to be changed to leverage it...

Are you currently working around this issue?

We think about putting cronjobs into a special provisioner which would not use consolidation but the ttlSecondsAfterEmpty feature.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Implement more kubernetes-sigs/descheduler features in Karpenter

Is an existing page relevant?

https://karpenter.sh/v0.19.2/getting-started/migrating-from-cas/

What karpenter features are relevant?

I am wondering if there is a need to still run kubernetes-sigs/descheduler or if all features are handled by karpenter already.

How should the docs be improved?

Maybe if it is relevant it could be added to the migrating from cluster-autoscaler page?

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Settings should not dynamically update at runtime

Tell us about your request

Karpenter's settings.SettingsStore currently updates at runtime to avoid having to restart the controller on any settings changes. This tracking has become difficult to manage and creates some development burden on the controller developer to handle constant requeueing while a given feature flag isn't enabled so that the controller can properly begin to react when a feature is enabled.

Rather than handle the burden of having to deal with this constant requeueing, it would be simpler to just restart the container when the settings change, meaning that settings can be assumed to be consistent for the lifetime of the process.

This is a consistent pattern throughout the Kubernetes community, where changes to configuration that are not bound to be particularly frequent require a pod restart.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Take away burden from the controller developer to handle runtime changes in Karpenter's settings.

Are you currently working around this issue?

Yes, we are having to do a constant requeue inside of controllers that use feature-flags.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Machine Disruption Gate

Tell us about your request
We already have a way to terminate the instance after X amount of time. It would be great to have a way of doing the same at a specified time as well. For example, say I want to terminate a node(s) at 2:00 AM UTC to save running costs.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We are looking to have the ability to scale down nodes at certain times, with ASGs it is easy using the scheduled actions but since Karpenter is managed outside of an ASG (which is a good thing), we can't automatically scale them down to save costs.

Are you currently working around this issue?
I may end up writing a lambda function to do the same but it would be great to have this feature inbuilt

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Limitation of node number per provisioner

Tell us about your request

It would be nice to have the ability to limit the number of nodes created by certain provisioner. For example:

  limits:
    nodes: "6"

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Let's say our application spawns pods for some tasks, and each pod requires a separate node. Now it limits by the max size of the node group. All these nodes are Spot instances, and there is a problem that sometimes there are no instances of the current family in the region, so we use different instance families: m4, m5, m5a, r5, r5a, etc. These instances could have a different amount of CPU and mem.

It would be nice to have the ability to limit the number of nodes by their count, not by their resources. It is clear that we have max pods in our app, but it is a bit of a synthetic example.

Are you currently working around this issue?

We use limits.resources.cpu with approx number of CPUs, but it is not accurate and not transparent a little bit.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Global resource limits

Tell us about your request
The ability to configure resource limits (CPU\GPU) on a global level that will be enforced\applied to all provisioners in the cluster.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In certain use cases, I need to enable a mix of several CPU and GPU instance types that are configured in separate provisioners (because of different taints that are added to the nodes). Without a global limit for resources, I cannot limit the overall capacity of nodes created by Karpenter in the cluster.

Are you currently working around this issue?
No.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Taint nodes before deletion

Tell us about your request

When removing nodes due to consolidation I would like to be able to apply a taint to the node before it is removed.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Reason for this is to be able to gracefully stop DaemonSet pods, see related issues below

I have consul agents running on nodes via DaemonSet, these agents join the cluster.
If they are just killed then they sit around in the cluster as failed, if the pod is given a stop signal then it will gracefully leave the cluster and then exit.

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.
Applying a NoExecute taint prior to deletion will evict those pods.

System DaemonSets (e.g. Kube-proxy) tolerate all taints and so this won't evict those.

Are you currently working around this issue?

Without Karpenter nodes are generally only removed
a) Manually, in which case I manually taint the node with a noExecute taint
b) By the node-termination-handler which is configured to add a taint as well

With Karpenter... well the workaround is to manually clear out failed nodes from my Consul cluster or get this feature added!

Additional Context

aws/aws-node-termination-handler#273
kubernetes/kubernetes#75482

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Feat: make provisioner to allow setting annotation

Right now, it only accepts labels: https://github.com/aws/karpenter-core/blob/main/pkg/apis/provisioning/v1alpha5/provisioner.go#L30-L87

karpenter doesn't scaleout when using time-slicing for GPUs

Tell us about your request

Karpenter should respect time-slicing configuration when provisioning gpu instances

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

When you use time slicing and the NVIDIA/k8s-device-plugin, Karpenter can fail to provision new instances

As per the documentation here: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing

To implement time-slicing you would configure:

version: v1
sharing:
timeSlicing:
renameByDefault:
failRequestsGreaterThanOne:
resources:
- name: < resource-name >
replicas: < num-replicas >

the outcome of this being that if you did $ kubectl describe node you’d see

Capacity:
nvidia.com/gpu: < num-replicas >

Karpenter respects that when scheduling pods, for example 6 pods which each request 4 GPU will be placed onto a node when num-replicas is 24.

However, when it comes to scaling, the following error occurs:

{"memory":"4Gi","nvidia.com/gpu":"4","pods":"1"},

because the deployment is requesting

      resources:
        limits:
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'
        requests:
          memory: 4Gi
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'

but the instance types in the Provisioner only have 1 GPU:

spec:
requirements:

key: node.kubernetes.io/instance-type
operator: In
values:
- g5.2xlarge
- g5.4xlarge

When evaluating which instance to provision, Karpenter should evaluate the number of GPUs specified in < num-replicas >, not the number of GPUs that instance type/size has, and therefore provision a new instance (in this example)

Are you currently working around this issue?

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

NewPDBLimits fails in 1.25+ k8s

If we try to use karpenter-core on k8s 1.25+ we encouter the following problem when a new node is provisioned:

2022-11-24T08:53:42.773Z	ERROR	controller	Reconciler error	{"controller": "inflightchecks", "controllerGroup": "", "controllerKind": "Node", "Node": {"name":"kazoo5-first-db-kazoo5-first-couchdb-kazoodb-9ef87341"}, "namespace": "", "name": "kazoo5-first-db-kazoo5-first-couchdb-kazoodb-9ef87341", "reconcileID": "2720caef-d6e2-4802-88b6-9ef9693a4e7c", "error": "no matches for kind \"PodDisruptionBudget\" in version \"policy/v1beta1\""}

Reason: https://kubernetes.io/blog/2022/04/07/upcoming-changes-in-kubernetes-1-24/#looking-ahead (v1beta1 has been removed)

It looks like the problem is in this place: https://github.com/aws/karpenter-core/blob/23060db5957f34c547dd8d28c8653482fa3a396a/pkg/controllers/deprovisioning/pdblimits.go#L40

Actually it is policyv1 "k8s.io/api/policy/v1beta1"

I would suggest to fetch stable v1 policy first and if there is no such kind, fallback to v1beta1.

Thanks.

Chore: Remove non-neutral resource types

Currently neutral tests rely on vendor specific resource names. See: https://github.com/aws/karpenter-core/pull/9/files

Cordon node with a do-not-evict pod when ttlSecondsUntilExpired is met

Tell us about your request

Regarding the do-not-evict annotation, it currrently prevents some nodes to be deprovisioned on our clusters.

I know, that's the goal and i'm fine with it, but would it be possible to cordon the node at least when the ttlSecondsUntilExpired is met ?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We provision expensive nodes as a failback for some workloads with a ttl of 3 hours, and we saw some still running with a 6 hours uptime.
The annotation was preventing the drain/deletion as cronjobs kept being scheduled on it, but a manual cordon fixed it within minutes.

Are you currently working around this issue?

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

bug: add tests and fix topology spread skew for 1.24+

From the 1.24 release docs:

The calculations for Pod topology spread skew now exclude nodes that don't match the node affinity/selector. This may lead to unschedulable pods if you previously had pods matching the spreading selector on those excluded nodes (not matching the node affinity/selector), especially when the topologyKey is not node-level. Revisit the node affinity and/or pod selector in the topology spread constraints to avoid this scenario.

Vertical resizer for Karpenter

Tell us about your request

My team just started using Karpenter in our clusters. One thing that was missing from the Karpenter deployment is the nanny that we used to have with Cluster Autoscaler. This is a bit inconvenient given that our clusters grow in time as more services are deployed on the clusters and we have to monitor the memory usage and manually bump up the resource requests. Do we already have something that we can use for Karpenter or is it on the roadmap?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Stated in the request.

Are you currently working around this issue?

Manually bumping up the resources requests for now.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

create a cross-namespace test

The fix for aws/karpenter-provider-aws#2709 was put in, but we don't have an E2E test that verifies it.

Better Scheduling Default Behavior

Been musing on the idea of building a scheduling template that enables users to set default scheduling rules for pods. We've seen this asked at the provisioner scope, but it doesn't make much sense because many provisioners are capable of scheduling the same pod. Alternatively, users may use a policy agent like https://kyverno.io/policies/karpenter/add-karpenter-nodeselector/add-karpenter-nodeselector/, but this requires another dependency.

A SchedulingTemplate would configure DefaultingWebhookConfigurations for kind Pod, and then the karpenter controller would reconcile these, handle admission requests, and inject the fields

A couple of open design questions:

Namespaced vs global
Fail open or closed

kind: SchedulingTemplate
spec:
  selector:
     my: app # could be global, or namespacable?
  template: # this is a subset of pod template spec
    metadata: 
      annotations: 
        karpenter.sh/do-not-evict: true # Use case 1: any pods the match this selector cannot be evicted
    spec: 
      topologySpreadConstraints: # Use case 2: Default to zonal spread
        - maxSkew: 1
          topologyKey: 'topology.kubernetes.io/zone'
          whenUnsatisfiable: ScheduleAnyway

Add a gracePeriod for the `do-not-disrupt` pod annotation

Tell us about your request
I'd like the ability to taint a node after a given period via ttlSecondsUntilTainted to allow running pods to finish before the node is terminated, this should still respect the ttlSecondsUntilExpired and ttlSecondsAfterEmpty.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When running jobs such as CI/CD pipelines in K8s there are long running jobs which shouldn't be terminated due to the high cost of re-running them, by adding ttlSecondsUntilTainted we can have nodes that expire and are replaced without the cost of killing potentially long running jobs.

Are you currently working around this issue?
Not using Karpenter.

Additional context
n/a

Attachments
n/a

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

improve "X out of Y instance types were excluded because they would breach provisioner limits"

Tell us about your request

329 out of 550 instance types were excluded because they would breach provisioner limits

atm this is misleading and noisy since I have

  - key: "node.kubernetes.io/instance-type"
    operator: In
    values: ["m6i.xlarge"]

so it would be nice to apply these filters first and then log if any of the possible nodes got excluded, so I have an actionable "I need to check my type list" log message

and on top of that incompatible with provisioner is misleading in the opposite way, when all types are filtered out it does not say that, but makes it look like there is a label/taint mismatch

Are you currently working around this issue?
Filter out these misleading logs.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Daemonset-driven consolidation

Version

Karpenter Version: v0.22.1

Kubernetes Version: v1.24.8

Hi,

I have set up Karpenter with the following cluster configuration:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: klss
  region: eu-central-1
  version: "1.24"
  tags:
    karpenter.sh/discovery: klss
managedNodeGroups:
  - instanceType: t3.small
    amiFamily: AmazonLinux2
    name: karpenter
    desiredCapacity: 2
    minSize: 2
    maxSize: 2
iam:
  withOIDC: true

This is the provisioner:

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - "on-demand"
        - "spot"
    - key: "kubernetes.io/arch"
      operator: In
      values:
        - "arm64"
        - "amd64"
    - key: "topology.kubernetes.io/zone"
      operator: In
      values:
        - "eu-central-1a"
        - "eu-central-1b"
        - "eu-central-1c"
  limits:
[nodes.zip](https://github.com/aws/karpenter/files/10483452/nodes.zip)

    resources:
      cpu: 32
      memory: 64Gi
  providerRef:
    name: default
  consolidation:
    enabled: true
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: klss
  securityGroupSelector:
    karpenter.sh/discovery: klss

Karpenter has currently provisioned three spot instances. When installing Prometheus with Helm chart version 19.3.1, two of the five node exporters can't be scheduled. The message is: "0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.". The Karpenter controllers didn't output any log entries.

This is the values file for the chart:

prometheus:
  serviceAccounts:
    server:
      create: false
      name: "amp-iamproxy-ingest-service-account"
  server:
    remoteWrite:
      - url: https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/xxxxxxxxxxxxxxxxxxxxx/api/v1/query
        sigv4:
          region: eu-central-1
        queue_config:
          max_samples_per_send: 1000
          max_shards: 200
          capacity: 2500
    persistentVolume:
      enabled: false

This is the live manifest of the DaemonSet of the Prometheus node exporter:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: '1'
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"},"name":"prometheus-prometheus-node-exporter","namespace":"prometheus"},"spec":{"selector":{"matchLabels":{"app.kubernetes.io/instance":"prometheus","app.kubernetes.io/name":"prometheus-node-exporter"}},"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"},"labels":{"app.kubernetes.io/component":"metrics","app.kubernetes.io/instance":"prometheus","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"prometheus-node-exporter","app.kubernetes.io/part-of":"prometheus-node-exporter","app.kubernetes.io/version":"1.5.0","helm.sh/chart":"prometheus-node-exporter-4.8.1"}},"spec":{"automountServiceAccountToken":false,"containers":[{"args":["--path.procfs=/host/proc","--path.sysfs=/host/sys","--path.rootfs=/host/root","--web.listen-address=[$(HOST_IP)]:9100"],"env":[{"name":"HOST_IP","value":"0.0.0.0"}],"image":"quay.io/prometheus/node-exporter:v1.5.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"name":"node-exporter","ports":[{"containerPort":9100,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"httpHeaders":null,"path":"/","port":9100,"scheme":"HTTP"},"initialDelaySeconds":0,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1},"securityContext":{"allowPrivilegeEscalation":false},"volumeMounts":[{"mountPath":"/host/proc","name":"proc","readOnly":true},{"mountPath":"/host/sys","name":"sys","readOnly":true},{"mountPath":"/host/root","mountPropagation":"HostToContainer","name":"root","readOnly":true}]}],"hostNetwork":true,"hostPID":true,"securityContext":{"fsGroup":65534,"runAsGroup":65534,"runAsNonRoot":true,"runAsUser":65534},"serviceAccountName":"prometheus-prometheus-node-exporter","tolerations":[{"effect":"NoSchedule","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/proc"},"name":"proc"},{"hostPath":{"path":"/sys"},"name":"sys"},{"hostPath":{"path":"/"},"name":"root"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":1},"type":"RollingUpdate"}}}
  creationTimestamp: '2023-01-23T19:32:19Z'
  generation: 1
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus-node-exporter
    app.kubernetes.io/part-of: prometheus-node-exporter
    app.kubernetes.io/version: 1.5.0
    helm.sh/chart: prometheus-node-exporter-4.8.1
  name: prometheus-prometheus-node-exporter
  namespace: prometheus
  resourceVersion: '1156021'
  uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: prometheus
      app.kubernetes.io/name: prometheus-node-exporter
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: prometheus
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: prometheus-node-exporter
        app.kubernetes.io/part-of: prometheus-node-exporter
        app.kubernetes.io/version: 1.5.0
        helm.sh/chart: prometheus-node-exporter-4.8.1
    spec:
      automountServiceAccountToken: false
      containers:
        - args:
            - '--path.procfs=/host/proc'
            - '--path.sysfs=/host/sys'
            - '--path.rootfs=/host/root'
            - '--web.listen-address=[$(HOST_IP)]:9100'
          env:
            - name: HOST_IP
              value: 0.0.0.0
          image: 'quay.io/prometheus/node-exporter:v1.5.0'
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 9100
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: node-exporter
          ports:
            - containerPort: 9100
              hostPort: 9100
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 9100
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /host/proc
              name: proc
              readOnly: true
            - mountPath: /host/sys
              name: sys
              readOnly: true
            - mountPath: /host/root
              mountPropagation: HostToContainer
              name: root
              readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      serviceAccount: prometheus-prometheus-node-exporter
      serviceAccountName: prometheus-prometheus-node-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - hostPath:
            path: /proc
            type: ''
          name: proc
        - hostPath:
            path: /sys
            type: ''
          name: sys
        - hostPath:
            path: /
            type: ''
          name: root
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 5
  desiredNumberScheduled: 5
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  numberUnavailable: 2
  observedGeneration: 1
  updatedNumberScheduled: 5

This is the live manifest of one of the pods that can't be scheduled:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    kubernetes.io/psp: eks.privileged
  creationTimestamp: '2023-01-23T19:32:19Z'
  generateName: prometheus-prometheus-node-exporter-
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: prometheus-node-exporter
    app.kubernetes.io/part-of: prometheus-node-exporter
    app.kubernetes.io/version: 1.5.0
    controller-revision-hash: 7b4cd87594
    helm.sh/chart: prometheus-node-exporter-4.8.1
    pod-template-generation: '1'
  name: prometheus-prometheus-node-exporter-9c5s5
  namespace: prometheus
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: DaemonSet
      name: prometheus-prometheus-node-exporter
      uid: 3659924e-2902-4651-aa2a-1d20a1dc1ce7
  resourceVersion: '1155915'
  uid: 98b0cea4-68fe-47ea-83f1-231d5b5809ca
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - ip-192-168-14-194.eu-central-1.compute.internal
  automountServiceAccountToken: false
  containers:
    - args:
        - '--path.procfs=/host/proc'
        - '--path.sysfs=/host/sys'
        - '--path.rootfs=/host/root'
        - '--web.listen-address=[$(HOST_IP)]:9100'
      env:
        - name: HOST_IP
          value: 0.0.0.0
      image: 'quay.io/prometheus/node-exporter:v1.5.0'
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 3
        httpGet:
          path: /
          port: 9100
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      name: node-exporter
      ports:
        - containerPort: 9100
          hostPort: 9100
          name: metrics
          protocol: TCP
      readinessProbe:
        failureThreshold: 3
        httpGet:
          path: /
          port: 9100
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources: {}
      securityContext:
        allowPrivilegeEscalation: false
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /host/sys
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 65534
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
  serviceAccount: prometheus-prometheus-node-exporter
  serviceAccountName: prometheus-prometheus-node-exporter
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoSchedule
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/pid-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/network-unavailable
      operator: Exists
  volumes:
    - hostPath:
        path: /proc
        type: ''
      name: proc
    - hostPath:
        path: /sys
        type: ''
      name: sys
    - hostPath:
        path: /
        type: ''
      name: root
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2023-01-23T19:32:19Z'
      message: >-
        0/5 nodes are available: 1 Too many pods. preemption: 0/5 nodes are
        available: 5 No preemption victims found for incoming pod.
      reason: Unschedulable
      status: 'False'
      type: PodScheduled
  phase: Pending
  qosClass: BestEffort

I also did a test with the node selector "karpenter.sh/capacity-type: on-demand". Then one of the spot instances is deleted, but no new instance is created. The DaemonSet also doesn't create any pods.

PR aws/karpenter-provider-aws#1155 should have fixed the issue of DaemonSets not being part of the scaling decision, but perhaps this is a special case? The node exporter wants a pod on each node because it wants to tap telemetry.

Best regards,

Werner.

Expected Behavior

An extra node to be provisioned.

Actual Behavior

No extra ode is provisioned while two DaemonSet pods can't be scheduled.

Steps to Reproduce the Problem

I did this when there were already three Karpenter nodes, but I think you can just install Prometheus because the nodes are not full.

Resource Specs and Logs

karpenter-6d57cdbbd6-dqgcj.log
karpenter-6d57cdbbd6-lsv9f.log

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Configuration parameter for webhook port

Hello! There is a parameter to configure a webhook port in all your docs - WEBHOOK_PORT. But in the code, it is just a PORT.

Node Repair

Tell us about your request
Allow a configurable expiration of NotReady nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration spec.provider.ttlSecondsUntilExpiration doesn't really work for my use case because it will terminate healthy nodes.

Are you currently working around this issue?
Manually deleting stuck nodes.

Additional context
Not sure if this is useful context, but I observed this error on one such stuck node. From /var/log/userdata.log:

Job for sandbox-image.service failed because the control process exited with error code. See "systemctl status sandbox-image.service" and "journalctl -xe" for details.

and then systemctl status sandbox-image.service:

  sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-06-28 18:47:42 UTC; 2h 9min ago
  Process: 4091 ExecStart=/etc/eks/containerd/pull-sandbox-image.sh (code=exited, status=2)
 Main PID: 4091 (code=exited, status=2)

From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Mega Issue: Karpenter doesnt support custom resources requests/limit

Version

Karpenter: v0.10.1

Kubernetes: v1.20.15

Expected Behavior

Karpenter should be able to trigger an autoscale

Actual Behavior

Karpenter isnt able to trigger an autoscale

Steps to Reproduce the Problem

We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition - smarter-devices/fuse: 1. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending state

Resource Specs and Logs

Provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  limits:
    resources:
      cpu: "100"
  provider:
    launchTemplate: xxxxx
    subnetSelector:
      xxxxx: xxxxx
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.large
    - m5.2xlarge
    - m5.4xlarge
    - m5.8xlarge
    - m5.12xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
status:
  resources:
    cpu: "32"
    memory: 128830948Ki

pod spec

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fuse-test
  labels:
    app: fuse-test
spec:
  replicas: 1
  selector:
    matchLabels:
      name: fuse-test
  template:
    metadata:
      labels:
        name: fuse-test
    spec:
      containers:
      - name: fuse-test
        image: ubuntu:latest
        ports:
          - containerPort: 8080
            name: web
            protocol: TCP
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
        resources:
          limits:
            cpu: 32
            memory: 4Gi
            smarter-devices/fuse: 1  # Custom resource
          requests:
            cpu: 32
            memory: 2Gi
            smarter-devices/fuse: 1  # Custom resource
        env:
        - name: S3_BUCKET
          value: test-s3
        - name: S3_REGION
          value: eu-west-1

karpenter controller logs:

controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];

Remove Webhooks in Favor of OpenAPI/CEL CRD Validation

Hi,

I'm trying to understand the benefits which the Karpenter webhooks provide.

Looking at the validating webhook code it seems that the same validation could be done via the CRD schema.

The defaulting webhook doesn't seem to do anything at all. In addition, you can add default values to the CRD schema.

Configure Deprovisioning Replacement Node Logic

Tell us about your request

Deprovisioning checks to see if a replacement node is needed given the pods scheduled to the node. If a replacement is needed, Karpenter will wait for the replacement node to become initialized and ready before beginning to deprovision the chosen node.

Karpenter may want to make the waiting condition configurable.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

If a user has particularly restrictive limits, Karpenter could fail to provision a replacement node, where the deprovisioning would fail and Karpenter would try deprovisioning on the cluster again later.

Are you currently working around this issue?

Not an issue, the replacement logic is focused on availability and is the safer option for running nodes. Making it configurable could empower users who care less about availability, and more so about the important respecting expiration or other requirements.

If users have non-restrictive limits, a user is less likely to run into deprovisioning issues with the current state of deprovisioning.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Performance Regression Testing

Tell us about your request

We do not currently have any mechanism that prevents performance regressions in Karpenter (scheduling latency, etc). We should build out a suite of tests and measure/assert KPIs:

single pod ready
n pods ready (1000?)
topology/affinity and other complex scheduling rules
todo: more test cases

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Support for node problem detector

Tell us about your request

We used to run node problem detector in combination with drainio and cluster-autoscaler. Once a problem was detected drainio/cluster-autoscaler removed the node.

Dranio is not really maintained anymore and cluster-autoscaler is replaced by karpenter in our usecase.

It would be great if karpenter would also look for node conditions like

KernelDeadlock
ReadonlyFilesystem
FrequentKubeletRestart
FrequentDockerRestart
FrequentContainerdRestart

and drain the corresponding node. Maybe this list could be configured in the karpenter setup...

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We would like to reduce complexity, because karpenter manages nodes anyway it seems to make sense to also include this feature.

Are you currently working around this issue?

Trying to switch to the DataDog drainio fork.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Capacity Type Distribution

Some application deployments would like the benefits of using Spot capacity, but would also like somewhat of a stability guarantee for the application. I propose a capacity-type percentage distribution of the k8s Deployment resource. Since Capacity-Type is likely to be implemented at the cloud-provider level, this too would need to be at the cloud-provider layer.

For example:

apiVersion: apps/v1
kind: Deployment
metadata: 
  name: inflate
spec:
  replicas: 10
  template:
    metadata:
      labels:
        app: inflate
        node.k8s.aws/capacity-type-distribution/spot-percentage: 90
    spec:
      containers:
      - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
        name: inflate
        resources:
          requests:
            cpu: "100m"

The above deployment spec would result in the deployment controller creating Pods for the 10 replicas. Karpenter would register a mutating admission webhook which would check if the pod's deployment spec has these labels and then check any current pods belonging to the deployment to determine which capacity-type label to apply. The pod resource after the admission webhook would look like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: inflate
    pod-template-hash: 8567cd588
  name: inflate-8567cd588-bjqzf
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: inflate-8567cd588
spec:
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
    name: inflate
    resources:
      requests:
        cpu: "100m"
  schedulerName: default-scheduler
  nodeSelector:
      node.k8s.aws/capacity-type: spot

^^ duplicated 8 more times, and then:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: inflate
    pod-template-hash: 4567dc765
  name: inflate-4567dc765-asdf
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: inflate-4567dc765
spec:
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
    name: inflate
    resources:
      requests:
        cpu: "100m"
  schedulerName: default-scheduler
  nodeSelector:
      node.k8s.aws/capacity-type: on-demand

report a metric that shows spot vs on-demand usage

Tell us about your request

Report a metric that shows spot vs on-demand usage. It might be helpful to have a few metrics for spot vs on-demand CPU and spot vs on-demand node count.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I want to monitor the amount of spot vs on-demand capacity in my cluster.

Are you currently working around this issue?

More complex prometheus queries.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Custom drain flow

Tell us about your request

Add a rollout flag when using drain. It will be used when consolidation and native termination handler (aws/karpenter-provider-aws#2546) will be ready.
The custom drain flow is like this:

Cordon the node
Do a rolling restart of the deployments that have pods running on the node.
Drain the node.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Currently when using consolidation feature or aws-node-termination-handler we can result of downtime or heavy performance degradation with the current implementation of kubectl drain

Current drain will terminate all workloads on a node and the scheduler will try to create those workloads on available nodes and if not any Karpenter will provision new node.
Even with PDB there is some level or degradation.

Are you currently working around this issue?

having a custom bash script that implements an alternative to kubectl drain

https://gist.github.com/juliohm1978/1f24f9259399e1e1edf092f1e2c7b089

Additional Context

kubectl drain leads to downtime even with a PodDisruptionBudget kubernetes/kubernetes#48307

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

support drain timeout

Tell us about your request
Sometimes nodes get stuck when a pod could not be drained, so we have to alert ourselves and then manually kill it, which is not ideal.

It would be nice to set an optional drainTimeout so we can ensure nodes always die after x hours/days.

Are you currently working around this issue?
Helper pod that kills pods on stuck nodes so they finish draining.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Knative Webhooks not included in liveness checks

Version

Karpenter Version: HEAD

Kubernetes Version: v1.0.0

Expected Behavior

#142

Actual Behavior

Knative webhooks liveness checks aren't included in kubelet liveness probe

Steps to Reproduce the Problem

make apply

Resource Specs and Logs

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

`make test` should use `CGO_ENABLED=1` on mac+arm

make test
go test ./... \
		-race \
		--ginkgo.focus="" \
		-cover -coverprofile=coverage.out -outputdir=. -coverpkg=./...
go: -race requires cgo; enable cgo by setting CGO_ENABLED=1
make: *** [test] Error 2

... so automatically set it to avoid this ?

CGO_ENABLED=1 make test
go test ./... \
		-race \

Adding Dynamic Taints to Nodes based on Tolerations

Tell us about your request

Using this scheduling technique, it's possible to add a custom label to a node based on a corresponding key/value pair specified in the node-selector or node-affinity. This allows us to have nodes from a single provisioner that have differing custom labels based on the workloads that get scheduled on them, and ultimately, this allows for workload isolation at the node level.

But this is only true if every pod using that provisioner specifies that custom label in the node-selector or node-affinity. If a workload doesn't specify it, it can now be scheduled on any node launched by the provisioner and isolation is no longer a guarantee.

It seems that in addition to dynamic labelling, a mechanism to dynamically taint the nodes would help guarantee workload isolation at the node level. This would be critical for multi-tenant scenarios where it's one thing to say that isolation is 'possible', but with this feature, isolation would be inviolable.

@ellistarn had the suggestion of defining taints in the provisioner with an empty or '*' value and then specifying the corresponding tolerations in the workload.

Taint specified in the provisioner:

- key: company.com/team
  value: "*"
  effect: NoSchedule

Toleration:

- key: company.com/team
  value: datascience
  effect: NoSchedule

If the corresponding toleration is not specified in the workload, the understanding is that the node on which it will be scheduled never gets a taint. This allows both isolated and non-isolated workloads to use the same provisioner.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

In a multi-tenant scenario, we prefer to have workloads cleanly segregated by nodes. Currently this is achievable by creating a new provisioner where the clear expectation is that everyone using it is expected to either have the custom label specified in the node-selector or node-affinity, and this would be what determines the workload isolation because their workloads would always be scheduled on nodes with matching labels. But what would be more ideal is to be able to do this directly from the default provisioner where both isolated and non-isolated workloads are able to coexist with it being impossible for the non-isolated pods to be scheduled on the isolated nodes.

Me being greedy

The above would be more than cool enough already. But if it's possible to do away with the necessity of specifying the custom labels or taints on the provisioner, that would be the ultimate level of flexibility. Unlimited node groups controlled by nothing but the workloads. Perhaps something along these lines:

nodeSelector:
  customLabel/company.com/team: datascience

tolerations:
  - key: customLabel/company.com/team
    value: datascience
    effect: NoSchedule

The expectation being that when Karpenter sees 'customLabel/' in the nodeSelector, it understands that it needs to generate a corresponding node label. The same would apply for generating taints when it sees the 'customLabel/' prefix in the tolerations.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Improve Rate Limiting and Jitter for Karpenter Provisioning Loop

Tell us about your request

Allow the Provisioner controller to pass through its own rate limiting and jitter policy so that it can ensure that backoff isn't too aggressive. Currently, corecontroller.NewSingletonManagedBy doesn't allow you to pass through any custom rate limiter.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

This is relatively straightforward pass-through.

Are you currently working around this issue?

No workaround, we just use the standard backoff retry that controller-runtime uses.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Spot -> Spot Consolidation

Tell us about your request

originally discussed in aws/karpenter-provider-aws#1091, but when the conversation started to be fruitful "aws locked as resolved and limited conversation to collaborators"

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

using karpenter costs a lot of money after a scale up when nodes are left running behind

Are you currently working around this issue?

various approached discussed in #

Additional Context

No response

Attachments

No response

Community Note

please don't close this and keep conversation going.

Karpenter as default scheduler

Tell us about your request

Right now karpenter is simulating how the kube-scheduler would react in a given scenario. The simulation by karpenter and final decision by kube-scheduler is also not happening at the exact same time.

Maybe it would make sense to have karpenter as a replacement for kube-scheduler?

If it can simulate scheduling decision anyway, it should not be a big effort to replace it completely?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Default scheduler could be changed like this:
https://stackoverflow.com/questions/63830665/how-to-change-default-kube-scheduler-in-kubernetes

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Documenting how to implement another cloud provider

Is an existing page relevant?
No pages for anything

What karpenter features are relevant?

How should the docs be improved?
Generic documentation about interfaces and what should be implemented in a generic way would help, instead of just giving an implementation for AWS with a lot of specific cases. Done in this way it is impossible to collaborate crafting the providers for other clouds. With this, I mean, with no documentation pages and no documentation comments in the code, how is it supposed to guess what is for Karpenter and what is just for AWS?

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Add cpu-cfs-quota support to kubeletConfiguration

Tell us about your request

Adds cpu-cfs-quota support into kubeletConfiguration

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ...
  kubeletConfiguration:
    cpuCFSQuota: true | false

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I want to be able to disable CPU CFS quota enforcement for containers that specify CPU limits.

Are you currently working around this issue?

Currently, I'm solving it with the user data script.
Ref: https://aws.amazon.com/premiumsupport/knowledge-center/eks-worker-nodes-image-cache/ But I dislike having to maintain that file. It has caused problems.

Additional Context

In case this one is approved, I would like to implement it.

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Scheduler Conformance Tests

Tell us about your request
What do you want us to build?

I would be great to verify our scheduler conformance by running upstream scheduler conformance tests.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Inflight checks use beta instance-type label instead of stable

Version

Karpenter Version: v0.22.0

Expected Behavior

I'm not sure but wouldn't it better to use stable label instead of beta?
Or use NormalizedLabels to translate deprecated labels to stable in inflight controller as well

Actual Behavior

https://github.com/aws/karpenter-core/blob/main/pkg/controllers/inflightchecks/nodeshape.go#L55

Steps to Reproduce the Problem

Not sure how it can be reproduced on AWS, but if we are talking about custom implementations, currently we should place beta label on node as well to make inflight checks work. Otherwise, instance type can't be found because there is no beta.kubernetes.io/instance-type label on nodes.

Resource Specs and Logs

No logs

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

CAPA / CAPI support and documentation questions

Hello! 👋

Getting right to it there is this doc: https://github.com/aws/karpenter/blob/main/designs/aws-launch-templates-options.md#capi-integration

This says that CAPI integration is discussed in a different doc. Can someone please point me to that doc? :) That would be awesome.

I'm trying to get integration with CAPA started with Karpenter and I was wondering what elements/objects/components CAPA could/should/shouldn't manage with Karpenter. Any help is much appreciated. Cheers!

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano

Tell us about your request

Add Karpenter support to work with custom schedulers (e.g., Apache Yunikorn, Volcano)
As per my understanding, Karpenter works only with the default scheduler to schedule the pods. However, It's prevalent among the Data on Kubernetes community to use custom schedulers like Apache YuniKorn or Volcano for running Spark jobs on Amazon EKS.
With the requested feature, Karpenter is effectively used as Autoscaler for spinning up new nodes while YuniKorn and Volcano handle scheduling decisions.

Please correct me and provide some context if this feature is already supported.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Using Apache YuniKorn/Volcano is becoming a basic requirement for running batch workloads(e.g., Spark) on Kubernetes. These schedulers are more application aware unlike default scheduler and provides number of other useful features(e.g., resource queues, sorting the jobs) for running multi-tenant data workloads on Kubernetes(Amazon EKS).

At this moment we could only use cluster autoscaler with these custom schedulers but it would be beneficial to add Karpenter support to leverage performance optimised by Karpenter over Cluster Autoscaler.

Are you currently working around this issue?

No, We are using Cluster Autoscaler as an alternative option to work with custom schedulers.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Mega Issue: Manual node provisioning

Tell us about your request
What do you want us to build?

I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.

Use Cases:

Create a System pool to run components like karpenter, loadbalancer, coredns, etc
Provision baseline capacity that never scales down
Manually preprovision a set of nodes before a large event

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Provisioner falls back to lower weight when near limit

Version

Karpenter Version: v0.20.0
Kubernetes Version: v1.24.0
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.7-eks-fb459a0", GitCommit:"c240013134c03a740781ffa1436ba2688b50b494", GitTreeState:"clean", BuildDate:"2022-10-24T20:36:26Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

I have two providers , one with a smaller number of instances, and one much bigger
the 1st one has a weight of 50

i've just spun up 20 pods/nodes and noticed it was using both providers and picking instances that could fit in the 1st one.
trying to understand why that happened.

should the weight be bigger?

Actual Behavior

N/A

Steps to Reproduce the Problem

N/A

Resource Specs and Logs

sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.453Z	DEBUG	controller.provisioner	96 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.466Z	DEBUG	controller.provisioner	101 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.480Z	DEBUG	controller.provisioner	164 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.488Z	DEBUG	controller.provisioner	220 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.495Z	DEBUG	controller.provisioner	287 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.500Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.503Z	DEBUG	controller.provisioner	43 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.513Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.516Z	DEBUG	controller.provisioner	96 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.544Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.547Z	DEBUG	controller.provisioner	101 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.568Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.571Z	DEBUG	controller.provisioner	164 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.591Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.600Z	DEBUG	controller.provisioner	215 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.612Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.615Z	DEBUG	controller.provisioner	287 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.625Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.629Z	DEBUG	controller.provisioner	432 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.644Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.646Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-zhg2b"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.647Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.649Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-lxct6"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.650Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.651Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-gz7zp"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.652Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.654Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-qhq7w"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.655Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.656Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nfn6p"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.657Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.659Z	DEBUG	controller.provisioner	relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nc6rn"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.660Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.662Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.664Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.666Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.669Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.673Z	DEBUG	controller.provisioner	506 out of 599 instance types were excluded because they would breach provisioner limits	{"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.k8s.aws/instance-cpu Exists <17, kubernetes.io/arch In [amd64 arm64], kubernetes.io/os In [linux], karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal]; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-zhg2b"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-generation In [6 7], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand spot], kubernetes.io/arch In [amd64 arm64], karpenter.k8s.aws/instance-cpu Exists <17; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-lxct6"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], kubernetes.io/arch In [amd64 arm64], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819]; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-gz7zp"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements kubernetes.io/os In [linux], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-memory Exists <130000, topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-family NotIn [z1d], kubernetes.io/arch In [amd64 arm64], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-size NotIn [metal]; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-qhq7w"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements kubernetes.io/os In [linux], karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-cpu Exists <17, kubernetes.io/arch In [amd64 arm64], karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-memory Exists <130000, topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-hypervisor In [nitro]; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nfn6p"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-memory Exists <130000, kubernetes.io/arch In [amd64 arm64], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-category NotIn [a t], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-generation In [6 7], karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-family NotIn [z1d], kubernetes.io/os In [linux]; all available instance types exceed provisioner limits	{"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nc6rn"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "f60dacd", "pods": 18}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z	INFO	controller.provisioner	computed new node(s) to fit pod(s)	{"commit": "f60dacd", "nodes": 12, "pods": 12}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.676Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gn.xlarge, m6g.xlarge, m6i.xlarge, c6a.xlarge and 54 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.678Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m6g.xlarge, m6i.xlarge, m6id.xlarge, c6gd.xlarge, r6gd.xlarge and 41 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.679Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m6i.xlarge, m6g.xlarge, m6id.xlarge, c7g.xlarge and 41 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.680Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m6i.xlarge, m6g.xlarge, m6id.xlarge, c7g.xlarge and 41 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.682Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m6g.xlarge, m6i.xlarge, m6in.xlarge, m6idn.xlarge and 54 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.687Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m5.xlarge, m5n.xlarge, c6gn.xlarge, m6g.xlarge and 132 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.690Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m5n.xlarge, m5.xlarge, m6g.xlarge, m6i.xlarge, r5.xlarge and 119 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.694Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gd.xlarge, c5a.xlarge, m5n.xlarge, c6gn.xlarge and 113 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.698Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gd.xlarge, c5a.xlarge, m5n.xlarge, c6gn.xlarge and 113 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.702Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m5.xlarge, m5n.xlarge, c6gn.xlarge, m6g.xlarge and 132 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.705Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m5n.xlarge, m5.xlarge, m6i.xlarge, m6g.xlarge and 116 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.706Z	INFO	controller.provisioner	launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m5n.xlarge, m6g.xlarge, m5.xlarge, m6i.xlarge, r5.xlarge and 41 other(s)	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.949Z	DEBUG	controller.provisioner.cloudprovider	discovered new ami	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "ami": "ami-0607aea1f8780fc6c", "query": "/aws/service/bottlerocket/aws-k8s-1.24/arm64/latest/image_id"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.989Z	DEBUG	controller.provisioner.cloudprovider	discovered launch template	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "launch-template-name": "Karpenter-fernando-in-bull-11069147462930180006"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:41.139Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "launch-template-name": "Karpenter-fernando-in-bull-17861520481371219300", "launch-template-id": "lt-08017f24bc95d402a"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:41.297Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool", "launch-template-name": "Karpenter-fernando-in-bull-3445210251536975458", "launch-template-id": "lt-0d54ee3ae6743def9"}

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  creationTimestamp: "2022-12-23T11:58:09Z"
  generation: 1
  name: fernando-in-bull-9e58ab3c10185819
  resourceVersion: "44773"
  uid: 8caa2779-65ca-405d-b2a2-005c9a9eaab2
spec:
  consolidation:
    enabled: true
  limits:
    resources:
      cpu: "100"
  providerRef:
    name: fernando-in-bull-9e58ab3c10185819
  requirements:
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
    - us-east-1d
    - us-east-1f
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: NotIn
    values:
    - a
    - t
  - key: karpenter.k8s.aws/instance-family
    operator: NotIn
    values:
    - z1d
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: karpenter.k8s.aws/instance-hypervisor
    operator: In
    values:
    - nitro
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "6"
    - "7"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Lt
    values:
    - "17"
  - key: karpenter.k8s.aws/instance-memory
    operator: Lt
    values:
    - "130000"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  ttlSecondsUntilExpired: 2592000
  weight: 50
status:
  resources:
    attachable-volumes-aws-ebs: "507"
    cpu: "58"
    ephemeral-storage: 1341467764Ki
    memory: 192902248Ki
    pods: "1089"
    vpc.amazonaws.com/pod-eni: "99"

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  creationTimestamp: "2022-12-23T11:58:09Z"
  generation: 1
  name: fernando-in-bull-9e58ab3c10185819-bigger-hw-pool
  resourceVersion: "44888"
  uid: 7c8d8ddf-9c0c-42db-81bc-a5e13cb5028a
spec:
  consolidation:
    enabled: true
  limits:
    resources:
      cpu: "100"
  providerRef:
    name: fernando-in-bull-9e58ab3c10185819-bigger-hw-pool
  requirements:
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
    - us-east-1d
    - us-east-1f
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: NotIn
    values:
    - a
    - t
  - key: karpenter.k8s.aws/instance-family
    operator: NotIn
    values:
    - z1d
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: karpenter.k8s.aws/instance-hypervisor
    operator: In
    values:
    - nitro
  - key: karpenter.k8s.aws/instance-generation
    operator: NotIn
    values:
    - "1"
    - "2"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Lt
    values:
    - "17"
  - key: karpenter.k8s.aws/instance-memory
    operator: Lt
    values:
    - "130000"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  ttlSecondsUntilExpired: 2592000
status:
  resources:
    attachable-volumes-aws-ebs: "312"
    cpu: "32"
    ephemeral-storage: 825518624Ki
    memory: 63705340Ki
    pods: "620"
    vpc.amazonaws.com/pod-eni: "90"

pod

kind: Deployment
apiVersion: apps/v1
metadata:
  name: inflate
  namespace: pause
  uid: c713003b-cb1c-48de-a495-c8b8e955321f
  resourceVersion: "45492"
  generation: 2
  creationTimestamp: "2022-12-23T14:27:17Z"
  labels:
    app: inflate
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 20
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: inflate
    spec:
      containers:
      - name: inflate
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: "3"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 0
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      securityContext: {}
      schedulerName: default-scheduler
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: inflate
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: inflate
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 2
  replicas: 20
  updatedReplicas: 20
  readyReplicas: 20
  availableReplicas: 20
  conditions:
  - type: Progressing
    status: "True"
    lastUpdateTime: "2022-12-23T14:28:22Z"
    lastTransitionTime: "2022-12-23T14:27:17Z"
    reason: NewReplicaSetAvailable
    message: ReplicaSet "inflate-6886cd9c5f" has successfully progressed.
  - type: Available
    status: "True"
    lastUpdateTime: "2022-12-23T14:31:04Z"
    lastTransitionTime: "2022-12-23T14:31:04Z"
    reason: MinimumReplicasAvailable
    message: Deployment has minimum availability.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Deprovisioning fails when node resources exactly equal to provisioner resources

Version

Karpenter Version: v0.22.1
Kubernetes Version: v1.23.0

Expected Behavior

A node using 16cpus should be provisionable as the sole instance in a provisioner with spec.resources.limits set to 16

Actual Behavior

2023-02-01T21:32:42.108Z	INFO	controller.deprovisioning	triggering termination for expired node after TTL	{"commit": "c4a4efd-dirty", "expirationTTL": "720h0m0s", "delay": "421h58m10.108235034s"}
2023-02-01T21:32:42.108Z	INFO	controller.deprovisioning	deprovisioning via expiration replace, terminating 1 nodes redacted/m6i.4xlarge/on-demand and replacing with on-demand node from types m6id.4xlarge, m6i.4xlarge	{"commit": "c4a4efd-dirty"}
2023-02-01T21:32:42.129Z	ERROR	controller.deprovisioning	processing cluster, deprovisioning nodes, launching replacement node, launching machine, cpu resource usage of 16 exceeds limit of 16	{"commit": "c4a4efd-dirty"}

Steps to Reproduce the Problem

Probably create a provisioner with cpu limits exactly equal to node size with a very low ttl and see if the node is deprovisioned after the TTL or not

Resource Specs and Logs

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: bottlerocket-smaller-amd64
spec:
  consolidation:
    enabled: true
  labels:
    architecture: amd64
    base-image: bottlerocket
    worker-group: karpenter-smaller-amd64
  limits:
    resources:
      cpu: "16"
  providerRef:
    name: bottlerocket-general-amd64
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m6i.4xlarge
    - m6id.4xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  ttlSecondsUntilExpired: 2592000
  weight: 20
status:
  resources:
    attachable-volumes-aws-ebs: "39"
    cpu: "16"
    ephemeral-storage: 103189828Ki
    memory: 64795788Ki
    pods: "234"

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Chore: Migrate `events.Recorder` to use `Event` as an interface

Currently, Karpenter events.Recorder uses a separate method call and interface method for each event type that it offers such as

func NewRecorder(rec record.EventRecorder) Recorder {
	return &recorder{rec: rec}
}

func (r recorder) WaitingOnDeletionForConsolidation(node *v1.Node) {
	r.rec.Eventf(node, "Normal", "ConsolidateWaiting", "Waiting on deletion to continue consolidation")
}
func (r recorder) WaitingOnReadinessForConsolidation(node *v1.Node) {
	r.rec.Eventf(node, "Normal", "ConsolidateWaiting", "Waiting on readiness to continue consolidation")
}

Rather than relying on making a separate call for each new Event we create, we can decouple the "eventing" and allow other cloudprovider's to utilize the same recorder by creating an recorder.Event interface type. Something like

type Recorder struct {
   rec record.EventRecorder
}

type Event interface {
   InvolvedObject() client.Object
   Type() string
   Reason() string
   Message() string
}

func NewRecorder(rec record.EventRecorder) Recorder {
	return &recorder{rec: rec}
}

func (r *Recorder) Create(evt Event) {
   r.rec.Eventf(evt.InvolvedObject(), evt.Type(), evt.Reason(), evt.Message())
}

As long as cloudproviders implement this event interface, they can simply call out to the recorder to fire an event

This change also means, we will have to think of ways to decouple the Dedupe and Loadshedding recorders from the specific event types that are fired. Right now, they are closely coupled.