aws / karpenter-provider-aws Goto Github PK

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.

License: Apache License 2.0

Makefile 0.36% Go 74.13% Shell 8.30% SCSS 0.16% HTML 16.57% Python 0.14% Smarty 0.27% JavaScript 0.07%

karpenter-provider-aws's Introduction

Karpenter is an open-source node provisioning project built for Kubernetes. Karpenter improves the efficiency and cost of running workloads on Kubernetes clusters by:

Watching for pods that the Kubernetes scheduler has marked as unschedulable
Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods
Provisioning nodes that meet the requirements of the pods
Removing the nodes when the nodes are no longer needed

Come discuss Karpenter in the #karpenter channel, in the Kubernetes slack or join the Karpenter working group bi-weekly calls. If you want to contribute to the Karpenter project, please refer to the Karpenter docs.

Check out the Docs to learn more.

Talks

09/08/2022 Workload Consolidation with Karpenter
05/19/2022 Scaling K8s Nodes Without Breaking the Bank or Your Sanity
03/25/2022 Karpenter @ AWS Community Day 2022
12/20/2021 How To Auto-Scale Kubernetes Clusters With Karpenter
11/30/2021 Karpenter vs Kubernetes Cluster Autoscaler
11/19/2021 Karpenter @ Container Day
05/14/2021 Groupless Autoscaling with Karpenter @ Kubecon
05/04/2021 Karpenter @ Container Day

karpenter-provider-aws's People

Contributors

Stargazers

Watchers

Forkers

dcherman sbanala12 gjtempleton jpoley ellistarn alexander-falca tomkerkhove jalawala prateekgogia alethatarn rheehot georgekuruvillak hasheddan kubernetescheatsheet thanghoang07 tchigher cameronsenese hyandell pratikfalke jacobgabrielson vijayansarathy chatcharoen bwagner5 lnouchien ruecarlo marceloboeira clementrey-dev metaver5o shivareddy712 waltervargas andskli pjlewisuk brianhammons franklinharry theofpa black-mirror-1 akestner geoffcline hey-rum-ba zhenyu-aws-lab olileach trevorrobertsjr rr-paras-patel eptiger joebowbeer cjerad laghao isgasho rizalgowandy zeta1999 jangocheng ravihari hyderali87 haugenj bellkev dastbe ekmixon ksivamuthu yamagai leewalter felix-zhe-huang aaroniscode suket22 ivallhon elmiko hitsub2 udyshnkr cpu1 obrienjason thor-consulting seprtk ksugahara0808 sebrandon1 pinkbluersglobal cduran onlykumarabhishek amarnathnachimuthu prakarsh-dt tuananh hien sadok-f kevin2458 mallow111 slcnx niall23236 njtran cmeninwa phillebaba mu-l rrengaramanujam zjd713 iamsamba bdwyertech laplacekorea mohan9456 gladiatr72 coro101 elocke linuer maxbrunet

karpenter-provider-aws's Issues

[MetricsProducer] Investigate integration with https://knative.dev/docs/eventing/sources/#knative-sources

Knative provides a larger number of event generators. If these output prometheus metrics (such as messages/sec or queue-length), they could be used by Karpenter to scale arbitrary resources.

Alternatively, Karpenter could integrate directly with Knative as an event sink with a new MetricSourceType in the HorizontalAutoscaler.

[Provisioning] Create an e2e demo

Create a demo script for provisioning.

[CapacityReservations] Memory is calculated incorrectly

30 pods x 30Gi should be 900Gi memory but isn't.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-11-03T20:21:51Z"
    message: '0/5 nodes are available: 1 Insufficient cpu, 5 Insufficient memory.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

➜  karpenter-aws-demo git:(main) ✗ k get metricsproducers.autoscaling.karpenter.sh -oyaml
apiVersion: v1
items:
- apiVersion: autoscaling.karpenter.sh/v1alpha1
  kind: MetricsProducer
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"autoscaling.karpenter.sh/v1alpha1","kind":"MetricsProducer","metadata":{"annotations":{},"name":"demo","namespace":"default"},"spec":{"reservedCapacity":{"nodeSelector":{"eks.amazonaws.com/nodegroup":"default"}}}}
    creationTimestamp: "2020-11-03T18:29:41Z"
    generation: 1
    name: demo
    namespace: default
    resourceVersion: "41164"
    selfLink: /apis/autoscaling.karpenter.sh/v1alpha1/namespaces/default/metricsproducers/demo
    uid: b14f6248-febc-4f3d-9041-5361f7abe4d7
  spec:
    reservedCapacity:
      nodeSelector:
        eks.amazonaws.com/nodegroup: default
  status:
    conditions:
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Active
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Calculable
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Ready
    reservedCapacity:
      cpu: 62%, 24050m/40
      memory: 4%, 6354Mi/161634452Ki
      pods: 14%, 41/290
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Stabilization Window doesn't appear to be respected on scale down

Repro

at equilibrium
scale up (should trigger)
scale down (should not trigger)

Remove dependency on Kustomize and Code Generation

The current config generation is fairly brittle and requires multiple workarounds in our makefile. It's trivial to manually create all k8s resources that we currently generate except for the CRDs. Many other projects simply hand-maintain CRD definitions after generating them a first time.

Remove config gen
Make a hack script for config generation.
Check in generated code (and hand maintain in the future)
Stop using Kustomize (previously only necessary to merge generated config)

Nodes not correctly counted for SNG's MNG Implementation

We currently count the number of instances in the ASG, which leads to overcounting before the nodes come up. This causes oscillation in the autoscaler as it believes there are more nodes than there actually are when calculating autoscaling decisions.

Pending pods design for metrics producer and autoscaler

Provide Helm chart to simplify installation

Provide Helm chart to simplify installation and not have to run any scripts.

Find a good Working Group meeting time

Let's find a time that works for both US-West and EU-East timezones. The working group is currently uncomfortably late for our EU-East friends at Friday 6pm.
https://github.com/awslabs/karpenter/tree/main/docs/working-group

[Scalable Node Group] [EKS MNG] Crashloop on poorly formatted ARN

2020-10-05T20:38:02.679Z FATAL failed to instantiate ManagedNodeGroup: invalid arn arn:aws:eks:us-west-2:1234567890:node-group:microservices
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/aws.NewNodeGroup
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/aws/managednodegroup.go:50
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup.(*Factory).For
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/factory.go:22
github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1.(*Controller).Reconcile
github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1/controller.go:68
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1

Explore logr

An ideal solution

Delineates logs based on component (e.g. CloudProvider/Allocator/Deallocator)
Doesn't require a constant logger to be defined per package
Doesn't require dependency injection of a logger

Explore this from controller runtime:

	// LoggerFrom returns a logger with predefined values from a context.Context.
	// The logger, when used with controllers, can be expected to contain basic information about the object
	// that's being reconciled like:
	// - `reconcilerGroup` and `reconcilerKind` coming from the For(...) object passed in when building a controller.
	// - `name` and `namespace` injected from the reconciliation request.
	//
	// This is meant to be used with the context supplied in a struct that satisfies the Reconciler interface.
	LoggerFrom = log.FromContext

	// LoggerInto takes a context and sets the logger as one of its keys.
	//
	// This is meant to be used in reconcilers to enrich the logger within a context with additional values.
	LoggerInto = log.IntoContext

	// SetLogger sets a concrete logging implementation for all deferred Loggers.
	SetLogger = log.SetLogger

Improve errors

Our error story is a little messy right now.

Errors tend to have "and then foo and then bar and then baz" which is effectively a bad stack trace.
ctrl doesn't interact very well with errors.Wrap() and prints useless stack traces.
It would be great if any errors were surfaced as status conditions, logs, and kubernetes events.

[Provisioning] Support Node Label Selectors

Introduce node selectors as a scheduling constraint for provisioning.

[Provisioning] AMI Support

Currently AMI is hardcoded. Use SSM to detect the best EKSOptimizedAMI for kubernetes version.

Query: kubectl get --raw /version
Pass k8s version to SSM for EKSOptimized AMI

Example: https://github.com/aws/aws-cdk/blob/v1.88.0/packages/@aws-cdk/aws-eks-legacy/lib/cluster.ts#L836

Flakey CI

Unclear what's causing this

"message": "within stabilization window 8852233077/824643902784 seconds"

Causing test flakes that don't happen locally.

[scalablenodegroup] reconciler runs in a loop when a nodegroup is not active.

Seeing this error occur in a loop multiple times per second, this is similar to what I saw in horizontal autoscaler. We are updating the status condition with requestID from API error - RequestID: \"f280c139-1e37-4fa0-83bc-ffab4de9eed3\"\n } which is causing this reconciler to run again multiple times per second until the underlying error resolves.

{"level":"error","ts":1605280988.646081,"msg":"Controller failed to reconcile kind: ScalableNodeGroup err: unable to set replicas for node group arn:aws:eks:us-east-2:674320443449:nodegroup/pgogia-dev/default/02babd44-f044-264a-9dc4-84dc3d865f8d, ResourceInUseException: Nodegroup cannot be updated as it is currently not in Active State\n{\n  RespMetadata: {\n    StatusCode: 409,\n    RequestID: \"f280c139-1e37-4fa0-83bc-ffab4de9eed3\"\n  },\n  ClusterName: \"pgogia-dev\",\n  Message_: \"Nodegroup cannot be updated as it is currently not in Active State\",\n  NodegroupName: \"default\"\n}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1605280988.7963583,"msg":"Controller failed to reconcile kind: ScalableNodeGroup err: unable to set replicas for node group arn:aws:eks:us-east-2:674320443449:nodegroup/pgogia-dev/default/02babd44-f044-264a-9dc4-84dc3d865f8d, ResourceInUseException: Nodegroup cannot be updated as it is currently not in Active State\n{\n  RespMetadata: {\n    StatusCode: 409,\n    RequestID: \"616bb320-ad94-4a32-92e2-43bdcf59bacc\"\n  },\n  ClusterName: \"pgogia-dev\",\n  Message_: \"Nodegroup cannot be updated as it is currently not in Active State\",\n  NodegroupName: \"default\"\n}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}

[Testing] Namespace torn after APIServer Shutdown

Investigate and resolve. Likely some goroutine ordering problem in local.go

2020-10-01T10:35:58.872-0700	ERROR	Failed to tear down namespace: Delete "http://127.0.0.1:61359/api/v1/namespaces/paintercotton": read tcp 127.0.0.1:61679->127.0.0.1:61359: read: connection reset by peer
github.com/ellistarn/karpenter/pkg/test/environment.(*Local).NewNamespace.func1
	/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/test/environment/local.go:87

Support aarch64 EKS amis

Presently the AWS provider only supports x86_64 worker nodes - this issue is to add aarch64 support to the Provisioner.

[Provisioning] Simple deallocator implemenation

Put TTL on empty nodes, delete nodes after TTL

[Bug] Pods are provisioned, even if capacity is available

Looks like there's something wrong with the scheduling predicate.

[Provisioning] Investigate Bottlerocket support as default AMI

Default to Bottlerocket instead of EKSOptimizedAMI. Needs to be tested, but we expect significant node bringup performance improvements.

Stop automatically patching aws-auth for KarpenterInstanceProfile

In keeping with POLP and restricting Karpenter. Consider making this a manual aws setup step.

AWS autoscalinggroup should accept an ARN to identify the ASG

The idea is that ASGs would be specified with either a name (like asg-name) or else with an ARN (like "arn:aws:autoscaling:region:123456789012:autoScalingGroup:uuid:autoScalingGroupName/asg-name"). Today one can only use the asg-name form:

out, err := a.Client.DescribeAutoScalingGroups(&autoscaling.DescribeAutoScalingGroupsInput{
		AutoScalingGroupNames: []*string{aws.String(a.ID)},
		MaxRecords:            aws.Int64(1),
	})

This should be magically detected rather than requiring the user to say which form they're using. (If it doesn't parse as an ARN, then assume it's a name).

Investigate Cloud Watch metrics integration

As a user, I should be able to scale a resource using a Cloudwatch metric.

We probably want to avoid adding a cloudwatch metric type directly into the Horizontal Autoscaler spec due to the cloud-provider-explosion problem.

Alternatively, we could enable https://github.com/awslabs/k8s-cloudwatch-adapter, but the metric adapter takes the sole "external metrics" slot, which prevents users from using other metrics providers like KEDA. The is detailed extensively in the design https://github.com/awslabs/karpenter/blob/main/docs/DESIGN.md#prometheus-vs-kubernetes-metrics-api.

Are there any other options?

Github Actions doesn't capture full Error Messages.

It may have something to do with the ginkgo logger.

github actions:

go fmt ./...
golangci-lint run
go build -o bin/karpenter karpenter/main.go
go test ./... -v -cover
?   	github.com/ellistarn/karpenter/karpenter	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis/autoscaling/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/cloudprovider	[no test files]
=== RUN   TestUpdateAutoScalingGroupSuccess
--- PASS: TestUpdateAutoScalingGroupSuccess (0.00s)
=== RUN   TestUpdateManagedNodeGroupSuccess
--- PASS: TestUpdateManagedNodeGroupSuccess (0.00s)
PASS
coverage: 36.4% of statements
ok  	github.com/ellistarn/karpenter/pkg/cloudprovider/aws	0.165s	coverage: 36.4% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers	[no test files]
=== RUN   TestAPIs
Running Suite: Horizontal Autoscaler Suite
==========================================
Random Seed: 1599870287
Will run 1 of 1 specs

FAIL	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1	9.808s
=== RUN   TestProportionalGetDesiredReplicas
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case

Local

go mod tidy
go mod download
go vet ./...
go fmt ./...
golangci-lint run
go test ./... -v -cover
?   	github.com/ellistarn/karpenter/karpenter	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis/autoscaling/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/cloudprovider	[no test files]
=== RUN   TestUpdateAutoScalingGroupSuccess
--- PASS: TestUpdateAutoScalingGroupSuccess (0.00s)
=== RUN   TestUpdateManagedNodeGroupSuccess
--- PASS: TestUpdateManagedNodeGroupSuccess (0.00s)
PASS
coverage: 36.4% of statements
ok  	github.com/ellistarn/karpenter/pkg/cloudprovider/aws	(cached)	coverage: 36.4% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers	[no test files]
=== RUN   TestAPIs
Running Suite: Horizontal Autoscaler Suite
==========================================
Random Seed: 1599870852
Will run 1 of 1 specs

• Failure [0.021 seconds]
Controller
/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:80
  with an empty resource
  /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:81
    should should create and delete [It]
    /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:90

    Expected success, but got an error:
        <*errors.StatusError | 0xc000456320>: {
            ErrStatus: {
                TypeMeta: {Kind: "", APIVersion: ""},
                ListMeta: {
                    SelfLink: "",
                    ResourceVersion: "",
                    Continue: "",
                    RemainingItemCount: nil,
                },
                Status: "Failure",
                Message: "Internal error occurred: failed calling webhook \"vhorizontalautoscaler.kb.io\": the server could not find the requested resource",
                Reason: "InternalError",
                Details: {
                    Name: "",
                    Group: "",
                    Kind: "",
                    UID: "",
                    Causes: [
                        {
                            Type: "",
                            Message: "failed calling webhook \"vhorizontalautoscaler.kb.io\": the server could not find the requested resource",
                            Field: "",
                        },
                    ],
                    RetryAfterSeconds: 0,
                },
                Code: 500,
            },
        }
        Internal error occurred: failed calling webhook "vhorizontalautoscaler.kb.io": the server could not find the requested resource

    /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:91
------------------------------


Summarizing 1 Failure:

[Fail] Controller with an empty resource [It] should should create and delete
/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:91

Ran 1 of 1 Specs in 4.638 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestAPIs (4.64s)
FAIL
coverage: 14.3% of statements
FAIL	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1	6.271s
=== RUN   TestProportionalGetDesiredReplicas
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_scales_to_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageUtilization_normal_case
=== RUN   TestProportionalGetDesiredReplicas/AverageUtilization_does_not_scale_to_zero
=== RUN   TestProportionalGetDesiredReplicas/Unknown_metric_type_returns_replicas
--- PASS: TestProportionalGetDesiredReplicas (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/ValueMetricType_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageValueMetricType_scales_to_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageUtilization_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageUtilization_does_not_scale_to_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/Unknown_metric_type_returns_replicas (0.00s)
PASS
coverage: 88.9% of statements
ok  	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/algorithms	(cached)	coverage: 88.9% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/autoscaler	[no test files]
?   	github.com/ellistarn/karpenter/pkg/controllers/metricsproducer/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics/clients	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics/producers	[no test files]
?   	github.com/ellistarn/karpenter/pkg/test	[no test files]
?   	github.com/ellistarn/karpenter/pkg/utils	[no test files]
FAIL
make: *** [test] Error 1

[Provisioning] Calculate Daemonset Overhead

Include daemonset overhead in node resource allocation calculations.

[TetrisCapacity] Explore a "Heterogenous" Node Group implementation

Must logically support spec.replicas for autoscaling to work well
Under the hood, can meet the capacity requirements using any instance types
Must respect constraints (availability, resources, schedulability)

See strawman: ellistarn#87

[Provisioning] Investigate InstanceType Selection

Research best options for constructing ec2 instance packings.
Potentially able to leverage: https://github.com/aws/amazon-ec2-instance-selector.

[HorizontalAutoscaler] HA ScaleTargetRef broken if replicas is not defined in SNG

{
  "lastTransitionTime": "2020-11-03T18:29:50Z",
  "message": "getting scale target for {ScalableNodeGroup demo autoscaling.karpenter.sh/v1alpha1}, Internal error occurred: the spec replicas field \".spec.replicas\" does not exist",
  "status": "False",
  "type": "Active"
},

HorizontalAutoscaler.Status.Replicas should be a pointer to support reporting ScaleToZero

[Provisioning] Include Kubelet overhead calculations

Implement: https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh#L251

Observed unusual scaling (up/down) activity in Karpenter even when pods are not changes

Created 3 EKS managed one on-demand and 2 spot nodegroups. Tested with a sample nginx deployment. When deployment is scaled, replicas are expected to spread on 2 spot nodegroups using a custom kube scheduler.

Karpenter initially scaled the instances 1st spot nodegroup to deploy the pods. And karpenter keeps scaling up/down the instances in this 1st nodegroup even though pods are not changed. Karpenter does not scale 2nd nodegroup for some reason. The details are mentioned at this link https://github.com/jalawala/karpenter-aws-demo/tree/main/k8s_custom_scheduler

[Termination Handler] Explore graceful termination for EC2 Instances

As described in the Design, termination handlers must be layered independently on top of Karpenter's autoscaler. By design, the node termination handler should have no knowledge of autoscaling behavior or configuration, or what even triggered the scale down (e.g. manual, preemption, autoscaling).

Potential requirements/solutions include:

Protect instances that are being deleted/scaled down to respect poddisruptionbudgets
Build a Karpenter CRD to model lifecycle hooks?
Use some sort of CloudProvider model to hook into EC2 lifecycle hooks to protect instances.

[Provisioning] Support node taints

Introduce taint support for provisioning

[Provisioning] Product Definition

Outline scope, responsibilities, ux for provisioning.karpenter.sh.

failed retrieving metric, request failed for query karpenter_reserved_capacity_cpu_utilization{name=\"demo\"}, Post \"http://prometheus-operated:9090/api/v1/query\": dial tcp: lookup prometheus-operated on 10.100.0.10:53: no such host

i have not configured prometheus yet. even though its redirecting to the below specified URL and its failing to collect the metrices. Can you please tell me where this configuration is made ? and how do i change it?

failed retrieving metric, request failed for query karpenter_reserved_capacity_cpu_utilization{name="demo"}, Post "http://prometheus-operated:9090/api/v1/query\": dial tcp: lookup prometheus-operated on 10.100.0.10:53: no such host

Exclude Nodes w/ Unschedulable Taint from ReservedCapacity calculations.

Karpenter Docker image is not public

Since the plan is to do a developer preview release on Monday, there should be documentation to install or may be users can build and install. We already have install steps in readme but that will fail because Karpenter image is not public.

Rename package f to functional

Currently f is confusing if the local struct context starts with the letter f.

v0.1.0 release manifests

HA should fail gracefully if metric does not exist

{"level":"error","ts":1605916233.9806492,"msg":"Controller failed to reconcile kind: HorizontalAutoscaler err: failed retrieving metric, invalid response for query karpenter_reserved_capacity_cpu_utilization{name="capacity"}, expected instant vector and got vector: []","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}

[Provisioning] Support EC2 Zone-Ids in addition to Zone Names

Since EC2 zone names are not always the same across AWS accounts, it would be helpful to be able to specify zone-ids in addition to zone names for scheduling constraints.

https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html

[Provisioning] Deallocator controller skeleton code

Implement basic noop deallocator controller.

[Tests] Investigate Race Condition in Capacity Reservation tests

https://github.com/ellistarn/karpenter/pull/130/checks?check_run_id=1393013218

• Failure [0.102 seconds]
Examples Capacity Reservations [It] should produce reservation metrics for 7/48 cores, 77/384 memory, 4/150 pods 
/home/runner/work/karpenter/karpenter/pkg/controllers/metricsproducer/v1alpha1/suite_test.go:65

  Expected
      <string>: 12%, 6/48
  to be equivalent to
      <string>: 14%, 7/48

Code Generation is busted for Webhooks, RBAC, and ClientGen

We might've deleted some code markers which is resulting in odd behavior.
CRD and DeepCopy() generation continues to work normally.

#!/bin/bash
set -eu -o pipefail


# Generate API Deep Copy
controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/apis/..."
# Generate CRDs
controller-gen crd:trivialVersions=false paths="./pkg/apis/..." "output:crd:artifacts:config=config/crd/bases"

# TODO Fix Me, doesn't generate anything
controller-gen rbac:roleName=karpenter paths="./pkg/controllers/..." output:stdout

# TODO Fix Me, doesn't generate anything
controller-gen webhook paths="./pkg/controllers/..."

# TODO Fix Me, this is broken into above generators
# controller-gen \
#     object:headerFile="hack/boilerplate.go.txt" \
#     webhook \
#     crd:trivialVersions=false \
#     rbac:roleName=manager-role \
#     paths="./pkg/apis/..." \
#     "output:crd:artifacts:config=config/crd/bases"

# TODO Fix Me, creates empty clients
# bash -e $GOPATH/pkg/mod/k8s.io/[email protected]/generate-groups.sh \
#     all \
#     "github.com/ellistarn/karpenter/pkg/client" \
#     "github.com/ellistarn/karpenter/pkg/apis" \
#     "autoscaling:v1alpha1" \
#     --go-header-file ./hack/boilerplate.go.txt -v2

./hack/boilerplate.sh

[Provisioning] Drain node before deletion

Research drain workflow https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

Something like

Mark node as Unschedulable
Wait a bit
Evict all pods
Delete Node

Move Karpenter control plane image to public repository

Karpenter control plane image (currently at v0.1.0) is made available via private container repository.
To simplify access, and serve a broader set of deployment scenarios: suggest moving to an ECS public registry.

Current image location:
197575167141.dkr.ecr.us-west-2.amazonaws.com/karpenter:v0.1.0@sha256:b80ac089c17f15ac37c5f62780c9761e5725463f8a801cb4a4fb69af75c17949

Referenced by:
https://github.com/awslabs/karpenter/blob/main/releases/aws/v0.1.0.yaml

[MetricsProducer] Queue Length Implementation

Modify installation procedure to handle karpenter namespace dependency

#183 resolves issue whereby command:

eksctl create iamserviceaccount --cluster ${CLUSTER_NAME} \
--name default \
--namespace karpenter \
--attach-policy-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/Karpenter" \
--override-existing-serviceaccounts \
--approve

fails when karpenter namespace does not exist. This will arise where the user is performing cloud provider specific configuration prior to installing karpenter.

Continuing with karpenter installation after implementing cloud provider config, the Helm chart installation will now fail as the karpenter namespace is pre-existing in the cluster. Helm generates the following error:

Error: rendered manifests contain a resource that already exists.
Unable to continue with install: Namespace "karpenter" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm";
annotation validation error: missing key "meta.helm.sh/release-name": must be set to "karpenter"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "karpenter"

Suggested workaround:

Remove namespace creation from the karpenter Helm chart manifest
Update Helm installation procedure to incorporate directives --create-namespace --namespace karpenter

Alternate (short-term) workaround:

Be prescriptive in instructing users to perform cloud provider configuration only after karpenter installation

Test environment doesn't shut down properly

For some reason, the last line never executes

		defer ginkgo.GinkgoRecover()
		gomega.Expect(manager.Start(stop)).To(gomega.Succeed(), "Failed to stop Manager")
		// TODO this code isn't shutting down correctly.
		gomega.Expect(environment.Stop()).To(gomega.Succeed())
	}()

This results in many ghost etcd processes that must be manually cleaned with pkill -f etcd