Giter Site home page Giter Site logo

aws / karpenter-provider-aws Goto Github PK

View Code? Open in Web Editor NEW
6.0K 6.0K 818.0 19.74 MB

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.

Home Page: https://karpenter.sh

License: Apache License 2.0

Makefile 0.36% Go 74.13% Shell 8.30% SCSS 0.16% HTML 16.57% Python 0.14% Smarty 0.27% JavaScript 0.07%

karpenter-provider-aws's Introduction

CI GitHub stars GitHub forks GitHub License Go Report Card Coverage Status contributions welcome

Karpenter is an open-source node provisioning project built for Kubernetes. Karpenter improves the efficiency and cost of running workloads on Kubernetes clusters by:

  • Watching for pods that the Kubernetes scheduler has marked as unschedulable
  • Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods
  • Provisioning nodes that meet the requirements of the pods
  • Removing the nodes when the nodes are no longer needed

Come discuss Karpenter in the #karpenter channel, in the Kubernetes slack or join the Karpenter working group bi-weekly calls. If you want to contribute to the Karpenter project, please refer to the Karpenter docs.

Check out the Docs to learn more.

Talks

karpenter-provider-aws's People

Contributors

akestner avatar bbodenmiller avatar billrayburn avatar bwagner5 avatar cameronsenese avatar chrisnegus avatar cjerad avatar dependabot[bot] avatar dewjam avatar ellistarn avatar engedaam avatar eptiger avatar felix-zhe-huang avatar geoffcline avatar github-actions[bot] avatar gliptak avatar jigisha620 avatar jmdeal avatar jonathan-innis avatar mbevc1 avatar mikesir87 avatar njtran avatar prateekgogia avatar robertnorthard avatar rothgar avatar spring1843 avatar stevehipwell avatar suket22 avatar tuananh avatar tzneal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

karpenter-provider-aws's Issues

[CapacityReservations] Memory is calculated incorrectly

30 pods x 30Gi should be 900Gi memory but isn't.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-11-03T20:21:51Z"
    message: '0/5 nodes are available: 1 Insufficient cpu, 5 Insufficient memory.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable
➜  karpenter-aws-demo git:(main) ✗ k get metricsproducers.autoscaling.karpenter.sh -oyaml
apiVersion: v1
items:
- apiVersion: autoscaling.karpenter.sh/v1alpha1
  kind: MetricsProducer
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"autoscaling.karpenter.sh/v1alpha1","kind":"MetricsProducer","metadata":{"annotations":{},"name":"demo","namespace":"default"},"spec":{"reservedCapacity":{"nodeSelector":{"eks.amazonaws.com/nodegroup":"default"}}}}
    creationTimestamp: "2020-11-03T18:29:41Z"
    generation: 1
    name: demo
    namespace: default
    resourceVersion: "41164"
    selfLink: /apis/autoscaling.karpenter.sh/v1alpha1/namespaces/default/metricsproducers/demo
    uid: b14f6248-febc-4f3d-9041-5361f7abe4d7
  spec:
    reservedCapacity:
      nodeSelector:
        eks.amazonaws.com/nodegroup: default
  status:
    conditions:
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Active
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Calculable
    - lastTransitionTime: "2020-11-03T18:29:41Z"
      status: "True"
      type: Ready
    reservedCapacity:
      cpu: 62%, 24050m/40
      memory: 4%, 6354Mi/161634452Ki
      pods: 14%, 41/290
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Remove dependency on Kustomize and Code Generation

The current config generation is fairly brittle and requires multiple workarounds in our makefile. It's trivial to manually create all k8s resources that we currently generate except for the CRDs. Many other projects simply hand-maintain CRD definitions after generating them a first time.

  1. Remove config gen
  2. Make a hack script for config generation.
  3. Check in generated code (and hand maintain in the future)
  4. Stop using Kustomize (previously only necessary to merge generated config)

Nodes not correctly counted for SNG's MNG Implementation

We currently count the number of instances in the ASG, which leads to overcounting before the nodes come up. This causes oscillation in the autoscaler as it believes there are more nodes than there actually are when calculating autoscaling decisions.

[Scalable Node Group] [EKS MNG] Crashloop on poorly formatted ARN

2020-10-05T20:38:02.679Z FATAL failed to instantiate ManagedNodeGroup: invalid arn arn:aws:eks:us-west-2:1234567890:node-group:microservices
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/aws.NewNodeGroup
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/aws/managednodegroup.go:50
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup.(*Factory).For
github.com/ellistarn/karpenter/pkg/cloudprovider/nodegroup/factory.go:22
github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1.(*Controller).Reconcile
github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1/controller.go:68
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1

Explore logr

An ideal solution

  • Delineates logs based on component (e.g. CloudProvider/Allocator/Deallocator)
  • Doesn't require a constant logger to be defined per package
  • Doesn't require dependency injection of a logger

Explore this from controller runtime:

	// LoggerFrom returns a logger with predefined values from a context.Context.
	// The logger, when used with controllers, can be expected to contain basic information about the object
	// that's being reconciled like:
	// - `reconcilerGroup` and `reconcilerKind` coming from the For(...) object passed in when building a controller.
	// - `name` and `namespace` injected from the reconciliation request.
	//
	// This is meant to be used with the context supplied in a struct that satisfies the Reconciler interface.
	LoggerFrom = log.FromContext

	// LoggerInto takes a context and sets the logger as one of its keys.
	//
	// This is meant to be used in reconcilers to enrich the logger within a context with additional values.
	LoggerInto = log.IntoContext

	// SetLogger sets a concrete logging implementation for all deferred Loggers.
	SetLogger = log.SetLogger

Improve errors

Our error story is a little messy right now.

  1. Errors tend to have "and then foo and then bar and then baz" which is effectively a bad stack trace.
  2. ctrl doesn't interact very well with errors.Wrap() and prints useless stack traces.
  3. It would be great if any errors were surfaced as status conditions, logs, and kubernetes events.

Flakey CI

Unclear what's causing this

"message": "within stabilization window 8852233077/824643902784 seconds"

Causing test flakes that don't happen locally.

[scalablenodegroup] reconciler runs in a loop when a nodegroup is not active.

Seeing this error occur in a loop multiple times per second, this is similar to what I saw in horizontal autoscaler. We are updating the status condition with requestID from API error - RequestID: \"f280c139-1e37-4fa0-83bc-ffab4de9eed3\"\n } which is causing this reconciler to run again multiple times per second until the underlying error resolves.

{"level":"error","ts":1605280988.646081,"msg":"Controller failed to reconcile kind: ScalableNodeGroup err: unable to set replicas for node group arn:aws:eks:us-east-2:674320443449:nodegroup/pgogia-dev/default/02babd44-f044-264a-9dc4-84dc3d865f8d, ResourceInUseException: Nodegroup cannot be updated as it is currently not in Active State\n{\n  RespMetadata: {\n    StatusCode: 409,\n    RequestID: \"f280c139-1e37-4fa0-83bc-ffab4de9eed3\"\n  },\n  ClusterName: \"pgogia-dev\",\n  Message_: \"Nodegroup cannot be updated as it is currently not in Active State\",\n  NodegroupName: \"default\"\n}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1605280988.7963583,"msg":"Controller failed to reconcile kind: ScalableNodeGroup err: unable to set replicas for node group arn:aws:eks:us-east-2:674320443449:nodegroup/pgogia-dev/default/02babd44-f044-264a-9dc4-84dc3d865f8d, ResourceInUseException: Nodegroup cannot be updated as it is currently not in Active State\n{\n  RespMetadata: {\n    StatusCode: 409,\n    RequestID: \"616bb320-ad94-4a32-92e2-43bdcf59bacc\"\n  },\n  ClusterName: \"pgogia-dev\",\n  Message_: \"Nodegroup cannot be updated as it is currently not in Active State\",\n  NodegroupName: \"default\"\n}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}

[Testing] Namespace torn after APIServer Shutdown

Investigate and resolve. Likely some goroutine ordering problem in local.go

2020-10-01T10:35:58.872-0700	ERROR	Failed to tear down namespace: Delete "http://127.0.0.1:61359/api/v1/namespaces/paintercotton": read tcp 127.0.0.1:61679->127.0.0.1:61359: read: connection reset by peer
github.com/ellistarn/karpenter/pkg/test/environment.(*Local).NewNamespace.func1
	/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/test/environment/local.go:87

Support aarch64 EKS amis

Presently the AWS provider only supports x86_64 worker nodes - this issue is to add aarch64 support to the Provisioner.

AWS autoscalinggroup should accept an ARN to identify the ASG

The idea is that ASGs would be specified with either a name (like asg-name) or else with an ARN (like "arn:aws:autoscaling:region:123456789012:autoScalingGroup:uuid:autoScalingGroupName/asg-name"). Today one can only use the asg-name form:

out, err := a.Client.DescribeAutoScalingGroups(&autoscaling.DescribeAutoScalingGroupsInput{
		AutoScalingGroupNames: []*string{aws.String(a.ID)},
		MaxRecords:            aws.Int64(1),
	})

This should be magically detected rather than requiring the user to say which form they're using. (If it doesn't parse as an ARN, then assume it's a name).

Investigate Cloud Watch metrics integration

As a user, I should be able to scale a resource using a Cloudwatch metric.

We probably want to avoid adding a cloudwatch metric type directly into the Horizontal Autoscaler spec due to the cloud-provider-explosion problem.

Alternatively, we could enable https://github.com/awslabs/k8s-cloudwatch-adapter, but the metric adapter takes the sole "external metrics" slot, which prevents users from using other metrics providers like KEDA. The is detailed extensively in the design https://github.com/awslabs/karpenter/blob/main/docs/DESIGN.md#prometheus-vs-kubernetes-metrics-api.

Are there any other options?

Github Actions doesn't capture full Error Messages.

It may have something to do with the ginkgo logger.

github actions:

go fmt ./...
golangci-lint run
go build -o bin/karpenter karpenter/main.go
go test ./... -v -cover
?   	github.com/ellistarn/karpenter/karpenter	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis/autoscaling/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/cloudprovider	[no test files]
=== RUN   TestUpdateAutoScalingGroupSuccess
--- PASS: TestUpdateAutoScalingGroupSuccess (0.00s)
=== RUN   TestUpdateManagedNodeGroupSuccess
--- PASS: TestUpdateManagedNodeGroupSuccess (0.00s)
PASS
coverage: 36.4% of statements
ok  	github.com/ellistarn/karpenter/pkg/cloudprovider/aws	0.165s	coverage: 36.4% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers	[no test files]
=== RUN   TestAPIs
Running Suite: Horizontal Autoscaler Suite
==========================================
Random Seed: 1599870287
Will run 1 of 1 specs

FAIL	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1	9.808s
=== RUN   TestProportionalGetDesiredReplicas
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case

Local

go mod tidy
go mod download
go vet ./...
go fmt ./...
golangci-lint run
go test ./... -v -cover
?   	github.com/ellistarn/karpenter/karpenter	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis	[no test files]
?   	github.com/ellistarn/karpenter/pkg/apis/autoscaling/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/cloudprovider	[no test files]
=== RUN   TestUpdateAutoScalingGroupSuccess
--- PASS: TestUpdateAutoScalingGroupSuccess (0.00s)
=== RUN   TestUpdateManagedNodeGroupSuccess
--- PASS: TestUpdateManagedNodeGroupSuccess (0.00s)
PASS
coverage: 36.4% of statements
ok  	github.com/ellistarn/karpenter/pkg/cloudprovider/aws	(cached)	coverage: 36.4% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers	[no test files]
=== RUN   TestAPIs
Running Suite: Horizontal Autoscaler Suite
==========================================
Random Seed: 1599870852
Will run 1 of 1 specs

• Failure [0.021 seconds]
Controller
/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:80
  with an empty resource
  /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:81
    should should create and delete [It]
    /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:90

    Expected success, but got an error:
        <*errors.StatusError | 0xc000456320>: {
            ErrStatus: {
                TypeMeta: {Kind: "", APIVersion: ""},
                ListMeta: {
                    SelfLink: "",
                    ResourceVersion: "",
                    Continue: "",
                    RemainingItemCount: nil,
                },
                Status: "Failure",
                Message: "Internal error occurred: failed calling webhook \"vhorizontalautoscaler.kb.io\": the server could not find the requested resource",
                Reason: "InternalError",
                Details: {
                    Name: "",
                    Group: "",
                    Kind: "",
                    UID: "",
                    Causes: [
                        {
                            Type: "",
                            Message: "failed calling webhook \"vhorizontalautoscaler.kb.io\": the server could not find the requested resource",
                            Field: "",
                        },
                    ],
                    RetryAfterSeconds: 0,
                },
                Code: 500,
            },
        }
        Internal error occurred: failed calling webhook "vhorizontalautoscaler.kb.io": the server could not find the requested resource

    /Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:91
------------------------------


Summarizing 1 Failure:

[Fail] Controller with an empty resource [It] should should create and delete
/Users/etarn/workspaces/go/src/github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/suite_test.go:91

Ran 1 of 1 Specs in 4.638 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestAPIs (4.64s)
FAIL
coverage: 14.3% of statements
FAIL	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1	6.271s
=== RUN   TestProportionalGetDesiredReplicas
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case
=== RUN   TestProportionalGetDesiredReplicas/AverageValueMetricType_scales_to_zero
=== RUN   TestProportionalGetDesiredReplicas/AverageUtilization_normal_case
=== RUN   TestProportionalGetDesiredReplicas/AverageUtilization_does_not_scale_to_zero
=== RUN   TestProportionalGetDesiredReplicas/Unknown_metric_type_returns_replicas
--- PASS: TestProportionalGetDesiredReplicas (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/ValueMetricType_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/ValueMetricType_does_not_scale_from_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageValueMetricType_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageValueMetricType_scales_to_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageUtilization_normal_case (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/AverageUtilization_does_not_scale_to_zero (0.00s)
    --- PASS: TestProportionalGetDesiredReplicas/Unknown_metric_type_returns_replicas (0.00s)
PASS
coverage: 88.9% of statements
ok  	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/algorithms	(cached)	coverage: 88.9% of statements
?   	github.com/ellistarn/karpenter/pkg/controllers/horizontalautoscaler/v1alpha1/autoscaler	[no test files]
?   	github.com/ellistarn/karpenter/pkg/controllers/metricsproducer/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/controllers/scalablenodegroup/v1alpha1	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics/clients	[no test files]
?   	github.com/ellistarn/karpenter/pkg/metrics/producers	[no test files]
?   	github.com/ellistarn/karpenter/pkg/test	[no test files]
?   	github.com/ellistarn/karpenter/pkg/utils	[no test files]
FAIL
make: *** [test] Error 1

Observed unusual scaling (up/down) activity in Karpenter even when pods are not changes

Created 3 EKS managed one on-demand and 2 spot nodegroups. Tested with a sample nginx deployment. When deployment is scaled, replicas are expected to spread on 2 spot nodegroups using a custom kube scheduler.

Karpenter initially scaled the instances 1st spot nodegroup to deploy the pods. And karpenter keeps scaling up/down the instances in this 1st nodegroup even though pods are not changed. Karpenter does not scale 2nd nodegroup for some reason. The details are mentioned at this link https://github.com/jalawala/karpenter-aws-demo/tree/main/k8s_custom_scheduler

[Termination Handler] Explore graceful termination for EC2 Instances

As described in the Design, termination handlers must be layered independently on top of Karpenter's autoscaler. By design, the node termination handler should have no knowledge of autoscaling behavior or configuration, or what even triggered the scale down (e.g. manual, preemption, autoscaling).

Potential requirements/solutions include:

  • Protect instances that are being deleted/scaled down to respect poddisruptionbudgets
  • Build a Karpenter CRD to model lifecycle hooks?
  • Use some sort of CloudProvider model to hook into EC2 lifecycle hooks to protect instances.

failed retrieving metric, request failed for query karpenter_reserved_capacity_cpu_utilization{name=\"demo\"}, Post \"http://prometheus-operated:9090/api/v1/query\": dial tcp: lookup prometheus-operated on 10.100.0.10:53: no such host

i have not configured prometheus yet. even though its redirecting to the below specified URL and its failing to collect the metrices. Can you please tell me where this configuration is made ? and how do i change it?

failed retrieving metric, request failed for query karpenter_reserved_capacity_cpu_utilization{name="demo"}, Post "http://prometheus-operated:9090/api/v1/query\": dial tcp: lookup prometheus-operated on 10.100.0.10:53: no such host

Karpenter Docker image is not public

Since the plan is to do a developer preview release on Monday, there should be documentation to install or may be users can build and install. We already have install steps in readme but that will fail because Karpenter image is not public.

HA should fail gracefully if metric does not exist

{"level":"error","ts":1605916233.9806492,"msg":"Controller failed to reconcile kind: HorizontalAutoscaler err: failed retrieving metric, invalid response for query karpenter_reserved_capacity_cpu_utilization{name="capacity"}, expected instant vector and got vector: []","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/wait.go:99"}

Code Generation is busted for Webhooks, RBAC, and ClientGen

We might've deleted some code markers which is resulting in odd behavior.
CRD and DeepCopy() generation continues to work normally.

#!/bin/bash
set -eu -o pipefail


# Generate API Deep Copy
controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./pkg/apis/..."
# Generate CRDs
controller-gen crd:trivialVersions=false paths="./pkg/apis/..." "output:crd:artifacts:config=config/crd/bases"

# TODO Fix Me, doesn't generate anything
controller-gen rbac:roleName=karpenter paths="./pkg/controllers/..." output:stdout

# TODO Fix Me, doesn't generate anything
controller-gen webhook paths="./pkg/controllers/..."

# TODO Fix Me, this is broken into above generators
# controller-gen \
#     object:headerFile="hack/boilerplate.go.txt" \
#     webhook \
#     crd:trivialVersions=false \
#     rbac:roleName=manager-role \
#     paths="./pkg/apis/..." \
#     "output:crd:artifacts:config=config/crd/bases"

# TODO Fix Me, creates empty clients
# bash -e $GOPATH/pkg/mod/k8s.io/[email protected]/generate-groups.sh \
#     all \
#     "github.com/ellistarn/karpenter/pkg/client" \
#     "github.com/ellistarn/karpenter/pkg/apis" \
#     "autoscaling:v1alpha1" \
#     --go-header-file ./hack/boilerplate.go.txt -v2

./hack/boilerplate.sh

Move Karpenter control plane image to public repository

Karpenter control plane image (currently at v0.1.0) is made available via private container repository.
To simplify access, and serve a broader set of deployment scenarios: suggest moving to an ECS public registry.

Current image location:
197575167141.dkr.ecr.us-west-2.amazonaws.com/karpenter:v0.1.0@sha256:b80ac089c17f15ac37c5f62780c9761e5725463f8a801cb4a4fb69af75c17949

Referenced by:
https://github.com/awslabs/karpenter/blob/main/releases/aws/v0.1.0.yaml

Modify installation procedure to handle karpenter namespace dependency

#183 resolves issue whereby command:

eksctl create iamserviceaccount --cluster ${CLUSTER_NAME} \
--name default \
--namespace karpenter \
--attach-policy-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/Karpenter" \
--override-existing-serviceaccounts \
--approve

fails when karpenter namespace does not exist. This will arise where the user is performing cloud provider specific configuration prior to installing karpenter.

Continuing with karpenter installation after implementing cloud provider config, the Helm chart installation will now fail as the karpenter namespace is pre-existing in the cluster. Helm generates the following error:

Error: rendered manifests contain a resource that already exists.
Unable to continue with install: Namespace "karpenter" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm";
annotation validation error: missing key "meta.helm.sh/release-name": must be set to "karpenter"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "karpenter"

Suggested workaround:

  • Remove namespace creation from the karpenter Helm chart manifest
  • Update Helm installation procedure to incorporate directives --create-namespace --namespace karpenter

Alternate (short-term) workaround:

  • Be prescriptive in instructing users to perform cloud provider configuration only after karpenter installation

Test environment doesn't shut down properly

For some reason, the last line never executes

		defer ginkgo.GinkgoRecover()
		gomega.Expect(manager.Start(stop)).To(gomega.Succeed(), "Failed to stop Manager")
		// TODO this code isn't shutting down correctly.
		gomega.Expect(environment.Stop()).To(gomega.Succeed())
	}()

This results in many ghost etcd processes that must be manually cleaned with pkill -f etcd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.