awslabs / amazon-eks-ami Goto Github PK

Packer configuration for building a custom EKS AMI

Home Page: https://awslabs.github.io/amazon-eks-ami/

License: MIT No Attribution

Makefile 2.77% Shell 54.55% PowerShell 4.32% Dockerfile 0.92% Python 0.53% Go 36.91%

amazon-eks-ami's Introduction

Amazon EKS AMI Build Specification

This repository contains resources and configuration scripts for building a custom Amazon EKS AMI with HashiCorp Packer. This is the same configuration that Amazon EKS uses to create the official Amazon EKS-optimized AMI.

Check out the 📖 documentation to learn more.

🚀 Getting started

If you are new to Amazon EKS, we recommend that you follow our Getting Started chapter in the Amazon EKS User Guide. If you already have a cluster, and you want to launch a node group with your new AMI, see Launching Amazon EKS Worker Nodes.

🔢 Pre-requisites

You must have Packer version 1.8.0 or later installed on your local system. For more information, see Installing Packer in the Packer documentation. You must also have AWS account credentials configured so that Packer can make calls to AWS API operations on your behalf. For more information, see Authentication in the Packer documentation.

👷 Building the AMI

A Makefile is provided to build the Amazon EKS Worker AMI, but it is just a small wrapper around invoking Packer directly. You can initiate the build process by running the following command in the root of this repository:

# build an AMI with the latest Kubernetes version and the default OS distro
make

# build an AMI with a specific Kubernetes version and the default OS distro
make k8s=1.29

# build an AMI with a specific Kubernetes version and a specific OS distro
make k8s=1.29 os_distro=al2023

# check default value and options in help doc
make help

The Makefile chooses a particular kubelet binary to use per Kubernetes version which you can view here.

Note The default instance type to build this AMI does not qualify for the AWS free tier. You are charged for any instances created when building this AMI.

🔒 Security

For security issues or concerns, please do not open an issue or pull request on GitHub. Please report any suspected or confirmed security issues to AWS Security https://aws.amazon.com/security/vulnerability-reporting/

⚖️ License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

amazon-eks-ami's People

Contributors

Stargazers

Watchers

Forkers

rickard-von-essen linus5 hmizuma hieunba ixolt dennisjohnson bshelton229 toniblyx irvingpop elonelung micahhausler billyteves toricls kyroskoh skyrioloc hobbsh schndr kuberkaul dholdaway signed8bit vsiddharth pivotlogix yamaszone runningman84 chrisfu miketomlin19 rtripat anirudhreddy7 rodfaria yonelacort natefox nxf5025 srihari9212 delgod pahud muga xavierdavidgarcia jpoley monsterxx03 maximlevitzky systemicemotions annie2010 bmacauley nrdlngr mbeacom applicaster arbitrarycritter fbdo lucioveloso nitecon slavaaaaaaaaaa dpiddock zachahuy-zz vjremotegithub advmicrogrid bmacauley-reward vanamgiri hypehub hobsons jpb emman27 mattlandis npaulhus brycecarman leonidgorkinsc runtimeinc rdavison saikovi11 hotl ranajit-jana anandology skyionblue alfredkrohmer bkruger99 jeremievallee delectable sw8fbar dgyhh jeffwan goswamig vidmind narezatel sklemmer fcervantes2 sgundapu ik-vms-dockers jesseshieh sairkv csdhome fincd-aws mseank adesso-as-a-service lendkey travcunn dawidmalina aveprev v-yarotsky tjerkw aelejota benluteijn

amazon-eks-ami's Issues

Newer Docker version

At the moment Docker version 17.06* is installed.

However the chown parameter for the COPY command is available since version 17.09.0.
See https://stackoverflow.com/a/44766666/948378

When will at least this version become available in this AMI?

How to pass arbitrary additional cmd line arguments to the custom eks-ami

How can I pass node-labels to the existing custom eks-ami kubelet script so that later we can use K8s nodeSelector spec to assign certain pods to specific set of EC2 nodes.

Add support for T3

Adding T3 Instances failed

Please see the following pull request:
Adds T3 Instance Types

When does EKS support R5?

When creating the cluster the EKS seems still not supporting R5 type, just want to confirm in here.

How to add two node labels to bootstrap arguments using --kubelet-extra-args

I am trying doing this
ParameterKey: "BootstrapArguments", ParameterValue: "--kubelet-extra-args --node-labels=MyKey=MyValue,MyKey2=MyValue2"
But this is not working.
If I try adding only one key pair value then node labels are gettig attached to the node.
What is the error that I am doing

ERROR: Redirection (301) without location on --2018-09-05 16:11:11-- https://s3.amazonaws.com/amazon-eks/1.10.3/2018-07-26/bin/linux/amd64/kubelet

Hitting the following when running make:

amazon-ebs: Downloading binaries from: s3://amazon-eks
amazon-ebs: --2018-09-05 16:11:11--  https://s3.amazonaws.com/amazon-eks/1.10.3/2018-07-26/bin/linux/amd64/kubelet
amazon-ebs: Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.114.156
amazon-ebs: Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.114.156|:443... connected.
amazon-ebs: HTTP request sent, awaiting response... 301 Moved Permanently
amazon-ebs: Location: unspecified
amazon-ebs: ERROR: Redirection (301) without location.

Cannot copy AMI amazon-eks-node (permission denied)

Hi folks,

I'm sorry if this is not the right place for posting this question/issue.

I need to copy the AMI id ami-0440e4f6b9713faf6 (amazon-eks-node-v24) to encrypt the copy using my own KMS master key but I get the following error. I'm trying to copy it inside the same region (us-east-1).

"You do not have permission to access the storage of this ami"

Is this something you can help me with?
My user has Admin permissions to do the copy and, in fact, I'm not having issues copying other AMIs, like the Amazon Linux 2 AMI.

Appreciate your help.
Thanks,

Amazon EKS-optimized AMI from parameter store

Any plan to get the Amazon EKS-optimized AMI from parameter store just like Amazon ECS does?

issue: aws cloudformation create-stack: Parameter validation failed

Hello,

aws cloudformation create-stack command with multi labels --kubelet-extra-args --node-labels=nodesgroup=main,subnets=private return message
Parameter validation failed: Unknown parameter in Parameters[10]: "subnets", must be one of: ParameterKey, ParameterValue, UsePreviousValue, ResolvedValue
Command:

    aws cloudformation create-stack \
  --stack-name ${EKS_NAME}-${EKS_NODE_GROUP01_NAME}  \
  --template-body file://amazon-eks-nodegroup.yaml \
        --capabilities CAPABILITY_IAM \
        --parameters \
      ParameterKey=NodeInstanceType,ParameterValue=${EKS_NODE_TYPE} \
      ParameterKey=NodeImageId,ParameterValue=${EKS_WORKER_AMI} \
      ParameterKey=NodeGroupName,ParameterValue=${EKS_NODE_GROUP01_NAME} \
      ParameterKey=NodeAutoScalingGroupMinSize,ParameterValue=${EKS_NODE_GROUP01_MIN} \
      ParameterKey=NodeAutoScalingGroupMaxSize,ParameterValue=${EKS_NODE_GROUP01_MAX} \
      ParameterKey=ClusterControlPlaneSecurityGroup,ParameterValue=${EKS_SG_CONTROLPLANE} \
      ParameterKey=ClusterName,ParameterValue=${EKS_NAME} \
      ParameterKey=Subnets,ParameterValue=${EKS_PRIVATE_IDS//,/\\,} \
      ParameterKey=VpcId,ParameterValue=${EKS_VPC_ID} \
      ParameterKey=KeyName,ParameterValue=${AWS_KEY_PAIR_NAME} \
      ParameterKey=BootstrapArguments,ParameterValue="--kubelet-extra-args --node-labels=nodesgroup=main,subnets=private"

ParameterKey=BootstrapArguments,ParameterValue="--kubelet-extra-args --node-labels=nodesgroup=main works without issues.

Reference: BootstrapArguments parameter

Best regards,
Vadim Zenin

Shiped AWS CLI doesn't know EKS service

Hi,

the current eks-worker-v20 AMI (ami-dea4d5a1 in us-east-1 (Copied ami-73a6e20b from us-west-2)) ships version 1.15.31 of the AWS CLI

EKS was introduced with version 1.15.32 which means first step each worker does is to teach his local AWS CLI what EKS is:

[root@ip-10-128-8-14 eks]# aws eks
Invalid choice: 'eks', maybe you meant:
  * es
[root@ip-10-128-8-14 eks]# aws configure add-model --service-model file://eks-2017-11-01.normal.json --service-name eks
[root@ip-10-128-8-14 eks]# aws eks
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:
  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: too few arguments

Please either update the shipped AWS CLI or move that step into the AMI build process.

Add support for R5

It seems that R5 instances are not currently supported, and they're not in the default list of EC2 instance types (similar to #27):

2018-09-12T14:33:35Z [INFO] DataStore has no available IP addresses
2018-09-12T14:33:35Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-09-12T14:33:35Z [ERROR] Failed to get eni IP limit due to unknown instance type r5.2xlarge
2018-09-12T14:33:35Z [INFO] Failed to retrieve ENI IP limit: vpc ip resource(eni ip limit): unknown instance type
2018-09-12T14:33:35Z [DEBUG] IP pool stats: total = 0, used = 0, c.currentMaxAddrsPerENI = 1, c.maxAddrsPerENI = 1
2018-09-12T14:33:35Z [DEBUG] Start increasing IP Pool size
2018-09-12T14:33:35Z [ERROR] Failed to get eni limit due to unknown instance type r5.2xlarge

(Found using troubleshooting guide at: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)

EKS-Optimized AMI with GPU Support

Just a few questions around the AMI with GPU support:

Is this AMI built from this repo too? If not, where from?
Does the AMI with GPU support use /etc/eks/bootstrap.sh also?

Question on changing base AMI to Red Hat

This is more of a question rather than an issue, but if we were to update the base AMI to Red Hat would:

A) EKS still function properly?
B) Be covered under AWS Enterprise support?

Docker/Package Updates & Release Cadence

Is there a specific reason that the AMI doesn't upgrade all packages (minus any required at set versions for Kubernetes) when it is built? Happy to submit a PR for this if it's desired.

Also, is there a plan to release this AMI with the same release cadence as the base image or should we be doing it ourselves?

Tagging released AMIs in git

It would be helpful to know which version of the code here was used to create e.g. the eks-worker-v22 AMI.

Does Docker do log rotation?

According to https://kubernetes.io/docs/concepts/cluster-administration/logging/, log rotation for Docker is required to not fill up the disk. GCE has it by default if kube-up.sh is used, but I can't find anything rotating /var/log/*.log in this repository. Here's how it works for gce.

Am I correct that if I run Docker images on EKS for long enough, it will crash with full disks?

S3 URL not found issue when install worker.sh is downloading binaries.

When running the MAKE file with default specifications, the install-worker.sh fails to download the binaries kubelet, kubectl,aws-iam-authenticator from

https://s3-us-west-2.amazonaws.com/amazon-eks-worker/1.10.3/2018-07-26/bin/linux/amd64/

and throws out the following error

amazon-ebs: --2018-08-02 19:18:11--  https://s3-us-west-2.amazonaws.com/amazon-eks worker/1.10.3/2018-07-26/bin/linux/amd64/kubelet
    amazon-ebs: Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.176.164
    amazon-ebs: Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.176.164|:443... connected.
    amazon-ebs: HTTP request sent, awaiting response... 404 Not Found
    amazon-ebs: 2018-08-02 19:18:11 ERROR 404: Not Found.

Is there any workaround for this ??

Support for instance draining?

Hi all,

Is it within scope for this AMI to drain pods from a node during:

scaling events (i.e., listen to the ASG lifecycle events)
manually triggered reboot/terminations

For context, here's how kube-aws approaches this.

EXTERNAL-IP missing

Hi guys, I'm not sure if there's something I'm missing on the bootstrap.sh script, or it's just something that EKS doesn't support. My use case involves using NodePort based services for setup of the cluster (I'm migrating from GKE), and it seems that the public IP of my EKS worker nodes is not being registered with k8s. Here's a quick summary of my output:

kubectl get no -o wide

NAME                                           STATUS   ROLES    AGE   VERSION   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-122-113-212.us-west-2.compute.internal   Ready    <none>   22h   v1.10.3   <none>        Amazon Linux 2   4.14.62-70.117.amzn2.x86_64   docker://17.6.2

Going for -o yaml confirms that there is no external IP being picked up. However, I've also inspected the actual EC2 instance here and there is in fact a public IP available.

My userdata.sh script is also below:

#!/bin/bash -xe

/etc/eks/bootstrap.sh cluster_name --kubelet-extra-args '--node-labels=something=hello,somethingelse=bye --register-with-taints=taint1=true'

Is there something I am missing?

Metrics Server not working

According to https://aws.amazon.com/blogs/opensource/horizontal-pod-autoscaling-eks/, Metrics Server should work. In my case, it is failing. I'm probably missing something but I'm not sure what that is.

What I did:

# Create a cluster with eksctl

eksctl create cluster \
    -n devops25 \
    -r us-west-2 \
    --kubeconfig cluster/kubecfg-eks \
    --node-type t2.small \
    --nodes 3 \
    --nodes-max 9 \
    --nodes-min 3

export KUBECONFIG=$PWD/cluster/kubecfg-eks

# install tiller

kubectl create \
    -f https://raw.githubusercontent.com/vfarcic/k8s-specs/master/helm/tiller-rbac.yml \
    --record --save-config

helm init --service-account tiller

# Install Metrics Server

helm install stable/metrics-server \
    --name metrics-server \
    --version 2.0.2 \
    --namespace metrics

What I got:

kubectl -n metrics logs -l app=metrics-server

I0925 22:39:59.871109       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
W0925 22:40:00.439451       1 authentication.go:166] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
W0925 22:40:00.447678       1 authentication.go:210] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
[restful] 2018/09/25 22:40:00 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2018/09/25 22:40:00 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0925 22:40:00.503815       1 serve.go:96] Serving securely on [::]:443

kubectl top nodes

Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

kubectl version --output=yaml

clientVersion:
  buildDate: 2018-09-10T11:44:36Z
  compiler: gc
  gitCommit: a4529464e4629c21224b3d52edfe0ea91b072862
  gitTreeState: clean
  gitVersion: v1.11.3
  goVersion: go1.11
  major: "1"
  minor: "11"
  platform: darwin/amd64
serverVersion:
  buildDate: 2018-05-28T20:13:43Z
  compiler: gc
  gitCommit: 2bba0127d85d5a46ab4b778548be28623b32d0b0
  gitTreeState: clean
  gitVersion: v1.10.3
  goVersion: go1.9.3
  major: "1"
  minor: "10"
  platform: linux/amd64

Console shows that the cluster uses platform version eks.2.

Enable log persistence

Right now, to use node-problem-detector, for example, one has to run sudo mkdir -p /var/log/journal on a node created with the AMI. Running this enables log persistence and allows tools relying on such persistence (e.g. NPD) to work. I can work around this, of course, but having tools such as NPD just work with the AMI would be a reasonable expectation, imo.

There's a question in AWS forums about enabling it for Amazon Linux 2 AMI, but there doesn't seem to be any progress on it, so perhaps we can enable it at the EKS worker AMI level?

bootstrap script must Register External IPs and External Address

The bootstrap script only registers Internal IPs and addresses.

Add Support for SpotFleets

It would be great to be able to launch worker nodes as spot fleets. Any thoughts on the best way to get there? Seems like we'd want to generalize the NodeGroup section to include a type for SpotFleet. As for selecting which one to use, it'd seem ideal if one could specify multiple ASGs and they'd each be templated into the final configuration file.

--kubelet-extra-args to support more than one parameters

As far as can understand it doesn't seem to be possible to use --kubelet-extra-args to pass more than one extra parameter. I would like to use this parameter to set both labels and taints but at the moment it only takes the first parameter after it.

ParameterKey=BootstrapArguments,ParameterValue="--kubelet-extra-args --register-with-taints=mytaintkey:mytaintvalueNoSchedule --node-labels=mykey=mylabel"

Does not apply the labels.

Where are older versions of the AMI published?

I can't find them here or on https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html.

Node becomes NotReady

We are running EKS in Ireland and our nodes are going unhealthy regularly.

It is not possible to SSH to the host, pods are not reachable. We have experienced this with t2.xlarge, t2.small and t3.medium instances.

We could ssh to another node in the cluster and ping the NotReady node, but are not able to ssh it either.

Graphs show the memory goes high at about the same time that the journalctl logs below. The EBS IO also goes high. The exact time is hard to pinpoint. I added logs with interesting 'failures' around the time that we think the node disappeared.

We are using the cluster for running tests, so pods are getting created and destroyed often.

We have not done anything described in #51 for log rotation.

Cluster Information:
CNI: Latest daemonset with image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.2.1
Region: eu-west-1

LOGS

** Node AMI**

AMI ID amazon-eks-node-v24 (ami-0c7a4976cb6fafd3a)

** File system **

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1,9G     0  1,9G   0% /dev
tmpfs           1,9G     0  1,9G   0% /dev/shm
tmpfs           1,9G  2,3M  1,9G   1% /run
tmpfs           1,9G     0  1,9G   0% /sys/fs/cgroup
/dev/nvme0n1p1   64G   40G   25G  62% /
tmpfs           389M     0  389M   0% /run/user/1000

** kubectl describe node**

Name: ip-<secret>.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1b
kubernetes.io/hostname=ip-<secret>.eu-west-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 30 Oct 2018 11:25:48 +0100
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Wed, 31 Oct 2018 10:56:53 +0100 Tue, 30 Oct 2018 11:25:46 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Wed, 31 Oct 2018 10:56:53 +0100 Wed, 31 Oct 2018 10:57:35 +0100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: <secret>
Hostname: ip-<secret>..eu-west-1.compute.internal
Capacity:
cpu: 2
ephemeral-storage: 67096556Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3980344Ki
pods: 17
Allocatable:
cpu: 2
ephemeral-storage: 61836185908
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3877944Ki
pods: 17
System Info:
Machine ID: asdf
System UUID: asdf
Boot ID: asdf
Kernel Version: 4.14.62-70.117.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.6.2
Kubelet Version: v1.10.3
Kube-Proxy Version: v1.10.3
ProviderID: aws:///eu-west-1b/i-<secret>
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system aws-node-hshhg 10m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-fkrb8 100m (5%) 0 (0%) 0 (0%) 0 (0%)
monitoring datadog-datadog-bk5bd 200m (10%) 200m (10%) 256Mi (6%) 256Mi (6%)
monitoring prometheus-node-exporter-4z2dg 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t1 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t3 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t5 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t6 0 (0%) 0 (0%) 0 (0%) 0 (0%)
t7 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t8 100m (5%) 0 (0%) 256Mi (6%) 0 (0%)
t9 250m (12%) 250m (12%) 500Mi (13%) 500Mi (13%)
t10 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1160m (57%) 950m (47%)
memory 2012Mi (53%) 1756Mi (46%)
Events: <none>

journalctl logs around the time

okt 31 10:01:29 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent: page allocation stalls for 10404ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:01:30 ip-<secret>.eu-west-1.compute.internal kernel: aws-k8s-agent cpuset=1ef2c300b3981b045f3f2fcab050f674afead7e7c828362ec2d40ef82bf02441 mems_allowed=0
okt 31 10:01:31 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 6267 Comm: aws-k8s-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:01:34 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:01:36 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:01:38 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:01:39 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:01:41 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:01:42 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:01:45 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:01:46 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:01:47 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:01:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:01:52 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:01:55 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:01:57 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:00 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:02 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:07 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:09 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 0001:0x1f
okt 31 10:02:12 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0000:000000c420170f58 EFLAGS: 4d32dce245d7
okt 31 10:02:15 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:16 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895836 inactive_anon:8314 isolated_anon:0
active_file:413 inactive_file:596 isolated_file:0
unevictable:0 dirty:1 writeback:0 unstable:0
slab_reclaimable:17241 slab_unreclaimable:26888
mapped:22510 shmem:28069 pagetables:7173 bounce:0
free:21650 free_pcp:12 free_cma:0
okt 31 10:02:17 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 active_anon:3583344kB inactive_anon:33256kB active_file:1652kB inactive_file:2384kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:90040kB dirty:4kB writeback:0kB shmem:112276kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 16384kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
okt 31 10:02:19 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA free:15620kB min:268kB low:332kB high:396kB active_anon:288kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
okt 31 10:02:22 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 2951 3849 3849
okt 31 10:02:24 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32 free:54972kB min:51600kB low:64500kB high:77400kB active_anon:2799324kB inactive_anon:31696kB active_file:556kB inactive_file:816kB unevictable:0kB writepending:0kB present:3129320kB managed:3044324kB mlocked:0kB kernel_stack:7968kB pagetables:19844kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB
okt 31 10:02:27 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 898 898
okt 31 10:02:30 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal free:15408kB min:15708kB low:19632kB high:23556kB active_anon:783732kB inactive_anon:1560kB active_file:884kB inactive_file:1392kB unevictable:0kB writepending:4kB present:987136kB managed:920112kB mlocked:0kB kernel_stack:4304kB pagetables:8848kB bounce:0kB free_pcp:420kB local_pcp:0kB free_cma:0kB
okt 31 10:02:32 ip-<secret>.eu-west-1.compute.internal kernel: lowmem_reserve[]: 0 0 0 0
okt 31 10:02:34 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 2*32kB (UM) 3*64kB (UM) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (ME) = 15620kB
okt 31 10:02:36 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 DMA32: 1659*4kB (UME) 1496*8kB (UME) 1181*16kB (UME) 446*32kB (UME) 54*64kB (UME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55356kB
okt 31 10:02:37 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 Normal: 334*4kB (UMEH) 351*8kB (UMEH) 431*16kB (UMEH) 93*32kB (UMEH) 4*64kB (H) 2*128kB (H) 4*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15552kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
okt 31 10:02:38 ip-<secret>.eu-west-1.compute.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 28769 total pagecache pages
okt 31 10:02:40 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages in swap cache
okt 31 10:02:42 ip-<secret>.eu-west-1.compute.internal kernel: Swap cache stats: add 0, delete 0, find 0/0
okt 31 10:02:43 ip-<secret>.eu-west-1.compute.internal kernel: Free swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal systemd-journal[26209]: Permanent journal is using 392.0M (max allowed 4.0G, trying to leave 4.0G free of 38.6G available → current limit 4.0G).
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: Total swap = 0kB
okt 31 10:02:49 ip-<secret>.eu-west-1.compute.internal kernel: 1033112 pages RAM
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 0 pages HighMem/MovableOnly
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: 38026 pages reserved
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent: page allocation stalls for 10580ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: process-agent cpuset=67b33ad9edc4663ce3e97ac968df4726a9beeff073706349383b1e9eabd93125 mems_allowed=0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: CPU: 1 PID: 7452 Comm: process-agent Not tainted 4.14.62-70.117.amzn2.x86_64 #1
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: Call Trace:
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: dump_stack+0x5c/0x82
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: warn_alloc+0x114/0x1c0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_slowpath+0x831/0xe00
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: ? get_page_from_freelist+0x371/0xba0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __alloc_pages_nodemask+0x227/0x250
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: filemap_fault+0x204/0x5f0
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __xfs_filemap_fault.constprop.8+0x49/0x120 [xfs]
okt 31 10:02:50 ip-<secret>.eu-west-1.compute.internal kernel: __do_fault+0x20/0x60
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_pte_fault+0x945/0xeb0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x34/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? __switch_to_asm+0x40/0x70
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __handle_mm_fault+0x431/0x540
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: handle_mm_fault+0xaa/0x1e0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: __do_page_fault+0x23e/0x4c0
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: ? async_page_fault+0x2f/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: async_page_fault+0x45/0x50
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RIP: 00f1:0x11
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: RSP: 0234:00007f768c4e5d38 EFLAGS: 00000000
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: Mem-Info:
okt 31 10:02:51 ip-<secret>.eu-west-1.compute.internal kernel: active_anon:895956 inactive_anon:8314 isolated_anon:0
active_file:15 inactive_file:24 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:17237 slab_unreclaimable:26796
mapped:21743 shmem:28069 pagetables:7196 bounce:0
free:21560 free_pcp:682 free_cma:0

plugin logs

2018-10-31T10:50:15Z [INFO] Starting CNI Plugin v1.2.1  ...
2018-10-31T10:50:15Z [INFO] Received CNI del request: ContainerID(56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=monitoring;K8S_POD_NAME=datadog-datadog-bk5bd;K8S_POD_INFRA_CONTAINER_ID=56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-10-31T10:50:15Z [ERROR] Error received from DelNetwork grpc call for pod datadog-datadog-bk5bd namespace monitoring container 56904923f2dfb96db21ddfb6d39f2429d641141f78511d07823bd315feaf4302: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

ipamd.log

2018-10-31T10:05:43Z [DEBUG] Found ip addresses [10.0.1.72 10.0.1.208 10.0.1.36 10.0.1.86 10.0.1.219 10.0.1.63] on eni 02:af:21:3c:f9:4e
2018-10-31T10:05:44Z [DEBUG] Found eni mac address : 02:b3:1a:eb:c3:5e
2018-10-31T10:05:52Z [DEBUG] Using device number 0 for primary eni: eni-0f37efb5e4ebecf09
2018-10-31T10:05:52Z [DEBUG] Found eni: eni-0f37efb5e4ebecf09, mac 02:b3:1a:eb:c3:5e, device 0
2018-10-31T10:05:55Z [DEBUG] Found cidr 10.0.1.0/24 for eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Found ip addresses [10.0.1.143 10.0.1.96 10.0.1.65 10.0.1.209 10.0.1.134 10.0.1.8] on eni 02:b3:1a:eb:c3:5e
2018-10-31T10:05:59Z [DEBUG] Reconcile existing ENI eni-0ce38d7ac411b07ab IP pool
2018-10-31T10:05:59Z [DEBUG] Reconcile and skip primary IP 10.0.1.117 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.53 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.53 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.102 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.102 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.120 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.120 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.42 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.42 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0ce38d7ac411b07ab)'s IPv4 address 10.0.1.59 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.59 on eni eni-0ce38d7ac411b07ab
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f1db76fd54b2e3f5 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.72 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.208 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.208 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.36 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.36 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.86 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.86 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.219 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.219 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f1db76fd54b2e3f5)'s IPv4 address 10.0.1.63 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.63 on eni eni-0f1db76fd54b2e3f5
2018-10-31T10:06:00Z [DEBUG] Reconcile existing ENI eni-0f37efb5e4ebecf09 IP pool
2018-10-31T10:06:00Z [DEBUG] Reconcile and skip primary IP 10.0.1.143 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.96 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.96 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.65 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.65 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.209 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.209 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.134 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.134 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Adding ENI(eni-0f37efb5e4ebecf09)'s IPv4 address 10.0.1.8 to datastore
2018-10-31T10:06:00Z [DEBUG] IP Address Pool stats: total: 15, assigned: 12
2018-10-31T10:06:00Z [DEBUG] Reconciled IP 10.0.1.8 on eni eni-0f37efb5e4ebecf09
2018-10-31T10:06:00Z [DEBUG] Successfully Reconciled ENI/IP pool
2018-10-31T10:06:06Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:06Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:06Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:11Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:12Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:13Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:17Z [INFO]  Pods deleted on my node: t1
2018-10-31T10:06:18Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:19Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:19Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:25Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:25Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:26Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:33Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:06:34Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:06:35Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:06:35Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:07:19Z [INFO]  Pods deleted on my node: t2
2018-10-31T10:09:29Z [INFO]  Pods deleted on my node: t3
2018-10-31T10:14:30Z [ERROR] Failed to retrieve interfaces data from instance metadata RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:37Z [ERROR] ip pool reconcile: Failed to get attached eni infoget attached enis: failed to retrieve interfaces data: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-10-31T10:14:47Z [DEBUG] IP pool stats: total = 15, used = 12, c.currentMaxAddrsPerENI = 5, c.maxAddrsPerENI = 5
2018-10-31T10:14:49Z [DEBUG] Start increasing IP Pool size
2018-10-31T10:14:51Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3
2018-10-31T10:14:52Z [DEBUG] Reconciling ENI/IP pool info...
2018-10-31T10:14:59Z [INFO]  Pods deleted on my node: t4
2018-10-31T10:15:43Z [DEBUG] Total number of interfaces found: 3 
2018-10-31T10:15:44Z [DEBUG] Found eni mac address : 02:63:a9:60:fc:42

What does it replace exactly ?

amazon-eks-ami/files/bootstrap.sh

Line 98 in eb0239f

sed -i s,CLUSTER_NAME,$CLUSTER_NAME,g /var/lib/kubelet/kubeconfig

Support --node-labels in kubelet.service with EC2 instance TAGs

I would like to use the parameter "--node-labels" in NodeLaunchConfig's UserData in the kubelet.service file with my EC2 instance TAGs to allow me to reserve some resources in the cluster.

https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-07-26/amazon-eks-nodegroup.yaml

Update Makefile default AMI to LTS AL2 AMI

Today, the Makefile references the AL2 Candidate AMI as the default. We should update that to reference the latest LTS AL2 release.

why no kubeadm command in shell

could you please add kubeadm command in the shell?

Question about MaxPods limits

Why does r4.16xlarge instance have a higher MaxPods limit than the x1.32xlarge instance?

Disk Corruption

Not sure if this is the proper place to open an issue as I am unclear where the issue is actually happening, but we are experiencing disk corruption on the worker nodes. We have tried v20, v23, and v24 and the root volumes showed corruption in as little as a couple hours and up to 4 days.

$ df -hT /dev/xvda1
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     xfs    20G  -64T   65T    - /

What ends up happening is that we start to see that these errors in the logs:

kubelet: W0919 14:54:08.962476    5823 image_gc_manager.go:285] available 70376773459968 is larger than capacity 21462233088

Then eventually the node runs out of disk space, fails instance status checks, and it gets replaced by the auto scaling group.

Any help would be appreciated.

Thanks.

/etc/logrotate.d/kube-proxy has incorrect ownership

/etc/logrotate.d/kube-proxy is owned by ec2-user:ec2-user, so logrotate ignores it.

/usr/sbin/logrotate -d /etc/logrotate.conf
reading config file /etc/logrotate.conf
[...]
Ignoring kube-proxy because the file owner is wrong (should be root).
[...]

Option Max pods doesn't take into account the actual number of interfaces

Problem:

The bootstrap script enable to specify the option "use-max-pods" .
these values are the maximum if and only if instances have the maximum number of ENI which is not what is defined in the getting started tutorial (single interface)
it creates a bad user experience: pod scheduling error when no more IP is available for the actual ENI

Version:

EKS K8S: 1.10
AMI v14

Reproduce:

Create an EKS cluster with one worker nodes t3.micro deployed with an Auto Scaling group
Schedule a single container pod (without any resource request) and 4 replicas.
2 of them can't succeed and you'll get an error on IP attachment.

Solution:
Different approaches:

Let the user set --use-max-pods 'false' and add another option: --use-max-pods-value (name can be better :) )
Automatically detect the number of interfaces and adapt the max pod accordingly
Add an option to the bootstrap script to specify the number of interfaces and adapt the max pod accordingly
(out of scope of this project) Make Autoscaling group enable to manage multiple ENI (currently, a workaround exists with a lambda function)

Tag repo on release

Could you please tag this repository on releases of new AMI versions?
I'm downloading content from the files directory to build my own images and would like to have repeatable builds.

Thanks.

[discussion] Default root size - 20GB

I run EKS cluster for one week, already get "no space" issue on EKS workers.

Found the code here that the default root size is 20GB.

https://github.com/awslabs/amazon-eks-ami/blob/master/eks-worker-al2.json#L22

      "volume_size": 20,

Should we increase it to 50GB or more?

I can update my codes to start with more space, but 20GB is too small, which I recommend to update in this AMI directly.

Problem with image copy in `us-east-1` region

I am trying to encrypt the root volume for my production usage of EKS worker nodes.

But I have trouble with doing the AMI image-copy operation. I have filled two support cases, but seems not going as expected.

I am suspecting that it is actually related to the build process as the region specified is in us-west-2.

Validate CNI+Plugins downloads

It's probably best to validate the CNI and CNI Plugins downloads since they offer up the files anyway. This is done with the S3 downloads, might as well add the github ones.

Eg:

sudo sha512sum -c cni-amd64-${CNI_VERSION}.tgz.sha512
sudo sha512sum -c cni-plugins-amd64-${CNI_PLUGIN_VERSION}.tgz.sha512

Will submit a PR shortly.

Docker install fails on default packer script

the packer script fails as the way to install docker in amazon linux 2 has changed.

if you the script as it is now , you get the following error, i have put a pr #2 to fix this, also introduce nfs utils, if you want I can remove that PR and create a new one only with the change to the installer.

    amazon-ebs: Complete!
    amazon-ebs: Loaded plugins: priorities, update-motd
    amazon-ebs: No package docker available.
    amazon-ebs: Error: Nothing to do
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' errored: Script exited with non-zero exit status: 1

==> Some builds didn't complete successfully and had errors:
--> amazon-ebs: Script exited with non-zero exit status: 1

==> Builds finished but no artifacts were created.
make: *** [ami] Error 1

/etc/eks/bootstrap.sh fails ungracefully if instance type is unknown

The code

MAX_PODS=$(grep $INSTANCE_TYPE $MAX_PODS_FILE | awk '{print $2}')

will fail if /etc/eks/eni-max-pods.txt does not list the instance type, and exit immediately. This means the kubelet service is never configured properly and the instance is left in a half configured state.

While there is already an issue open to support additional instance types, it would be nice if /etc/eks/bootstrap.sh could at least warn users that an unknown type was used and that the worker is left in an inconsistent state. Right now the only way to determine this is to run through the script line by line to try and work out what is going on.

Automatically create node labels.

Autoscaling is a problem when using node selector, affinity or anti-affinity.
Scaling a group up would not make more pods availible, replacing unhealthy instances could leave a pod with no valid node.
A feature allowing node labels to be created when new instances were created by autoscaling would be ideal.
One possibility is to use the existing tags resource, used in a similar way, and just duplicate what's there as labels for the node inside Kubernetes. Autoscaling group tags would appear as labels on newly registered nodes.

Kubelet cluster-dns parameter is set incorrectly

The IP address of the kube-dns service is picked according to the IP address of the node: https://github.com/awslabs/amazon-eks-ami/blob/v24/files/bootstrap.sh#L105

Normally this works fine, but on one of my clusters, the kube-dns was running on 172.20.0.10, and the kubelets were configured to point at 10.100.0.10. I'm not sure how kube-dns decides which IP address to use, but however that works, the kubelet should use the exact same logic to pick IP address to use for cluster-dns.

what the aws command is that we can use to add a K8s pod's ip address into an NLB. As well as how to remove the ec2 instances IP from the existing NLB

AMI Generated by packer, kubelet report incorrect version

When creating and using an AMI created with packer, the version of kubelet report the following version:

[root@ip-172-31-5-73 ~]# kubelet --version
Kubernetes v0.0.0-master+$Format:%h$

It seems to be a cosmetic issue, but would be nice to have the right version on the binary.

amazon-eks-node-v24 Changelog?

As the latest EKS-Optimized AMI goes to v24 now, can we update the Changelog?

https://github.com/awslabs/amazon-eks-ami/blob/master/CHANGELOG.md

thanks

Encrypted EBS Option

Hi,

Does AWS has an Encrypted Amazon EKS-optimized AMI or I have to create one? I searched on the market place and could not find one (region us-east-1). EKS is now HIPAA Eligible, but the information of EKS is not described on AWS HIPAA White Paper.

Many Thanks.

Indicate instance ID & Lifecycle

Can the next release add in additional node labels by default.
This would prevent the need to have custom user data setup.

Instance ID
Lifecycle (spot/ondemand/scheduled)