kinvolk-archives / lokomotive-kubernetes Goto Github PK

Lokomotive is a 100% open-source Kubernetes distribution from the folks at Kinvolk

License: MIT License

HCL 97.13% Makefile 0.21% Shell 1.76% Smarty 0.90%

lokomotive-kubernetes's Introduction

IMPORTANT: Further development on Lokomotive is happening at https://github.com/kinvolk/lokomotive. The new repo brings together several repos; adding a commandline tool, Lokomotive components and improved documentation.

Lokomotive

Lokomotive is an open source project by Kinvolk which distributes pure upstream Kubernetes.

Features

Kubernetes v1.17.3 (upstream, via kubernetes-incubator/bootkube)
Single or multi-master, Calico or flannel networking
On-cluster etcd with TLS, RBAC-enabled, network policy
Advanced features like worker pools and snippets customization

Modules

Lokomotive provides a Terraform Module for each supported operating system and platform. Flatcar Container Linux is a mature and reliable choice.

Platform	Operating System	Terraform Module	Status
AWS	Flatcar Container Linux	aws/flatcar-linux/kubernetes	stable
Azure	Flatcar Container Linux	azure/flatcar-linux/kubernetes	alpha
Bare-Metal	Flatcar Container Linux	bare-metal/flatcar-linux/kubernetes	stable
Packet	Flatcar Container Linux	packet/flatcar-linux/kubernetes	beta

Documentation

Architecture concepts and operating-systems
Tutorials for AWS, Azure, Bare-Metal and Packet

Usage

Define a Kubernetes cluster by using the Terraform module for your chosen platform and operating system. Here's a minimal example.

module "aws-tempest" {
  source = "git::https://github.com/kinvolk/lokomotive-kubernetes//aws/flatcar-linux/kubernetes?ref=master"

  # AWS
  cluster_name = "yavin"
  dns_zone     = "example.com"
  dns_zone_id  = "Z3PAABBCFAKEC0"

  # configuration
  ssh_keys = [
    "ssh-rsa AAAAB3Nz...",
    "ssh-rsa AAAAB3Nz...",
  ]

  asset_dir          = "/home/user/.secrets/clusters/yavin"

  # optional
  worker_count = 2
  worker_type  = "t3.small"
}

Initialize modules, plan the changes to be made, and apply the changes.

$ terraform init
$ terraform plan
Plan: 64 to add, 0 to change, 0 to destroy.
$ terraform apply
Apply complete! Resources: 64 added, 0 changed, 0 destroyed.

In 4-8 minutes (varies by platform), the cluster will be ready. This AWS example creates a yavin.example.com DNS record to resolve to a network load balancer backed by controller instances.

$ export KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME                                       ROLES              STATUS  AGE  VERSION
yavin-controller-0.c.example-com.internal  controller,master  Ready   6m   v1.14.1
yavin-worker-jrbf.c.example-com.internal   node               Ready   5m   v1.14.1
yavin-worker-mzdm.c.example-com.internal   node               Ready   5m   v1.14.1

List the pods.

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY  STATUS    RESTARTS  AGE
kube-system   calico-node-1cs8z                         2/2    Running   0         6m
kube-system   calico-node-d1l5b                         2/2    Running   0         6m
kube-system   calico-node-sp9ps                         2/2    Running   0         6m
kube-system   coredns-1187388186-dkh3o                  1/1    Running   0         6m
kube-system   kube-apiserver-zppls                      1/1    Running   0         6m
kube-system   kube-controller-manager-3271970485-gh9kt  1/1    Running   0         6m
kube-system   kube-controller-manager-3271970485-h90v8  1/1    Running   1         6m
kube-system   kube-proxy-117v6                          1/1    Running   0         6m
kube-system   kube-proxy-9886n                          1/1    Running   0         6m
kube-system   kube-proxy-njn47                          1/1    Running   0         6m
kube-system   kube-scheduler-3895335239-5x87r           1/1    Running   0         6m
kube-system   kube-scheduler-3895335239-bzrrt           1/1    Running   1         6m
kube-system   pod-checkpointer-l6lrt                    1/1    Running   0         6m
kube-system   pod-checkpointer-l6lrt-controller-0       1/1    Running   0         6m

Try Flatcar Container Linux Edge

Flatcar Container Linux Edge is a Flatcar Container Linux channel that includes experimental bleeding-edge features.

To try it just add the following configuration option to the example above.

os_image = "flatcar-edge"

Help

Ask questions on the IRC #lokomotive-k8s channel on freenode.net.

lokomotive-kubernetes's People

Stargazers

Watchers

lokomotive-kubernetes's Issues

packet: Consider removing tf variable setup_raid_*

Right now there are several variables: setup_raid_ssd_fs, setup_raid_ssd, setup_raid_hdd and setup_raid.

The code is messy and cause several regressions already (#72 and 85ef5fa to name two, I'm sure there are more).

We should re-consider why these variables were added and see if we can remove some to simplify things. Currently, also, they can't be used at the same time as it was expected, because of: #73.

The goal of this issue is to double check we need these variables or not, to make sure we remove all the unneeded code

Conformance tests are failing on 1.17

It seem that conformance tests are failing, even with aggregation enabled (#76).

Plugin: e2e
Status: failed
Total: 4814
Passed: 278
Failed: 2
Skipped: 4534

Failed tests:
[sig-network] Services should be able to create a functioning NodePort service [Conformance]
[sig-network] Services should be able to change the type from ExternalName to NodePort [Conformance]

Plugin: systemd-logs
Status: passed
Total: 3
Passed: 3
Failed: 0
Skipped: 0

Sonobuoy Version: v0.17.1
MinimumKubeVersion: 1.15.0
MaximumKubeVersion: 1.17.99
GitSHA: efb9d68ffe37b4eb8e5465252696a7ad523a5ef4
API Version:  v1.17.0

Tested on 6f1576f on AWS.

Configuration:

module "aws-test-mateusz" {
  source = "../lokomotive-kubernetes/aws/flatcar-linux/kubernetes"

  providers = {
    aws      = "aws.default"
    local    = "local.default"
    null     = "null.default"
    template = "template.default"
    tls      = "tls.default"
  }

  cluster_name = "test-mateusz"
  dns_zone     = "<removed>"
  dns_zone_id  = "<removed>"

  ssh_keys  = ["<removed>"]
  asset_dir = "../cluster-assets"

  controller_count = "1"
  controller_type  = "t3.small"

  worker_count = "2"
  worker_type  = "t3.small"
  networking = "calico"
  network_mtu = ""
  enable_reporting = "false"

  os_image = "flatcar-stable"

  controller_clc_snippets = []
  worker_clc_snippets     = []

  enable_aggregation = "true"
}

Add documentation section/page with our recommendations on applying CPU limits

We have been giving users advise on how to use CPU limits but we've not documented it. We should do so.

We could place this in appropriate documents in the Topics directory. Or we could start a knowledge-base section.

In general, we should collect all the tips and advise we currently have, and gain along the way, and work them into our documentation so more users can benefit.

bootkube-start: Upload failed

When creating a cluster on AWS I encountered this error:

module.aws-….aws_route53_record.apiserver: Creation complete after 44s (ID: …-…k8s.net._A)
module.aws-….null_resource.bootkube-start: Creating...
module.aws-….null_resource.bootkube-start: Provisioning with 'file'...
module.aws-….null_resource.bootkube-start: Still creating... (10s elapsed)
Error: Error applying plan:
1 error occurred:	
	* module.aws-….null_resource.bootkube-start: Upload failed: Process exited with status 1
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Linkerd installation broken - no apiserver client ca file

Good and bad news, I believe one of your recent commits broke Linkerd (and probably similarly architected applications)

The good news is that I think I've narrowed it down to a very likely candidate:
3283579

And I also think I know what's missing:
kubectl -n kube-system get cm/extension-apiserver-authentication -ojsonpath='{.data.requestheader-client-ca-file}'

And here's a hint as to where or what needs to be fixed/added:
github.com/kubernetes-sigs/bootkube/issues/994

Hope this helps

Steps to reproduce:
start a fresh cluster

curl -sL https://run.linkerd.io/install | sh
export PATH=$PATH:$HOME/.linkerd2/bin
linkerd install | kubectl apply -f -
#Kill it after a couple minutes, it'll never boot. Check
kubectl get pods -n linkerd
# and you'll see linkerd-tap crashing

Adding/removing controller nodes to existing cluster does not work

It seems that new etcd member is not able to join the cluster.

However, if you create 3-controller node cluster from scratch and then you need to replace one one (so the data from the node is lost), everything is fine.

packet: Consider using afterburn to simplify setup_raid_* flags

Right now we are using a bash script in the ignition configuration to create RAID arrays in Packet. This script is getting messier and doesn't feel like a clean way to implement this: https://github.com/kinvolk/lokomotive-kubernetes/blob/00e6d550c369035b4f0b57c7ab46d01875207c9d/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl#L128-L223

You can see it's ~100 LOC, created several bugs (just see the history in that file) and before it gets out of control we should consider options to change this.

On other platforms we can probably just use ignition as the disk layout is stable. In packet, however, this is not the case: /dev/sda might be the smallest of 4 disks and rotational on one server while in another identical server /dev/sda might be the biggest of disks and even ssd. Therefore, we couldn't use ignition as it lacks the expresiveness to selec disks with those criterias (smallest, ssd, hdd).

However, I think we might be able to simplify all of this and use ignition if we add more features to afterburn for the Packet provider.

I'm not familiar with afterburn but it seems like a nice way to export metadata and maybe (not sure it really makes sense) we can consider exporting some other variables in packet, like:

AFTERBURN_PACKET_DISKS_SMALLEST: this variable can be the smallest disk on that instance, detected by afterburn, or all the smallest disks and the user can select one of them
AFTERBURN_PACKET_DISKS_SSD: the list of SSD disks in the server that are not in use (i.e. excluding the disk used for the OS or that are used for something else), for example "/dev/sda /dev/sdd /dev/sdb"
AFTERBURN_PACKET_DISKS_HDD`: the list of HDD disks in the server that are not in use (i.e. excluding the disk used for the OS or that are used for something else), for example "/dev/sda /dev/sdd /dev/sdb"

If this is possible and these variables can be used in ignition to format the disks, we should be able to dramatically simplify the setup_raid_* flags we currently have.

Please note that this will not solve #73, this will just remove this every time more complicated bash script before it runs out of control, probably reduce quite some bugs there and, hopefully, implement this in a clean way.

This can also probably remove from flatcar-install the -s experimental functionality created here flatcar/init@1a24532, as now flatcar install will be possible to be called (if afterburn is enabled) with something like:

flatcar-install -d ${AFTERBURN_PACKET_DISKS_SMALLEST}

And, therefore, the -s functionality for packet can be removed (the only platform we are using it right now).

Use of Cluster API

Will you use in an upcoming version Cluster API for the life cycle Management?

Dead links in packet docs

Links in the modules table are dead https://github.com/kinvolk/lokomotive-kubernetes/blob/master/docs/index.md#modules

Adding more controlplane nodes fails on Packet

With following error:

module.dns.aws_route53_record.dns-records[2]: Creation complete after 41s [id=Z1QEP94KB3YNR3_mateusz.example.com._A]
module.dns.aws_route53_record.dns-records[1]: Creation complete after 41s [id=Z1QEP94KB3YNR3_mateusz-etcd1.example.com._A]

Error: [ERR]: Error building changeset: InvalidChangeBatch: [Tried to create resource record set [name='mateusz-private.dev.example.com.', type='A'] but it already exists]
	status code: 400, request id: fee05404-59ba-485b-8d0d-bb755ec683de

  on .terraform/modules/dns/main.tf line 17, in resource "aws_route53_record" "dns-records":
  17: resource "aws_route53_record" "dns-records" {

CC @mauriciovasquezbernal

packet: Mounts inside /mnt/ can't be used by pods

Node local storage was working when it was mounted to /mnt. However, we later changed that to support mounting hdd and ssd disks on /mnt/node-local-sdd-storage and /mnt/node-local-hdd-storage.

However, when doing so (#72) the mounts were not exposed to the pod and the pod, therefore, was just using the installation OS. This PR that fixed the issue has more details: #72

The quick fix was to mount on /mnt again, but we should investigate to make sure we understand why mounts under that (eg. in /mnt/node-local-storage) are not exposed to the pods and can't use them.

The goal is to either understand why mounts don't propagate and is not possible to change, or do the needed changes for the mounts to be propagated.

However, if we find out mount's can't be propagated, we can consider a potentially very nasty approach (only if we really have to. We should see #74 before doing this too) that is the following:

We can consider expose in the rkt container, like now is being down with /mnt/, /mnt/node-local-storage, /mnt/node-ssd-local-storage/, /mnt/node-hdd-local-storage, etc.

However, we can only use as volumes for rkt directories that do exist[1], so we should either select in the template based on the different setup_* flags (this is very messy), or we should always create all directories and rkt just exposes all (this is more elegant, as that can all be part of the systemd unit as with the rest of the ExecStartPre=/bin/mkdir .. we are already doing).

[1]: I tried replacing /mnt by /mnt/node-local-storage in the kubelet systemd unit, while that directory didn't exist, and the systemd unit failed to start with: Sep 27 11:03:25 rata-test-backend-worker-0 kubelet-wrapper[5459]: run: stat of host path /mnt/node-local-storage: stat /mnt/node-local-storage: no such file or directory

Packet "phone home" is failing

The packet-phone-home.service systemd unit is currently failing. This causes cluster bootstrap failures in some cases.

● packet-phone-home.service - Report Success to Packet
   Loaded: loaded (/etc/systemd/system/packet-phone-home.service; enabled; vendor preset: enabled)
   Active: inactive (dead) (Result: resources) since Thu 2019-06-20 19:08:51 UTC; 5min ago

Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: packet-phone-home.service: Service RestartSec=2s expired, scheduling restart.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: packet-phone-home.service: Scheduled restart job, restart counter is at 7.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: Stopped Report Success to Packet.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: Dependency failed for Report Success to Packet.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: packet-phone-home.service: Job packet-phone-home.service/start failed with result 'dependency'.

core@johannes-test-controller-0 ~ $ cat /etc/systemd/system/packet-phone-home.service
[Unit]
Description=Report Success to Packet
ConditionFirstBoot=true
Requires=coreos-metadata.service
After=coreos-metadata.service

[Service]
EnvironmentFile=/run/metadata/flatcar
ExecStart=/usr/bin/curl --header "Content-Type: application/json" --request POST "${COREOS_PACKET_PHONE_HOME_URL}"
Restart=on-failure
RestartSec=2

[Install]
WantedBy=multi-user.target

core@johannes-test-controller-0 ~ $ systemctl status coreos-metadata.service | cat
● coreos-metadata.service - CoreOS Metadata Agent
   Loaded: loaded (/etc/systemd/system/coreos-metadata.service; enabled; vendor preset: enabled)
   Active: failed (Result: start-limit-hit) since Thu 2019-06-20 19:08:49 UTC; 10min ago
  Process: 1329 ExecStart=/usr/bin/coreos-metadata ${COREOS_METADATA_OPT_PROVIDER} --attributes=/run/metadata/coreos (code=exited, status=0/SUCCESS)
 Main PID: 1329 (code=exited, status=0/SUCCESS)

Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:49 johannes-test-controller-0 coreos-metadata[1329]: Jun 20 19:08:49.612 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:49 johannes-test-controller-0 coreos-metadata[1329]: Jun 20 19:08:49.617 INFO Fetch successful
Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Start request repeated too quickly.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Failed with result 'start-limit-hit'.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: Failed to start CoreOS Metadata Agent.

core@johannes-test-controller-0 ~ $ journalctl -u coreos-metadata.service | cat
-- Logs begin at Thu 2019-06-20 19:08:01 UTC, end at Thu 2019-06-20 19:21:24 UTC. --
Jun 20 19:08:29 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:30 johannes-test-controller-0 coreos-metadata[921]: Jun 20 19:08:30.037 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:30 johannes-test-controller-0 coreos-metadata[921]: Jun 20 19:08:30.050 INFO Fetch successful
Jun 20 19:08:30 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:30 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:32 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:32 johannes-test-controller-0 coreos-metadata[1067]: Jun 20 19:08:32.266 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:32 johannes-test-controller-0 coreos-metadata[1067]: Jun 20 19:08:32.267 INFO Failed to fetch: http://metadata.packet.net/metadata: failed to lookup address information: Name or service not known
Jun 20 19:08:33 johannes-test-controller-0 coreos-metadata[1067]: Jun 20 19:08:33.268 INFO Fetching http://metadata.packet.net/metadata: Attempt #2
Jun 20 19:08:33 johannes-test-controller-0 coreos-metadata[1067]: Jun 20 19:08:33.277 INFO Fetch successful
Jun 20 19:08:33 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:33 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:35 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:35 johannes-test-controller-0 coreos-metadata[1076]: Jun 20 19:08:35.583 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:40 johannes-test-controller-0 coreos-metadata[1076]: Jun 20 19:08:40.595 INFO Fetch successful
Jun 20 19:08:40 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:40 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:42 johannes-test-controller-0 coreos-metadata[1105]: Jun 20 19:08:42.364 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:42 johannes-test-controller-0 coreos-metadata[1105]: Jun 20 19:08:42.370 INFO Fetch successful
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:42 johannes-test-controller-0 coreos-metadata[1157]: Jun 20 19:08:42.863 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:42 johannes-test-controller-0 coreos-metadata[1157]: Jun 20 19:08:42.869 INFO Fetch successful
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:42 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:45 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:45 johannes-test-controller-0 coreos-metadata[1302]: Jun 20 19:08:45.112 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:45 johannes-test-controller-0 coreos-metadata[1302]: Jun 20 19:08:45.118 INFO Fetch successful
Jun 20 19:08:45 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:45 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:47 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:47 johannes-test-controller-0 coreos-metadata[1315]: Jun 20 19:08:47.362 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:47 johannes-test-controller-0 coreos-metadata[1315]: Jun 20 19:08:47.369 INFO Fetch successful
Jun 20 19:08:47 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:47 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: Starting CoreOS Metadata Agent...
Jun 20 19:08:49 johannes-test-controller-0 coreos-metadata[1329]: Jun 20 19:08:49.612 INFO Fetching http://metadata.packet.net/metadata: Attempt #1
Jun 20 19:08:49 johannes-test-controller-0 coreos-metadata[1329]: Jun 20 19:08:49.617 INFO Fetch successful
Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Succeeded.
Jun 20 19:08:49 johannes-test-controller-0 systemd[1]: Started CoreOS Metadata Agent.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Start request repeated too quickly.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: coreos-metadata.service: Failed with result 'start-limit-hit'.
Jun 20 19:08:51 johannes-test-controller-0 systemd[1]: Failed to start CoreOS Metadata Agent.

Packet: Broken AWS provider due to version mismatch

When deploying a Packet cluster by following the docs, terraform init is currently failing with the following error:

Initializing modules...
- module.controller
  Getting source "git::https://github.com/kinvolk/lokomotive-kubernetes//packet/flatcar-linux/kubernetes?ref=master"
- module.worker-pool-1
  Getting source "git::https://github.com/kinvolk/lokomotive-kubernetes//packet/flatcar-linux/kubernetes/workers?ref=master"
- module.controller.bootkube
  Getting source "github.com/kinvolk/terraform-render-bootkube?ref=d07243a9e7f6084cfe08b708731a79c26146badb"

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "template" (1.0.0)...
- Downloading plugin for provider "tls" (1.2.0)...
- Downloading plugin for provider "packet" (1.7.2)...

No provider "aws" plugins meet the constraint "~> 1.57,~> 2.8.0".

The version constraint is derived from the "version" argument within the
provider "aws" block in configuration. Child modules may also apply
provider version constraints. To view the provider versions requested by each
module in the current configuration, run "terraform providers".

To proceed, the version constraints for this provider must be relaxed by
either adjusting or removing the "version" argument in the provider blocks
throughout the configuration.

This is caused by the following conflicting provider versions:

https://github.com/kinvolk/lokomotive-kubernetes/blame/fc16b6175749328bf12552e1c1595b988d554326/docs/flatcar-linux/packet.md#L58

https://github.com/kinvolk/lokomotive-kubernetes/blob/fc16b6175749328bf12552e1c1595b988d554326/packet/flatcar-linux/kubernetes/require.tf#L32

Remove `worker_count` variable

https://github.com/kinvolk/lokomotive-kubernetes/blob/546e48004d76cb7d99c8211f04a170e20e918fe9/packet/flatcar-linux/kubernetes/ssh.tf#L107

We may be able to get rid of this variable and use the length of ${var.worker_nodes_hostnames} instead. Explicitly specifying the count seems a bit redundant to me.

I don't remember if we can use length() inside a count though.

docs/conformance are out of date

Issues:

Terraform 0.11 syntax
Providers are aliased and passed to module, which has been deprecated
worker_nodes_hostnames parameter does not exist anymore

Difference between Lokomotive and typhoon

How does this project differentiate from Poseidon/typhoon?

Is it just a fork? Do you plan on building extra features into it?
Or what’s the goal of this project?
Maybe I’m overlooking something.
Thanks for your reply!

[Packet] Automatic node_private_cidr

Maybe more of a question than issue, but is there any reason why the node_private_cidr value needs to be set manually?

This part (from the example config)

This is different for each project on Packet and depends on the packet facility/region.
Check yours from the IPs & Networks tab from your Packet.net account.
If an IP block is not allocated yet, try provisioning an instance from the console in
that region. Packet will allocate a public IP CIDR

Because for whatever reason (cluster was down for two weeks whilst I continued development) mine changed without deleting the project and it was not fun trying to figure out why like 20% of intra-node connectivity was working (apart from why any connectivity was working) :)

The potential issue is of course that its an array, but is it even possible to have multiple private ip ranges? Could [0] be a reasonable default?

Suggested change:

# This is different for each project on Packet (...) Packet will allocate a public IP CIDR.
  node_private_cidr = data.packet_ip_block_ranges.k8s.private_ipv4[0]
}

data "packet_ip_block_ranges" "k8s" {
  project_id = local.project_id
}

Parameterize the `source` value in bootkube.tf

We have the module bootkube with source value hardcoded, can we have a variable with default value to be following (which we wish to hardcode) and then user can change if they want to.

https://github.com/kinvolk/lokomotive-kubernetes/blob/fbb7eee2cd38a81019ed292a8a5eb6ae05328957/packet/flatcar-linux/kubernetes/bootkube.tf#L1-L4

Permanent Terraform diff on Azure

When deploying on Azure, terraform plan always shows a diff on the worker scale set which causes a re-creation of all the workers:

Terraform will perform the following actions:

-/+ module.johannes-test.module.workers.azurerm_monitor_autoscale_setting.workers (new resource required)
      id:                                                                                              "/subscriptions/.../resourceGroups/johannes-test/providers/microsoft.insights/autoscalesettings/johannes-test-maintain-desired" => <computed> (forces new resource)
      enabled:                                                                                         "true" => "true"
      location:                                                                                        "westeurope" => "westeurope"
      name:                                                                                            "johannes-test-maintain-desired" => "johannes-test-maintain-desired"
      profile.#:                                                                                       "1" => "1"
      profile.0.capacity.#:                                                                            "1" => "1"
      profile.0.capacity.0.default:                                                                    "2" => "2"
      profile.0.capacity.0.maximum:                                                                    "2" => "2"
      profile.0.capacity.0.minimum:                                                                    "2" => "2"
      profile.0.name:                                                                                  "default" => "default"
      resource_group_name:                                                                             "johannes-test" => "johannes-test"
      tags.%:                                                                                          "0" => <computed>
      target_resource_id:                                                                              "/subscriptions/.../resourceGroups/johannes-test/providers/Microsoft.Compute/virtualMachineScaleSets/johannes-test-workers" => "${azurerm_virtual_machine_scale_set.workers.id}" (forces new resource)

-/+ module.johannes-test.module.workers.azurerm_virtual_machine_scale_set.workers (new resource required)
      id:                                                                                              "/subscriptions/.../resourceGroups/johannes-test/providers/Microsoft.Compute/virtualMachineScaleSets/johannes-test-workers" => <computed> (forces new resource)
      automatic_os_upgrade:                                                                            "false" => "false"
      eviction_policy:                                                                                 "" => "Delete" (forces new resource)
      identity.#:                                                                                      "0" => <computed>
...

Steps to reproduce

Create an Azure cluster using the docs.
After terraform apply completes, run terraform plan again. A diff is shown.

kube-hunter CI job is flaky

It sometimes doesn't finish within 7 minutes for some reason, which makes CI job to fail. We should investigate that.

Packet: Feature Request: Consider enabling multipathd by default

At least for packet, both the Packet CSI driver and their object storage scripts rely on it running.

I've fixed it with a CLC_snippet, but figured it would be worth mentioning.

Great scripts btw, thanks

The working snippet:

# Required for CSI driver
systemd:
  units:
    - name: multipathd.service
      enable: true

Maximal Ignition config length almost reached

There are not may lines left to be added until provisioning will fail because the 16 kB limit is hit.

Files could either be copied at a later stage or the size brought down through compression.

Align and set proper provider constraints across all modules

There are some inconsistencies found in #140, which should be fixed. We should also make our contraints to allow minor updates of all providers.

packet: support adding autoscaler tags to nodes in worker pools

The packet cluster autoscaler identifies nodes to scale with tags in packet.

I'd like to have an option to add those tags when creating worker pools in Lokomotive.

Update Packet docs to use official Packet images of Flatcar Linux

Packet now has official Flatcar Linux images available via their API. We are working on adding support for that and should make sure to updated the Packet documentation accordingly.

Update terraform-render-bootkube version

Currently, we have only updated the commit hash of terraform-render-bootkube on packet alone in bootkube module. We should do same for other providers.

(packet?) Can't run terraform plan/apply

On master (2512344), can't run anything terraform related really. I disabled all modules except for worker/controller, didn't help. Not much more I can think of to help fix this right now.

Following error message:

Error: could not generate output checksum: lstat .terraform/modules/controller.bootkube/resources/charts/calico: no such file or directory



Error: could not generate output checksum: lstat .terraform/modules/controller.bootkube/resources/bootstrap-manifests: no such file or directory



Error: could not generate output checksum: lstat .terraform/modules/controller.bootkube/resources/charts/kubernetes: no such file or directory

Packet: Empty CLC causes silent failure

  clc_snippets = [
    "${file("./snippets/enable-ntpd")}"
]

File exists, but is empty (oops). This causes none of the worker nodes to be created.

Example git patch output:

diff --git a/main.tf b/main.tf
index b31a2aa..e392b60 100644
--- a/main.tf
+++ b/main.tf
@@ -90,6 +90,7 @@ module "worker-pool-helium" {
   type         = "t1.small.x86"

   clc_snippets = [
+    "${file("./snippets/enable-ntpd")}",
(other working snippets)}"
   ]
diff --git a/snippets/enable-ntpd b/snippets/enable-ntpd
new file mode 100644

Shows an empty file, so I'm quite sure this is what caused it.

packet: Investigate slowness when deploying big CLC snippets

The size of the CLC snippets affects the time a node needs to deploy in a severe way, for instance a snippet of 64kB increases the deployment time by 5 minutes and a snippet of 128kB could cause the deployment to timeout.

Preliminary investigations indicate that the issue is on Packet and not in Lokomotive not Flatcar Container Linux.

Document RAID-related vars

New Terraform vars were added in #48. We should document them.

@surajssd - FYI.

[Packet] sig-storage-local-static-provisioner recently stopped working

At some point between my last known working configuration of Lokomotive (56acc13) and current master (2512344)
sig-storage-local-static-provisioner stopped working

the provisioner starts correctly, finds the block device pool that I created in /mnt/disks and adds them to the cache.
the application starts correctly and binds a physical volume claim to a physical volume when asked to do so
wait a minute or so, and it all times out with MountVolume.NewMounter initialization failed for volume "local-pv-7f70b248" : path "/mnt/local/vol1" does not exist

ssh into the host, disk is there, disk is accessible, permissions are correct (tried 777 -R just in case)

So provisioner, block device, detection of disks, kubernetes pod security profiles, etc. Everything seems to work and thinks it should be working except that something (I'm not 100% sure which component, CSI? Kubelet?) can't access the disk in order to actually finish the mounting 'loop'

(probably relevant, the cleanup script that the provisioner runs after deleting a pvc (some variation of rm -rf path_that_was_mounted does work, although I don't know if that's executed on the host or in a container)

Go back to 56acc13, everything works again.

The only hint I can find is regarding containerized kubelets but when I checked the kubelet systemd unit file /mnt looked like it was accessable, although the kubelet running docker didnt (rkt and docker kubelet?)

Before I dive into this further, are you aware of any changes that could have caused this?

edit: how to recreate, bind mount a block device into /mnt/disks, helm install the static provisioner as per the docs and try to claim it via a deployment. Let me know if you need some example config, although it's 99% what's in the installation guide

azure: add support for storage_profile_data_disk for workers

storage_profile_data_disk allows attaching data disks into Azure Scale Set we use for workers and I think this is a good way of providing disks for OpenEBS on Azure.

There are 2 things which still needs to be addressed. Once new worker is provisioned it requires manual action of preparing a disk, which is sudo umount /mnt/resource && sudo sgdisk --zap-all /dev/sdb.

I also think, that data disks should be optional. Since storage_profile_data_disk is a block, we should wait until migration to Terraform 0.12 is done, so we can support optional blocks with dynamic syntax. Terraform reference issue: hashicorp/terraform#7034

Lokomotive Disk type specific RAID Setup optimizations

Make following optimizations in the RAID setup for worker nodes on Packet

Use tr instead of awk

In https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L72 & https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L106

use tr \\n " ".

context: #48 (comment)

Merge functions create_disk_specific_data_raid and create_data_raid

Try to merge these two functions without making code unreadable.

https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L63 & https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L91

context: #48 (comment)

Better method of sorting drives

The output of lsblk is sorted on the basis of disk size using sort utility. Sort using flags of lsblk.

Remove sort from https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L54-L58 and https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L69-L73 and https://github.com/kinvolk/lokomotive-kubernetes/blob/4b5174bfba533702382a55e186cd78030b721e2d/packet/flatcar-linux/kubernetes/workers/cl/install.yaml.tmpl#L101-L107

Sort should be removed and something like this can be used

lsblk -lnpd -o name,rota,size -I "$${major_numbers}" -x size

context: #48 (comment)

kinvolk-archives / lokomotive-kubernetes Goto Github PK

lokomotive-kubernetes's Introduction

Lokomotive

Features

Modules

Documentation

Usage

Try Flatcar Container Linux Edge

Help

lokomotive-kubernetes's People

Stargazers

Watchers

Forkers

lokomotive-kubernetes's Issues

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org