tinkerbell / cluster-api-provider-tinkerbell Goto Github PK

View Code? Open in Web Editor NEW

86.0 17.0 34.0 1.17 MB

Cluster API Infrastructure Provider

License: Apache License 2.0

Dockerfile 0.77% Makefile 5.56% Go 82.75% Shell 5.60% Python 5.33%

tinkerbell cluster-api kubernetes baremetal

cluster-api-provider-tinkerbell's Introduction

Cluster API Provider Tinkerbell

Kubernetes-native declarative infrastructure for Kubernetes clusters on Tinkerbell.

What is the Cluster API Provider Tinkerbell

The Cluster API brings declarative, Kubernetes-style APIs to Kubernetes cluster creation, configuration and management.

The API itself is shared across multiple cloud providers allowing for true hybrid deployments of Kubernetes, both on-premises and off.

Quick Start

See the Quick Start

Compatibility with Cluster API and Kubernetes Versions

This provider's versions are compatible with the following versions of Cluster API:

	v1beta1 (v1.0)
Tinkerbell Provider v1beta1 (v0.1)	✓

This provider's versions are able to install and manage the following versions of Kubernetes:

	v1.19	v1.20	v1.21	v1.22
Tinkerbell Provider v1beta1 (v0.1)	✓	✓	✓	✓

Each version of Cluster API for Tinkerbell will attempt to support all community supported Kubernetes versions during it's maintenance cycle; e.g., Cluster API for Tinkerbell v0.1 supports Kubernetes 1.19, 1.20, 1.21, 1.22 etc.

NOTE: As the versioning for this project is tied to the versioning of Cluster API, future modifications to this policy may be made to more closely align with other providers in the Cluster API ecosystem.

Kubernetes versions with published Images

Pre-built images are pushed to the GitHub Container Registry. We currently publish images for Ubuntu 18.04 and Ubuntu 20.04.

Current state

Currently, it is possible to bootstrap both single instance and multiple instance Control Plane workload clusters using hardware managed by Tinkerbell.

See docs/README.md for more information on setting up a development environment.

cluster-api-provider-tinkerbell's People

Contributors

Stargazers

Watchers

cluster-api-provider-tinkerbell's Issues

Implement retries for BMC interactions

BMCs are known to fail/act oddly. CAPT uses Rufio when BMC data is referenced by the Hardware resource to power machines off/on and configure netboot. The Rufio Tasks/Jobs indicate whether they failed or succeeded. For increased resiliancy we should consider implementing retries in CAPT for the Rufio interactions.

Document how users can try out this provider

In addition to developer workflow mentioned in #13, we should also document how this can be tried out e.g. on Packet.

CC @thebsdbox

Release 0.2

We have kubified all the Tinkerbell services like Tink-apis, Hegel, Boots, Rufio, etc.. We have accommodated changes related to those services in CAPT. Thus, we should cut a new release for that.

Support for node power management with cluster create and delete

When clusters are created/deleted with CAPT, the bare metal nodes have to be manually powered on and set to PXE boot for cluster create and need to be manually powered off after cluster delete to completely delete the cluster.

Expected Behaviour

When the hardware CRDs are applied and those hardware CRDs are picked for a Machine, CAPT can perform an extra step to power on the nodes being made part of the cluster.
During cluster deletion, once a hardware CRD is released, i.e. its not tagged to a Machine. CAPT can set the next boot order to PXE and power off the nodes.

Current Behaviour

Currently we manually power on nodes for cluster creation.
During delete, the nodes are left powered on and only the resources (templates, workflows) are deleted. Thus the cluster is still reachable since the API server is running.

Possible Solution

An integration with a BMC power management service like pbnj would help automate power on/off for CAPT.

Move away from tools.go

Having updated Go to a recent version, the go run command can now be used to run specific versions of tooling. We can use this to ensure we get the same binary each time and so the tools.go is redundant.

Upgrade Go version to 1.19

The github action for building, testing, etc is Go 1.17. Go is currently at a stable release of 1.19.

Expected Behaviour

Current Behaviour

Some make targets fail when using Go 1.19 but not with 1.17. make lint, for example.

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Consider provisioning load balancer machine as part of cluster infrastructure

In theory, as part of cluster controller, we could run a workflow for a single machine with e.g. HAProxy configured on it, which would act as a load balancer for the cluster. Such machine would have to expose an API to add and remove the machine.

While not perfect, this would allow having multiple controlplane machines and replacing them.

CAPT (Cluster API for Tinkerbell)

Mark will work with Avi/Gianluca and Graham to figure out an initial design doc for how this could work.

Remove boilerplate and associated validation code

Copyright boilerplate in source files isn't necessary as code is copyrighted by default. The license under /LICENSE is sufficient for the entire project.

Maintaining boierplate makes it difficult to run formatting tools as it wants to format the boilerplate in a particular way dependent on how the comment is written.

Mergify config not matching github actions jobs

Expected Behaviour

Mergify should merge when conditions are met, including the CI jobs defined in github actions

Current Behaviour

Mergify fails to merge the pr, showing that check-success=validate is false even when all github actions have passed.

Possible Solution

Based on https://docs.mergify.com/conditions/#about-status-checks, it looks like the configuration would be correct, in that the github actions job name is validate, however I believe it probably needs to be modified to take into account that the job is a matrix job.

Some investigation will need to be done to determine if we need to specify check-success for all combinations of the matrix job or if there is a way to configure it to check for a prefix or regex matching the matrix jobs.

Build OS image and store it on provisioner node

To avoid having large postKubeadmCommands and preKubeadmCommands sections in Cluster template, we could follow Openstack provider https://image-builder.sigs.k8s.io/capi/providers/openstack.html#building-images-for-openstack and instruct user to build required OS image on the Tinkerbell provisioner, so it can be served using nginx running there.

This will simplify the installation process and make it faster.

Rename api/v1alpha3 package to api/v1alpha1

As api/v1alpha1 and api/v1alpha2 does not exist.

Same Hardware assigned to multiple TinkerbellMachines when maxUnavailable >= 2 in MachineDeployment

I setup a Cluster API environment with Tinkerbell provider, plus a tinkerbell stack on a single server by following this https://github.com/tinkerbell/cluster-api-provider-tinkerbell/tree/main/docs. It successfully provisioned a workload K8s cluster (3 control plane nodes + 3 workload nodes) where all servers are physical machines.
When testing rolling restart a MachineDeployment which contains 3 workload nodes, I set maxUnavailable: 3 (an extreme case that I want to test how refreshing (actually re-image + rejoin) nodes works in parallel. Then, a single Hardware was assigned to two TinkerbellMachines and so the rolling restart got stuck.

Expected Behaviour

The nodes managed by the MachineDeployment should be linked to different Hardwares even when multiple nodes are getting refreshed or restarted.

Current Behaviour

From the screenshot, the hardware n62-107-74 was linked to two TinkerbellMachines: capi-quickstart-worker-a-2kjwb and capi-quickstart-worker-a-h9f8z.

Possible Solution

Explained in the context below.

Steps to Reproduce (for bugs)

Provision a MachineDeployment with multiple nodes (>= 2).
Set the strategy of the MachineDeployment:

  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 3 # Just set it to 3 for easy repo.
    type: RollingUpdate

Rolling restart the MachineDeployment with clusterctl:

clusterctl alpha rollout restart machinedeployment/capi-quickstart-worker-a

All 3 Machines would be deleted first (including TinkerbellMachines, Workflows as well) and then enter into provisioning stage.
Then, it would be highly possible to happen that a single Hardware is linked to multiple TinkerbellMachines (2 or 3).

Context

This issue blocks nodes upgrading and restarting.
After digging into code, it looks like the owner labels on the Hardware got deleted twice (even 3 times).
Let's use the case above as an example, and the here are labels on the Hardware:

It could happen like:

The 3 machines were being deleted.
The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method, which deletes the ownership labels on the Hardware and then creates a PowerOff bmc job to shutdown this (n62-107-74) machine.
For reconcile of machine (n62-107.78), it might be faster and already completes its bmc job. Then it will call ensureHardware() to select a Hardware for itself as well. So, it's possible to select Hardware n62-107-74 as the ownership labels of n62-107-74 got deleted in step 2. It adds the ownership labels for Hardware n62-107-74. eg: v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-h9f8z
The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method again to double check if the PowerOff bmc job completes. Then inside DeleteMachineWithDependencies(), it deletes the ownership labels again, but the owenrship labels were just created in steps 3 by reconcile of machine (n62-107.78).
The reconcile of machine (n62-107-74) continues when the PowerOff bmc job completes, so it will call ensureHardware() to select a Hardware for itself, which is possible to select Hardware n62-107-74 as the ownership labels got deleted in step 4. So it put the ownership label as v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-2kjwb.
Then it causes two machines (n62-107-74) and (n62-107.78) link to the same Hardware (n62-107-74).

A possible solution is to move deleting ownership labels behind checking PowerOff bmc job. Here is the code I just used which works well in my environment.

// DeleteMachineWithDependencies removes template and workflow objects associated with given machine.
func (scope *machineReconcileScope) DeleteMachineWithDependencies() error {
	scope.log.Info("Removing machine", "hardwareName", scope.tinkerbellMachine.Spec.HardwareName)

	// Fetch hardware for the machine.
	hardware := &tinkv1.Hardware{}
	if err := scope.getHardwareForMachine(hardware); err != nil {
		return err
	}

        // getOrCreateBMCPowerOffJob() is method I wrote for PowerOff job only.
	if bmcJob, err := scope.getOrCreateBMCPowerOffJob(hardware); err != nil {
		return err
	} else {
		// Check the Job conditions to ensure the power off job is complete.
		if !bmcJob.HasCondition(rufiov1.JobCompleted, rufiov1.ConditionTrue) {
			scope.log.Info("Waiting BMCJob of power off hardware to complete",
				"Name", bmcJob.Name,
				"Namespace", bmcJob.Namespace,
			)
			return nil
		}

		if bmcJob.HasCondition(rufiov1.JobFailed, rufiov1.ConditionTrue) {
			return fmt.Errorf("bmc job %s/%s failed", bmcJob.Namespace, bmcJob.Name) //nolint:goerr113
		}
	}

	// Only remove ownership labels here when BMC PowerOff job completes.
	if err := scope.removeDependencies(hardware); err != nil {
		return err
	}

	// The hardware BMCRef is nil.
	// Remove finalizers and let machine object delete.
	if hardware.Spec.BMCRef == nil {
		scope.log.Info("Hardware BMC reference not present; skipping hardware power off",
			"BMCRef", hardware.Spec.BMCRef, "Hardware", hardware.Name)
	}

	return scope.removeFinalizer()
}

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
Debian 10
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
I deployed Tinkerbell stack as sanbox one with docker-compose on a server.
Link to your project or a code example to reproduce issue:
N/A

Add Flatcar systemd-sysext how to

Similar to kubernetes-sigs/cluster-api-provider-openstack#1776, Flatcar images for cluster api can be created using the cluster api image builder (https://github.com/kubernetes-sigs/image-builder) or leveraging the systemd-sysext implementation, where the booted image is the vanilla one and ignition is used to add the systemd-sysext kubernetes overlay.

This issue was created for tracking the PR(s) that add the documentation on how to use Flatcar image with CAPT.

Add golangci-lint GitHub action

To keep code quality high (by making it consistent and readable) and to avoid adding back issues like ones found in #5.

Update documentation

Review and update documentation under /docs and the README.md to reflect the state of v0.4.

Document development workflow

It would be good to document how to develop this provider.

So far, I figured out that one needs to follow https://cluster-api.sigs.k8s.io/developer/tilt.html (with fixes added in #12).

~~I still didn't figure out how to put TINKERBELL_GRPC_AUTHORITY and TINKERBELL_CERT_URL into it, right now I just edit the secret which contains them and restart the pod manually.~~

Next thing we should figure out how to setup Tinkerbell for testing it. @detiber pointed me to https://github.com/detiber/tink/blob/kindDev/README-TILT.md, so we can run Tinkerbell on the same K8s cluster as the cluster API, but I didn't figure that part yet.

Also, we can consider merging those 2 workflows somehow, so tilt up just works automagically.

Note: cluster-api must be on branch release-0.3 right now.

WIP list of steps:

curl -fsSL https://raw.githubusercontent.com/tilt-dev/tilt/master/scripts/install.sh | bash
git clone https://github.com/kubernetes-sigs/cluster-api
git clone [email protected]:tinkerbell/cluster-api-provider-tink.git
cd cluster-api
git checkout release-0.3

cat <<EOF > tilt-settings.json
{
  "default_registry": "quay.io/<your username>",
  "provider_repos": ["../cluster-api-provider-tink"],
  "enable_providers": ["tinkerbell", "docker", "kubeadm-bootstrap", "kubeadm-control-plane"],
  "allowed_contexts": ["<your kubeconfig context to use"]
}
EOF

export TINKERBELL_GRPC_AUTHORITY=10.17.3.2:42113 TINKERBELL_CERT_URL=http://10.17.3.2:42114/cert
tilt up

Make sure TINKERBELL_GRPC_AUTHORITY and TINKERBELL_CERT_URL points to the running Tinkerbell instance, where your Kubernetes cluster pods can access it.

If all commands above succeeded, Tilt website should be all green and changes to the codebase of the provider should propagate to the cluster.

Figure out Tinkerbell workflow for provisioning Ubuntu instance with cloud-init userdata

To be able to use existing cloud-init configs generated by Cluster API, we should figure out a way how Tinkerbell can provision Ubuntu on worker nodes with specific cloud-init userdata specified.

This should allow easy boostrapping of clusters.

Regarding the cloud-init userdata delivery to the provisioning process, I think for MVP we could store userdata base64 encoded in workflow template directly, the same as we do for installing Flatcar on Tinkerbell: https://github.com/kinvolk/tinkerbell.org/blob/invidian/flatcar/content/examples/flatcar-container-linux/_index.md#preparing-template.

While a bit hacky, this will allow delivering userdata without any middleware like Hegel, which should make things simple for the start.

I figured out how to install Ubuntu Cloud ISO image on machine and make it use cloud-init. Here are the steps:

Download some live Ubuntu image. In my case I used https://releases.ubuntu.com/20.04/ubuntu-20.04.1-live-server-amd64.iso
Boot this image on machine which needs to be provsioned. In my case I used libvirt.

Run the following commands to install Ubuntu Cloud on virtual machine disk:

cat <<EOF > 90_dpkg.cfg
datasource_list: [ None ]
datasource:
  None:
    metadata:
      local-hostname: bar
    userdata_raw: |
      #cloud-config

      users:
        - name: root
          ssh_authorized_keys:
            - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC96Ge1knq4wq3EUSCzYUPsumdLIi1B7Z/lv478G8kCgrxH5OAncqrWIRP+hG1WThe+yi60vTd1jeBc3YzOtTqSFRDnpSNNx++Lvh0/KRN6d1advX8XnmPHMvCPBtJXoRjuQLh7CS8VrP+VA3QY1y8mRoX2710pShyf2MiVI5G0Vi32GBKWJ1MRvP9kGAeay50HJsvF12AGz0VJImVn+eqtVlgPvmF6K+NqgEMdz8660avi1RKrWhLPAym82Am6WEoWQdSx01OrVlGW6nBMys3/fMAVAvYlUwolfPIPB7tl7Vh/xMeyzX6QW7BEVeGRTdzaGACl8vgbZ4BPeQiMihOWmnKHw//BDmGLYWh+2oDdvfNNL+kgwTg4YgN4NL1Crk5rPCB2WHdiwTj5I8N+NLY7K1u0vq8Loe+cZciuRIN3495sL0m9eiXkKcknM9j4ApqDe+WMiv1gojB6o8yE0RV4CmYgJa9TFwDv9+ib+G1tDCsD+ULUn9g9ARtLaHtQX5XW+adAVkXtk+DEqWHdvfc8uv298aQ6UxexKcsF4uqj88c2gDiFjjrddqt3YScPV4d3KQzBuVIO6whbCzHnEUBqfkDhogrEronEC5+wRav5NsvntucDcKDIMpQCgHTIqSOKk8wog0a1iPrEFaHs21W2CsxGfDYuKiFfSF1JvpMwnw== [email protected]
EOF

cat <<EOF > run.sh
#!/bin/bash

set -euxo pipefail

# Having kernel in /boot is required by virt-customize, so we need to install some kernel, as
# Ubuntu server live image has no kernel in /boot.
#
# However, installing kernel requires writable /lib/modules and Ubuntu server live image has it
# read-only, so we workaround it by creating tmpfs on top of it.
mount -t tmpfs tmpfs /lib/modules

apt update && apt install -y libguestfs-tools linux-image-generic

wget https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64.img
qemu-img convert -f qcow2 -O raw focal-server-cloudimg-amd64.img focal-server-cloudimg-amd64.raw
rm focal-server-cloudimg-amd64.img
dd bs=4M if=focal-server-cloudimg-amd64.raw of=/dev/vda conv=fdatasync  status=progress
virt-customize -a /dev/vda --copy-in ./90_dpkg.cfg:/etc/cloud/cloud.cfg.d/ --root-password password:somepassword
reboot
EOF

chmod +x run.sh

If needed, modify disk path and public SSH key part in files created above. In my case, my VM has single disk /dev/vda.
Run the script:
```
./run.sh
```
If everything went well, reboot the machine:
```
reboot
```

Now, your VM should boot Ubuntu Cloud with provided user-data applied.

Use `actions/setup-go` instead of our own wrapper

We're using a wrapper around actions/setup-go GitHub action to facilitate caching. actions/setup-go now deals with caching so its no longer necessary to wrap it.

Figure out checking if given hardware ID is available to use

Right now we ask user to select hardware IDs which should be used for nodes, but we could as well pick any available hardware from Tinkerbell server. In both cases, we need to check if given hardware is ready to run workflows. The easiest way of doing this is just to see if there are any workflows associated with a given hardware.

can we add CAPT operators to operatorhub.io?

Add required controlPlaneEndpoint field to cluster spec

https://cluster-api.sigs.k8s.io/developer/architecture/controllers/cluster.html says it must be there and we don't have it, so we need to add it. And later on make sure cluster controller sets it to the right value.

Update Tink dependency

Update dependency of Tink to the released 0.7 version (when it comes out).

Possible race condition

I observed today a possible race condition. The scenario was this. A machine is sitting in HookOS. Tink worker is running and connected to Tink server. CAPT crds are created, initiating the provisioning of the first control plane node. The code flow gets to a point where it creates a template and a workflow and then creates the Rufio power jobs to power the machine off, set the next boot device, and then power the machine on, reference. With this as the order of operations and in this scenario the workflow started before the machine powered off.

Expected Behaviour

The machine should be powered off before a workflow is created.

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Update cluster-api dependency to latest v1.1.x release

Update to support the latest cluster-api release

Update how network booting is disabled

Currently, CAPT tells Smee to not allowing a machine to network boot after it has been provisioned. This happens by CAPT setting 2 values in a Hardware object. Hardware.Spec.Metadata.Instance.State = provisioned and Hardware.Spec.Metadata.State = in_use.

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 52 in d828a9e

inUse = "in_use"

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 53 in d828a9e

provisioned = "provisioned"

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 110 in d828a9e

    
           return hw.Spec.Metadata.State == inUse && hw.Spec.Metadata.Instance.State == provisioned

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 187 in d828a9e

if err := scope.patchHardwareStates(hw, inUse, provisioned); err != nil {

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 723 in d828a9e

hardware.Spec.Metadata.State = ""

cluster-api-provider-tinkerbell/controllers/machine_reconcile_scope.go

Line 724 in d828a9e

hardware.Spec.Metadata.Instance.State = ""

This was the case because Smee was using these fields to gate network booting. This is no longer the case in Smee (since v0.10.0). Gating of network booting in Smee now occurs via Hardware.Spec.Interfaces[].Netboot.AllowPXE.

The affect of this is that when a machine reboots, if the firmware is setup to network boot first, then the machine will be served network boot packets from Smee and the machine will boot into HookOS and sit there indefinitely.

Expected Behaviour

A machine provisioned by CAPT should not network boot after a reboot (hardware configured to tell Smee not to netboot a machine).

Current Behaviour

Possible Solution

Update CAPT to set Hardware.Spec.Interfaces[].Netboot.AllowPXE = false after a machine is provisioned.

Steps to Reproduce (for bugs)

Provision a cluster with CAPT
Set a machine's firmware to network boot
Reboot the machine
See that HookOS is loaded and sits indefinitely

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Document why certain hardware metadata configuration is required

Followup to #51 (comment)

Clarify to users why certain hardware metadata is required

part is to satisfy cloud-init requirements for an ec2 style metadata source
part is to allow CAPI to know which device to target when creating the template/workflow

Fix Go module name or repository name

Currently Go module name is following:

$ grep mod go.mod
module github.com/tinkerbell/cluster-api-provider-tinkerbell

and repository name is https://github.com/tinkerbell/cluster-api-provider-tink.

This breaks looking up packages documentation in https://go.dev/, as neither https://pkg.go.dev/github.com/tinkerbell/cluster-api-provider-tinkerbell or https://pkg.go.dev/github.com/tinkerbell/cluster-api-provider-tink are valid links.

We should either change module name and all imports or rename the repository.

I think this also affects go get'ing this repository.

The cluster creation phase for v0.1.0 Tech Preview

I want to use this issue to describe what I have in my mind and how I think we should implement the lifecycle for v0.1.0 Tech preview of CAPT.

First TInkerbellClusterSpec has 2 lists of hardware IDs coming from Tinkerbell one for the control plane and one for the data plane.

Based on what is required the machine controller will take an Id from the right list.

The first machine that CAPI creates is always a control plane, and we can follow the same implementation we made for CAPP during its first release. The first control plane IP will become the Custer Endpoint. Just for simplicity. (we can think about a HA like solution later)

From there everything else should work almost in the same way as any other provider implementation. We have to figure out how to correctly inject UserData in the metadata server (@mmlb can you give us some tips here).

To summarize the interaction with Tinkerbell in the Machine creation:

Identify available hardware ID from the list specified in the TinkerbellClusterSpec
If it is the first control plane set the Cluster Endpoint with the hardware IP.
Edit the metadata for that selected hardware ID storing the right UserData.
Create a workflow from the specified template (templateName is part of the TinkerbellMachineSpec)
Wait until the workflow is in success state (this means that the OS is ready
From there the OS is running and the OS should be able to read the user data persisted at point 3 and everything should work!

Temporary avoid CCM

Technically if Tinkerbell replaces a cloud provider we should have a CCM for Tinkerbell, but for now, we should try to label and annotate nodes with the ProviderID during machine creation if possible, even if it is not the right way, it is faster.

Expose a way to fully customize metadata source configuration

Currently the metadata server is determined using the value of TINKERBELL_IP from the environment, but we should also likely allow configuration of the port, protocol or even the full URL:

Originally posted by @displague in #100 (comment)

TinkerbellMachine's cannot be deleted when the bound Hardware doesn't exist

If you delete a Hardware object that's used by a TinkerbellMachine object the TinkerbellMachine controller errors out and never removes the finalizer preventing it from ever being deleted.

Expected Behaviour

Deleting a TinkerbellMachine object that's bound to a deleted Hardware object is removed.

Current Behaviour

When reconciling a deleted TinkerbellMachine resource CAPT retrieves the Hardware. If the hardware fails to be retrieved it will return an error and the TinkerbellMachine never has its finalizer removed, therefore is never deleted from Kubernetes and blocks deletion of all owning resources (e.g. CAPI Machine, CAPI Cluster etc).

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Add Rufio to the quickstart

The quickstart doesn't document the needed Rufio CRDs required to enable CAPT to do the BMC interactions.

Expected Behaviour

Add a section in the quickstart for creating Rufio CRDs.

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Update Rufio dependency

Rufio has has an API change since it was integrated and released 0.1. This issue is to update CAPT in-line with Rufio v0.1.

Enable Mergify for this repo

Expected Behaviour

Approved PR with ready-to-merge tag gets merged

Current Behaviour

Approved PR with ready-to-merge tag doesn't get merged

Manifests generated by "make release-manifests" have too long names

When trying to apply them, I was getting the following error:

Error: action failed after 10 attempts: failed to create provider object /v1, Kind=Service, cluster-api-provider-tink-system/cluster-api-provider-tinkerbell-controller-manager-metrics-service: Service "cluster-api-provider-tinkerbell-controller-manager-metrics-service" is invalid: metadata.name: Invalid value: "cluster-api-provider-tinkerbell-controller-manager-metrics-service": must be no more than 63 characters

Perhaps we should consistently change all manifests to use cluster-api-provider-tink- prefix instead of cluster-api-provider-tinkerbell-.

v0.5.0 release

tracking ticket for v0.5.0 release.

Expected Behaviour

Do we want to include the new API changes before the v0.5.0 release? is there enough changes between top of tree and v0.4.0 to warrant a release before? I believe there is one race condition commit and a couple version updates (CAPI and Rufio).

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

v0.3

Discussion issue for 0.3

This release is focused on updating dependencies to use released versions and ensuring stability with the latest Tinkerbell stack.

Structure review and a bit of refactoring if required

Ciao! As we spoke yesterday the boilerplate code comes from CAPP. But I think there are a few things that I would like to better understand. All the code generation part: go and swagger is a huge unknown. Make works but for me, it is all magic.

Plus if you know that foldering is wrong or things like that, please feel free to make it more in line with the CAPI way to do things.

release process builds a broken infrastructure-components.yaml

The v0.5.0 release created and uploaded to the GitHub release a broken infrastructure-components.yaml file. The are numerous creationTimestamp: "null" being created. This caused a clusterctl init command to fail with:

Error: failed to get provider components for the "tinkerbell" provider: failed to set the TargetNamespace on the components: unable to convert unstructured object to apiextensions.k8s.io/v1, Kind=CustomResourceDefinition: parsing time "null" as "2006-01-02T15:04:05Z07:00": cannot parse "null" as "2006"

I have updated the infrastructure-components.yaml from the v0.5.0 release so that creationTimestamp: "null" is now creationTimestamp: null.

This needs fixed for future releases.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Address feedback from https://github.com/tinkerbell/cluster-api-provider-tink/pull/8

#32 was created to merge the current state of PoC work from #17, which also included changes from #8.

Feedback from #8 still needs to be reviewed and addressed.

v0.4

Discussion issue for v0.4.

Default mirror setting missing?

Expected Behaviour

We expect a new machine in a cluster created with cluster-api-provider-tinkerbell to be able to download tools like oci2disk, writefile, and kexec directly from quay or a local repo.

Current Behaviour

The machines try to download the images from the default server (I assume this to be docker hub, but the error doesn't really state this). I feel like there's some option to set up that default server, but I cannot find the option in the Hardware CRDs or the Cluster CRDs either. I'm basically just following the getting started guide but running into this.

Possible Solution

The getting started guide seems to be missing an explanation on how to tell a server what default mirror server it should be using. This is quite probably a linuxkit setting or something, as the template in the code (assuming it's internal/templates/templates.go) does not seem to allow for overriding the server.

Steps to Reproduce (for bugs)

Follow getting started up to "Apply the workload cluster" (https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/main/docs/QUICK-START.md#apply-the-workload-cluster)
Check the boots logs and see that it's failing because it's unable to pull the oci2disk image.
This is after it already download the tink-worker from quay, so networking is working.

Context

I'm not sure how to continue here, how to set the default mirror host to quay.

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS): Linux
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: Directly on hardware (Lenovo servers) from a k3d cluster on a mgmt server
Link to your project or a code example to reproduce issue:

Let me know if there's anything else you need to help me solve this!

Investigate CAPI upgrades

Investigate what needs doing to upgrade CAPT to the latest CAPI APIs/versions.

Add CI check for "go mod tidy"

We shouldn't accept PRs which do not include changes from go mod tidy to keep repository clean.

Push images to quay.io

Tinkerbell projects push their Docker images to quay.io. CAPT doesn't currently have a repository in quay.io (admins to create). Having done that, the release process needs twiddling to push the images accordingly.

"make release-manifests" no longer works

To run clusterctl config cluster locally, one must generate the infrastructure-components.yaml file which is done by make release-manifests, but this command currently prints the following error:

$ make release-manifests
release version v0.0.0
KUSTOMIZE_ENABLE_ALPHA_COMMANDS=true /tmp/test/cluster-api-provider-tink/hack/tools/bin/kustomize-v3.8.8 cfg set config image-tag v0.0.0
Error: unable to find "Krmfile" in package "config"
make: *** [Makefile:364: release-version] Error 1

CC @cpanato

Refactor main function

To avoid main function growing infinitely, we should refactor at early stage, to bring there some structure and make it more readable.

Support for non-Linux local test environments

Currently the only way to test locally with VMs is to use the Tinkerbell sandbox setup with Vagrant/libvirt on LInux.

The Tinkerbell sandbox also supports Vagrant/Virtualbox, which would provide a more cross-platform local test environment, but due to the networking configuration, does not work with streaming the sigs.k8s.io/image-builder based images from ghcr.io, and the resulting images would fail due to networking configuration even if they were written properly.

We need to either have a way for users to consume the published image-builder built images on Vagrant/Virtualbox or have another way(s) of running a local test environment using Windows and Mac.

Expected Behaviour

Mac/Windows users to have a way to run a local test environment using VMs

Current Behaviour

The only way to run a local test environment on VMs today is on Linux using Vagrant/Libvirt

Possible Solutions

Enable support for Tinkerbell sandbox Vagrant/Virtualbox environments either by updating hook and sigs.k8s.io/image-builder to support the Virtualbox networking configuration by default
Possibly support the Virtualbox networking environment using a combination of:
- Configuring the Tinkerbell hardware networking configuration differently to better support the VirtualBox networking config
- Ensuring that hook properly initializes all defined networking devices listed in the hardware configuration
- Have a cluster template that can be used with Vagrant/Virtualbox to properly configure the networking devices defined in Hardware and not just the first device.
  - This may also require some changes to the cloud-init configuration injected into the written image by the CAPT-generated workflow template, longer term if this is needed it should be addressed via proper cloud-init support for Tinkerbell
Provide an alternative setup for local VMs on Windows/Mac (preferably adding support to tinkerbell/sandbox instead of a one-off setup for this repo).

Consume Tink v1alpha2 APIs

how to try capt on real hardware

I followed the quick-start guide to try capt, but nothing happens after apply hardware and cluster.yaml, I can see the cloud-init user data written into hardware CR, but in the machine side, nothing happens, looks like something is missing for real hardware use case, which is expected given the guide, I find no where to set the IPMI IP and credentials, capt has no way to control the OS provision process, however going through the issues, it looks like capt supports OS provision, just don't know how should I use this feature.

Expected Behaviour

Expect to try capt on real hardware success following a guide and create a cluster successfully.
What I hoped it to provision K8s cluster on a bare metal which has nothing configured, except networking boot is set. Is this possible for capt?

Current Behaviour

Nothing happens except user data is written into hardware, and hardware does not have any state.

Possible Solution

A guide or some insight on this?

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

tinkerbell / cluster-api-provider-tinkerbell Goto Github PK

cluster-api-provider-tinkerbell's Introduction

Cluster API Provider Tinkerbell

What is the Cluster API Provider Tinkerbell

Quick Start

Compatibility with Cluster API and Kubernetes Versions

Kubernetes versions with published Images

Current state

cluster-api-provider-tinkerbell's People

Contributors

Stargazers

Watchers

Forkers

cluster-api-provider-tinkerbell's Issues

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Temporary avoid CCM

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solutions

Expected Behaviour