tinkerbell / roadmap Goto Github PK

7.0 6.0 3.0 217 KB

Official Tinkerbell Roadmap

License: Apache License 2.0

tinkerbell

roadmap's Introduction

Roadmap

The official Tinkerbell roadmap that provides visibility into high level themes of work and features.

Anyone can raise a roadmap item for consideration. Raise an issue against this repository and it will be considered by project maintainers. Ensure you include sufficient detail making minmal assumptions about the readers knowledge. After some discussion you may be asked to provide a design; maintainers will guide you as needed.

When the issue is processed it will appear in the roadmap project.

What is appropriate for the roadmap?

The roadmap exists for tracking and visibility purposes. Our process is designed to be light weight without sacraficing necessary detail to gain an high level understanding of project work.

Items appropriate for the roadmap include anything that presents a significant change - often but not always across multiple Tinkerbell repositories - or that the community really ought to be aware of when gaining a snapshot understanding of in-flight work. There are no hard rules
about roadmap item size so if in doubt raise the issue and we can determine how best to track the work item later.

roadmap's People

Contributors

Stargazers

Watchers

Forkers

chrisdoherty4 jacobweinstock moadqassem

roadmap's Issues

Tinkerbell v1alpha2 API

The Custom Resource Definitions defined in the Tink repository are mapped from the old Postgres backend without thinking too much about the data and organization.

We've identified duplicate and hard to understand fields on CRDs. We would like to refactor the CRDs to better represent the data they contain.

Project
https://github.com/orgs/tinkerbell/projects/26

Support BMC actions as part of a workflow

Summary

Currently, running Rufio actions against hardware is not builtin to a workflow. Users must create the jobs/tasks manually. Making this builtin to a workflow and optionally enabled would be very valuable.

The initial thought here is that this would be the responsibility of the tink controller. We could start off by just having a single option. Something that would get a machine into a networking booting state. The user experience would probably be just a boolean. netboot: true for example. This would be an opt-in feature. This would also probably only run as an initial step, not something you would be able to specify during the course of a running workflow.

Integrate with Secret Managers in Templates/Workflows

We need first class construct, design, etc for secret management in HookOS, Templates and Workflows.

Integrate with ISC Kea

Introduce `WorkflowSet` and `HardwareRuleSet` CRDs

Currently, Workflows have to be created using a 1:1 mapping between Hardware and Workflow. This has been the case since the beginning. Workflow creation is left up to the user. For large deployments this can be challenging. I propose we build on top of the existing Workflow object and build the capability to have the Stack do a 1:many mapping between Hardware and Workflow. This opens up many new possibilities and even integration with auto capabilities.

The idea is that a user can define a WorkflowSet object and the Tink controller (or something else) will use the object in order to create >= 1 Workflow object(s). This significantly improves the user experience around large batch creation of Workflows.

Some of the technical details aren't fully formed yet. You'll see that in the comments below. I will update this issue as the details become more fully formed.

New CRDs

WorkflowSet

For each hardware object create a workflow object if an existing (exact match? hardware ref already exists?) workflow object does not exist. Use the pause annotation to pause creating workflow objects. Tink worker matching: The Hardware object must provide a unique identifier. the namespace/name for the Hardware object is unique but might not be usable for the tink worker id. It could be the "first" mac address. There could be a field in the Hardware object that defines the unique identifier. This identifier needs to be coordinated with the Tink worker and Smee (Smee sets the ID in kernel parameters).

---
apiVersion: tinkerbell.org/v1alpha1
kind: WorkflowSet
metadata:
  annotations:
    tinkerbell.org/pause: "false"
  name: set1
  namespace: tink
spec:
  HardwareRuleSetRefs:
    - name: ruleset1
      namespace: tink
  TemplateRef:
    name: template1
    namespace: tink

HardwareRuleSet - CRD

the result of matching Hardware against the ruleset will be a list of Hardware objects.

---
apiVersion: tinkerbell.org/v1alpha1
kind: HardwareRuleSet
metadata:
  name: ruleset1
  namespace: tink
spec:
  operation: AND # OR
  rules:
    - label: kubernetes.io/arch
      value: amd64
      type: string # int, bool, float
      matchExpression: "=="

K8s Operator for Tinkerbell Stack Management

The Tinkerbell stack is a set of containers that could be managed by a Kubernetes operator. Initial stack deployment is reasonably trivial but it becomes more complex with stack upgrades. We have seen instances where users are instrumenting their own logic to perform Tinkerbell stack management.

Deprecate and Remove Postgres Backend

Tinkerbell has numerous pieces of functionality that are unsupported, poorly maintained and/or no longer used. This project seeks to clean up the Tinkerbell codebase by dubbing functionality deprecated and removing it.

Proposals
https://github.com/tinkerbell/proposals/tree/main/proposals/0029

Project
https://github.com/orgs/tinkerbell/projects/18

Support kubernetes secret for providing user-data to cloud-init

Context

When using Hegel, currently, if we want to provide user-data to cloud-init, we need to pass it via Hardware spec.
For example:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: hw1
  namespace: tink-system
spec:
  userData: |
    #cloud-config
    ---
    user: <USERNAME>
    password: <PLAINTEXT_PASSWORD>
    chpasswd: {expire: False}
    ssh_pwauth: True

Hegel then serves the user-data on HEGEL_IP:HEGEL_PORT/2009-04-04/user-data and
meta-data on HEGEL_IP:HEGEL_PORT/2009-04-04/meta-data/

cloud-init can read these user-data and meta-data when datasource is configured correctly.

This behavior works ok as long as user-data does not contain any sensitive information. However, it could still cause formatting issues with user-data.

Proposal

If user-data contains sensitive data like passwords, license keys etc it might not be desirable to put these in Hardware spec in plaintext format which can be read by anyone with read access to Hardware CR.

To help with this, we could move the user-data to a kubernetes secret object and reference that object in Hardware spec.
This secret object reference can be used by Hegel to pull user-data.
New spec example:

apiVersion: tinkerbell.org/v1alpha1
kind: Hardware
metadata:
  name: hw1
  namespace: tink-system
spec:
  userDataRef:
     name: <SECRET_NAME>
     namespace: <SECRET_NAMESPACE>

This approach has a few benefits,

We can avoid sensitive user-data information in Hardware spec.
Access to this Secret can be restricted to only required users i.e. cluster-admin and hegel serviceaccount.
Secret stores data in base64 encoded form. This helps preserve formatting of user-data by creating a secret directly from the user-data file.

Raspberry PI 4B Support

Add support for pulling in and using Secrets

We need to build the concept for pulling in and using secrets.

From conversation with Nathan:

“I think we’re going to handle it by having a privileged worker simply running totally separate from unprivileged.”

We will also need a “more formal architecture and option and implementation of how you do secrets management.”

Extra notes:

An out of the box "tink secrets" container
Integrating with k8s-secrets or Vault

Expand Supported DHCP Configurations

Project
https://github.com/orgs/tinkerbell/projects/21/views/1

Resource Validation

Tinkerbells primary backend is Kubernetes. This means it acts as the data source for Hardware, Workflows and Templates. When these objects are submitted to the cluster they do not undergo any validation. This theme of work is to address general issues encountered by users when submitting data to Tinkerbell.

Project (Combined with CRD Refactor)
https://github.com/orgs/tinkerbell/projects/26

Related
tinkerbell/tink#532

Rearchitect `in_use` flag

The in_use flag found on the Hardware CRD in the Tink repository is hyperloaded dependent on the client reading/updating it. In Cluster API it indicates a machine has been provisioned while in the Tinkerbell stack it has loose semantics that generally prevent DHCP being served for that Hardware.

We want to rearchitect this flag, possibly remove it in favor of simpler solutions, to make understanding the system state easier.

Introduce Instance field in Hardware resource

I propose we introduce a new top level field in the Hardware CR, instance. An Instance would be data that changes more frequently than the physical characteristics of a machine. This data would generally correspond to information the current operating system on the Hardware uses to configure and identify itself. Or information that is used in the provisioning process of the Hardware. This is in contrast to the facts about Hardware which would/should describe the physical characteristics of a machine and change very infrequently. These facts would move to a field named, attributes. As part of this we should move all state fields to the status part of the CRD.

Why

Currently the v1alpha1 API for Hardware holds facts and some instance data and state about a machine. This means that when state is defined (CAPT uses this, for example), some things like GitOps for Hardware becomes tricky and sometimes infeasible. While state for a Hardware object, or any CR object, is not implicitly bad, this state most often should be in the "status" of an object, not in the "spec".

Examples

The following are example of what could be in "spec" for each category.

Instance data

Userdata
operatingSystem
sshKeys
IPAM {vlans,}
DHCPEnabled
Netboot {enabled,ipxeScript,ipxeURL,osie,kernelParams,}*
tags
partitions
filesystems
logical volumes
bonding
routing {bgp peers, etc}
secrets
plan
facility
custom data (user defined key/value pairs)

Attribute data

Nics
Disks
Memory
CPU
GPU
TPU
PCI
Chassis
BIOS
Baseboard
Product
TPM
StorageControllers
PSU

Web UI

AuthN/AuthZ

Tinkerbell doesn't have strong AuthN/AuthZ support. This has been raised in tinkerbell/tink#507 with some ideas on how we could address.

Conditional Tinkerbell Template Actions

Overview

Tinkerbell defines a Template object that contains Actions. Actions represent an activity that contributes to the provisioning of a machine (for the primary Tinkerbell use-case). Actions are flexible as they are OCI images that can be developed and maintained by third parties. This flexibility contributes to the flexibility of Templates.

Template's themselves, however, aren't particularly flexible. For use-cases such as CAPI/CAPT where the same template is used to provision the same kind of node (such as control plane nodes) where its necessary to perform different dependent on the hardware it can be difficult to model using Template's.

Proposal

Provide control flow type capabilities in Template's that enables toggling of individual actions. This could work in a similar fashion to Github Action's if statement.

# Github Action Example
jobs:
  job_name:
     if: EXPRESSION

The semantics of if are to run the job if the EXPRESSION evaluates to true.

We could create something similar for Tinkerbell Template actions (note the historical concept of a 'task' has been removed for simplicity as it will be removed in future versions of Tinkerbell).

actions:
- name: "write-file"
  if: EXPRESSION

Rationale

Adding expression capabilities with if in Tinkerbell Templates adds complexity in the form of maintenance. Its non-trivial and the Go standard library doesn't offer expression evaluation. Leveraging third party libraries for evaluating expressions would be ideal.

The particular CAPI/CAPT example used is quite specific to CAPI/CAPT. Its possible that a CAPT solution could be created that decouples templates from CAPT TinkerbellMachineTemplate objects and alleviates that specific problem as there's nothing inherently preventing a user in Tinkerbell core from creating a different template for a specific kind of machine today.

Support RKE2 deployments

Overview

RKE2 is a Kubernetes distribution developed by Rancher that targets Governments.

Several community members have expressed a desire to use RKE2 with some of them running into trouble. This ticket is to experiment and provide a clear path forward for users wanting to leverage RKE2.

Documentation Updates

Significant architectural changes have happened to Tinkerbell in the last 12 months; the documentation does not reflect these changes.

To minimize documentation effort we want the following:

High level documentation that helps depict the Tinkerbell system should reside on the docs website.
Known setups/working hardware/supported technologies should be documented on the website.
Contributing, project structure and lower level architectural detail should exist in project repositories. We expect this kind of documentation to remain volatile for some time and its easier to keep it in lock-step with the code when it lives close to the code.

Project
https://github.com/orgs/tinkerbell/projects/17

Tinkerbell CLI

Use-cases:

Easy installation of the stack.
Self testing a deployment.
Generating API objects.

Automated testing for Playground deployments

Currently we don't have any automated testing for the vagrant deployments of the Playground. We should write automated functional tests to validate the deployments as best we can.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Secure Boot

Add support across the Tinkerbell stack for secure boot.

Hardware Monitoring and Alerting

Instrument monitoring and alerting of hardware managed by Tinkerbell.

Redfish may provide APIs to achieve the behavior.

Redfish being created as DMTF’s Redfish® is a standard designed to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center (SDDC). Both human readable and machine capable, Redfish leverages common Internet and web services standards to expose information directly to the modern tool chain.

https://www.dmtf.org/standards/redfish

Auto enrollment of nodes

Overview

There have been various requests to auto enroll devices with some sort of MAC filtering. Auto enrollment could mean bringing a device online ready to process workflows, or it could mean defining a default workflow to be run on all devices that auto enroll.

It may be useful to think of running a default workflow as an independently configurable feature from auto enrolling a device. This would help define auto enrollment as simply bringing a Tink Worker online on said device and subsequently allow operators to manually define workflows as well as define an automated approach.

Integrate with Netbox

Migrate to a Single Tinkerbell Version

Tinkerbell is composed of several microservices. All Tinkerbell microservices are semver major version 0 and we have historically made breaking changes in minor version increments. The volatile nature of major version 0 has made it difficult for users to know which versions of our services are compatible with eachother.

This ticket is to track migration to a singular version that represents a set of known to work versions of the Tinkerbell microservices.

Integrate Rufio with the core Tinkerbell Stack

Disclaimer: this is more of a train of thought than a well thought out idea, but I'd like to discuss it.

Rufio is currently an 'optional' component. An orchestration component is responsible for arranging the Rufio Jobs and Tink Workflows (this is essentially what CAPT does for provisioning Kubernetes clusters).

Is there a use-case for tighter integration between Tink Core and Rufio by supplying fields on Workflows? Would it, for example, be useful to allow users to specify some sort of "boot strategy" on a Workflow that results in a netboot?

Assuming we worked out the details of a "boot strategy" the result would be a fully automated machine provisioning solution that can be executed with a single workflow instead of requiring an additional orchestration layer.

An example boot strategy could be "issue the series of commands to netboot every 60s until the first action transitions to 'running'".

Archive tinkerbell/hub and move actions to dedicated actions repository

The tinkerbell/hub repository contains Tinkerbell supported actions that are commonly used in provisioning. The repository was originally written to publish actions to https://artifacthub.io. Other code in the repository is tool based and used for generating and publishing the actions.

https://artifacthub.io explicitly advertises itself as a place to "Find, install and publish
Kubernetes packages".

Publishing to Artifact Hub feels slightly inappropriate given actions are not Kubernetes packages; instead we can publish to qya.io or ghcr.io.
The remaining tooling in the repository is around action generation. Given we rarely generate actions and those produced by third parties seem unlikely to use this tool it doesn't seem worthwhile to maintain.
With the repository publishing to a new registry and most of the code removed (excluding actions themselves), we can archive the hub repository in favor of a more easily repository identified name such as actions.

Integrate Cluster API Provider Tinkerbell(CAPT) into Sandbox

Cluster API Provider Tinkerbell is not properly integrated into the Sandbox. This would be extremely valuable improving the development cycle of CAPT.

Additional Hardware Support

Project
https://github.com/orgs/tinkerbell/projects/19/views/1

Support device restart or kexec as part of workflows

Summary

When provisioning devices users inevitably need to restart the device (or kexec). To date, users achieve a restart through an action which can lead to incorrect status reporting on Workflows.

When the restart action is run it instructs the kernel to perform a system restart. The restart process races against the action exiting and Tink Worker reporting the action was successful. In the case of kexec, we rarely - if ever - see the action transition to a success state. This generally leaves workflows to timeout which is misleading for users.

This issue is to track the introduction of restart/kexec to workflows as a built in feature removing the need to include an action. Other proposals are welcome.

tinkerbell / roadmap Goto Github PK

roadmap's Introduction

Roadmap

What is appropriate for the roadmap?

roadmap's People

Contributors

Stargazers

Watchers

Forkers

roadmap's Issues

Summary

New CRDs

WorkflowSet

HardwareRuleSet - CRD

Context

Proposal

Why

Examples

Instance data

Attribute data

Overview

Proposal

Rationale

Overview

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Overview

Summary

Recommend Projects

Recommend Topics

Recommend Org