tinkerbell / smee Goto Github PK

View Code? Open in Web Editor NEW

245.0 33.0 81.0 2.16 MB

DHCP and iPXE Server

Home Page: https://tinkerbell.org

License: Apache License 2.0

Dockerfile 0.28% Makefile 1.63% Go 95.14% Shell 2.87% Starlark 0.09%

ipxe dhcp tftp netboot tinkerbell pxe

smee's Introduction

Tinkerbell

License

Tinkerbell is licensed under the Apache License, Version 2.0. See LICENSE for the full license text. Some of the projects used by the Tinkerbell project may be governed by a different license, please refer to its specific license.

Tinkerbell is part of the CNCF Projects.

Community

The Tinkerbell community meets bi-weekly on Tuesday. The meeting details can be found here.

Community Resources:

What's Powering Tinkerbell?

The Tinkerbell stack consists of several microservices, and a gRPC API:

Tink

Tink is the short-hand name for the tink-server and tink-worker. tink-worker and tink-server communicate over gRPC, and are responsible for processing workflows. The CLI is the user-interactive piece for creating workflows and their building blocks, templates and hardware data.

Smee

Smee is Tinkerbell's DHCP server. It handles DHCP requests, hands out IPs, and serves up iPXE. It uses the Tinkerbell client to pull and push hardware data. It only responds to a predefined set of MAC addresses so it can be deployed in an existing network without interfering with existing DHCP infrastructure.

Hegel

Hegel is the metadata service used by Tinkerbell and OSIE. It collects data from both and transforms it into a JSON format to be consumed as metadata.

OSIE

OSIE is Tinkerbell's default an in-memory installation environment for bare metal. It installs operating systems and handles deprovisioning.

Hook

Hook is the newly introduced alternative to OSIE. It's the next iteration of the in-memory installation environment to handle operating system installation and deprovisioning.

PBnJ

PBnJ is an optional microservice that can communicate with baseboard management controllers (BMCs) to control power and boot settings.

Building

Use make help. The most interesting targets are make all (or just make) and make images. make all builds all the binaries for your host OS and CPU to enable running directly. make images will build all the binaries for Linux/x86_64 and build docker images with them.

Configuring OpenTelemetry

Rather than adding a bunch of command line options or a config file, OpenTelemetry is configured via environment variables. The most relevant ones are below, for others see https://github.com/equinix-labs/otel-init-go

Currently this is just for tracing, metrics needs to be discussed with the community.

Env Variable	Required	Default
`OTEL_EXPORTER_OTLP_ENDPOINT`	n	localhost
`OTEL_EXPORTER_OTLP_INSECURE`	n	false
`OTEL_LOG_LEVEL`	n	info

To work with a local opentelemetry-collector, try the following. For examples of how to set up the collector to relay to various services take a look at otel-cli

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true
./cmd/tink-server/tink-server <stuff>

Website

For complete documentation, please visit the Tinkerbell project hosted at tinkerbell.org.

smee's People

Contributors

Stargazers

Watchers

Forkers

gauravgahlot mmlb invidian mikemrm kqdeng splaspood apoland rgl grahamc cbkhare infracloudio displague colemickens detiber nicolerenee geekgonecrazy parauliya hien packethost mrchrd liuyi71sinacom ahmedelshafaie markjacksonfishing jacobweinstock stolsma skunkwerks rawkode dlotterman nshalman ibrokethecloud truongmd micahhausler richyta tobert belchy06 tstromberg nestorwheelock olidacombe deitch jgavinray tinkerbell-bac pokearu wahello scottgarman ptrivedi edw-eqix leishigege chrisdoherty4 abhinavmpandey08 isabella232 ekmixon luke-jarymowycz dlaube moadqassem fire-ant justinsb rohankumardubey 99rgosse fintelia 805jurassicparkharp pedroalvesbatista rawmind0 ader1990 mpanduru jtu-ampere alva8756 reconbug umangachapagain epuertat counihantom tomcounihan lightbitslabs swind davyb1964 rpardini kubefirst jasonyates kubefirst

smee's Issues

Github Actions produced container images are broken

The GitHub Actions based CI is putting the boots binary into the container w/o setting the execute bit set thus breaking the container running using the default entrypoint.

~/go/src/github.com/tinkerbell/boots> docker pull quay.io/tinkerbell/boots:latest
latest: Pulling from tinkerbell/boots
Digest: sha256:e061d587b04e08fb75076b2293dbb1e30790cc4ccdf29c052c0baed51bd9eb7e
Status: Image is up to date for quay.io/tinkerbell/boots:latest
quay.io/tinkerbell/boots:latest
~/go/src/github.com/tinkerbell/boots> docker inspect  -f '{{.Config.Entrypoint}}' quay.io/tinkerbell/boots:latest
[/boots]
~/go/src/github.com/tinkerbell/boots> docker run -ti --entrypoint=sh quay.io/tinkerbell/boots:latest -c 'ls -l /boots'
-rw-r--r--    1 root     root      19849258 Oct 23 09:11 /boots

We should have a dependency license scanner

A proprietary dependency was introduced in #134 and we did not catch it. I'm pretty sure something like https://snyk.io/ or similar would have caught it. We should search/pick a tool/service that provides this and plug it into PR checks.

bou.ke/monkey is proprietary

I just came across the license for bou.ke/monkey module (https://github.com/bouk/monkey/blob/master/LICENSE.md) which explicitly states not giving permission for any use for anyone, as such we are currently in violation and should remedy.

Expected Behaviour

boots follows license requirements of dependencies

Current Behaviour

boots is in violation of bou.ke/monkey's license

Possible Solution

Remove all use of bou.ke/monkey code. I will temporarily comment the tests making use of the monkey's code.

Reintroduce tests removed via #198

PR #198 removed some useful tests due to their use of a proprietary library. We should figure out a way to reintroduce the tests. By reintroduce, I mean the functionality of the tests and not necessarily the exact same implementation, in fact it would be nice to avoid needing something like monkey patching, possibly by making use of standalone mode?

When an IPXe URL is a data-url, expand the url into ipxe content

Expected Behaviour

When providing an IPXe URL, that is data-uri encoded, boots should expand this URI into the decoded content and pass this to IPXE. This would be equivalent to #!ipxe userdata, but frees the userdata metadata to be consumed as userdata (cloud-init, ignition, etc).

Current Behaviour

Data URIs are not handled.

Expected Behaviour

Ideally, changes in this commit #115 should be rectified to handle the following scenario.

tink-server should have a mechanism to execute/resume running workflow after reboot.
When rebooted via workflow either workflow should be marked as completed.
Or worker should pick up new workflow (if there is one defined for it).

Current Behaviour.

Below are the stepwise states happening right now.

worker does ipxeboot.
worker executes the workflow.
worker reboots as a part of workflow.
worker, after reboot, retries to pxe boot (because of #115 ) and fails.
When a new workflow is created.
Worker stills picks up the workflow which is running state and against, fails.

Steps to Reproduce (for bugs)

Create a workflow using the below template.
Use any alpine image for executing reboot action.

version: "0.1"
name: try-install
global_timeout: 1800
tasks:
  - name: "hello world"
    worker: "{{.device_1}}"
    actions:
      - name: "hello_world"
        image: hello-world
        timeout: 60
      - name: "reboot"
        image: alpine-image
        command:
          - bash
          - -c
          - |
            echo 1 > /proc/sys/kernel/sysrq; echo b > /proc/sysrq-trigger

Observe steps as mentioned in current behaviour.

State of first workflow

docker exec -i deploy_tink-cli_1 tink workflow state 88241ed8-6221-11eb-af88-0242ac120005 
+----------------------+--------------------------------------+
| FIELD NAME           | VALUES                               |
+----------------------+--------------------------------------+
| Workflow ID          | 88241ed8-6221-11eb-af88-0242ac120005 |
| Workflow Progress    | 50%                                  |
| Current Task         | hello world                          |
| Current Action       | reboot                               |
| Current Worker       | 0eba0bf8-3772-4b4a-ab9f-6ebe93b90a95 |
| Current Action State | STATE_RUNNING                        |
+----------------------+--------------------------------------+

state of the worker after reboot action.

Create a new workflow.
worker still picks up the old workflow because of #115 and fails again.
Both of the workflow new and old, are stuck at their respective state.

vagrant@provisioner:/vagrant/deploy/centos$ docker exec -i deploy_tink-cli_1 tink workflow state 88241ed8-6221-11eb-af88-0242ac120005 
+----------------------+--------------------------------------+
| FIELD NAME           | VALUES                               |
+----------------------+--------------------------------------+
| Workflow ID          | 88241ed8-6221-11eb-af88-0242ac120005 |
| Workflow Progress    | 50%                                  |
| Current Task         | hello world                          |
| Current Action       | reboot                               |
| Current Worker       | 0eba0bf8-3772-4b4a-ab9f-6ebe93b90a95 |
| Current Action State | STATE_RUNNING                        |
+----------------------+--------------------------------------+
vagrant@provisioner:/vagrant/deploy/centos$ docker exec -i deploy_tink-cli_1 tink workflow state 5894621e-6222-11eb-af88-0242ac120005
+----------------------+--------------------------------------+
| FIELD NAME           | VALUES                               |
+----------------------+--------------------------------------+
| Workflow ID          | 5894621e-6222-11eb-af88-0242ac120005 |
| Workflow Progress    | 0%                                   |
| Current Task         |                                      |
| Current Action       |                                      |
| Current Worker       |                                      |
| Current Action State | STATE_PENDING                        |
+----------------------+--------------------------------------+

Result: If used for OS installation and worker is rebooted as a part of the workflow. The worker will never come up.

Context

Creating a centos workflow.

Your Environment

Vagrant

efficient way to collect mac address

Tinkerbell need to know the mac address of the machine previsiously
is there any way to collect the mac address of the machine efficiently?
expecially the mac of PCI-e Nic Card.

with great appreciate.

Configure Alpine networking via `ip` kernel command line

Expected Behaviour

Remove need for configuring Alpine networking via dhcp.

Current Behaviour

A simple bounded retry loop with hard coded sleep times before, during and after calling udhcpc.

Possible Solution

Alpine's init supports ip configuration via the ip kernel command line.
We have not been using it because it bypassed the nic detection logic (which we probably should have just changed anyway..) and did not support dns configuration.
But now that all changes with tinkerbell/osie#72 so we should just make use of ip at some point.

git-lfs instructions don't work on my OS

The git-lfs instructions don't work on my OS:

Are there further instructions and/or testing required? Is there an assumed base-OS which isn't being communicated in the README?

Tested the instructions on Ubuntu 18.04 and Arch Linux with the same error:

[alex@nuc boots]$ uname -a
Linux nuc 5.5.13-arch2-1 #1 SMP PREEMPT Mon, 30 Mar 2020 20:42:41 +0000 x86_64 GNU/Linux
[alex@nuc boots]$ git lfs install
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
[alex@nuc boots]$ git lfs pull
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
[alex@nuc boots]$ git version
git version 2.26.0
[alex@nuc boots]$

Re-enable http sanboot

When the iPXE build/config was reworked I got rid of sanboot. Turns out that breaks some EM usage and should be re-enabled.

Expected Behaviour

sanboot http://boot.ipxe.org/freedos/fdfullcd.iso should work.

Current Behaviour

sanboot command doesn't exist.

Unexport CreateFromIP

CreateFromIP is only used by tftp.go, and its able to use it because it already makes use of a single user helper function that does the host:port split. We should just do CreateFromRemoteAddr(c.String()) and avoid similar looking calls. Then we can just drop tftpClientIP.

Vagrant: Boots service fails to start after

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Replace implicit boot from hdd with explicit via ipxe script

There are times when we want to force a PXE booting machine to instead boot from disk. Think of machines that have netboot as the default boot option. We want to do this to avoid waiting for netboot
to timeout and proceed to boot from disk for each and every boot. We currently do this by sending a tftp file named /nonexiistent or /pxe-is-not-allowed (depending on if #149 is merged yet) but @splaspood mentioned that there's a way to boot from disk from withing iPXE.

Looking at netboot.xyz's iPXE script we see:

:local
echo Booting from local disks ...
exit 0

We should do the same instead of the non-existent tftp file. Doing so will extend the pxe boot process a little bit, especially after #149 is merged. This should be neglible.

Uniform Standards: Maintained Repository

#50 Expected Behaviour

We believe this repository is Maintained and therefore needs the following files updated:

Current Behaviour

If you feel the repository should be experimental or end of life or that you'll need assistance to update these files, please let us know by filing an issue with https://github.com/packethost/standards.

Possible Solution

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Context

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

Your Environment

https://github.com/tinkerbell/boots/

Include error return checking in golangci-lint results

Expected Behaviour

Linting prevents risky code from being introduced into the project.

Current Behaviour

The current “golanglint-ci” thresholds overlook common problems from building up due to not checking for errors. When overlooked these problems lead to error-prone code.

Possible Solution

If I had to guess, this linter was disabled for main.go:48: defer sync().

An inline // nolint:errcheck could be used to call out awareness in this one situation, without disabling the linter elsewhere.

In drone configurations,

Increase the version of golangci-lint to the latest.
Remove -D errcheck from the calling arguments.

.drone.yml:    image: golangci/golangci-lint:v1.24.0
.drone.yml:      - golangci-lint run -v -D errcheck

The PR that introduces this for CI should include (or be preceded) by one that addresses the outstanding issues, as reflected by the default golangci-lint reporting settings, in this project.

Context

$ golangci-lint run ./...
job/events_test.go:31:10: Error return value of `w.Write` is not checked (errcheck)
                w.Write([]byte(`{"id":"event-id"}`))
                       ^
job/hardware.go:57:9: Error return value of `w.Write` is not checked (errcheck)
        w.Write([]byte{})
               ^
job/http.go:55:30: Error return value of `(*encoding/json.Encoder).Encode` is not checked (errcheck)
                json.NewEncoder(buf).Encode(post_data)
                                           ^
job/http.go:66:9: Error return value of `w.Write` is not checked (errcheck)
        w.Write([]byte{})
               ^
job/job_test.go:52:9: Error return value of `j.setup` is not checked (errcheck)
        j.setup(d)
               ^
job/job_test.go:111:9: Error return value of `j.setup` is not checked (errcheck)
        j.setup(d)
               ^
job/job_test.go:143:9: Error return value of `j.setup` is not checked (errcheck)
        j.setup(d)
               ^
packet/client.go:155:15: Error return value of `io.Copy` is not checked (errcheck)
        defer io.Copy(ioutil.Discard, res.Body) // ensure all of the body is read so we can quickly reuse connection
                     ^
files/tarball/tarball.go:87:18: Error return value of `t.tw.WriteHeader` is not checked (errcheck)
        t.tw.WriteHeader(&t.hdr)
                        ^
installers/coreos/oem.go:52:15: Error return value of `f.WriteString` is not checked (errcheck)
        f.WriteString(cloudConfig)
                     ^
installers/coreos/oem.go:57:15: Error return value of `f.WriteString` is not checked (errcheck)
        f.WriteString(phoneHome)
                     ^
installers/coreos/oem.go:61:15: Error return value of `f.WriteString` is not checked (errcheck)
        f.WriteString(phoneHomeService)
                     ^
files/ignition/config.go:30:14: Error return value of `errors.Wrap` is not checked (errcheck)
                errors.Wrap(err, "writing ignition config")
                           ^
installers/vmware/kickstart_script_test.go:48:17: Error return value of `genKickstart` is not checked (errcheck)
                                genKickstart(m.Job(), &w)

misbehaving DHCP client can crash boots

Expected Behaviour

Boots shouldn't crash no matter what kind of traffic we throw at the DHCP server.

Current Behaviour

A dhcp client that sends a malformed packet can crash the service.

This example sends an invalid client GUID that causes a crash:

udhcpc -q -v PXEClient -x 0x5d:0000 -x 0x61:aaaaaaaa4a525bd43517df7f8b47

Possible Solution

Looks like it's in the dhcp4-go logging code. I did not dig into it.

Steps to Reproduce (for bugs)

Start boots standalone mode via docker-compose, then log into the client container with docker exec and run the offending udhcpc command.

# start boots in standalone mode
docker-compose up

# this should succeed
docker exec -ti boots_client_1 \
    busybox udhcpc -q -v PXEClient -x 0x5d:0000 -x 0x61:000000004a525bd43517df7f8b4799c18d

# this will crash boots
docker exec -ti boots_client_1 \
    busybox udhcpc -q -v PXEClient -x 0x5d:0000 -x 0x61:aaaaaaaa4a525bd43517df7f8b47

Context

I found this while coming up with udhcpc commands to test boots in docker-compose.

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):

Arch Linux

How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

docker-compose

Link to your project or a code example to reproduce issue:

Everything you need is in main.

panic: required envvar is unset (ROLLBAR_TOKEN)

Successful build on Ubuntu 18.04 for aarch64 under WSL, but when I follow the instructions in the README I get this:

ed@iyengar:~/src/github.com/tinkerbell/boots$ ./boots
{"level":"panic","ts":1585923538.9905274,"caller":"rollbar/rollbar.go:20","msg":"required envvar is unset","service":"github.com/tinkerbell/boots","pkg":"log","envvar":"ROLLBAR_TOKEN"}
panic: required envvar is unset

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x4000204000, 0x400016a300, 0x1, 0x2)
        /home/ed/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:229 +0x40c
go.uber.org/zap.(*SugaredLogger).log(0x40001b0018, 0x4, 0x641d77, 0x18, 0x0, 0x0, 0x0, 0x40000e9a78, 0x2, 0x2)
        /home/ed/go/pkg/mod/go.uber.org/[email protected]/sugar.go:234 +0xd4
go.uber.org/zap.(*SugaredLogger).Panicw(...)
        /home/ed/go/pkg/mod/go.uber.org/[email protected]/sugar.go:204
github.com/packethost/pkg/log/internal/rollbar.Setup(0x40001b0018, 0x64397e, 0x1b, 0x2)
        /home/ed/go/pkg/mod/github.com/packethost/[email protected]/log/internal/rollbar/rollbar.go:20 +0x3a8
github.com/packethost/pkg/log.configureLogger(0x40001a0120, 0x64397e, 0x1b, 0x633420, 0x4, 0x632e61, 0x3, 0x633eef, 0x5)
        /home/ed/go/pkg/mod/github.com/packethost/[email protected]/log/log.go:69 +0x1c8
github.com/packethost/pkg/log.Init(0x64397e, 0x1b, 0x0, 0x0, 0x0, 0x0, 0x400008b838, 0x0)
        /home/ed/go/pkg/mod/github.com/packethost/[email protected]/log/log.go:87 +0xd4
main.main()
        /home/ed/src/github.com/tinkerbell/boots/main.go:41 +0xb8

Fix lint failures

In https://cloud.drone.io/packethost/boots/4/1/2 there are a set of failures in the lint step, which block automated testing of the code itself.

Build is broken since update to Alpine 3.14

The automated builds are broken on arm:
https://github.com/tinkerbell/boots/runs/3198503141
https://github.com/tinkerbell/boots/runs/3242069576

Expected Behaviour

The builds of the master branch should work on all supported architectures.

Current Behaviour

The builds fail on arm.

Possible Solution

Downgrade to Alpine 3.13
See #185

Add support for UEFI HTTP Boot

Expected Behaviour

I expected to be able to use UEFI HTTP Boot to boot a machine.

Current Behaviour

Only PXE Boot is supported.

Possible Solution

I've got dnsmasq serving HTTP Boot at:

https://github.com/rgl/talos-vagrant/blob/6ce54f83e4066bc60415d697f5c4a7a6cd5ac561/provision-dnsmasq.sh#L53-L59

boots should not PXE if no workflow defined

If there is no active workflow for a worker, boots should not PXE boot a system. Right now, it will always PXE into the environment and noop, which is not a practical use case.

Boots panics I presume because hardware data is not as it expects

Expected Behaviour

The worker boots correctly and exec the workflow

Current Behaviour

I am using the vagrant setup with this hardware:

$ docker exec -i deploy_tink-cli_1 tink  hardware  id 0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94 | jq
{
  "id": "0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94",
  "network": {
    "interfaces": [
      {
        "dhcp": {
          "arch": "x86_64",
          "ip": {
            "address": "192.168.1.5",
            "gateway": "192.168.1.1",
            "netmask": "255.255.255.248"
          },
          "mac": "08:00:27:00:00:01"
        },
        "netboot": {
          "allow_pxe": true,
          "allow_workflow": true
        }
      }
    ]
  }
}

Boots panics:

{"level":"info","ts":1596012796.4664223,"caller":"[email protected]/handler.go:105","msg":"","service":"github.com/tinkerbell/boots","pkg":"dhcp","pkg":"dhcp","event":"recv","mac":"08:00:27:00:00:01","via":"0.0.0.0","iface":"eth1","xid":"\"f8:75:02:6d\"","type":"DHCPDISCOVER","secs":16}
{"level":"info","ts":1596012796.4665282,"caller":"src/dhcp.go:71","msg":"parsed option82/circuitid","service":"github.com/tinkerbell/boots","pkg":"main","mac":"08:00:27:00:00:01","circuitID":""}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x9d2f31]

goroutine 41 [running]:
github.com/tinkerbell/boots/packet.DiscoveryTinkerbellV1.Hostname(...)
        /drone/src/packet/models_tinkerbell.go:92
github.com/tinkerbell/boots/job.(*Job).setup(0xc00001dab0, 0xd1ef40, 0xc0003b1360, 0xc00041e018, 0x4)
        /drone/src/job/job.go:160 +0x63c
github.com/tinkerbell/boots/job.CreateFromDHCP(0xc0002d16a0, 0x6, 0x10, 0xc00041e018, 0x4, 0x164, 0x0, 0x0, 0x0, 0x0, ...)
        /drone/src/job/job.go:63 +0x229
main.dhcpHandler.serveDHCP(0xc000152880, 0xd04620, 0xc000416960, 0xc0003b1800)
        /drone/src/dhcp.go:74 +0x859
main.dhcpHandler.ServeDHCP.func1()
        /drone/src/dhcp.go:45 +0x45
github.com/gammazero/workerpool.(*WorkerPool).dispatch.func1(0xc000152880, 0xc000404ae0)
        /go/pkg/mod/github.com/gammazero/[email protected]/workerpool.go:169 +0x27
created by github.com/gammazero/workerpool.(*WorkerPool).dispatch
        /go/pkg/mod/github.com/gammazero/[email protected]/workerpool.go:167 +0x4c2

Possible Solution

I presume the recent change made with user-defined metadata #58 is involved here. Because I am not sending any of them by purpose. Technically I should be able to do it because they are not an unstructured object.

Steps to Reproduce (for bugs)

Follow the vagrant workflow and create the hardware as I pasted here
Create a workflow and start a worker

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):

MacOS

How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

Vagrant with Virtualbox

Remove binaries from the project

Expected Behaviour

Project binaries should be external artifacts, built through automation, not included in the git repository.

Current Behaviour

Binaries are included in the project and are expected to be compiled and updated by contributors. The binaries must then be manually verified through recompiling (with the same tools) and checksum verification. This slows our ability to accept new changes (#79).

Possible Solution

Similar to #4, the ipxe/bin artifacts should be published to GitHub if they are needed outside of a boots container. If a boots container image is sufficient, the binaries should be compiled at image build time, not trusted from the repository.

The LFS ipxe/bin artifacts should then be removed and the instructions (https://github.com/tinkerbell/boots#local-setup) updated accordingly.

Question on HTTP endpoints purpose

Hello, I am wondering what the use-cases are for Boots serving these HTTP endpoints?

I'm having trouble finding any docs or even code comments on their purpose. Thanks so much!

Make use of params command optional for phone-home

Working through getting my physical servers working with tinkerbell to start doing some experimentation.. they fail saying the “params” command is not found.

Leased from datacenter so flashing or updating ipxe might not be possible or ideal.

Expected Behaviour

Boots

Current Behaviour

Errors out at params not found

Possible Solution

Maybe hardware option? Or use query string

Steps to Reproduce (for bugs)

Run ipxe with out params built in
Try to boot

Context

My solution was just to remove the params here:
https://github.com/tinkerbell/boots/blob/master/ipxe/script.go#L34

Then just let it phone home even with out params. Seems like using ?body=${body}&type=${body} might could work. I’m also not even sure if needed? Looking at the phone home code seems like it’s mostly ignored?

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: tinkerbell it’s self kvm on the network
Link to your project or a code example to reproduce issue:

Support for workflow mode

To aid in bootstrapping a worker and providing it with the right Kernel parameters to execute a workflow.

Need better way to override console configured for osie

Currently to override the default console for a given hardware you need to override the facility_code, such as "onprem console=ttyS0". Otherwise it will use hardcoded values based on the architecture and plan_slug.

Expected Behaviour

There should be a more straightforward way to override the console without having to resort to hacking the facility_code.

Current Behaviour

One needs to override the facility_code to add additional kernel parameters.

Possible Solution

Expose a well defined field as part of the hardware schema to override the default values for the console configuration (if not set).

Steps to Reproduce (for bugs)

Create a hardware definition for a machine that does not conform to the default serial configurations hardcoded in boots (such as a Macchiatobin, which is an arm64 board that uses ttyS0 for the console instead of ttyAMA0.

'panic: connect to cacher: lookup srv record, no such host' when used with MIRROR_HOST

See https://github.com/tinkerbell/boots/blob/9d6cded511d1e7678ec601a29aac4137f25ecb5a/env/mirror.go#L49 and following -

The function buildMirrorBaseURL offers three options for setting a mirror site:

set MIRROR_BASE_URL
set MIRROR_HOST
FacilityCode

If you use the FacilityCode option, it hard-codes a packet.net address. But if you set MIRROR_HOST you get further.

When I run

ROLLBAR_TOKEN=foo PACKET_ENV=bar PACKET_VERSION=bletch API_CONSUMER_TOKEN=baz API_AUTH_TOKEN=xyzzy MIRROR_HOST=127.0.0.1 FACILITY_CODE=ARB1 ./boots

I get this error

panic: connect to cacher: lookup srv record: lookup _http._tcp.cacher.ARB1.packet.net on 192.168.1.254:53: no such host

Add support for new hardware data model

Boots shall continue to support the existing model and shall also support the new model described below for tinkerbell.

The new data model:

{
   "id": "fde7c87c-d154-447e-9fce-7eb7bdec90c0",
   "dhcp": { 
     "mac": "00:00:00:00:00:00", // MAC address for lookup0
     "ip": "172.16.1.35/31",     // IP to hand out
     "hostname": "server001",    // hostname (optional, no default) 
     "lease_time": 86400,        // expiration in secs (optional, default: 86400)  
     "name_servers": [ .. ],     // DNS servers (optional, no default)
     "time_servers": [ .. ],     // NTP servers (optional, no default)  
     "gateway": "192.168.1.1"    // gateway address (optional? no default)
     "arch": <string> 	         //  Architecture of machine  
     "uefi": <bool> 
  },
   "netboot": {
     "allow_pxe": <bool>,
     "allow_workflow": <bool>,
     "ipxe": {  
       "url": "http://<url>/menu.ipxe",
       "contents": "#!ipxe"
     },
     "bootstrapper": {
       "kernel": "http://<url-to>/kernel",
       "initrd": "http://<url-to>/initrd",
       "os": "http://<url-to>/osie"
     } 
   }, 
   "network": [{
     "dhcp": {
       // any global "dhcp" settings can be overridden here 
     }, 
     "netboot": {
       // any global "netboot" settings can be overridden here
     }
   }],
   "metadata": {  
     "state" : <string> // state of the hardware 
     "bonding_mode" : <int> 
     "manufacturer" : {  
        "id": <string>,  
        "slug": <string> 
     }
     "instance": { ... },     
     "custom": { 
        "preinstalled_operating_system_version: {...},  
       "private_subnets": [<string>] 
     },  
     "facility": {  
        "plan_slug": <string>,
        "plan_version_slug": <string>,
        "facility_code":<string>  
     } 
   } 
}

Clarify conflicts with existing networking in README

The README should clarify if this component will "play nice" with existing networks, or whether it needs to be run in a sandboxed / isolated network.

My worry is that running it on my 192.168.0.0/24 may break all the IP allocations. That concern can be alleviated by stating in the README how this software is designed to be used.

For instance pixiecore by the same author as MetalLB says it uses a DHCP proxy. I'm not sure what path Plundr takes.

Baremetal auto enlist/discover

now we can add hardware by only one way: collect the mac and assign ip in advance.

Is there any plan to add a feature that hardware can enlist automaticlly?

Move ipxe to a dedicated repository?

Currently ipxe is built in this repository, that makes the build of boots take longer and is more complex.

What do you think about:

moving the ipxe build to the boots-ipxe repository.
build ipxe in a GitHub Actions inside a docker container using buildx from a ubuntu 20.04 image.
publish an artifact for each arch as a GitHub Release.
modify the boots build to use the GitHub Release.

This would make things simpler from the boots build perspective and for someone that is trying to understand and build boots.

feature/bug: expected behavior around short circuiting a netboot request

Currently, there is logic to short circuit a netboot request based on some tink/cacher hardware data. see here. I looked through the Tink code base and didn't see any code paths where hardware data was updated based on a workflow progression. The tink worker sends report statuses as it progresses (ref here) but tink server doesn't update hardware data in any way. Taking all this into account it appears that the code here in Boots is expected some external entity to update the hardware data in conjunction with a workflow's progress. This makes the Boots -> Tink server combo always netboot unless hardware data is manually updated. This, in my option, is not expected behavior. This was also raised in the Tinkerbell community Slack channel, here. This feels like probably a feature request more than a bug. But at a bare minimum, a non-documented quark that affects a generally expected behavior, in my opinion.

CC @rothgar

Expected Behaviour

After a machine has been provisioned we should be able to boot from a local disk without changing the boot order.

Current Behaviour

See above.

Possible Solution

Write a workflow action that updates tink hardware data. This is just for the sake of giving any kind of workaround. I don't think this is a viable mid-long term solution.

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

boots should not panic on bad hardware data

I made a mistake in my hardware data definition and I observed a fatal panic in the boots, which crashed it completely. I don't think it is a suitable behaviour, in my opinion it would be better to recover from panic and log error message, allowing boots to continue operating.
The panic was caused by the code in https://github.com/tinkerbell/boots/blob/master/job/job.go#L79 .

Consider a new name for `boots`

I was trying to talk with @gauravgahlot about some stuff, and we had to be really careful about phrasing sentences to not get confused about if we were talking about a verb (to boot, it boots) or a noun (boots, the software).

Here are some excerpts:

okay, so boots first serves a regular ipxe file

the worker boots and uses whatever PXE support it has to netboot off of boots

boots notices the worker is booting with whatever PXE support it had, and says "okay, you need to run my own version of iPXE first. Here it is, netboot off of: undionly.kpxe (https://github.com/tinkerbell/boots/blob/master/job/dhcp.go#L61, this is the "not isPacket branch": https://github.com/tinkerbell/boots/blob/master/job/dhcp.go#L105-L114)

the worker says "yep, sure" and netboots off of undionly.kpxe and that again connects to boots

boots notices the worker is using boots's special iPXE and then goes through the same codepath, but should fall through to this case: https://github.com/tinkerbell/boots/blob/master/job/dhcp.go#L130-L131

and then consider this sentence, and whether or not you're 100% certain as to what kind of log file you're about to see:

I saved these boot logs sometime ago...

Maybe a new name could be connected to boots, like a type of boot Hegel might have worn: Hessian, Wellington, Blucher, and ankle-jack.

Anyway, it isn't too late. I like the levity in the name boots, but I think it comes at a high cost.

Investigate PXE issue with Lenovo ThinkSystem HR330A (Ampere eMAG)

Apparently, #216 may have broken PXE boot on this hardware. I'm not sure what BIOS version, however.

@mmlb mentioned this on chat:

... the last bunch of dhcps ... where the filename is under ipxe/ and nothing back from the nic/uefi. I was on console and the machine went almost straight to uefi shell.

#223 may fix this, but if we can confirm the problem we should probably just roll-back #216 and bring it back without the path changes.

Proposal 0016, Default Workflows

Expected Behaviour

It would be very helpful to have default workflows for Tinkerbell, meaning that when a currently unknown machine sends out a DHCPDISCOVER request, Boots will push the new hardware and create a workflow for the new machine from the default template.

Current Behaviour

Currently, you must manually push each new piece of hardware you expect and then manually create a workflow for this HW

Possible Solution

on dhcpdiscover received:
- grab mac address from req
- create hardware with mac
- push hardware
- grab a template titled 'default'
- create workflow with new hw id and default template id

Context

Trying to provision multiple bare metal machines on a private network as soon as they are turned on

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
Win10, Ubuntu
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Running based on Local Setup with Vagrant tutorial

Make TFTP session timeout tunable (defaults to 3 seconds)

It would be nice to have the possibility to edit the TFTP session timeout for each worker server. Currently, if the worker server does not respond within 3-6 seconds, TFTP times out and the operating system installation ultimately fails.

Current Behaviour

CLIENT MAC ADDR: $MAC GUID: 00000000 0000 0000 0000 AC1F6BA50120
CLIENT IP: xxx.xxx.xxx.xxx MASK: yyy.yyy.yyy.yyy DHCP IP: zzz.zzz.zzz.zzz
GATEWAY IP: aaa.aaa.aaa.aaa
PXE-E32: TFTP open timeout
PXE-M0F: Exiting Intel Boot Agent.

Steps to Reproduce (for bugs)

In our case, because of the network infrastructure, the server is a bit slower to respond in that part of the world.

Context

Because the server takes longer to respond, TFTP's TTL expires and the session times out, ultimately failing the whole deployment process.

Your Environment

Provisioner running CentOS 7, deploying to another bare-metal server.

Add retries to iPXE script when fetching files

The iPXE file fetches can run into temporary network issues when downloading the kernel/initramfs files.
We should add some retry logic.

Expected Behaviour

Temporary network failures do not cause the iPXE boot to fail.

Current Behaviour

iPXE boot will fail if there's a network issue.

man page for boots

Provide a man page for boots, so that a sysadmin looking for help on command line options, flags, and other issues has a single place to go.

Suggest that the man page be written with markdown + pandoc, as described in this tutorial - no particular need to write raw troff. The file https://github.com/eddieantonio/license/blob/master/license.1.md is an example written this way.

Proposed Boots roadmap

The following is a proposed roadmap of work items for Boots.
They are not ordered in terms of priority.
I don't know if we can get a Github project created in this repo to maybe track these.
If so, I can create all the tickets.

is git-lfs needed?

Is the only file we store with git-lfs ipxe/bin/snp-hua.efi? If so, why do we need git-lfs? If I'm not mistaken this file is only about 220KB. Seems unnecessary. Maybe, I'm missing something?

If this is not the only file we store with git-lfs, does anyone know the others? The .gitattributes only shows snp-hua.efi.

Produce a binary release, for multiple architectures

To aid in bootstrapping and testing, the project should publish a binary release (for at least two architectures) to allow users to get their hands on this code without needing at the very outset to have a full development environment.

Boots should pass Osie to registered hardware without workflow yet

Expected Behaviour

A registered hardware without a workflow registered should boot into Osie.

Current Behaviour

Boots logs:

{"level":"info","ts":1608652334.4974012,"caller":"boots/dhcp.go:76","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"f4:4d:30:64:8e:0f","err":"discover from dhcp message: get hardware by mac from tink: rpc error: code = Unknown desc = unexpected end of JSON input","errVerbose":"rpc error: code = Unknown desc = unexpected end of JSON input\nget hardware by mac from tink\ngithub.com/tinkerbell/boots/packet.(*Client).DiscoverHardwareFromDHCP\n\t/home/runner/work/boots/boots/packet/endpoints.go:104\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP.func1\n\t/home/runner/work/boots/boots/job/fetch.go:16\ngithub.com/golang/groupcache/singleflight.(*Group).Do\n\t/home/runner/go/pkg/mod/github.com/golang/[email protected]/singleflight/singleflight.go:56\ngithub.com/tinkerbell/boots/job.discoverHardwareFromDHCP\n\t/home/runner/work/boots/boots/job/fetch.go:18\ngithub.com/tinkerbell/boots/job.CreateFromDHCP\n\t/home/runner/work/boots/boots/job/job.go:59\nmain.dhcpHandler.serveDHCP\n\t/home/runner/work/boots/boots/dhcp.go:74\nmain.dhcpHandler.ServeDHCP.func1\n\t/home/runner/work/boots/boots/dhcp.go:45\ngithub.com/gammazero/workerpool.startWorker\n\t/home/runner/go/pkg/mod/github.com/gammazero/[email protected]/workerpool.go:218\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.15.5/x64/src/runtime/asm_386.s:1333\ndiscover from dhcp message"}

And it does not serve Osie via iPXE

Possible Solution

Serve osie even if there are not workflow registered.

Steps to Reproduce (for bugs)

Start tinkerbell with Vagrant or what ever
Register hardware
Start the hardware without any workflow

Context

In my home lab I do not persist operating system in all the devices, I like to have a few of them "ephemeral" and Osie is perfect for that, but I have to push a hell world workflow otherwise they won't even boot

Your Environment

Sandbox!

boots doesn't compile using `go build`: undefined: ipxe.MustAsset

[root@73a5a635-85b2-4a26-cd19-bd3e89bc9c36 boots]$ go build -v
github.com/tinkerbell/boots/tftp

github.com/tinkerbell/boots/tftp

tftp/tftp.go:15:20: undefined: ipxe.MustAsset
tftp/tftp.go:16:20: undefined: ipxe.MustAsset
tftp/tftp.go:17:20: undefined: ipxe.MustAsset
tftp/tftp.go:18:20: undefined: ipxe.MustAsset

Move to GitHub action

tinkerbell/tink uses GitHub action already. I migrated it from the internal drone to GH Action a few weeks ago.

We should move boots as well.

You can take inspiration from the tink repository BUT with Docker Action v2

The requirement is the same as the one we have for tinkebell/tink. Migrate the current checks to GitHub actions and you should push images to quay.io when a PR is merged to master.

Uniform Standards Request: Experimental Repository

Hello!

We believe this repository is Experimental and therefore needs the following files updated:

If you feel the repository should be maintained or end of life or that you'll need assistance to create these files, please let us know by filing an issue with https://github.com/packethost/standards.

Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

The Goal

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Provide git-lfs instructions for Linux users

The git-lfs link only provides instructions for MacOS users.

I would like to suggest switching to this page which includes instructions for Linux users:

https://github.com/git-lfs/git-lfs/wiki/Installation

dhcp lease time issues

With the introduction of new data model, it does not use the environment variable for default DHCP lease time. In addition, there is not a default lease time. This causes the value of 0 to be used, which creates a flood of DHCP requests to the DHCP server because of the short lease time. We should default this to 86400 if the user does not supply one.

tinkerbell / smee Goto Github PK

smee's Introduction

Tinkerbell

License

Community

What's Powering Tinkerbell?

Tink

Smee

Hegel

OSIE

Hook

PBnJ

Building

Configuring OpenTelemetry

Website

smee's People

Contributors

Stargazers

Watchers

Forkers

smee's Issues

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

See also

Expected Behaviour

Current Behaviour.

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Current Behaviour

Possible Solution

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Context

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution