Giter Site home page Giter Site logo

tinkerbell / pbnj Goto Github PK

View Code? Open in Web Editor NEW
102.0 18.0 35.0 921 KB

Service for interacting with BMCs

License: Apache License 2.0

Dockerfile 1.19% Go 93.12% Shell 2.17% Makefile 3.41% Ruby 0.11%
ipmitool bmc bmclib ipmi tinkerbell redfish bare-metal baremetal

pbnj's Introduction

PBNJ

For each commit and PR

Description

This service handles BMC interactions.

  • machine and BMC power on/off/reset
  • setting next boot device
  • user management
  • setting BMC network source

The gRPC PBnJ server listens by default on port 50051. This can be started with pbnj server. Use pbnj server --help for more runtime details.

Usage

Container

Build

make image

Run

# default gRPC port is 50051
make run-image

Local

Build

# builds the binary and puts it in ./bin/
make build

Run

# default gRPC port is 50051; does a `go run` of the code base
make run-server

Authorization

Documentation on enabling authorization can be found here.

Contributing

See the contributors guide here.

Website

For complete documentation, please visit the Tinkerbell project hosted at tinkerbell.org.

pbnj's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pbnj's Issues

Refactor `SetPower` to be test-able

The PowerSet method is over 100 lines of code, core to the functionality of PBnJ, and not tested. This is dangerous, as seen in #105. We should refactor SetPower and BootDeviceSet to be test-able.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

  • Link to your project or a code example to reproduce issue:

Add AMT Support

Devices with Intel vPro processors have an out-of-band management solution built-in with Intel AMT. Adding support for AMT to PBnJ will enable BMC-like interactions to systems with these processors. This will allow full Tinkerbell management of hardware often used for small-scale cluster testing on hardware like Intel NUCs.

Expected Behaviour

PBnJ should communicate with AMT over the SOAP-based WS-Management interface to perform BMC interactions like power cycling and PXE booting the target device. Initial manual configuration to enable AMT is expected.

Current Behaviour

Currently PBnJ does not support AMT.

Possible Solution

Intel AMT has a SOAP-based WS-Management interface for interacting out-of-band with AMT systems. There are a few potential starting points below with code that communicates over this interface.

Resources:

Intel has released the Open Active Management Technology (Open AMT) Cloud Toolkit (docs, source) which provides a set of microservices and libraries for integrating AMT. The Remote Provisioning Client (RPC) may be the most helpful piece, as it's written in go and interacts with AMT directly over the WS-MAN interface. (docs, source).

There is also OpenWSMAN (site, wiki, source. This was originally developed and open-sourced by Intel and looks to be in C++.

An interesting (although dated) write-up on AMT from a discovery and security perspective can be found here: https://www.uberwall.org/bin/download/download/102/lacon12_intel_amt.pdf

Context

I'm building a small cluster with 3 Intel NUCs to demonstrate EKS Anywhere, which uses Tinkerbell and PBnJ.

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):
    Tinkerbell services are running in a kind cluster on an AL2 VM on a Mac OSX laptop.

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
    Tinkerbell services are incorporated into EKS Anywhere.

$ eksctl version
0.117.0-dev+98027776c.2022-11-04T12:59:49Z

$ eksctl anywhere version
v0.12.1
  • Hardware (x3)
    Intel NUCs Model: SBNUC11TNHv50L0
    32 GB memory: F4-2400C16D-32GRS
    500 GB SSD: MZ-77E500B-AM

  • Link to your project or a code example to reproduce issue:
    n/a

Background

Intel Active Management Technology (AMT) is part of Intel vPro. If you have a vPro processor, you have AMT.

Other Potential Resources

There is also the High Level API (HLAPI) written in C# (seems to be Windows-focused):
Intel AMT High-level API (HLAPI) overview
Intel AMT High-level API (HLAPI) docs

The docs here seem to indicate Linux is not really supported by the AMT SDK:
Intel AMT Implementation and Reference Guide (includes AMT SDK docs)
(Linux sample app no longer supported, Linux version required is RHEL 5.x)

Uniform Standards Request: Experimental Repository

Hello!

We believe this repository is Experimental and therefore needs the following files updated:

If you feel the repository should be maintained or end of life or that you'll need assistance to create these files, please let us know by filing an issue with https://github.com/packethost/standards.

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

  • Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
  • Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
  • We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

The Goal

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Env vars aren't validated

Env vars are not validated like cli flags are.

Expected Behaviour

Env vars should be validated just like cli flags are.

Current Behaviour

A bad env var that wouldn't pass the cli flag validation could cause unexpected behavior.

For example:

# cli flag validation
❯ go run main.go server --bmcTimeout 60 
Error: invalid argument "60" for "--bmcTimeout" flag: time: missing unit in duration "60"
Usage:
  pbnj server [flags]

Flags:
      --bmcTimeout duration        Timeout for BMC calls (default 15s)
      --enableAuthz                enable Authz middleware. Configure with configuration file details
      --enableHTTP                 enable the HTTP server
  -h, --help                       help for server
      --hsKey string               HS key
      --metricsListenAddr string   metrics server listen address (default ":8080")
      --port string                grpc server port (default "50051")
      --rsPubKey string            RS public key

Global Flags:
      --config string     config file (default is pbnj.yaml)
      --logLevel string   log level (default is info (default "info")

invalid argument "60" for "--bmcTimeout" flag: time: missing unit in duration "60"
exit status 1
# no validation for env var. The server starts up with the bmc timeout set to 0
PBNJ_BMCTIMEOUT=60 go run main.go server
{"level":"info","ts":1624559879.264104,"caller":"cmd/server.go:97","msg":"debugging","service":"github.com/tinkerbell/pbnj","timeout":0,"timeout_string":"0s"}
{"level":"info","ts":1624559879.2745972,"caller":"grpcsvr/server.go:129","msg":"starting PBnJ gRPC server","service":"github.com/tinkerbell/pbnj"}

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

  • Link to your project or a code example to reproduce issue:

Include error return checking in golangci-lint results

Expected Behaviour

Linting prevents risky code from being introduced into the project.

Current Behaviour

The current “golanglint-ci” thresholds overlook common problems from building up due to not checking for errors. When overlooked these problems lead to error-prone code.

Possible Solution

If I had to guess, this linter was disabled for main.go:48: defer sync().

An inline // nolint:errcheck could be used to call out awareness in this one situation, without disabling the linter elsewhere.

In drone configurations,

  • Increase the version of golangci-lint to the latest.
  • Remove -D errcheck from the calling arguments.
.drone.yml:    image: golangci/golangci-lint:v1.24.0
.drone.yml:      - golangci-lint run -v -D errcheck

The PR that introduces this for CI should include (or be preceded) by one that addresses the outstanding issues, as reflected by the default golangci-lint reporting settings, in this project.

Context

$ golangci-lint run ./...
api/bmc.go:29:10: Error return value of `c.Error` is not checked (errcheck)
		c.Error(err)
		       ^
api/boot.go:27:10: Error return value of `c.Error` is not checked (errcheck)
		c.Error(err)
		       ^
api/lan.go:29:10: Error return value of `c.Error` is not checked (errcheck)
		c.Error(err)
		       ^
main.go:48:12: Error return value of `sync` is not checked (errcheck)
	defer sync()
	          ^

Announcement: `main` to be set as the default branch

Making main the default branch

On the _th of June 2021, As soon as we locate a person with the correct permissions we will be renaming the master branch in this repository to main and making main the default branch. When this occurs there will be potential for some disruption. Please read the following to understand and prepare for the change.

Things you will need to do

After you rename a branch in a repository on GitHub, any collaborator with a local clone of the repository will need to update the clone.

From the local clone of the repository on a computer, run the following commands to update the name of the default branch.

$ git branch -m master main
$ git fetch origin
$ git branch -u origin/main main
$ git remote set-head origin -a

Optionally, run the following command to remove tracking references to the old branch name.

$ git remote prune origin

Things that will be done on the GitHub side

The following are things that you won’t have to worry about. They are here for transparency.

  1. We will rename the default branch
  2. We will merge a PR into main that will update all references to "master"

By default GitHub will automatically do the following when we change the default branch name:

  • Re-target any open pull requests
  • Update any draft releases based on the branch
  • Move any branch protection rules that explicitly reference the old name
  • Update the branch used to build GitHub Pages, if applicable
  • Show a notice to repository contributors, maintainers, and admins on the repository homepage with instructions to update
  • local copies of the repository
  • Show a notice to contributors who git push to the old branch
  • Redirect web requests for the old branch name to the new branch name
  • Return a "Moved Permanently" response in API requests for the old branch name

Refs:
https://github.com/github/renaming

https://docs.github.com/en/github/administering-a-repository/managing-branches-in-your-repository/renaming-a-branch

https://docs.github.com/en/github/administering-a-repository/managing-branches-in-your-repository/renaming-a-branch#updating-a-local-clone-after-a-branch-name-changes

Integrate PBNJ into Tinkerbell k8s model

Currently PBNJ is a standalone service that performs power management operations. It would benefit to have a formal integration with the Tinkerbell stack with the changes for k8s resource model.

Expected Behaviour

When provisioning baremetal nodes using Tinkerbell, the pbnj component would be responsible for the power/boot management of the nodes. The hardware CRD can be extended to contain the necessary BMC information, that pbnj may leverage to perform actions. This would help power on nodes, create BMC users and setting boot options. Also opens a scope to deprovision nodes, perform reboots/resets etc.

Current Behaviour

Manual intervention is required for powering up baremetal nodes and setting the boot order to net boot for Tinkerbell provisioning.

Initial Ideas

These are some rough ideas that can be discussed and expanded to a more formal proposal.

PBNJ as k8s Service

Currently PBNJ is a GRPC service, this can be run on the k8s cluster along with all the other Tinkerbell components (Boots, Hegel). The PBNJ service would have read access to the Hardware CRDs to fetch the BMC information and perform actions.

PBNJ as a k8s Controller

PBNJ can be redesigned to be a k8s controller. The controller could watch Workflow CRDs and pickup tasks tagged to it and perform power management actions.

PBNJ as a Hub action

This idea is based off tink-worker, we could possibly have a long running pbnj-worker on the same cluster as the Tinkerbell stack. The pbnj-worker could run hub actions, which use PBNJ binary to perform power management tasks.

Tinkerbell Uniform Standards: Maintained Repository

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Expected Behaviour

We believe this repository is Maintained and therefore needs the following files updated:

If you feel the repository should be experimental or end of life or that you'll need assistance to update these files, please let us know by filing an issue with https://github.com/packethost/standards.

Current Behaviour

n/a

Possible Solution

n/a

Context

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

  • Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
  • Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
  • We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

Your Environment

Non-deterministic behavior in serialized contexts

This issue is based solely on observations and code study. As with any non-deterministic behavior, it can be difficult to pinpoint and correct so more eyes and validation would be useful.

PBnJ operates on requests asynchronously. When sending multiple, serialized requests to PBnJ it accepts them but has no context. As requests are serviced asynchronously its left to the go runtime to perform non-deterministic scheduling of the goroutines resulting in non-deterministic behavior. For example, sending a boot device request followed by a power on request can result in machines starting but not with the expected boot device.

I don't think this is a bug, I think its a design flaw. PBnJ would benefit from a deterministic API of sorts removing the need for consumers to synchronize their requests.

Possible Solution

(1) Introduce synchronous APIs. This would ensure RPCs don't return until the action has actually been carried out greatly improving the consumer experience.

(2) Offer an API that allows specifying multiple actions per request. This would allow PBnJ to operate on the actions asynchronously still but provide the context so the actions can be made serially.

(3) The most complicated. PBnJ could manage a queue per BMC. When a request is received and no queue is present for the endpoint create a new one and hold it in memory in an LRU cache, possibly with timeouts too (just to control the memory footprint more intentionally). Add the task to the queue and have a job management construct with N workers plucking from the various queues. You can't start the next item in the queue until the previous one for that queue has finished but you can pluck from other BMC queues. This hinges on some sort of static data for BMCs available in a request so you can lookup the queue. I suspect the IP address would be static enough given it makes little sense for BMCs to be dynamic (maybe people use a fancy dynamic DNS setup?). This is akin to a job management system.

Steps to Reproduce (for bugs)

Its a race condition so hard to reproduce. Run boot device and power on requests in that order enough times and you'll probably see it.

Context

On EKS-A we observed, when beginning the provisioning process, that machines booted into disks despite being asked to PXE boot. The BMC capability offered by EKS-A ensures we can provision machines even if they have an existing image on the OS.

UpdateUser appears to create user if it doesn't exist

I'm still investigating but it appears that calls to UpdateUser rpc create the user if it doesn't exist.

Expected Behaviour

UpdateUser only updates existing users.

Current Behaviour

UpdateUser creates a new user if the user that is provided in the request does not already exist.

Possible Solution

we could refactor to check for user existence first.

latest pbnj image is broken and container not able to start up

pbnj latest image is not been to deploy, the following error is seen

Error: failed to start container "pbnj": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "/pbnj": permission denied": unknown

image tag: docker pull quay.io/tinkerbell/pbnj:sha-77f35030

  Normal   Scheduled  2m15s                default-scheduler  Successfully assigned tinkerbell/pbnj-576db5c745-skhfj to sv16-ems-sitectl-node02
  Normal   Pulled     2m12s                kubelet            Successfully pulled image "quay.io/tinkerbell/pbnj:latest" in 993.664583ms
  Normal   Pulled     2m10s                kubelet            Successfully pulled image "quay.io/tinkerbell/pbnj:latest" in 1.001249224s
  Normal   Pulled     116s                 kubelet            Successfully pulled image "quay.io/tinkerbell/pbnj:latest" in 1.030160974s
  Normal   Pulling    83s (x4 over 2m13s)  kubelet            Pulling image "quay.io/tinkerbell/pbnj:latest"
  Normal   Created    82s (x4 over 2m12s)  kubelet            Created container pbnj
  Normal   Pulled     82s                  kubelet            Successfully pulled image "quay.io/tinkerbell/pbnj:latest" in 1.023346092s
  Warning  Failed     81s (x4 over 2m12s)  kubelet            Error: failed to start container "pbnj": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/pbnj\": permission denied": unknown
  Warning  BackOff    43s (x9 over 2m7s)   kubelet            Back-off restarting failed container

Previous image works fine. tag: 46-859a322ed28904d9165d6a78faf19c76142a720a

Thanks
Xin

PBNJ should be able to handle any BMC commands

PBNJ is currently limited to a small subset of commands (power control, status, and boot options). As an abstraction mechanism to racadm and ipmitool, it would benefit operators to be able to call any BMC method with a passthru mechanism.

Expected Behaviour

To be able to call arbitrary racadm and ipmitool commands with PBNJ.

Current Behaviour

Only able to check power status, perform power cycles, and set next boot options.

Possible Solution

Allow the command that is needing to be run to be passed to PBNJ.

Context

There are many things that someone may want to perform on their BMCs - e.g. updating BIOS or system settings.

Your Environment

Equinix Metal

multi-arch support

Expected Behaviour

PBNJ should be available for use in ARM based environments.

Current Behaviour

PBNJ is only available for amd64. https://quay.io/repository/tinkerbell/pbnj?tab=tags does not show multiple Linux penguin logos per tag, which indicates a multi-arch build.

Context

tinkerbell/tink#226

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

  • Link to your project or a code example to reproduce issue:

master branch to main

change the default branch to be main instead of master

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:

  • Link to your project or a code example to reproduce issue:

log request parameters

It would be helpful in troubleshooting issues if request parameters were logged.

Expected Behaviour

Current Behaviour

Possible Solution

Update bmc lib to v2

@nshalman noted BMC lib is releasing v2. We should think about updating to remain current. Non-urgent for the immediate future.

Want ability to send NMI

Expected Behaviour

A method to send a non-maskable interrupt to a device (server)

Current Behaviour

This is currently not possible

Possible Solution

An API call that will execute something like:

ipmitool <options> chassis power diag

Steps to Reproduce (for bugs)

N/A

Context

For some operating systems, sending an NMI will initiate a panic and crash dump. The crash dump can be analyzed post-mortem. This can sometimes be needed when the system is non-responsive to external input.

While we can initiate a reboot via the API, that does not allow for post-mortem debugging.

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS):
    SmartOS

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
    Equinix Metal

  • Link to your project or a code example to reproduce issue:
    https://smartos.org/

Move to GitHub Action

tinkerbell/tink uses GitHub action already. I migrated it from the internal drone to GH Action a few weeks ago.

We should move PBNJ as well.

You can take inspiration from the tink repository BUT with Docker Action v2

The requirement is the same as the one we have for tinkebell/tink. Migrate the current checks to GitHub actions and you should push images to quay.io when a PR is merged to master.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.