siderolabs / extensions Goto Github PK

View Code? Open in Web Editor NEW

93.0 93.0 89.0 2.34 MB

Talos Linux System Extensions

Makefile 57.83% Shell 8.96% Go 33.21%

extensions talos

extensions's People

Contributors

Stargazers

Watchers

Forkers

andrewrynhard smira rsmitty sergelogvinov frezbo charlie-haley ct-test-account timjones 0xe282b0 sauterp anthr76 peterschoellhorn githubcdr deekue djalpee ammmze bzub utkuozdemir sebedh dwsr justingarfield inf0rmatiker buroa casdr ogkevin ckflaskerud nathandotleeathpe tmayad btrepp zenjoy theilleniumstudios thesocialproxy roeeklinger reitermarkus kvaps bobymcbobs ndbrew donch gottox sleepingshell linuxmaniac dreamingraven erickaby mymirelhub djeebus hackworthltd jtcressy-home preisschild alessio-pascolini protryon soleblaze ccben87 willglynn pl4nty hpe-breckenridge fidencio tool-chain-io unix4ever fabienpaitry mlinares1998 aarnaud-talos aarnaud saiyam1814 schoeppi5 eyanulis dnugmanov echozio benlodotcom j3rwin cosmicrocks winternis jinxcappa wasurerarenai clinia kingdonb mdallaire bernardgut rgl oscarsiles philhug the-wondersmith

extensions's Issues

drop Talos version from extension versions where not needed

We don't need Talos version to be part of the extension version if the extension doesn't contain kernel modules.

This will help with signing releases (less signatures to put).

ZFS extension add module, but no zfs/zpool binaries

Following #197 , I'm also trying to run zfs-localpv as well, and after enabling the extension and enabling the module in the machine configuration, I have no /usr/local/sbin/zfs or zpool in my host directory. If I install zfs on my container from apt or apk, it works fine after binding /dev. But zfs-localpv expects zfs available on the host filesystem. What I'm doing wrong to not having zfs on my talos filesystem?

❯ talosctl -n 192.168.178.93 get extensions
NODE             NAMESPACE   TYPE              ID                                                                                                                      VERSION   NAME               VERSION
192.168.178.93   runtime     ExtensionStatus   000.ghcr.io-siderolabs-qemu-guest-agent-8.0.2@sha256-79979c41a8acfc51b2651d6d44bf187b8e7164e6c649f4e39b9cb699fed8917b   1         qemu-guest-agent   8.0.2
192.168.178.93   runtime     ExtensionStatus   001.ghcr.io-siderolabs-iscsi-tools-v0.1.4@sha256-58cbadb0a315d83e04b240de72c5ac584e9c63b690b2c4fcd629dd1566c3a7a1       1         iscsi-tools        v0.1.4
192.168.178.93   runtime     ExtensionStatus   002.ghcr.io-siderolabs-zfs-2.1.12-v1.5.4@sha256-df5674ed21a97c96ec4903ebdbd7318bb8ab5d18fbf10070139f84bf72a94d2d        1         zfs                2.1.12-v1.5.4
192.168.178.93   runtime     ExtensionStatus   modules.dep                                                                                                             1         modules.dep        6.1.58-talos
❯ talosctl -n 192.168.178.93 cat /proc/modules
zfs 3727360 - - Live 0xffffffffc0510000 (PO)
zunicode 335872 - - Live 0xffffffffc04bd000 (PO)
zzstd 581632 - - Live 0xffffffffc042e000 (O)
zlua 180224 - - Live 0xffffffffc0401000 (O)
zavl 16384 - - Live 0xffffffffc03f7000 (PO)
icp 307200 - - Live 0xffffffffc03ab000 (PO)
zcommon 86016 - - Live 0xffffffffc0395000 (PO)
znvpair 77824 - - Live 0xffffffffc0363000 (PO)
spl 90112 - - Live 0xffffffffc037e000 (O)
virtio_pci 24576 - - Live 0xffffffffc035c000
virtio_pci_legacy_dev 16384 - - Live 0xffffffffc0379000
virtio_pci_modern_dev 16384 - - Live 0xffffffffc0344000
❯ talosctl -n 192.168.178.93 list -l /usr/local/sbin
NODE             MODE         UID   GID   SIZE(B)   LASTMOD             NAME
192.168.178.93   drwxr-xr-x   0     0     278       Jan 20 2022 19:35   .
192.168.178.93   Lrwxrwxrwx   0     0     8         Jan 20 2022 19:35   brcm_iscsiuio -> iscsiuio
192.168.178.93   -rwxr-xr-x   0     0     5559      Jan 20 2022 19:35   iscsi-gen-initiatorname
192.168.178.93   -rwxr-xr-x   0     0     14240     Jan 20 2022 19:35   iscsi-iname
192.168.178.93   -rwxr-xr-x   0     0     5293      Jan 20 2022 19:35   iscsi_discovery
192.168.178.93   -rwxr-xr-x   0     0     222       Jan 20 2022 19:35   iscsi_fw_login
192.168.178.93   -rwxr-xr-x   0     0     9616      Jan 20 2022 19:35   iscsi_offload
192.168.178.93   -rwxr-xr-x   0     0     315184    Jan 20 2022 19:35   iscsiadm
192.168.178.93   -rwxr-xr-x   0     0     327928    Jan 20 2022 19:35   iscsid
192.168.178.93   -rwxr-xr-x   0     0     2376469   Jan 20 2022 19:35   iscsid-wrapper
192.168.178.93   -rwxr-xr-x   0     0     294432    Jan 20 2022 19:35   iscsistart
192.168.178.93   -rwxr-xr-x   0     0     146880    Jan 20 2022 19:35   iscsiuio
192.168.178.93   -rwxr-xr-x   0     0     61640     Jan 20 2022 19:35   tgtadm
192.168.178.93   -rwxr-xr-x   0     0     1219040   Jan 20 2022 19:35   tgtd
192.168.178.93   -rwxr-xr-x   0     0     59520     Jan 20 2022 19:35   tgtimg

Thunderbolt extension

Would be great if Sidero could provide an extension to enable Thunderbolt.
Thanks for the great work.

Alsa Extension

I know it’s a very niche thing. But for days I’m trying to get one of my Nodes to output sound. Just wondering if Alsa is shipped with the base-talos-kernel. And if not, if this could be done via a extension. I know this is lowest priority and I understand that you have more important stuff to do. Nevertheless thanks for this great and stable operating system.

AWS ECR Image Credential Provider

ECR Credential Provider Extension

https://github.com/kubernetes/cloud-provider-aws/tree/master/cmd/ecr-credential-provider

This is a kubelet compatible credential provider helper, which generates short-lived tokens to authenticate against AWS ECR Registries. AWS' Elastic Container Registry (ECR) only allows short-lived tokens, so hard-coding credentials directly in registryconfig.auth doesnt work.

Kubelet Credential Provider API

https://kubernetes.io/docs/tasks/administer-cluster/kubelet-credential-provider/

Kubelet can be configured with the --image-credential-provider-bin-dir parameter, which configures the directory where the helper binary should be. The binary name (and other options) are then set over a kind: CredentialProviderConfig config, which is passed to kubelet with the --image-credential-provider-config parameter.

Thoughts

I think for the first step, providing the ecr-credential-provider binary should be enough, because everything else can be configured in the Talosconfig.

new kernel module fails to insert with "key was rejected by service"

I'm working on creating an extension for the nvidia grid drivers (nvidia's open drivers don't support datacenter vgpus like the tesla line of cards). I've copied the nonfree-kmod-nvidia tree, in the hopes that they're similar enough that I can swap the linux installer and it would Just Work ™️ , but either I did something wrong or I'm missing some critical step in building and pushing the extensions.

NOTE:
One other thing that might complicate all this is that I'm running fairly old hardware that requires GOAMD64=v1. this is the process I use to have github actions build all the artifacts for me. It currently builds everything, but I'm fairly certain I only use the installer and talos images. I'm also currently on 1.5.1.

build process

copy the nonfree-kmod-nvidia-pkg package and create the nonfree-kmod-nvidia-grid-pkg package
push it to ghcr.io/djeebus/talos/nonfree-kmod-nvidia-grid-pkg
copy the nonfree-kmod-nvidia extension and create the nonfree-kmod-nvidia-grid extension
push it to ghcr.io/djeebus/talos/nonfree-kmod-nvidia-grid

installation

create the following patch:

# nvidia-vgpu.yaml
- op: add
  path: /machine/install/extensions
  value:
    - image: ghcr.io/djeebus/talos/nonfree-kmod-nvidia-grid:535.54.03-v1.5.1
- op: add
  path: /machine/kernel
  value:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
- op: add
  path: /machine/sysctls
  value:
    net.core.bpf_jit_harden: 1

apply the patch via:

talosctl  \
    --nodes $NODE \
    patch mc \
    --patch @nvidia-vgpu.yaml

trigger a reboot:

  talosctl \
    --nodes $NODE \
    upgrade --image=ghcr.io/djeebus/talos/installer:v1.5.1

the error message

After all that, I get the following pair of messages in dmesg after a reboot:

$NODE: kern:  notice: [2023-09-12T16:35:58.881790988Z]: Loading of module with unavailable key is rejected
$NODE: user: warning: [2023-09-12T16:35:58.887564988Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nvidia\x5c": load nvidia failed: key was rejected by service"}

Any advice you could give would be very welcome, thanks!

Support multiple version of tools in extension spec

Some extensions like open-iscsi, nvidia-toolkit needs multiple tools/programs packaged together where-in each one might have different versions, currently we tag some extensions with the versions of different tools separated by a dash.

I propose to add support in the extension spec to define the versions of each tool/program separately and have a separate versioning schema for the extension itself.

version: v1alpha1
metadata:
  name: <extension name>
  # this keeps the current versioning scheme for the overall extension
  version: <version of the package the extension installs>-<version of the extensions repo (tracks with talos version)>
  tools:
    <foo>: <version>
    <bar>: <version>
  author: Andrew Rynhard
  description: |
    <detailed description of the extension/package>
  ## The compatibility section is "optional" but highly recommended to specify a Talos version that
  ## has been tested and known working for this extension.
  compatibility:
    talos:
      version: ">= v1.0.0"

This also would allow to show the versions of tools shipped with an extension when running a talosctl get extensions

add support for kata container runtime

See https://github.com/kata-containers/kata-containers/blob/main/docs/install/container-manager/containerd/containerd-install.md

Currently blocked on kata-containers/kata-containers#927, as Talos is only cgroups v2.

Error when installing iscsi-tools

I'm probably doing something wrong and would appreciate any help.
I'm running talos v1.5.4 on metal and trying to install ghcr.io/siderolabs/iscsi-tools:v0.1.4@sha256:180f0d17015ecd1c02c0f0f421aa9c5410970f33cd61bf43b1bc4c52e9310e69
Unfortunately getting this error

: user: warning: [2023-11-03T21:25:57.437836397Z]: Error: error validating extension "iscsi-tools": path "/usr/include/libopeniscsiusr.h" is not allowed in extensions

Any idea?

iscsi-tools extension fails to start on node startup

Error message starting a node and viewing logs with talosctl dmesg:

192.168.1.153: user: warning: [2022-05-06T03:40:35.396887679Z]: [talos] service[ext-iscsid](Preparing): Running pre state
192.168.1.153: user: warning: [2022-05-06T03:40:35.474317679Z]: [talos] service[ext-iscsid](Preparing): Creating service runner
192.168.1.153: user: warning: [2022-05-06T03:40:35.558298679Z]: [talos] service[ext-iscsid](Failed): Failed to create runner: mkdir /usr/local/etc/iscsi/iscsid.conf: not a directory

talosctl version:

Client:
        Tag:         v1.0.0
        SHA:         80167fd2
        Built:       
        Go version:  go1.17.8
        OS/Arch:     darwin/amd64
Server:
        NODE:        192.168.1.153
        Tag:         v1.0.3
        SHA:         689c6e54
        Built:       
        Go version:  go1.17.7
        OS/Arch:     linux/amd64
        Enabled:     RBAC

Part of the configuration:

version: v1alpha1
machine:
    type: controlplane
    kubelet:
        image: ghcr.io/siderolabs/kubelet:v1.23.6
    install:
        extensions:
            - image: ghcr.io/siderolabs/intel-ucode:20220419
            - image: ghcr.io/siderolabs/iscsi-tools:v0.1.0

An extension for Video4Linux

I have a usecase in which I need V4L modules in the kernel. I am using ffmpeg with a USB webcam attached to a nearby 3D printer and want to run an ffmpeg transcoding job from the webcam to both an Obico instance and an nginx-rtmp instance for detecting print failures and accessing the stream remotely, respectively.

Given this is pretty niche, would Siderolabs accept a PR for a new extension for V4L similar to what is done for a few other kernel modules?

qemu-guest-agent not compatible with 1.4.6

192.168.251.103: user: warning: [2023-07-11T14:52:13.527141505Z]: 2023/07/11 14:52:14 running pre-flight checks
192.168.251.103: user: warning: [2023-07-11T14:52:13.531414505Z]: 2023/07/11 14:52:14 host Talos version: v1.4.6
192.168.251.103: user: warning: [2023-07-11T14:52:13.538455505Z]: 2023/07/11 14:52:14 host Kubernetes versions: kubelet: 1.26.4, kube-apiserver: 1.26.4, kube-scheduler: 1.26.4, kube-controller-manager: 1.26.4  
192.168.251.103: user: warning: [2023-07-11T14:52:13.541381505Z]: 2023/07/11 14:52:14 all pre-flight checks successful
192.168.251.103: user: warning: [2023-07-11T14:52:13.542369505Z]: 2023/07/11 14:52:14 discovered system extensions:
192.168.251.103: user: warning: [2023-07-11T14:52:13.543243505Z]: 2023/07/11 14:52:14 NAME               VERSION   AUTHOR
192.168.251.103: user: warning: [2023-07-11T14:52:13.544191505Z]: 2023/07/11 14:52:14 iscsi-tools        v0.1.4    Sidero Labs
192.168.251.103: user: warning: [2023-07-11T14:52:13.545380505Z]: 2023/07/11 14:52:14 qemu-guest-agent   8.0.2     Markus Reiter
192.168.251.103: user: warning: [2023-07-11T14:52:13.546809505Z]: 2023/07/11 14:52:14 validating system extensions
192.168.251.103: user: warning: [2023-07-11T14:52:13.547854505Z]: Error: error validating extension "qemu-guest-agent": version constraint >= v1.5.0 can't be satisfied with Talos version 1.4.6
192.168.251.103: user: warning: [2023-07-11T14:52:13.550655505Z]: Usage:
192.168.251.103: user: warning: [2023-07-11T14:52:13.551176505Z]:   installer install [flags]
192.168.251.103: user: warning: [2023-07-11T14:52:13.551895505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.552153505Z]: Flags:
192.168.251.103: user: warning: [2023-07-11T14:52:13.552481505Z]:   -h, --help   help for install
192.168.251.103: user: warning: [2023-07-11T14:52:13.553194505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.553655505Z]: Global Flags:
192.168.251.103: user: warning: [2023-07-11T14:52:13.554332505Z]:       --arch string                    The target architecture (default "amd64")
192.168.251.103: user: warning: [2023-07-11T14:52:13.556241505Z]:       --board string                   The value of talos.board (default "none")
192.168.251.103: user: warning: [2023-07-11T14:52:13.558227505Z]:       --bootloader                     Install a booloader to the specified disk (default true)
192.168.251.103: user: warning: [2023-07-11T14:52:13.560454505Z]:       --config string                  The value of talos.config
192.168.251.103: user: warning: [2023-07-11T14:52:13.562169505Z]:       --disk string                    The path to the disk to install to
192.168.251.103: user: warning: [2023-07-11T14:52:13.562178505Z]:       --extra-kernel-arg stringArray   Extra argument to pass to the kernel
192.168.251.103: user: warning: [2023-07-11T14:52:13.562196505Z]:       --force                          Indicates that the install should forcefully format the partition
192.168.251.103: user: warning: [2023-07-11T14:52:13.562204505Z]:       --meta metaValueSlice            A key/value pair for META (default [])
192.168.251.103: user: warning: [2023-07-11T14:52:13.562210505Z]:       --platform string                The value of talos.platform
192.168.251.103: user: warning: [2023-07-11T14:52:13.562217505Z]:       --upgrade                        Indicates that the install is being performed by an upgrade
192.168.251.103: user: warning: [2023-07-11T14:52:13.562224505Z]:       --zero                           Indicates that the install should write zeros to the disk before installing
192.168.251.103: user: warning: [2023-07-11T14:52:13.562231505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.562238505Z]: error validating extension "qemu-guest-agent": version constraint >= v1.5.0 can't be satisfied with Talos version 1.4.6

Cannot use iscsi-tools to mount a PV

I'm trying to create a PV using iSCSI target on my NAS. I'm using Talos v1.4.5 and iscsi-tools v0.1.4. I can confirm that iscsiadm and friends are present on the host under /usr/local/sbin. I've created PV and PVC. When Pod using that PVC is starting, it cannot mount the PV. The error message from kubelet says that it cannot find the binary in $PATH.

I've checked the k8s source code to see where that error is coming from and saw that it tries to use iscsiadm. I assumed that this is the binary it cannot find. I've tried adding /use/local/sbin to extraMounts for kubelet, but that did not help. What else can I try?

iscsi-tools not binding some mounts into filesystem

Hello!

My situation is almost identical to #33 - I've got a cluster of raspberry pi 4's and a Synology NAS that I'm trying to use as a storage backend for my PVCs, and I can't use NFS due to the limitations of file locking SQLite databases on network filesystems. I'm using the synology-csi driver to handle creation, deletion, and connection of iSCSI-backed LUNs on my NAS as PersistentVolumes in my cluster.

The initial setup went smoothly, with the driver able to create and delete PersistentVolumes and their backing LUNs with no issue on my NAS. I now suspect that it was accomplishing that via the Synology API and not via iSCSI, as I have not actually been able to mount the volume via iSCSI.

Here's my talosctl version:

❯ talosctl version
Client:
	Tag:         v1.0.0
	SHA:         80167fd2
	Built:       
	Go version:  go1.17.8
	OS/Arch:     darwin/amd64
Server:
	NODE:        192.168.4.9
	Tag:         v1.1.0-alpha.1-65-gafb679586
	SHA:         afb67958
	Built:       
	Go version:  go1.18.2
	OS/Arch:     linux/arm64
	Enabled:     RBAC

Originally, this was the event trace when trying to mount a volume to a pod:

❯ kubectl describe pod pvc-inspector
...
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        86s                default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling        85s                default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               83s                default-scheduler        Successfully assigned synology-csi/pvc-inspector to pi-sidero
  Normal   SuccessfulAttachVolume  83s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b"
  Warning  FailedMount             10s (x8 over 76s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b" : rpc error: code = Internal desc = rpc error: code = Internal desc = Failed to login with target iqn [iqn.2000-01.com.synology:<hostname>.pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b], err: chroot: can't execute '/usr/bin/env': No such file or directory
 (exit status 127)

I dug around and determined it was due to iscsiadm being aliased to this code snippet (see the Dockerfile):
https://github.com/SynologyOpenSource/synology-csi/blob/fc3359223fe51a13bcfa5a7cabbf59611bbeb901/chroot/chroot.sh#L4-L9

It looks like synology-csi just mounts the whole host filesystem into /host (source), and uses chroot to emulate calling iscsiadm from the host environment. Since there's no /usr/bin/env in the Talos environment, the command fails. I changed it to directly link to the iscsiadm binary contained in the iscsi-tools extension like so:

exec chroot $DIR "/usr/local/sbin/$BIN" "$@"

That fixed the issue of not actually being able to run iscsiadm, but I'm now running into this error:

❯ kubectl describe pod pvc-inspector
...
Events:
  Type     Reason                  Age              From                     Message
  ----     ------                  ----             ----                     -------
  Warning  FailedScheduling        3m19s            default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling        3m18s            default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling        117s             default-scheduler        0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Normal   Scheduled               11s              default-scheduler        Successfully assigned synology-csi/pvc-inspector to pi-sidero
  Normal   SuccessfulAttachVolume  11s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083"
  Warning  FailedMount             1s (x2 over 2s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083" : rpc error: code = Internal desc = rpc error: code = Internal desc = Failed to login with target iqn [iqn.2000-01.com.synology:<hostname>.pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083], err: iscsiadm: Could not make /etc/iscsi/ 30
iscsiadm: exiting due to idbm configuration error
 (exit status 6)

From what I understand, the /etc/iscsi folder should be bound to /system/iscsi:

extensions/storage/iscsi-tools/iscsid.yaml

Lines 41 to 47 in 7cf2843

    
           - source: /system/iscsi 
        
             destination: /etc/iscsi 
        
             type: bind 
        
             options: 
        
               - rshared 
        
               - rbind 
        
               - rw

The iscsi-tools extension seems to be working for the most part:

talosctl ls /usr/local/sbin returns a list of iscsi-related binaries
talosctl get extensions lists the correct information for iscsi-tools
I can run iscsiadm --help in the synology-csi container that mounts the host fs

However, the /system/iscsi directory is not being bound correctly to /etc/iscsi:

❯ talosctl ls system
NODE          NAME
192.168.4.9   .
192.168.4.9   config
192.168.4.9   etc
192.168.4.9   iscsi
192.168.4.9   libexec
192.168.4.9   overlays
192.168.4.9   run
192.168.4.9   secrets
192.168.4.9   state
192.168.4.9   var

❯ talosctl ls etc
NODE          NAME
192.168.4.9   .
192.168.4.9   ca-certificates
192.168.4.9   cni
192.168.4.9   containerd
192.168.4.9   cri
192.168.4.9   extensions.yaml
192.168.4.9   hosts
192.168.4.9   kubernetes
192.168.4.9   localtime
192.168.4.9   lvm
192.168.4.9   machine-id
192.168.4.9   os-release
192.168.4.9   pki
192.168.4.9   resolv.conf
192.168.4.9   ssl
192.168.4.9   ssl1.1

Is there something I need to be doing in the machineconfig to get this directory bound as expected?

[feature-request] btrfs support

I'm currently trying to integrate an already existing file system into talos and make it available to pods running on one node. As this pre-existing file system is based on btrfs, I'd like to add btrfs support to the talos-os. As talos is aiming to reduce attack surfaces, I think it makes the most sense to add support as an extension.

other solutions I considered

I also considered switching from btrfs to zfs, which is already supported by talos and has a similiar featureset. Unfortunately, depending systems use btrfs send/receive features and cannot easily be migrated to other filesystems.

Tasks

Beta Give feedback

build kernel with btrfs support in @siderolabs/pkgs
install kernel module into own package
add extension to @siderolabs/extensions linking the package
Options

iscsid failing to load due to missing libraries

I noticed my nodes infinitely looping, so i decided to check the logs, it seems ext-iscsid is unable to find libcrypto for some reason

10.10.11.7: Error loading shared library libcrypto.so.3: No such file or directory (needed by /usr/local/sbin/iscsid)
10.10.11.7: Error loading shared library libcrypto.so.3: No such file or directory (needed by /usr/local/lib/libisns.so.0)

directory structure

We might benefit from better directory structure:

move base + musl under base/. better if we could use one from pkgs (?)
container runtimes might go under container-runtime/
firmware-based stuff under firmware/.

(note: bldr doesn't care about directory structure, it finds pkg.yaml by name)

Grub "error: out of memory" with nvidia extensions

Talos v1.5.5
nonfree-kmod-nvidia:535.54.03-v1.5.5
nvidia-container-toolkit:535.54.03-v1.13.5

Hardware: NUC11PHKi7CAA (UEFI boot)
64g RAM

After performing the initial installation with the vanilla metal-amd64.iso I apply the configuration with the nvidia extensions and kernel modules. All of which work well. When I perform a reboot with the -m powercycle option the machine won't boot with the following error:

Discussing with frezbo he mentionned this grub issue as a possible cause : https://bugs.launchpad.net/oem-priority/+bug/1842320

I performed other tests in proxmox running on the same hardware model by switching between UEFI and BIOS on VMs with the NVIDIA card in PCI passthrough but I was not able to reproduce the issue.

Machine	Boot Mode	Result
Proxmox VM	BIOS	Works
Proxmox VM	UEFI	Works
Physical	UEFI	grub OOM
Physical	BIOS	Not available for this hw

[feature request] estargz extension

We should take a look at creating an estargz extension for Talos if possible. Customers have a need for this type of image loading and it could have some real value for others pulling big images as well. It feels like an extension might be a nice way to tackle this as opposed to having it baked into talos, since I think that it's just enabling some containerd plugins.

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: ^(?<major>\d+)\.(?<minor>\d+)\.?(?<patch>(?!9000$)\d+)?$
Error type: Invalid regular expression: ^(?\d+).(?\d+).?(?(?!9000$)\d+)?$

[qemu-guest-agent] Not launching in maintenance mode (1.6.1)

I'm trying to automate Talos deployments via Terraform on my Proxmox cluster.
To be able to provision the VM's correctly QEMU GA has to be started on boot so the Terraform Proxmox provider can get the VM's IP addresses.

This seems possible now since v1.6.0 using early extension initialization feature on maintenance mode.
Unfortunately QEMU GA seems to not be working unlike Xen GA. It get's stuck on "Waiting for service "cri" to be registered" message.

I'm using an ISO bundled with QEMU GA by Image Factory with the following configuration.

Your image schematic ID is: f4524a31eb3f0cf8dc1449b8f43755d5ae4d35fc5c5b27eb79c8b8aca590bdfa
customization:
    systemExtensions:
        officialExtensions:
            - siderolabs/amd-ucode
            - siderolabs/intel-ucode
            - siderolabs/qemu-guest-agent

QEMU GA is correctly listed by talosctl

❯ talosctl -n 10.120.0.35 get extensions -i
NODE   NAMESPACE   TYPE              ID   VERSION   NAME               VERSION
       runtime     ExtensionStatus   0    1         amd-ucode          20231111
       runtime     ExtensionStatus   1    1         intel-ucode        20231114
       runtime     ExtensionStatus   2    1         qemu-guest-agent   8.1.3
       runtime     ExtensionStatus   3    1         schematic          f4524a31eb3f0cf8dc1449b8f43755d5ae4d35fc5c5b27eb79c8b8aca590bdfa

Thanks!

Error validating extension "iscsi-tools"

There is a problem installing the iscsi-tools extension.

Upgrade command

talosctl --talosconfig=./clusterconfig/talosconfig upgrade --nodes 192.168.x.x -p -i ghcr.io/siderolabs/installer:v1.2.7

Extension config

...
  install:
    disk: /dev/sda
    image: ghcr.io/siderolabs/installer:v1.2.7
    extensions:
      - image: ghcr.io/siderolabs/iscsi-tools:v0.1.2
    bootloader: true
    wipe: false
...

Error Messages

[  163.035975] 2022/12/03 08:15:14 discovered system extensions:
[  163.036500] 2022/12/03 08:15:14 NAME          VERSION   AUTHOR
[  163.036850] 2022/12/03 08:15:14 iscsi-tools   v0.1.2    Sidero Labs
[  163.037221] 2022/12/03 08:15:14 validating system extensions
[  163.037569] Error: error validating extension "iscsi-tools": path "/etc/logrotate.d/iscsiuiolog" is not allowed in extensions
[  163.038296] Usage:
[  163.038433]   installer install [flags]
[  163.038675] 
[  163.038775] Flags:
[  163.038911]   -h, --help   help for install
[  163.039179] 
[  163.039279] Global Flags:
[  163.039451]       --arch string                    The target architecture (default "amd64")
[  163.039962]       --board string                   The value of talos.board (default "none")
[  163.040456]       --bootloader                     Install a booloader to the specified disk (default true)
[  163.041022]       --config string                  The value of talos.config
[  163.041435]       --disk string                    The path to the disk to install to
[  163.041898]       --extra-kernel-arg stringArray   Extra argument to pass to the kernel
[  163.042378]       --force                          Indicates that the install should forcefully format the partition
[  163.042986]       --platform string                The value of talos.platform
[  163.043415]       --upgrade                        Indicates that the install is being performed by an upgrade
[  163.043984]       --zero                           Indicates that the install should write zeros to the disk before installing
[  163.044631] 
[  163.044791] error validating extension "iscsi-tools": path "/etc/logrotate.d/iscsiuiolog" is not allowed in extensions

DRBD 9.2.5 for Talos v1.5.3

I would like to use DRBD 9.2.5 with Talos v1.5.3. Currently the DRBD 9.2.5 extension is only available for Talos v1.6.0-*. Would you please add the new DRBD version for Talos v1.5.3?

stargz-snapshotter doesn't work

I installed this extension and tried to run a test pod with the estargz image:

apiVersion: v1
kind: Pod
metadata:
  name: nodejs
spec:
  containers:
    - name: nodejs-stargz
      image: ghcr.io/stargz-containers/node:17.8.0-esgz
      command: ["node"]
      args:
        - -e
        - var http = require('http');
          http.createServer(function(req, res) {
          res.writeHead(200);
          res.end('Hello World!\n');
          }).listen(80);
      ports:
        - containerPort: 80

Getting this error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "node": executable file not found in $PATH: unknown

I also see these errors in the ext-stargz-snapshotter service.

{"dir":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/68/fs","error":"specified path \"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/68/fs\" isn't a mountpoint","level":"debug","msg":"failed to unmount","time":"2023-10-06T15:35:52.587369547Z"}

{"error":null,"key":"k8s.io/91/extract-195307888--2O1 sha256:2414385fd51d34e07d564ec6041ee66de902424f028528bce52743d92b1bc875","level":"info","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/64/fs","msg":"[fusermount fusermount3] not installed; trying direct mount","parent":"sha256:a3926353a4b2389bed133fe4b9f8bdb8439529ba6a965b37ef0c1a7921043a00","time":"2023-10-06T15:34:17.197856631Z"}

[drbd] drbd modules installed in /lib/modules/6.1.27-talos/extras instead of 6.1.28-talos

Hi,

Wanted to test Piraeus/Linstor CSI, which needs the drbd extension, on Talos v1.4.4 (upgraded this morning), on my test cluster. I patched a test node machineconfig accordingly to piraeus documentation concerning Talos Linux (before patching the whole cluster's workers):

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/drbd:9.2.2-v1.4.4
  kernel:
    modules:
      - name: drbd
        parameters:
          - usermode_helper=disabled
      - name: drbd_transport_tcp

and got this error in dashboard's log:

on a talosctl list -n test_node /lib/modules/6.1.28-talos (the actuel kernel used by my cluster VMs), no extras folder. But, on a talosctl list -n test_node /lib/modules/6.1.27-talos (the previous kernel), the extras/drbd.ko and extras/drbd_transport_tcp.ko kernel modules are present.

Should I force the use of the 6.1.27 kernel, or wait for next drbd extension version?

Sorry for my approximative english,

Best regards, and thank you for making Talos Linux.

Edit: replaced the test node by a fresh node, applied machineconfig, same behavior, drbd modules gets installed in /lib/modules/6.1.27-talos/extras instead of /lib/modules/6.1.28-talos/extras.

Ble extension

I know it’s a very niche thing. But for days I’m trying to get one of my Nodes to provision a matter-device. Bluetooth is needed during the comissioning-process. Would be great if Ble could be provided via an extension. Nevertheless thanks for this great and stable operating system.

iscsi-tools : Containerd(ext-iscsid), going to restart forever

Hi,

I'm a new user of talos. The documentation is very complete and helped me a lot as a beginner. Thank you for all your hard work.

I'd like to use longhorn on all my Talos nodes, and as you know : I need to install open-iscsi. I tried the following configuration but it didn't worked well. The container ext-iscsid restarts forever.

debug: false
persist: true
machine:
    type: controlplane
    token: redacted
    ca:
        crt: redacted
        key: redacted
    certSANs:
      - 127.0.0.1
    kubelet:
        image: ghcr.io/siderolabs/kubelet:v1.28.3
        defaultRuntimeSeccompProfileEnabled: true
        disableManifestsDirectory: true
    network: {}
    install:
        disk: /dev/sda
        image: ghcr.io/siderolabs/installer:v1.5.5
        wipe: false
        extensions:
          - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4
          - image: ghcr.io/siderolabs/qemu-guest-agent:8.1.3

Here is the output:

❯ talosctl services ext-iscsid status -n 192.168.128.10
NODE     192.168.128.10
ID       ext-iscsid
STATE    Waiting
HEALTH   ?
EVENTS   [Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (2s ago)
         [Running]: Started task ext-iscsid (PID 33404) for container ext-iscsid (2s ago)
         [Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (7s ago)
         [Running]: Started task ext-iscsid (PID 33339) for container ext-iscsid (7s ago)
         [Preparing]: Creating service runner (7s ago)
         [Preparing]: Running pre state (7s ago)
         [Waiting]: Waiting for service "containerd" to be "up", service "cri" to be "up", service "ext-tgtd" to be "up", network (7s ag
o)
         [Finished]: Service finished successfully (7s ago)
         [Stopping]: Aborting restart sequence (7s ago)
         [Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (11s ago)
         [Running]: Started task ext-iscsid (PID 33273) for container ext-iscsid (11s ago)

I'm using Talos v1.5.5, my nodes are virtual machines on a Proxmox hypervisor.

Thank you for your help

`amdgpu-firmware` not registering `kfd` device

I've added the extension to my talos config and I can verify it's loaded, however, the kfd device is not present.

lspci, when run from a busybox container, shows the amdgpu kernel module is loaded.

05:00.0 Class 0300: 1002:1638 amdgpu

Loading the ROCM k8s-device-plugin shows only the following output:

I0127 18:29:58.101908       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                                                     
I0127 18:29:58.101961       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                                         
I0127 18:29:58.101965       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                                                
I0127 18:29:58.101971       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                                         
I0127 18:29:58.101980       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                                            
I0127 18:29:58.102101       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                                         
I0127 18:29:58.102184       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                                                      
I0127 18:29:58.102196       1 manager.go:66] Handling incoming signals

It appears to never receive the signal that a GPU is available.

I'm happy to try to diagnose anything further, supply logs, etc.

Nvidia OSS Drivers Require OpenRmEnableUnsupportedGpus=1 For Certain GPUs

System:

Talos: v1.3.3
K8s: 1.26.1
GPU: RTX 2080S
Nvidia driver versions:
nvidia-open-gpu-kernel-modules:515.65.01-v1.3.3
nvidia-container-toolkit:515.65.01-v1.11.0

I followed the directions in the documentations for OSS drivers (including creating the runtime and verifying the drivers were successfully installed).

When I run the following command helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia (to enable the nvidia device plugin) I get the following in the logs on the node:

Open nvidia.ko is only ready for use on Data Center GPUs. 
To force use of Open nvidia.ko on other GPUs, see the 
'OpenRmEnableUnsupportedGpus' kernel module parameter described
in the README 
GPU 0000:07:00.0: RmInitAdapter failed!

Additionally the pod fails to start in k8s.

I found this is due to the following: because I'm not running a datacenter GPU I need to run:

  modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1

or, in an /etc/modprobe.d/ configuration file:

   options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

In fact the readme says the following (it appears this is needed for support on many GPUs) :
https://github.com/NVIDIA/open-gpu-kernel-modules#compatible-gpus

To enable use of the open kernel modules on GeForce and Workstation GPUs, set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1. For more details, see the NVIDIA GPU driver end user README here:

However, I'm not sure how to do this via the TalosCLI.

Would it be possible to support this feature in the future on the OSS drivers? Is there anyway to set that parameter on extension install or have it as a talos feature? I would be happy to implement it but may need a pointer in the right direction.

Is there a way to create additional gvisor configurations?

Or is the only way to make a separate system extension / modify the existing one?

Just asking in case I need to change the gvisor configuration for example, enabling root fs overlay, or changing the platform used in gvisor.

[iscsi-tools] Please support multipath

Hi team,

It looks like multipath-tools is not available in the iscsi-tools extension. I tried to search for multipath references everywhere, to no avail.

When mounting iscsi target trying to use multipath (using democratic-csi), I get failed to discover multipath device. However, I can see that the two disk references are successfully mounted :

{"host":"kubeworker03","level":"info","message":"new request - driver: ControllerSynologyDriver method: NodeStageVolume call: {\"metadata\":{\"x-forwarded-host\":[\"localhost\"],\"user-agent\":[\"grpc-go/1.49.0\"]},\"request\":{\"publish_context\":{},\"secrets\":\"redacted\",\"volume_context\":{\"lun\":\"1\",\"node_attach_driver\":\"iscsi\",\"portal\":\"\",\"portals\":\"nas01-iscsi0.i,nas01-iscsi1.i\",\"provisioner_driver\":\"synology-iscsi\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1679431778041-8081-synology-iscsi\",\"interface\":\"\",\"iqn\":\"iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d\"},\"volume_id\":\"pvc-83c2d084-36e9-4b31-86f2-bca6df03770d\",\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/synology-iscsi/4db6d710ff1e3fb3e3cd58f0a4fb57b0c962a6dd773ba289f292dbf2b48d2c61/globalmount\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_MULTI_WRITER\"},\"mount\":{\"mount_flags\":[],\"fs_type\":\"\",\"volume_mount_group\":\"\"},\"access_type\":\"mount\"}},\"cancelled\":false}","service":"democratic-csi","timestamp":"2023-03-21T21:31:11.735Z"}
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -o new
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -o update --name node.startup --value manual
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -l
executing iscsi command: iscsiadm -m session
executing iscsi command: iscsiadm -m session -r 5 --rescan
executing filesystem command: realpath /dev/disk/by-path/ip-nas01-iscsi0.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1
{"host":"kubeworker03","level":"info","message":"successfully logged into portal nas01-iscsi0.i:3260 and created device /dev/disk/by-path/ip-nas01-iscsi0.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1 with realpath /dev/sdf","service":"democratic-csi","timestamp":"2023-03-21T21:31:11.948Z"}
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -o new
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -o update --name node.startup --value manual
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -l
executing iscsi command: iscsiadm -m session
executing iscsi command: iscsiadm -m session -r 6 --rescan
executing filesystem command: realpath /dev/disk/by-path/ip-nas01-iscsi1.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1
{"host":"kubeworker03","level":"info","message":"successfully logged into portal nas01-iscsi1.i:3260 and created device /dev/disk/by-path/ip-nas01-iscsi1.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1 with realpath /dev/sdg","service":"democratic-csi","timestamp":"2023-03-21T21:31:12.196Z"}
executing filesystem command: sh -c for file in $(ls -la /dev/mapper/* | grep "\->" | grep -oP "\-> .+" | grep -oP " .+"); do echo $(F=$(echo $file | grep -oP "[a-z0-9-]+");echo $F":"$(ls "/sys/block/${F}/slaves/");); done;
executing filesystem command: sh -c for file in $(ls -la /dev/mapper/* | grep "\->" | grep -oP "\-> .+" | grep -oP " .+"); do echo $(F=$(echo $file | grep -oP "[a-z0-9-]+");echo $F":"$(ls "/sys/block/${F}/slaves/");); done;
{"host":"kubeworker03","level":"error","message":"handler error - driver: ControllerSynologyDriver method: NodeStageVolume error: {\"name\":\"GrpcError\",\"code\":2,\"message\":\"failed to discover multipath device\"}","service":"democratic-csi","timestamp":"2023-03-21T21:31:14.221Z"}

On the democratic-csi side, nsenter support for the multipath executable needs to be added as well (see the original PR for supporting iscsiadm here : democratic-csi/democratic-csi#225)

Thanks for taking this request into consideration :)

Cheers !

gVisor fails with cgroups support on Talos

Error message:

172.20.0.2: {"error":"failed to create containerd task: failed to create shim task: OCI runtime create failed: creating container: Rel: can't make podruntime/runtime relative to /: unknown","level":"error","msg":"RunPodSandbox for \u0026PodSandboxMetadata{Name:nginx-gvisor,Uid:540e0e84-7090-48e9-838b-8d7d0965e301,Namespace:default,Attempt:0,} failed, error","time":"2022-01-25T14:24:32.902977189Z"}

It seems to be related to the cgroups handling code, as podruntime/runtime is the cgroup of the cri process.

I traced the error back to: https://github.com/google/gvisor/blob/fa4e2fff8a842dcfe6518d8bf9cdbd94750780cc/runsc/cgroup/cgroup.go#L292

binfmt extensions not available

Extension fails to load after node restart

Im using the iSCSI extension in my machine install section:

install: disk: /dev/sda image: ghcr.io/siderolabs/installer:v1.3.5 bootloader: true wipe: true extensions: - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4

This might actually be specific issue to the iscsi-tools extension and not Talos, but reporting nonetheless.

For a new node this works great and as expected, but as soon as I restart the node I get this error:

spec: failed to generate spec: failed to mkdir "/usr/local/etc/iscsi": mkdir /usr/local/etc: read-only file system

Was expecting the node restart to behave the same, suspect its something thats being done on initial install with the iScsi tools install that isnt recreated on a restart post initial installation. Very frustrating having to re-roll nodes, if they need a reboot.

Environment

Talos version: [Tag: v1.3.5 SHA: 03edf8c1 Built: Go version: go1.19.6 OS/Arch: linux/amd64 Enabled: RBAC]
Kubernetes version: [Server Version: v1.26.1]
Platform:
Linux

An error occurs when Building Extensions

I pull the extensions project, And building the example extension. But an error occurs when building extensions.

make local-hello-world-service PLATFORM=linux/amd64 DEST=_out
make[1]: Entering directory `/opt/extensions'
[+] Building 3.0s (5/5) FINISHED                                                                                                                                                               docker:default
 => [internal] load build definition from Pkgfile                                                                                                                                                        0.0s
 => => transferring dockerfile: 467B                                                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                          0.0s
 => resolve image config for ghcr.io/siderolabs/bldr:v0.2.0-alpha.12                                                                                                                                     2.4s
 => CACHED docker-image://ghcr.io/siderolabs/bldr:v0.2.0-alpha.12@sha256:a97e2c870337b4677860887004415a20b0df94861567eab37495bd31ee7b248a                                                                0.0s
 => load Pkgfile, pkg.yamls and vars.yamls                                                                                                                                                               0.0s
 => => transferring dockerfile: 25.02kB                                                                                                                                                                  0.0s
Pkgfile:1
--------------------
   1 | >>> # syntax = ghcr.io/siderolabs/bldr:v0.2.0-alpha.12
   2 |
   3 |     format: v1alpha2
--------------------
ERROR: failed to solve: failed to solve LLB: requested experimental feature mergeop  has been disabled on the build server: only enabled with containerd image store backend
make[1]: *** [target-hello-world-service] Error 1
make[1]: Leaving directory `/opt/extensions'
make: *** [local-hello-world-service] Error 2

Missing build: NUT client extension not built

It seems that an error occured during the nut client extension build.

https://ci.dev.talos-systems.io/siderolabs/extensions/258/1/2

Unable to build with CGO enabled on ARM64

Hey Sidero Labs community,

Motivation

I am trying to build an extension for runu (part of kraftkit), a container runtime for unikernels.

I want to use it to run unikernels on my Raspberry Pis.

Unfortunately, I am not able to get the extension to build.

What I tried so far

I started with trying to cross-compile the runu binary on linux/amd64, but ran into issues due to the nsenter dependency, which is a CGO library from runc.

However, I was able to build runu natively on my Raspberry Pi 5, which is currently good enough for me:

$ GOARCH=arm64 CGO_ENABLED=1 DOCKER="" make runu

My problem now is, that I can't get the extension to build.

You can find my current work-in-progress in my fork

It builds find for linux/amd64 on my laptop (also linux/amd64), but fails on my Pi 5 with the following error:

$ PLATFORM=linux/arm64 make target-runu
...
85.44 GOOS=linux \
85.44 GOARCH=arm64 \
85.44 go build \
85.44   -buildmode=pie \
85.44   -gcflags=all='' \
85.44   -ldflags='-s -w -X "kraftkit.sh/internal/version.version=-dirty" -X "kraftkit.sh/internal/version.commit=" -X "kraftkit.sh/internal/version.buildTime=Mon Feb  5 22:17:04 UTC 2024"' \
85.44   -o /go/src/github.com/unikraft/kraftkit/dist/runu \
85.44   /go/src/github.com/unikraft/kraftkit/cmd/runu
203.8 # kraftkit.sh/cmd/runu
203.8 /toolchain/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
203.8 collect2: fatal error: cannot find 'ld'
203.8 compilation terminated.
203.8 
203.9 make: *** [Makefile:128: runu] Error 1

I am not sure, how to debug this issue. ld is installed in the build image (ghcr.io/siderolabs/base:v1.7.0-alpha.0-13-gf376a53), but gcc seems to be unable to find it during the build (as far as I can tell).

I hope I included all necessary information.

Any input on this issue would be appreciated, as I am a little out of my depth on this one.

P.S.: I asked the unikraft community, if they could provide a precompiled version of runu for arm64 and they told me, that they will look into it.

[drbd] 9.2.2-v1.4.1 causes immediate crash/restart on replicated volumes

I am testing out piraeus-datastore using talos, and I was able to get it working wonderfully on v1.3.7 but after upgrading to v1.4.1 (and using drbd extension 9.2.2-v1.4.1), upon provisioning a PVC with more then one replica (necessitating communication between nodes) the "primary" node for the provisioned volume restarts without kernel panic/console message.

I have attached a "talosctl dmesg" log in hopes that it will help troubleshoot.

Thank you!
elite6.log

[nvidia-container-toolkit=530.41.03-v1.12.1] [talos=v1.4.5] crash on container creation when NVIDIA_DRIVER_CAPABILITIES contains "graphics"

I'm not 100% sure if this is the right place to report the issue, feel free to point me in the right direction if not :)

My setup:

I am running a 2 node cluster based on talos 1.4.5 and i've enabled the nvidia container toolkit extension by following the official talos tutorial (changing all mentions of talos version v1.4.4 to v1.4.5 and using version v0.14.0 of the nvidia device plugin)

The nodes both contain a RTX 2080 GPU

The issue:

All conventional cuda based container workloads i've tested run without any problems, nvidia-smi from inside of a container also reports the proper gpu information.

whenever i add graphics to the NVIDIA_DRIVER_CAPABILITIES env variable container creation fails with the following status:

failed to create containerd task: 
failed to create shim task: 
OCI runtime create failed: 
failed to create NVIDIA Container Runtime: 
failed to construct OCI spec modifier: 
failed to construct discoverer: 
failed to create mounts discoverer: 
failed to construct library locator: 
error loading ldcache: 
open /etc/ld.so.cache: no such file or directory: unknown

i am honestly at a loss here, if anyone has any input why this might be happening i would greatly appreciate any help

nvidia-device-plugin won't start on talos 1.3.0

After upgrading my nvidia node to talos 1.3.0 and recompiling and applying the nvidia proprietary driver my nvidia-device-plugin deployment no longer start with the following error:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0: unknown

My research on this error seem to point to cgroups v2 compatibility and indeed reverting one of the node to cgroups v1 allowed the container to start.

libnvidia-container appears to support cgroups v2 since v1.8.0

and v1.8.1 brings a fix to cgroup detection NVIDIA/libnvidia-container#158

I have to admit this is getting a bit beyond my scope of knowledge but seeing as there is reference to permissions and capacities I figured it may be related since talos locks down a lot of things.

How to install extension documentation missing actual instruction on how to install

The documentation only covers how to determine the right version, but nothing that I see on how to actually instruct the OS to install it.

I am attempting to install an extension (ecr-credential-provider) in a Talos deployment and cannot seem to get the machine config patch right. The doc for 1.6 says machine.install.extensions is deprecated, but there is no mention of the preferred method.

(Using terraform talos_machine_configuration_apply

Max memory configuration for ZFS ARC

Would be nice to have a way to limit the amount of memory the ZFS can use for ARC. The default is 1/2 of system memory installed Reference 1.

I have a 32G RAM worker, in which 12G is being used by ARC (at the moment) and it randomly makes processes within pods going OOM killed.

What is i915-ucode

The PR that introduced this makes it seem like this is the intel graphics driver for the igpu, however it's listed under firmware and not drivers. Is this required to use the igpu in pods running on talos or does this update the actual microcode for the igpu? It's unclear and there's no mention anywhere about what most packages in this repo do.

DRBD extension not working on arm / rock64

Hi,

Apologies I don't really know how to debug this, but on my rock64 when using the DRBD extension I seem to be missing the drbd module :

talosctl --talosconfig talosconfig -n mynode list /lib/modules/6.1.58-talos/extras
NODE   NAME
1 error occurred:
 rpc error: code = Unknown desc = lstat /lib/modules/6.1.58-talos/extras: no such file or directory

The same config deployed on x86 machines does have that directory populated with the .ko files as expected.
I tried using the tag and also the specific arm hash from here, to be sure but no luck when "upgrading" to the same version to rebuild the initramfs.

I can't access the display for that node, not sure which service log might explain why this is failing ?
Thanks

Documentation issue with iscsi docs

I was not sure where to list this issue, but it seems like a general doc issue for installing the iscsi extension at least.

https://www.talos.dev/v1.2/kubernetes-guides/configuration/replicated-local-storage-with-openebs-jiva/

This page shows this yaml:

- op: add
  path: /machine/install/extensions
  value:
    - image: ghcr.io/siderolabs/iscsi-tools:

However, this is not proper (does not parse) and if you were to just remove the colon, it also fails to upgrade, and the upgrade ends up in "locked" state, which I don't know how to get out of other than fully rebooting.

I changed the image to ghcr.io/siderolabs/iscsi-tools:latest which also seemed to fail. Using ghcr.io/siderolabs/iscsi-tools:v0.1.1 worked:

NODE          NAMESPACE   TYPE              ID                                          VERSION   NAME          VERSION
10.45.101.1   runtime     ExtensionStatus   000.ghcr.io-siderolabs-iscsi-tools-v0.1.1   1         iscsi-tools   v0.1.1

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore: update cgr.dev/chainguard/wolfi-base docker digest to e9c488a
chore: update releases (git://git.kernel.org/pub/scm/libs/libcap/libcap.git, google/gvisor, https://github.com/qemu/qemu.git, https://gitlab.com/nvidia/container-toolkit/container-toolkit.git, https://gitlab.com/nvidia/container-toolkit/libnvidia-container.git, https://gitlab.gnome.org/GNOME/glib.git, kata-containers/kata-containers, nvidia/open-gpu-kernel-modules, tailscale/tailscale)
chore: update releases (major) (https://github.com/qemu/qemu.git, nvidia/open-gpu-kernel-modules)
Click on this checkbox to rebase all open PRs at once

Detected dependencies

github-actions

.github/workflows/ci.yaml

kenchan0130/actions-system-info v1.3.0

actions/checkout v4

docker/setup-buildx-action v3

docker/login-action v3

actions/github-script v7

crazy-max/ghaction-github-release v2

kenchan0130/actions-system-info v1.3.0

actions/checkout v4

docker/setup-buildx-action v3

moby/buildkit v0.13.2

moby/buildkit v0.13.2

.github/workflows/slack-notify.yaml

slackapi/slack-github-action v1

.github/workflows/weekly.yaml

kenchan0130/actions-system-info v1.3.0

actions/checkout v4

docker/setup-buildx-action v3

moby/buildkit v0.13.2

gomod

examples/hello-world-service/src/go.mod

go 1.21

nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper/go.mod

go 1.22

golang.org/x/sys v0.20.0

nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper/go.mod

go 1.22

golang.org/x/sys v0.20.0

storage/iscsi-tools/iscsid-wrapper/go.mod

go 1.22

golang.org/x/sys v0.20.0

regex

container-runtime/vars.yaml

google/gvisor 20240325.0

containerd/stargz-snapshotter v0.15.1

kubernetes/cloud-provider-aws v1.30.0

containerd/runwasi v0.3.0

spinkube/containerd-shim-spin v0.14.1

kata-containers/kata-containers 3.3.0

firmware/vars.yaml

intel/Intel-Linux-Processor-Microcode-Data-Files 20240514

guest-agents/vars.yaml

https://github.com/qemu/qemu.git 8.2.3

https://gitlab.gnome.org/GNOME/glib.git 2.80.0

https://gitlab.com/xen-project/xen-guest-agent.git 0.4.0

siderolabs/talos-vmtoolsd v0.5.1

network/vars.yaml

tailscale/tailscale 1.64.2

nvidia-gpu/vars.yaml

nvidia/open-gpu-kernel-modules 535.129.03

https://gitlab.com/nvidia/container-toolkit/container-toolkit.git v1.14.6@5605d191332dcfeea802c4497360d60a65c7887e

https://gitlab.com/nvidia/container-toolkit/libnvidia-container.git v1.14.6@d2eb0afe86f0b643e33624ee64f065dd60e952d4

cgr.dev/chainguard/wolfi-base sha256:3d6dece13cdb5546cd03b20e14f9af354bc1a56ab5a7b47dca3e6c1557211fcf

https://sourceware.org/git/glibc.git 2.39

seccomp/libseccomp 2.5.5

git://git.kernel.org/pub/scm/libs/libcap/libcap.git 2.69

git://sourceware.org/git/elfutils.git 0.191

power/vars.yaml

networkupstools/nut 2.8.2

storage/iscsi-tools/vars.yaml

open-iscsi/open-iscsi 2.1.9

open-iscsi/open-isns 0.102

storage/vars.yaml

libfuse/libfuse 3.16.2

git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git 4.3

Pkgfile

git://linux-nfs.org/~steved/libtirpc 1-3-3

madler/zlib 1.3.1

Pkgfile

siderolabs/bldr v0.3.1

Check this box to trigger a request for Renovate to run again on this repository

Upgrade zfs

Upgrade zfs to https://github.com/openzfs/zfs/releases/tag/zfs-2.2.2 or https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14

They have an important fix for a data corruption bug; and 2.x now has support for Linux containers (e.g. overlayfs).

@smira I think this should be part of the current and next version of talos.

Mutable/user-defined config location?

I don't know whether #24 (comment) would be contradicting with the philosophy of Talos because everything is supposed to be running off the SquashFS, but some users may want to edit the config file. Including those config files are quite long to fit into YAML.

I'm not aware if the user can supply the config during build time or we can have a consistent storage using etcd, for example for /etc

[drbd] error loading module \x5c"drbd\x5c": module not found

Hi all,

I've deployed a cluster using Talos via the Terraform provider. Currently the cluster has only a single node. Not I am trying to set up the piraeus-operator according to this how to https://github.com/piraeusdatastore/piraeus-operator/blob/v2/docs/how-to/talos.md. I already have adapted the machine config patch to the following:

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/drbd:9.2.4-v1.4.6
  kernel:
    modules:
      - name: drbd
        parameters:
          - usermode_helper=disabled
      - name: drbd_transport_tcp

Executing the two commands for validation returned the following:

talosctl read /proc/modules

virtio_balloon 24576 - - Live 0xffffffffc02bf000
virtio_pci 24576 - - Live 0xffffffffc02a2000
virtio_pci_legacy_dev 16384 - - Live 0xffffffffc0292000
virtio_pci_modern_dev 16384 - - Live 0xffffffffc027a000

talosctl read /sys/module/drbd/parameters/usermode_helper

1 error occurred:
 rpc error: code = Unknown desc = stat /sys/module/drbd/parameters/usermode_helper: no such file or directory

So it looks like something is wrong. So I have checked the logs which contained the following two lines:
talosctl dmesg

REMOVED: user: warning: [2023-07-25T10:02:44.873296699Z]: [talos] apply config request: mode auto(no_reboot)
REMOVED: user: warning: [2023-07-25T10:03:13.119427699Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"drbd\x5c": module not found"}

error loading module \x5c"drbd\x5c": module not found looks a bit strange for me. I've already checked my machine config patch and rewrote it by hand, reapplied the config and rebooted the node. The error still comes up about every minute and the validation still fails.

Now I have no idea what else I can check or how to solve the issue.