siderolabs / extensions Goto Github PK
View Code? Open in Web Editor NEWTalos Linux System Extensions
Talos Linux System Extensions
I was not sure where to list this issue, but it seems like a general doc issue for installing the iscsi extension at least.
This page shows this yaml:
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/iscsi-tools:
However, this is not proper (does not parse) and if you were to just remove the colon, it also fails to upgrade, and the upgrade ends up in "locked" state, which I don't know how to get out of other than fully rebooting.
I changed the image to ghcr.io/siderolabs/iscsi-tools:latest
which also seemed to fail. Using ghcr.io/siderolabs/iscsi-tools:v0.1.1
worked:
NODE NAMESPACE TYPE ID VERSION NAME VERSION
10.45.101.1 runtime ExtensionStatus 000.ghcr.io-siderolabs-iscsi-tools-v0.1.1 1 iscsi-tools v0.1.1
I'm probably doing something wrong and would appreciate any help.
I'm running talos v1.5.4 on metal and trying to install ghcr.io/siderolabs/iscsi-tools:v0.1.4@sha256:180f0d17015ecd1c02c0f0f421aa9c5410970f33cd61bf43b1bc4c52e9310e69
Unfortunately getting this error
: user: warning: [2023-11-03T21:25:57.437836397Z]: Error: error validating extension "iscsi-tools": path "/usr/include/libopeniscsiusr.h" is not allowed in extensions
Any idea?
I would like to use DRBD 9.2.5 with Talos v1.5.3. Currently the DRBD 9.2.5 extension is only available for Talos v1.6.0-*. Would you please add the new DRBD version for Talos v1.5.3?
Or is the only way to make a separate system extension / modify the existing one?
Just asking in case I need to change the gvisor configuration for example, enabling root fs overlay, or changing the platform used in gvisor.
I'm currently trying to integrate an already existing file system into talos and make it available to pods running on one node. As this pre-existing file system is based on btrfs, I'd like to add btrfs support to the talos-os. As talos is aiming to reduce attack surfaces, I think it makes the most sense to add support as an extension.
I also considered switching from btrfs to zfs, which is already supported by talos and has a similiar featureset. Unfortunately, depending systems use btrfs send/receive features and cannot easily be migrated to other filesystems.
I am testing out piraeus-datastore using talos, and I was able to get it working wonderfully on v1.3.7 but after upgrading to v1.4.1 (and using drbd extension 9.2.2-v1.4.1), upon provisioning a PVC with more then one replica (necessitating communication between nodes) the "primary" node for the provisioned volume restarts without kernel panic/console message.
I have attached a "talosctl dmesg" log in hopes that it will help troubleshoot.
Thank you!
elite6.log
Would be great if Sidero could provide an extension to enable Thunderbolt.
Thanks for the great work.
I know it’s a very niche thing. But for days I’m trying to get one of my Nodes to output sound. Just wondering if Alsa is shipped with the base-talos-kernel. And if not, if this could be done via a extension. I know this is lowest priority and I understand that you have more important stuff to do. Nevertheless thanks for this great and stable operating system.
The documentation only covers how to determine the right version, but nothing that I see on how to actually instruct the OS to install it.
I am attempting to install an extension (ecr-credential-provider) in a Talos deployment and cannot seem to get the machine config patch right. The doc for 1.6 says machine.install.extensions is deprecated, but there is no mention of the preferred method.
(Using terraform talos_machine_configuration_apply
I don't know whether #24 (comment) would be contradicting with the philosophy of Talos because everything is supposed to be running off the SquashFS, but some users may want to edit the config file. Including those config files are quite long to fit into YAML.
I'm not aware if the user can supply the config during build time or we can have a consistent storage using etcd, for example for /etc
I'm trying to automate Talos deployments via Terraform on my Proxmox cluster.
To be able to provision the VM's correctly QEMU GA has to be started on boot so the Terraform Proxmox provider can get the VM's IP addresses.
This seems possible now since v1.6.0 using early extension initialization feature on maintenance mode.
Unfortunately QEMU GA seems to not be working unlike Xen GA. It get's stuck on "Waiting for service "cri" to be registered" message.
I'm using an ISO bundled with QEMU GA by Image Factory with the following configuration.
Your image schematic ID is: f4524a31eb3f0cf8dc1449b8f43755d5ae4d35fc5c5b27eb79c8b8aca590bdfa
customization:
systemExtensions:
officialExtensions:
- siderolabs/amd-ucode
- siderolabs/intel-ucode
- siderolabs/qemu-guest-agent
QEMU GA is correctly listed by talosctl
❯ talosctl -n 10.120.0.35 get extensions -i
NODE NAMESPACE TYPE ID VERSION NAME VERSION
runtime ExtensionStatus 0 1 amd-ucode 20231111
runtime ExtensionStatus 1 1 intel-ucode 20231114
runtime ExtensionStatus 2 1 qemu-guest-agent 8.1.3
runtime ExtensionStatus 3 1 schematic f4524a31eb3f0cf8dc1449b8f43755d5ae4d35fc5c5b27eb79c8b8aca590bdfa
Thanks!
I pull the extensions project, And building the example extension. But an error occurs when building extensions.
make local-hello-world-service PLATFORM=linux/amd64 DEST=_out
make[1]: Entering directory `/opt/extensions'
[+] Building 3.0s (5/5) FINISHED docker:default
=> [internal] load build definition from Pkgfile 0.0s
=> => transferring dockerfile: 467B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> resolve image config for ghcr.io/siderolabs/bldr:v0.2.0-alpha.12 2.4s
=> CACHED docker-image://ghcr.io/siderolabs/bldr:v0.2.0-alpha.12@sha256:a97e2c870337b4677860887004415a20b0df94861567eab37495bd31ee7b248a 0.0s
=> load Pkgfile, pkg.yamls and vars.yamls 0.0s
=> => transferring dockerfile: 25.02kB 0.0s
Pkgfile:1
--------------------
1 | >>> # syntax = ghcr.io/siderolabs/bldr:v0.2.0-alpha.12
2 |
3 | format: v1alpha2
--------------------
ERROR: failed to solve: failed to solve LLB: requested experimental feature mergeop has been disabled on the build server: only enabled with containerd image store backend
make[1]: *** [target-hello-world-service] Error 1
make[1]: Leaving directory `/opt/extensions'
make: *** [local-hello-world-service] Error 2
Upgrade zfs to https://github.com/openzfs/zfs/releases/tag/zfs-2.2.2 or https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14
They have an important fix for a data corruption bug; and 2.x now has support for Linux containers (e.g. overlayfs).
@smira I think this should be part of the current and next version of talos.
Hi,
I'm a new user of talos. The documentation is very complete and helped me a lot as a beginner. Thank you for all your hard work.
I'd like to use longhorn on all my Talos nodes, and as you know : I need to install open-iscsi. I tried the following configuration but it didn't worked well. The container ext-iscsid restarts forever.
debug: false
persist: true
machine:
type: controlplane
token: redacted
ca:
crt: redacted
key: redacted
certSANs:
- 127.0.0.1
kubelet:
image: ghcr.io/siderolabs/kubelet:v1.28.3
defaultRuntimeSeccompProfileEnabled: true
disableManifestsDirectory: true
network: {}
install:
disk: /dev/sda
image: ghcr.io/siderolabs/installer:v1.5.5
wipe: false
extensions:
- image: ghcr.io/siderolabs/iscsi-tools:v0.1.4
- image: ghcr.io/siderolabs/qemu-guest-agent:8.1.3
Here is the output:
❯ talosctl services ext-iscsid status -n 192.168.128.10
NODE 192.168.128.10
ID ext-iscsid
STATE Waiting
HEALTH ?
EVENTS [Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (2s ago)
[Running]: Started task ext-iscsid (PID 33404) for container ext-iscsid (2s ago)
[Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (7s ago)
[Running]: Started task ext-iscsid (PID 33339) for container ext-iscsid (7s ago)
[Preparing]: Creating service runner (7s ago)
[Preparing]: Running pre state (7s ago)
[Waiting]: Waiting for service "containerd" to be "up", service "cri" to be "up", service "ext-tgtd" to be "up", network (7s ag
o)
[Finished]: Service finished successfully (7s ago)
[Stopping]: Aborting restart sequence (7s ago)
[Waiting]: Error running Containerd(ext-iscsid), going to restart forever: task "ext-iscsid" failed: exit code 127 (11s ago)
[Running]: Started task ext-iscsid (PID 33273) for container ext-iscsid (11s ago)
I'm using Talos v1.5.5, my nodes are virtual machines on a Proxmox hypervisor.
Thank you for your help
It would be really nice to have a way to manage the physical nodes firmware using https://fwupd.org/.
For example, in fedora, this can be typically used as:
# add support for installing firmware/bios/uefi/intel-me updates.
dnf install -y fwupd-efi
# list the system devices that can be updated.
fwupdmgr get-devices
# list the available firmware updates.
fwupdmgr get-updates
# update the firmwares.
fwupdmgr update
# reboot to apply the update (required for bios/firmware/uefi/intel-me).
reboot
# check the result.
fwupdmgr get-updates
It seems that an error occured during the nut client extension build.
https://ci.dev.talos-systems.io/siderolabs/extensions/258/1/2
I've added the extension to my talos config and I can verify it's loaded, however, the kfd
device is not present.
lspci
, when run from a busybox container, shows the amdgpu
kernel module is loaded.
05:00.0 Class 0300: 1002:1638 amdgpu
Loading the ROCM k8s-device-plugin
shows only the following output:
I0127 18:29:58.101908 1 main.go:305] AMD GPU device plugin for Kubernetes
I0127 18:29:58.101961 1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704
I0127 18:29:58.101965 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0127 18:29:58.101971 1 manager.go:42] Starting device plugin manager
I0127 18:29:58.101980 1 manager.go:46] Registering for system signal notifications
I0127 18:29:58.102101 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0127 18:29:58.102184 1 manager.go:60] Starting Discovery on new plugins
I0127 18:29:58.102196 1 manager.go:66] Handling incoming signals
It appears to never receive the signal that a GPU is available.
I'm happy to try to diagnose anything further, supply logs, etc.
I'm working on creating an extension for the nvidia grid drivers (nvidia's open drivers don't support datacenter vgpus like the tesla line of cards). I've copied the nonfree-kmod-nvidia
tree, in the hopes that they're similar enough that I can swap the linux installer and it would Just Work ™️ , but either I did something wrong or I'm missing some critical step in building and pushing the extensions.
NOTE:
One other thing that might complicate all this is that I'm running fairly old hardware that requires GOAMD64=v1. this is the process I use to have github actions build all the artifacts for me. It currently builds everything, but I'm fairly certain I only use the installer
and talos
images. I'm also currently on 1.5.1
.
nonfree-kmod-nvidia-pkg
package and create the nonfree-kmod-nvidia-grid-pkg packagenonfree-kmod-nvidia
extension and create the nonfree-kmod-nvidia-grid extensioncreate the following patch:
# nvidia-vgpu.yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/djeebus/talos/nonfree-kmod-nvidia-grid:535.54.03-v1.5.1
- op: add
path: /machine/kernel
value:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
net.core.bpf_jit_harden: 1
apply the patch via:
talosctl \
--nodes $NODE \
patch mc \
--patch @nvidia-vgpu.yaml
trigger a reboot:
talosctl \
--nodes $NODE \
upgrade --image=ghcr.io/djeebus/talos/installer:v1.5.1
After all that, I get the following pair of messages in dmesg
after a reboot:
$NODE: kern: notice: [2023-09-12T16:35:58.881790988Z]: Loading of module with unavailable key is rejected
$NODE: user: warning: [2023-09-12T16:35:58.887564988Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nvidia\x5c": load nvidia failed: key was rejected by service"}
Any advice you could give would be very welcome, thanks!
https://github.com/kubernetes/cloud-provider-aws/tree/master/cmd/ecr-credential-provider
This is a kubelet compatible credential provider helper, which generates short-lived tokens to authenticate against AWS ECR Registries. AWS' Elastic Container Registry (ECR) only allows short-lived tokens, so hard-coding credentials directly in registryconfig.auth
doesnt work.
https://kubernetes.io/docs/tasks/administer-cluster/kubelet-credential-provider/
Kubelet can be configured with the --image-credential-provider-bin-dir
parameter, which configures the directory where the helper binary should be. The binary name (and other options) are then set over a kind: CredentialProviderConfig
config, which is passed to kubelet with the --image-credential-provider-config
parameter.
I think for the first step, providing the ecr-credential-provider
binary should be enough, because everything else can be configured in the Talosconfig.
I'm not 100% sure if this is the right place to report the issue, feel free to point me in the right direction if not :)
I am running a 2 node cluster based on talos 1.4.5 and i've enabled the nvidia container toolkit extension by following the official talos tutorial (changing all mentions of talos version v1.4.4
to v1.4.5
and using version v0.14.0
of the nvidia device plugin)
The nodes both contain a RTX 2080 GPU
All conventional cuda based container workloads i've tested run without any problems, nvidia-smi
from inside of a container also reports the proper gpu information.
whenever i add graphics
to the NVIDIA_DRIVER_CAPABILITIES
env variable container creation fails with the following status:
failed to create containerd task:
failed to create shim task:
OCI runtime create failed:
failed to create NVIDIA Container Runtime:
failed to construct OCI spec modifier:
failed to construct discoverer:
failed to create mounts discoverer:
failed to construct library locator:
error loading ldcache:
open /etc/ld.so.cache: no such file or directory: unknown
i am honestly at a loss here, if anyone has any input why this might be happening i would greatly appreciate any help
Hi all,
I've deployed a cluster using Talos via the Terraform provider. Currently the cluster has only a single node. Not I am trying to set up the piraeus-operator according to this how to https://github.com/piraeusdatastore/piraeus-operator/blob/v2/docs/how-to/talos.md. I already have adapted the machine config patch to the following:
machine:
install:
extensions:
- image: ghcr.io/siderolabs/drbd:9.2.4-v1.4.6
kernel:
modules:
- name: drbd
parameters:
- usermode_helper=disabled
- name: drbd_transport_tcp
Executing the two commands for validation returned the following:
talosctl read /proc/modules
virtio_balloon 24576 - - Live 0xffffffffc02bf000
virtio_pci 24576 - - Live 0xffffffffc02a2000
virtio_pci_legacy_dev 16384 - - Live 0xffffffffc0292000
virtio_pci_modern_dev 16384 - - Live 0xffffffffc027a000
talosctl read /sys/module/drbd/parameters/usermode_helper
1 error occurred:
rpc error: code = Unknown desc = stat /sys/module/drbd/parameters/usermode_helper: no such file or directory
So it looks like something is wrong. So I have checked the logs which contained the following two lines:
talosctl dmesg
REMOVED: user: warning: [2023-07-25T10:02:44.873296699Z]: [talos] apply config request: mode auto(no_reboot)
REMOVED: user: warning: [2023-07-25T10:03:13.119427699Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"drbd\x5c": module not found"}
error loading module \x5c"drbd\x5c": module not found
looks a bit strange for me. I've already checked my machine config patch and rewrote it by hand, reapplied the config and rebooted the node. The error still comes up about every minute and the validation still fails.
Now I have no idea what else I can check or how to solve the issue.
talosctl version
:
Client:
Tag: v1.4.6
SHA: 8615b213
Built:
Go version: go1.20.5
OS/Arch: linux/amd64
Server:
NODE: REMOVED
Tag: v1.4.6
SHA: 8615b213
Built:
Go version: go1.20.5
OS/Arch: linux/amd64
Enabled: RBAC
I know it’s a very niche thing. But for days I’m trying to get one of my Nodes to provision a matter-device. Bluetooth is needed during the comissioning-process. Would be great if Ble could be provided via an extension. Nevertheless thanks for this great and stable operating system.
Following #197 , I'm also trying to run zfs-localpv as well, and after enabling the extension and enabling the module in the machine configuration, I have no /usr/local/sbin/zfs or zpool in my host directory. If I install zfs on my container from apt or apk, it works fine after binding /dev. But zfs-localpv expects zfs available on the host filesystem. What I'm doing wrong to not having zfs on my talos filesystem?
❯ talosctl -n 192.168.178.93 get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
192.168.178.93 runtime ExtensionStatus 000.ghcr.io-siderolabs-qemu-guest-agent-8.0.2@sha256-79979c41a8acfc51b2651d6d44bf187b8e7164e6c649f4e39b9cb699fed8917b 1 qemu-guest-agent 8.0.2
192.168.178.93 runtime ExtensionStatus 001.ghcr.io-siderolabs-iscsi-tools-v0.1.4@sha256-58cbadb0a315d83e04b240de72c5ac584e9c63b690b2c4fcd629dd1566c3a7a1 1 iscsi-tools v0.1.4
192.168.178.93 runtime ExtensionStatus 002.ghcr.io-siderolabs-zfs-2.1.12-v1.5.4@sha256-df5674ed21a97c96ec4903ebdbd7318bb8ab5d18fbf10070139f84bf72a94d2d 1 zfs 2.1.12-v1.5.4
192.168.178.93 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.58-talos
❯ talosctl -n 192.168.178.93 cat /proc/modules
zfs 3727360 - - Live 0xffffffffc0510000 (PO)
zunicode 335872 - - Live 0xffffffffc04bd000 (PO)
zzstd 581632 - - Live 0xffffffffc042e000 (O)
zlua 180224 - - Live 0xffffffffc0401000 (O)
zavl 16384 - - Live 0xffffffffc03f7000 (PO)
icp 307200 - - Live 0xffffffffc03ab000 (PO)
zcommon 86016 - - Live 0xffffffffc0395000 (PO)
znvpair 77824 - - Live 0xffffffffc0363000 (PO)
spl 90112 - - Live 0xffffffffc037e000 (O)
virtio_pci 24576 - - Live 0xffffffffc035c000
virtio_pci_legacy_dev 16384 - - Live 0xffffffffc0379000
virtio_pci_modern_dev 16384 - - Live 0xffffffffc0344000
❯ talosctl -n 192.168.178.93 list -l /usr/local/sbin
NODE MODE UID GID SIZE(B) LASTMOD NAME
192.168.178.93 drwxr-xr-x 0 0 278 Jan 20 2022 19:35 .
192.168.178.93 Lrwxrwxrwx 0 0 8 Jan 20 2022 19:35 brcm_iscsiuio -> iscsiuio
192.168.178.93 -rwxr-xr-x 0 0 5559 Jan 20 2022 19:35 iscsi-gen-initiatorname
192.168.178.93 -rwxr-xr-x 0 0 14240 Jan 20 2022 19:35 iscsi-iname
192.168.178.93 -rwxr-xr-x 0 0 5293 Jan 20 2022 19:35 iscsi_discovery
192.168.178.93 -rwxr-xr-x 0 0 222 Jan 20 2022 19:35 iscsi_fw_login
192.168.178.93 -rwxr-xr-x 0 0 9616 Jan 20 2022 19:35 iscsi_offload
192.168.178.93 -rwxr-xr-x 0 0 315184 Jan 20 2022 19:35 iscsiadm
192.168.178.93 -rwxr-xr-x 0 0 327928 Jan 20 2022 19:35 iscsid
192.168.178.93 -rwxr-xr-x 0 0 2376469 Jan 20 2022 19:35 iscsid-wrapper
192.168.178.93 -rwxr-xr-x 0 0 294432 Jan 20 2022 19:35 iscsistart
192.168.178.93 -rwxr-xr-x 0 0 146880 Jan 20 2022 19:35 iscsiuio
192.168.178.93 -rwxr-xr-x 0 0 61640 Jan 20 2022 19:35 tgtadm
192.168.178.93 -rwxr-xr-x 0 0 1219040 Jan 20 2022 19:35 tgtd
192.168.178.93 -rwxr-xr-x 0 0 59520 Jan 20 2022 19:35 tgtimg
There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.
Location: ^(?<major>\d+)\.(?<minor>\d+)\.?(?<patch>(?!9000$)\d+)?$
Error type: Invalid regular expression: ^(?\d+).(?\d+).?(?(?!9000$)\d+)?$
I'm trying to create a PV using iSCSI target on my NAS. I'm using Talos v1.4.5 and iscsi-tools v0.1.4. I can confirm that iscsiadm
and friends are present on the host under /usr/local/sbin
. I've created PV and PVC. When Pod using that PVC is starting, it cannot mount the PV. The error message from kubelet
says that it cannot find the binary in $PATH.
I've checked the k8s source code to see where that error is coming from and saw that it tries to use iscsiadm
. I assumed that this is the binary it cannot find. I've tried adding /use/local/sbin
to extraMounts
for kubelet
, but that did not help. What else can I try?
System:
Talos: v1.3.3
K8s: 1.26.1
GPU: RTX 2080S
Nvidia driver versions:
nvidia-open-gpu-kernel-modules:515.65.01-v1.3.3
nvidia-container-toolkit:515.65.01-v1.11.0
I followed the directions in the documentations for OSS drivers (including creating the runtime and verifying the drivers were successfully installed).
When I run the following command helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia
(to enable the nvidia device plugin) I get the following in the logs on the node:
Open nvidia.ko is only ready for use on Data Center GPUs.
To force use of Open nvidia.ko on other GPUs, see the
'OpenRmEnableUnsupportedGpus' kernel module parameter described
in the README
GPU 0000:07:00.0: RmInitAdapter failed!
Additionally the pod fails to start in k8s.
I found this is due to the following: because I'm not running a datacenter GPU I need to run:
modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1
or, in an /etc/modprobe.d/ configuration file:
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
In fact the readme says the following (it appears this is needed for support on many GPUs) :
https://github.com/NVIDIA/open-gpu-kernel-modules#compatible-gpus
To enable use of the open kernel modules on GeForce and Workstation GPUs, set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1. For more details, see the NVIDIA GPU driver end user README here:
However, I'm not sure how to do this via the TalosCLI.
Would it be possible to support this feature in the future on the OSS drivers? Is there anyway to set that parameter on extension install or have it as a talos feature? I would be happy to implement it but may need a pointer in the right direction.
Hello!
My situation is almost identical to #33 - I've got a cluster of raspberry pi 4's and a Synology NAS that I'm trying to use as a storage backend for my PVCs, and I can't use NFS due to the limitations of file locking SQLite databases on network filesystems. I'm using the synology-csi driver to handle creation, deletion, and connection of iSCSI-backed LUNs on my NAS as PersistentVolumes in my cluster.
The initial setup went smoothly, with the driver able to create and delete PersistentVolumes and their backing LUNs with no issue on my NAS. I now suspect that it was accomplishing that via the Synology API and not via iSCSI, as I have not actually been able to mount the volume via iSCSI.
Here's my talosctl version
:
❯ talosctl version
Client:
Tag: v1.0.0
SHA: 80167fd2
Built:
Go version: go1.17.8
OS/Arch: darwin/amd64
Server:
NODE: 192.168.4.9
Tag: v1.1.0-alpha.1-65-gafb679586
SHA: afb67958
Built:
Go version: go1.18.2
OS/Arch: linux/arm64
Enabled: RBAC
Originally, this was the event trace when trying to mount a volume to a pod:
❯ kubectl describe pod pvc-inspector
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 86s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Warning FailedScheduling 85s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Normal Scheduled 83s default-scheduler Successfully assigned synology-csi/pvc-inspector to pi-sidero
Normal SuccessfulAttachVolume 83s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b"
Warning FailedMount 10s (x8 over 76s) kubelet MountVolume.MountDevice failed for volume "pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b" : rpc error: code = Internal desc = rpc error: code = Internal desc = Failed to login with target iqn [iqn.2000-01.com.synology:<hostname>.pvc-1969eb78-dbb4-43e1-9659-6b6bdff9167b], err: chroot: can't execute '/usr/bin/env': No such file or directory
(exit status 127)
I dug around and determined it was due to iscsiadm
being aliased to this code snippet (see the Dockerfile):
https://github.com/SynologyOpenSource/synology-csi/blob/fc3359223fe51a13bcfa5a7cabbf59611bbeb901/chroot/chroot.sh#L4-L9
It looks like synology-csi
just mounts the whole host filesystem into /host
(source), and uses chroot
to emulate calling iscsiadm
from the host environment. Since there's no /usr/bin/env
in the Talos environment, the command fails. I changed it to directly link to the iscsiadm
binary contained in the iscsi-tools
extension like so:
exec chroot $DIR "/usr/local/sbin/$BIN" "$@"
That fixed the issue of not actually being able to run iscsiadm
, but I'm now running into this error:
❯ kubectl describe pod pvc-inspector
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m19s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Warning FailedScheduling 3m18s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Warning FailedScheduling 117s default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Normal Scheduled 11s default-scheduler Successfully assigned synology-csi/pvc-inspector to pi-sidero
Normal SuccessfulAttachVolume 11s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083"
Warning FailedMount 1s (x2 over 2s) kubelet MountVolume.MountDevice failed for volume "pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083" : rpc error: code = Internal desc = rpc error: code = Internal desc = Failed to login with target iqn [iqn.2000-01.com.synology:<hostname>.pvc-4ad7b598-50a7-423e-84c5-d4f0c4edb083], err: iscsiadm: Could not make /etc/iscsi/ 30
iscsiadm: exiting due to idbm configuration error
(exit status 6)
From what I understand, the /etc/iscsi
folder should be bound to /system/iscsi
:
extensions/storage/iscsi-tools/iscsid.yaml
Lines 41 to 47 in 7cf2843
The iscsi-tools
extension seems to be working for the most part:
talosctl ls /usr/local/sbin
returns a list of iscsi
-related binariestalosctl get extensions
lists the correct information for iscsi-tools
iscsiadm --help
in the synology-csi container that mounts the host fsHowever, the /system/iscsi
directory is not being bound correctly to /etc/iscsi
:
❯ talosctl ls system
NODE NAME
192.168.4.9 .
192.168.4.9 config
192.168.4.9 etc
192.168.4.9 iscsi
192.168.4.9 libexec
192.168.4.9 overlays
192.168.4.9 run
192.168.4.9 secrets
192.168.4.9 state
192.168.4.9 var
❯ talosctl ls etc
NODE NAME
192.168.4.9 .
192.168.4.9 ca-certificates
192.168.4.9 cni
192.168.4.9 containerd
192.168.4.9 cri
192.168.4.9 extensions.yaml
192.168.4.9 hosts
192.168.4.9 kubernetes
192.168.4.9 localtime
192.168.4.9 lvm
192.168.4.9 machine-id
192.168.4.9 os-release
192.168.4.9 pki
192.168.4.9 resolv.conf
192.168.4.9 ssl
192.168.4.9 ssl1.1
Is there something I need to be doing in the machineconfig to get this directory bound as expected?
192.168.251.103: user: warning: [2023-07-11T14:52:13.527141505Z]: 2023/07/11 14:52:14 running pre-flight checks
192.168.251.103: user: warning: [2023-07-11T14:52:13.531414505Z]: 2023/07/11 14:52:14 host Talos version: v1.4.6
192.168.251.103: user: warning: [2023-07-11T14:52:13.538455505Z]: 2023/07/11 14:52:14 host Kubernetes versions: kubelet: 1.26.4, kube-apiserver: 1.26.4, kube-scheduler: 1.26.4, kube-controller-manager: 1.26.4
192.168.251.103: user: warning: [2023-07-11T14:52:13.541381505Z]: 2023/07/11 14:52:14 all pre-flight checks successful
192.168.251.103: user: warning: [2023-07-11T14:52:13.542369505Z]: 2023/07/11 14:52:14 discovered system extensions:
192.168.251.103: user: warning: [2023-07-11T14:52:13.543243505Z]: 2023/07/11 14:52:14 NAME VERSION AUTHOR
192.168.251.103: user: warning: [2023-07-11T14:52:13.544191505Z]: 2023/07/11 14:52:14 iscsi-tools v0.1.4 Sidero Labs
192.168.251.103: user: warning: [2023-07-11T14:52:13.545380505Z]: 2023/07/11 14:52:14 qemu-guest-agent 8.0.2 Markus Reiter
192.168.251.103: user: warning: [2023-07-11T14:52:13.546809505Z]: 2023/07/11 14:52:14 validating system extensions
192.168.251.103: user: warning: [2023-07-11T14:52:13.547854505Z]: Error: error validating extension "qemu-guest-agent": version constraint >= v1.5.0 can't be satisfied with Talos version 1.4.6
192.168.251.103: user: warning: [2023-07-11T14:52:13.550655505Z]: Usage:
192.168.251.103: user: warning: [2023-07-11T14:52:13.551176505Z]: installer install [flags]
192.168.251.103: user: warning: [2023-07-11T14:52:13.551895505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.552153505Z]: Flags:
192.168.251.103: user: warning: [2023-07-11T14:52:13.552481505Z]: -h, --help help for install
192.168.251.103: user: warning: [2023-07-11T14:52:13.553194505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.553655505Z]: Global Flags:
192.168.251.103: user: warning: [2023-07-11T14:52:13.554332505Z]: --arch string The target architecture (default "amd64")
192.168.251.103: user: warning: [2023-07-11T14:52:13.556241505Z]: --board string The value of talos.board (default "none")
192.168.251.103: user: warning: [2023-07-11T14:52:13.558227505Z]: --bootloader Install a booloader to the specified disk (default true)
192.168.251.103: user: warning: [2023-07-11T14:52:13.560454505Z]: --config string The value of talos.config
192.168.251.103: user: warning: [2023-07-11T14:52:13.562169505Z]: --disk string The path to the disk to install to
192.168.251.103: user: warning: [2023-07-11T14:52:13.562178505Z]: --extra-kernel-arg stringArray Extra argument to pass to the kernel
192.168.251.103: user: warning: [2023-07-11T14:52:13.562196505Z]: --force Indicates that the install should forcefully format the partition
192.168.251.103: user: warning: [2023-07-11T14:52:13.562204505Z]: --meta metaValueSlice A key/value pair for META (default [])
192.168.251.103: user: warning: [2023-07-11T14:52:13.562210505Z]: --platform string The value of talos.platform
192.168.251.103: user: warning: [2023-07-11T14:52:13.562217505Z]: --upgrade Indicates that the install is being performed by an upgrade
192.168.251.103: user: warning: [2023-07-11T14:52:13.562224505Z]: --zero Indicates that the install should write zeros to the disk before installing
192.168.251.103: user: warning: [2023-07-11T14:52:13.562231505Z]:
192.168.251.103: user: warning: [2023-07-11T14:52:13.562238505Z]: error validating extension "qemu-guest-agent": version constraint >= v1.5.0 can't be satisfied with Talos version 1.4.6
Hi,
Wanted to test Piraeus/Linstor CSI, which needs the drbd extension, on Talos v1.4.4 (upgraded this morning), on my test cluster. I patched a test node machineconfig accordingly to piraeus documentation concerning Talos Linux (before patching the whole cluster's workers):
machine:
install:
extensions:
- image: ghcr.io/siderolabs/drbd:9.2.2-v1.4.4
kernel:
modules:
- name: drbd
parameters:
- usermode_helper=disabled
- name: drbd_transport_tcp
and got this error in dashboard's log:
on a talosctl list -n test_node /lib/modules/6.1.28-talos
(the actuel kernel used by my cluster VMs), no extras
folder. But, on a talosctl list -n test_node /lib/modules/6.1.27-talos
(the previous kernel), the extras/drbd.ko
and extras/drbd_transport_tcp.ko
kernel modules are present.
Should I force the use of the 6.1.27 kernel, or wait for next drbd extension version?
Sorry for my approximative english,
Best regards, and thank you for making Talos Linux.
Edit: replaced the test node by a fresh node, applied machineconfig, same behavior, drbd modules gets installed in /lib/modules/6.1.27-talos/extras
instead of /lib/modules/6.1.28-talos/extras
.
Some extensions like open-iscsi
, nvidia-toolkit
needs multiple tools/programs packaged together where-in each one might have different versions, currently we tag some extensions with the versions of different tools separated by a dash.
I propose to add support in the extension spec to define the versions of each tool/program separately and have a separate versioning schema for the extension itself.
version: v1alpha1
metadata:
name: <extension name>
# this keeps the current versioning scheme for the overall extension
version: <version of the package the extension installs>-<version of the extensions repo (tracks with talos version)>
tools:
<foo>: <version>
<bar>: <version>
author: Andrew Rynhard
description: |
<detailed description of the extension/package>
## The compatibility section is "optional" but highly recommended to specify a Talos version that
## has been tested and known working for this extension.
compatibility:
talos:
version: ">= v1.0.0"
This also would allow to show the versions of tools shipped with an extension when running a talosctl get extensions
There is a problem installing the iscsi-tools extension.
Upgrade command
talosctl --talosconfig=./clusterconfig/talosconfig upgrade --nodes 192.168.x.x -p -i ghcr.io/siderolabs/installer:v1.2.7
Extension config
...
install:
disk: /dev/sda
image: ghcr.io/siderolabs/installer:v1.2.7
extensions:
- image: ghcr.io/siderolabs/iscsi-tools:v0.1.2
bootloader: true
wipe: false
...
Error Messages
[ 163.035975] 2022/12/03 08:15:14 discovered system extensions:
[ 163.036500] 2022/12/03 08:15:14 NAME VERSION AUTHOR
[ 163.036850] 2022/12/03 08:15:14 iscsi-tools v0.1.2 Sidero Labs
[ 163.037221] 2022/12/03 08:15:14 validating system extensions
[ 163.037569] Error: error validating extension "iscsi-tools": path "/etc/logrotate.d/iscsiuiolog" is not allowed in extensions
[ 163.038296] Usage:
[ 163.038433] installer install [flags]
[ 163.038675]
[ 163.038775] Flags:
[ 163.038911] -h, --help help for install
[ 163.039179]
[ 163.039279] Global Flags:
[ 163.039451] --arch string The target architecture (default "amd64")
[ 163.039962] --board string The value of talos.board (default "none")
[ 163.040456] --bootloader Install a booloader to the specified disk (default true)
[ 163.041022] --config string The value of talos.config
[ 163.041435] --disk string The path to the disk to install to
[ 163.041898] --extra-kernel-arg stringArray Extra argument to pass to the kernel
[ 163.042378] --force Indicates that the install should forcefully format the partition
[ 163.042986] --platform string The value of talos.platform
[ 163.043415] --upgrade Indicates that the install is being performed by an upgrade
[ 163.043984] --zero Indicates that the install should write zeros to the disk before installing
[ 163.044631]
[ 163.044791] error validating extension "iscsi-tools": path "/etc/logrotate.d/iscsiuiolog" is not allowed in extensions
Hey Sidero Labs community,
I am trying to build an extension for runu
(part of kraftkit), a container runtime for unikernels.
I want to use it to run unikernels on my Raspberry Pis.
Unfortunately, I am not able to get the extension to build.
I started with trying to cross-compile the runu
binary on linux/amd64, but ran into issues due to the nsenter
dependency, which is a CGO library from runc
.
However, I was able to build runu
natively on my Raspberry Pi 5, which is currently good enough for me:
$ GOARCH=arm64 CGO_ENABLED=1 DOCKER="" make runu
My problem now is, that I can't get the extension to build.
You can find my current work-in-progress in my fork
It builds find for linux/amd64 on my laptop (also linux/amd64), but fails on my Pi 5 with the following error:
$ PLATFORM=linux/arm64 make target-runu
...
85.44 GOOS=linux \
85.44 GOARCH=arm64 \
85.44 go build \
85.44 -buildmode=pie \
85.44 -gcflags=all='' \
85.44 -ldflags='-s -w -X "kraftkit.sh/internal/version.version=-dirty" -X "kraftkit.sh/internal/version.commit=" -X "kraftkit.sh/internal/version.buildTime=Mon Feb 5 22:17:04 UTC 2024"' \
85.44 -o /go/src/github.com/unikraft/kraftkit/dist/runu \
85.44 /go/src/github.com/unikraft/kraftkit/cmd/runu
203.8 # kraftkit.sh/cmd/runu
203.8 /toolchain/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
203.8 collect2: fatal error: cannot find 'ld'
203.8 compilation terminated.
203.8
203.9 make: *** [Makefile:128: runu] Error 1
I am not sure, how to debug this issue. ld
is installed in the build image (ghcr.io/siderolabs/base:v1.7.0-alpha.0-13-gf376a53
), but gcc seems to be unable to find it during the build (as far as I can tell).
I hope I included all necessary information.
Any input on this issue would be appreciated, as I am a little out of my depth on this one.
P.S.: I asked the unikraft community, if they could provide a precompiled version of runu
for arm64 and they told me, that they will look into it.
Talos v1.5.5
nonfree-kmod-nvidia:535.54.03-v1.5.5
nvidia-container-toolkit:535.54.03-v1.13.5
Hardware: NUC11PHKi7CAA (UEFI boot)
64g RAM
After performing the initial installation with the vanilla metal-amd64.iso I apply the configuration with the nvidia extensions and kernel modules. All of which work well. When I perform a reboot with the -m powercycle
option the machine won't boot with the following error:
Discussing with frezbo he mentionned this grub issue as a possible cause : https://bugs.launchpad.net/oem-priority/+bug/1842320
I performed other tests in proxmox running on the same hardware model by switching between UEFI and BIOS on VMs with the NVIDIA card in PCI passthrough but I was not able to reproduce the issue.
Machine | Boot Mode | Result |
---|---|---|
Proxmox VM | BIOS | Works |
Proxmox VM | UEFI | Works |
Physical | UEFI | grub OOM |
Physical | BIOS | Not available for this hw |
Im using the iSCSI extension in my machine install section:
install: disk: /dev/sda image: ghcr.io/siderolabs/installer:v1.3.5 bootloader: true wipe: true extensions: - image: ghcr.io/siderolabs/iscsi-tools:v0.1.4
This might actually be specific issue to the iscsi-tools extension and not Talos, but reporting nonetheless.
For a new node this works great and as expected, but as soon as I restart the node I get this error:
spec: failed to generate spec: failed to mkdir "/usr/local/etc/iscsi": mkdir /usr/local/etc: read-only file system
Was expecting the node restart to behave the same, suspect its something thats being done on initial install with the iScsi tools install that isnt recreated on a restart post initial installation. Very frustrating having to re-roll nodes, if they need a reboot.
Tag: v1.3.5 SHA: 03edf8c1 Built: Go version: go1.19.6 OS/Arch: linux/amd64 Enabled: RBAC
]Server Version: v1.26.1
]I have a usecase in which I need V4L modules in the kernel. I am using ffmpeg
with a USB webcam attached to a nearby 3D printer and want to run an ffmpeg
transcoding job from the webcam to both an Obico instance and an nginx-rtmp
instance for detecting print failures and accessing the stream remotely, respectively.
Given this is pretty niche, would Siderolabs accept a PR for a new extension for V4L similar to what is done for a few other kernel modules?
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
git://git.kernel.org/pub/scm/libs/libcap/libcap.git
, google/gvisor
, https://github.com/qemu/qemu.git
, https://gitlab.com/nvidia/container-toolkit/container-toolkit.git
, https://gitlab.com/nvidia/container-toolkit/libnvidia-container.git
, https://gitlab.gnome.org/GNOME/glib.git
, kata-containers/kata-containers
, nvidia/open-gpu-kernel-modules
, tailscale/tailscale
)https://github.com/qemu/qemu.git
, nvidia/open-gpu-kernel-modules
).github/workflows/ci.yaml
kenchan0130/actions-system-info v1.3.0
actions/checkout v4
docker/setup-buildx-action v3
docker/login-action v3
actions/github-script v7
crazy-max/ghaction-github-release v2
kenchan0130/actions-system-info v1.3.0
actions/checkout v4
docker/setup-buildx-action v3
moby/buildkit v0.13.2
moby/buildkit v0.13.2
.github/workflows/slack-notify.yaml
slackapi/slack-github-action v1
.github/workflows/weekly.yaml
kenchan0130/actions-system-info v1.3.0
actions/checkout v4
docker/setup-buildx-action v3
moby/buildkit v0.13.2
examples/hello-world-service/src/go.mod
go 1.21
nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper/go.mod
go 1.22
golang.org/x/sys v0.20.0
nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper/go.mod
go 1.22
golang.org/x/sys v0.20.0
storage/iscsi-tools/iscsid-wrapper/go.mod
go 1.22
golang.org/x/sys v0.20.0
container-runtime/vars.yaml
google/gvisor 20240325.0
containerd/stargz-snapshotter v0.15.1
kubernetes/cloud-provider-aws v1.30.0
containerd/runwasi v0.3.0
spinkube/containerd-shim-spin v0.14.1
kata-containers/kata-containers 3.3.0
firmware/vars.yaml
intel/Intel-Linux-Processor-Microcode-Data-Files 20240514
guest-agents/vars.yaml
https://github.com/qemu/qemu.git 8.2.3
https://gitlab.gnome.org/GNOME/glib.git 2.80.0
https://gitlab.com/xen-project/xen-guest-agent.git 0.4.0
siderolabs/talos-vmtoolsd v0.5.1
network/vars.yaml
tailscale/tailscale 1.64.2
nvidia-gpu/vars.yaml
nvidia/open-gpu-kernel-modules 535.129.03
https://gitlab.com/nvidia/container-toolkit/container-toolkit.git v1.14.6@5605d191332dcfeea802c4497360d60a65c7887e
https://gitlab.com/nvidia/container-toolkit/libnvidia-container.git v1.14.6@d2eb0afe86f0b643e33624ee64f065dd60e952d4
cgr.dev/chainguard/wolfi-base sha256:3d6dece13cdb5546cd03b20e14f9af354bc1a56ab5a7b47dca3e6c1557211fcf
https://sourceware.org/git/glibc.git 2.39
seccomp/libseccomp 2.5.5
git://git.kernel.org/pub/scm/libs/libcap/libcap.git 2.69
git://sourceware.org/git/elfutils.git 0.191
power/vars.yaml
networkupstools/nut 2.8.2
storage/iscsi-tools/vars.yaml
open-iscsi/open-iscsi 2.1.9
open-iscsi/open-isns 0.102
storage/vars.yaml
libfuse/libfuse 3.16.2
git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git 4.3
Pkgfile
git://linux-nfs.org/~steved/libtirpc 1-3-3
madler/zlib 1.3.1
Pkgfile
siderolabs/bldr v0.3.1
We don't need Talos version to be part of the extension version if the extension doesn't contain kernel modules.
This will help with signing releases (less signatures to put).
Hi,
Apologies I don't really know how to debug this, but on my rock64 when using the DRBD extension I seem to be missing the drbd module :
talosctl --talosconfig talosconfig -n mynode list /lib/modules/6.1.58-talos/extras
NODE NAME
1 error occurred:
rpc error: code = Unknown desc = lstat /lib/modules/6.1.58-talos/extras: no such file or directory
The same config deployed on x86 machines does have that directory populated with the .ko files as expected.
I tried using the tag and also the specific arm hash from here, to be sure but no luck when "upgrading" to the same version to rebuild the initramfs.
I can't access the display for that node, not sure which service log might explain why this is failing ?
Thanks
We should take a look at creating an estargz extension for Talos if possible. Customers have a need for this type of image loading and it could have some real value for others pulling big images as well. It feels like an extension might be a nice way to tackle this as opposed to having it baked into talos, since I think that it's just enabling some containerd plugins.
Hi team,
It looks like multipath-tools is not available in the iscsi-tools extension. I tried to search for multipath references everywhere, to no avail.
When mounting iscsi target trying to use multipath (using democratic-csi), I get failed to discover multipath device
. However, I can see that the two disk references are successfully mounted :
{"host":"kubeworker03","level":"info","message":"new request - driver: ControllerSynologyDriver method: NodeStageVolume call: {\"metadata\":{\"x-forwarded-host\":[\"localhost\"],\"user-agent\":[\"grpc-go/1.49.0\"]},\"request\":{\"publish_context\":{},\"secrets\":\"redacted\",\"volume_context\":{\"lun\":\"1\",\"node_attach_driver\":\"iscsi\",\"portal\":\"\",\"portals\":\"nas01-iscsi0.i,nas01-iscsi1.i\",\"provisioner_driver\":\"synology-iscsi\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1679431778041-8081-synology-iscsi\",\"interface\":\"\",\"iqn\":\"iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d\"},\"volume_id\":\"pvc-83c2d084-36e9-4b31-86f2-bca6df03770d\",\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/synology-iscsi/4db6d710ff1e3fb3e3cd58f0a4fb57b0c962a6dd773ba289f292dbf2b48d2c61/globalmount\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_MULTI_WRITER\"},\"mount\":{\"mount_flags\":[],\"fs_type\":\"\",\"volume_mount_group\":\"\"},\"access_type\":\"mount\"}},\"cancelled\":false}","service":"democratic-csi","timestamp":"2023-03-21T21:31:11.735Z"}
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -o new
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -o update --name node.startup --value manual
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi0.i:3260 -l
executing iscsi command: iscsiadm -m session
executing iscsi command: iscsiadm -m session -r 5 --rescan
executing filesystem command: realpath /dev/disk/by-path/ip-nas01-iscsi0.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1
{"host":"kubeworker03","level":"info","message":"successfully logged into portal nas01-iscsi0.i:3260 and created device /dev/disk/by-path/ip-nas01-iscsi0.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1 with realpath /dev/sdf","service":"democratic-csi","timestamp":"2023-03-21T21:31:11.948Z"}
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -o new
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -o update --name node.startup --value manual
executing iscsi command: iscsiadm -m node -T iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d -p nas01-iscsi1.i:3260 -l
executing iscsi command: iscsiadm -m session
executing iscsi command: iscsiadm -m session -r 6 --rescan
executing filesystem command: realpath /dev/disk/by-path/ip-nas01-iscsi1.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1
{"host":"kubeworker03","level":"info","message":"successfully logged into portal nas01-iscsi1.i:3260 and created device /dev/disk/by-path/ip-nas01-iscsi1.i:3260-iscsi-iqn.2000-01.com.synology:csi.k8s-pvc-83c2d084-36e9-4b31-86f2-bca6df03770d-lun-1 with realpath /dev/sdg","service":"democratic-csi","timestamp":"2023-03-21T21:31:12.196Z"}
executing filesystem command: sh -c for file in $(ls -la /dev/mapper/* | grep "\->" | grep -oP "\-> .+" | grep -oP " .+"); do echo $(F=$(echo $file | grep -oP "[a-z0-9-]+");echo $F":"$(ls "/sys/block/${F}/slaves/");); done;
executing filesystem command: sh -c for file in $(ls -la /dev/mapper/* | grep "\->" | grep -oP "\-> .+" | grep -oP " .+"); do echo $(F=$(echo $file | grep -oP "[a-z0-9-]+");echo $F":"$(ls "/sys/block/${F}/slaves/");); done;
{"host":"kubeworker03","level":"error","message":"handler error - driver: ControllerSynologyDriver method: NodeStageVolume error: {\"name\":\"GrpcError\",\"code\":2,\"message\":\"failed to discover multipath device\"}","service":"democratic-csi","timestamp":"2023-03-21T21:31:14.221Z"}
On the democratic-csi
side, nsenter
support for the multipath executable needs to be added as well (see the original PR for supporting iscsiadm here : democratic-csi/democratic-csi#225)
Thanks for taking this request into consideration :)
Cheers !
After upgrading my nvidia node to talos 1.3.0 and recompiling and applying the nvidia proprietary driver my nvidia-device-plugin deployment no longer start with the following error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0: unknown
My research on this error seem to point to cgroups v2 compatibility and indeed reverting one of the node to cgroups v1 allowed the container to start.
libnvidia-container appears to support cgroups v2 since v1.8.0
and v1.8.1 brings a fix to cgroup detection NVIDIA/libnvidia-container#158
I have to admit this is getting a bit beyond my scope of knowledge but seeing as there is reference to permissions and capacities I figured it may be related since talos locks down a lot of things.
We might benefit from better directory structure:
base
+ musl
under base/
. better if we could use one from pkgs
(?)container-runtime/
firmware/
.(note: bldr
doesn't care about directory structure, it finds pkg.yaml
by name
)
Error message:
172.20.0.2: {"error":"failed to create containerd task: failed to create shim task: OCI runtime create failed: creating container: Rel: can't make podruntime/runtime relative to /: unknown","level":"error","msg":"RunPodSandbox for \u0026PodSandboxMetadata{Name:nginx-gvisor,Uid:540e0e84-7090-48e9-838b-8d7d0965e301,Namespace:default,Attempt:0,} failed, error","time":"2022-01-25T14:24:32.902977189Z"}
It seems to be related to the cgroups handling code, as podruntime/runtime
is the cgroup of the cri
process.
I traced the error back to: https://github.com/google/gvisor/blob/fa4e2fff8a842dcfe6518d8bf9cdbd94750780cc/runsc/cgroup/cgroup.go#L292
Error message starting a node and viewing logs with talosctl dmesg
:
192.168.1.153: user: warning: [2022-05-06T03:40:35.396887679Z]: [talos] service[ext-iscsid](Preparing): Running pre state
192.168.1.153: user: warning: [2022-05-06T03:40:35.474317679Z]: [talos] service[ext-iscsid](Preparing): Creating service runner
192.168.1.153: user: warning: [2022-05-06T03:40:35.558298679Z]: [talos] service[ext-iscsid](Failed): Failed to create runner: mkdir /usr/local/etc/iscsi/iscsid.conf: not a directory
talosctl version
:
Client:
Tag: v1.0.0
SHA: 80167fd2
Built:
Go version: go1.17.8
OS/Arch: darwin/amd64
Server:
NODE: 192.168.1.153
Tag: v1.0.3
SHA: 689c6e54
Built:
Go version: go1.17.7
OS/Arch: linux/amd64
Enabled: RBAC
Part of the configuration:
version: v1alpha1
machine:
type: controlplane
kubelet:
image: ghcr.io/siderolabs/kubelet:v1.23.6
install:
extensions:
- image: ghcr.io/siderolabs/intel-ucode:20220419
- image: ghcr.io/siderolabs/iscsi-tools:v0.1.0
Currently blocked on kata-containers/kata-containers#927, as Talos is only cgroups v2.
The PR that introduced this makes it seem like this is the intel graphics driver for the igpu, however it's listed under firmware and not drivers. Is this required to use the igpu in pods running on talos or does this update the actual microcode for the igpu? It's unclear and there's no mention anywhere about what most packages in this repo do.
I noticed my nodes infinitely looping, so i decided to check the logs, it seems ext-iscsid
is unable to find libcrypto
for some reason
10.10.11.7: Error loading shared library libcrypto.so.3: No such file or directory (needed by /usr/local/sbin/iscsid)
10.10.11.7: Error loading shared library libcrypto.so.3: No such file or directory (needed by /usr/local/lib/libisns.so.0)
I installed this extension and tried to run a test pod with the estargz
image:
apiVersion: v1
kind: Pod
metadata:
name: nodejs
spec:
containers:
- name: nodejs-stargz
image: ghcr.io/stargz-containers/node:17.8.0-esgz
command: ["node"]
args:
- -e
- var http = require('http');
http.createServer(function(req, res) {
res.writeHead(200);
res.end('Hello World!\n');
}).listen(80);
ports:
- containerPort: 80
Getting this error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "node": executable file not found in $PATH: unknown
I also see these errors in the ext-stargz-snapshotter
service.
{"dir":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/68/fs","error":"specified path \"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/68/fs\" isn't a mountpoint","level":"debug","msg":"failed to unmount","time":"2023-10-06T15:35:52.587369547Z"}
{"error":null,"key":"k8s.io/91/extract-195307888--2O1 sha256:2414385fd51d34e07d564ec6041ee66de902424f028528bce52743d92b1bc875","level":"info","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/64/fs","msg":"[fusermount fusermount3] not installed; trying direct mount","parent":"sha256:a3926353a4b2389bed133fe4b9f8bdb8439529ba6a965b37ef0c1a7921043a00","time":"2023-10-06T15:34:17.197856631Z"}
Would be nice to have a way to limit the amount of memory the ZFS can use for ARC. The default is 1/2 of system memory installed Reference 1.
I have a 32G RAM worker, in which 12G is being used by ARC (at the moment) and it randomly makes processes within pods going OOM killed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.