Giter Site home page Giter Site logo

ironic-image's Introduction

Metal3 Ironic Container

This repo contains the files needed to build the Ironic images used by Metal3.

Build Status

CLOMonitor OpenSSF Scorecard Ubuntu daily main build status CentOS daily main build status

Description

When updated, builds are automatically triggered on https://quay.io/repository/metal3-io/ironic/

This repo supports the creation of multiple containers needed when provisioning baremetal nodes with Ironic. Eventually there will be separate images for each container, but currently separate containers can share this same image with specific entry points.

The following entry points are provided:

  • runironic - Starts the ironic-conductor and ironic-api processes to manage the provisioning of baremetal nodes. Details on Ironic can be found at https://docs.openstack.org/ironic/latest/. This is the default entry point used by the Dockerfile.
  • rundnsmasq - Runs the dnmasq dhcp server to provide addresses and initiate PXE boot of baremetal nodes. This includes a lightweight TFTP server. Details on dnsmasq can be found at http://www.thekelleys.org.uk/dnsmasq/doc.html.
  • runhttpd - Starts the Apache web server to provide images via http for PXE boot and for deployment of the final images.
  • runlogwatch - Waits for host provisioning ramdisk logs to appear, prints their contents and deletes files.

All of the containers must share a common mount point or data store. Ironic requires files for both the TFTP server and HTTP server to be stored in the same partition. This common store must include, in <shared store>/html/images, the following images:

  • ironic-python-agent.kernel
  • ironic-python-agent.initramfs
  • final image to be deployed onto node in qcow2 format

The following environment variables can be passed in to customize run-time functionality:

  • PROVISIONING_MACS - a comma seperated list of mac address of the master nodes (used to determine the PROVISIONING_INTERFACE)
  • PROVISIONING_INTERFACE - interface to use for ironic, dnsmasq(dhcpd) and httpd (default provisioning, this is calculated if the above PROVISIONING_MACS is provided)
  • PROVISIONING_IP - the specific IP to use (instead of calculating it based on the PROVISIONING_INTERFACE)
  • DNSMASQ_EXCEPT_INTERFACE - interfaces to exclude when providing DHCP address (default lo)
  • HTTP_PORT - port used by http server (default 80)
  • HTTPD_SERVE_NODE_IMAGES - used by runhttpd script, controls access to the /shared/html/images directory via the default virtual host (HTTP_PORT). (default true)
  • DHCP_RANGE - dhcp range to use for provisioning (default 172.22.0.10-172.22.0.100)
  • DHCP_HOSTS - a ; separated list of dhcp-host entries, e.g. known MAC addresses like 00:20:e0:3b:13:af;00:20:e0:3b:14:af (empty by default). For more details on dhcp-host see the man page.
  • DHCP_IGNORE - a set of tags on hosts that should be ignored and not allocate DHCP leases for, e.g. tag:!known to ignore any unknown hosts (empty by default)
  • MARIADB_PASSWORD - The database password
  • OS_<section>_\_<name>=<value> - This format can be used to set arbitary Ironic config options
  • IRONIC_RAMDISK_SSH_KEY - A public key to allow ssh access to nodes running IPA, takes the format "ssh-rsa AAAAB3....."
  • IRONIC_KERNEL_PARAMS - This parameter can be used to add additional kernel parameters to nodes running IPA
  • GATEWAY_IP - gateway IP address to use for ironic dnsmasq(dhcpd)
  • DNS_IP - DNS IP address to use for ironic dnsmasq(dhcpd)
  • IRONIC_IPA_COLLECTORS - Use a custom set of collectors to be run on inspection. (default default,logs)
  • HTTPD_ENABLE_SENDFILE - Whether to activate the EnableSendfile apache directive for httpd (default, false)
  • IRONIC_CONDUCTOR_HOST - Host name of the current conductor (only makes sense to change for a multinode setup). Defaults to the IP address used for provisioning.
  • IRONIC_EXTERNAL_IP - Optional external IP if Ironic is not accessible on PROVISIONING_IP.
  • IRONIC_EXTERNAL_CALLBACK_URL - Override Ironic's external callback URL. Defaults to use IRONIC_EXTERNAL_IP if available.
  • IRONIC_EXTERNAL_HTTP_URL - Override Ironic's external http URL. Defaults to use IRONIC_EXTERNAL_IP if available.
  • IRONIC_ENABLE_VLAN_INTERFACES - Which VLAN interfaces to enable on the agent start-up. Can be a list of interfaces or a special value all. Defaults to all.

The ironic configuration can be overridden by various environment variables. The following can serve as an example:

  • OS_CONDUCTOR__DEPLOY_CALLBACK_TIMEOUT=4800 - timeout (seconds) to wait for a callback from a deploy ramdisk
  • OS_CONDUCTOR__INSPECT_TIMEOUT=1800 - timeout (seconds) for waiting for node inspection
  • OS_CONDUCTOR__CLEAN_CALLBACK_TIMEOUT=1800 - timeout (seconds) to wait for a callback from the ramdisk doing the cleaning
  • OS_PXE__BOOT_RETRY_TIMEOUT=1200 - timeout (seconds) to enable boot retries.

Build Ironic Image from RPMs

The ironic image is built using RPMs for system software and source code for ironic specific software and libraries. It is possible to build it using RPMs from RDO project code setting the INSTALL_TYPE argument to rpm at build time; for example:

podman build -t ironic-image -f Dockerfile --build-arg INSTALL_TYPE=rpm

Custom source for ironic software

When building the ironic image from source, it is also possible to specify a different source for ironic, ironic-lib or the sushy library using the build arguments IRONIC_SOURCE, IRONIC_LIB_SOURCE, and SUSHY_SOURCE. The accepted formats are gerrit refs, like refs/changes/89/860689/2, commit hashes, like a1fe6cb41e6f0a1ed0a43ba5e17745714f206f1f, repo tags or branches, or a local directory that needs to be under the sources/ directory in the container context. An example of a full command installing ironic from a gerrit patch is:

podman build -t ironic-image -f Dockerfile --build-arg INSTALL_TYPE=source \
    --build-arg IRONIC_SOURCE="refs/changes/89/860689/2"

An example using the local directory sources/ironic:

podman build -t ironic-image -f Dockerfile --build-arg INSTALL_TYPE=source \
    --build-arg IRONIC_SOURCE="ironic"

It is also possible to specify an upper-constraints file using the UPPER_CONSTRAINTS_FILE argument. By default this is the upper-constraints.txt file found in the container context; the content of the file can be modified keeping the default name or it's possible to specify an entire different filename as far as it's in the container context.

Apply project patches to the images during build

When building the image, it is possible to specify a patch of one or more upstream projects to apply to the image using the PATCH_LIST argument in the cli command, for example:

podman build -t ironic-image -f Dockerfile --build-arg \
    PATCH_LIST=my-patch-list

The PATCH_LIST argument is a path to a file under the image context. Its format is a simple text file that contains references to upstream patches for the ironic projects. Each line of the file is in the form: project_dir refspec (git_host) where:

  • project_dir is the last part of the project url including the organization, for example for ironic is openstack/ironic
  • refspec is the gerrit refspec of the patch we want to test, for example if you want to apply the patch at https://review.opendev.org/c/openstack/ironic/+/800084 the refspec will be refs/changes/84/800084/22 Using multiple refspecs is convenient in case we need to test patches that are connected to each other, either on the same project or on different projects.
  • git_host (optional) is the git host from which the project will be cloned. If unset, https://opendev.org is used.

ironic-image's People

Contributors

adetalhouet avatar bfournie avatar derekhiggins avatar dhellmann avatar dtantsur avatar elfosardo avatar fmuyassarov avatar furkatgofurov7 avatar iamfive avatar imain avatar iurygregory avatar juliakreger avatar kashifest avatar lentzi90 avatar maelk avatar mahnoorasghar avatar mboukhalfa avatar metal3-io-bot avatar mquhuy avatar namnx228 avatar rhjanders avatar rozzii avatar russellb avatar shibapuppy avatar stbenjam avatar tuminoid avatar vrutkovs avatar yprokule avatar yselkowitz avatar zaneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ironic-image's Issues

Basic auth on json-rpc broken with IPv6

There is a bug in Ironic with json-rpc when specifying an IPv6 address for the [DEFAULT].host config option for ironic-conductor.

Since #201 we explicitly set this option to $IRONIC_IP except when the json-rpc auth type is noauth (in which case it's forced to be localhost).

Until the bug is fixed, basic auth on json-rpc is unusable with IPv6.

Missing idrac-redfish interface for raid

Baremetal operator fails on registration of server with the idrac-virtualmedia scheme: with

 Client-side error: Could not find the following interface in the ‘ironic.hardware.interfaces.raid’ entrypoint: idrac-redfish. Valid interfaces are [‘agent’, ‘fake’, ‘ibmc’, ‘idrac-wsman’, ‘ilo5’, ‘irmc’, ‘no-raid’].

The reason is the conflict between
https://github.com/metal3-io/baremetal-operator/blob/41350ed511b7e2d90df81a9ab4afd56819a8b2d0/pkg/hardwareutils/bmc/idrac_virtualmedia.go#L88
and

enabled_raid_interfaces = no-raid,irmc,agent,fake,ibmc,idrac-wsman,ilo5

Adding idrac-redfish to the line solved the problem (only tested on a system without RAID configuration).

minimize iptables rule in ironic images

We need some iptables rules set in the ironic images while we run them on the bootstrap node, outside of the cluster. We need to be careful that the containers do not force setting those rules when run inside the cluster, however, because that triggers issues like #82. See also #83 for more discussion.

Error setting up bootloader. Error UTF-16 stream does not start with BOM: UnicodeError: UTF-16 stream does not start with BOM

I'm still trying to determing what changed to trigger this. any help debugging this issue would be much appreciated.
I did login to a host during ironic-inspection and ran efibootmgr successfully, but did not see what this CSV file is that its tripping on.

Feb 18 05:45:15 localhost.localdomain ironic-python-agent[2090]: 2022-02-18 05:45:15.259 2090 INFO ironic_lib.disk_utils [-] Disk metadata on /dev/nvme24n1 successfully destroyed for node
Feb 18 05:45:15 localhost.localdomain ironic-python-agent[2090]: 2022-02-18 05:45:15.269 2090 INFO ironic_python_agent.extensions.standby [-] Writing image with command: qemu-img convert -t directsync -S 0 -O host_device -W /tmp/ubuntu-2004-kube-v1.20.10.a-efi.qcow2 /dev/nvme24n1
Feb 18 05:45:37 localhost.localdomain ironic-python-agent[2090]: 2022-02-18 05:45:37.306 2090 INFO ironic_python_agent.extensions.standby [-] Image /tmp/ubuntu-2004-kube-v1.20.10.a-efi.qcow2 written to device /dev/nvme24n1 in 27.08680534362793 seconds
Feb 18 06:09:54 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:09:54.511 2085 INFO root [-] Configdrive for node 83b642eb-3205-4d77-8132-03e931da3b9d successfully copied onto partition /dev/nvme24n1p5

Feb 18 06:10:06 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:02.610 2085 INFO root [-] Asynchronous command install_bootloader started execution
Feb 18 06:10:26 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:26.585 2085 INFO ironic_lib.utils [-] Root device found! The device "{'name': '/dev/nvme24n1', 'model': 'Micron_2300_MTFDHBA512TDV', 'size': 512110190592, 'rotational': False, 'wwn': 'eui.000000000000000100a075212d5a81d8', 'serial': '21092D5A81D8', 'vendor': None, 'wwn_with_extension': 'eui.000000000000000100a075212d5a81d8', 'wwn_vendor_extension': None, 'hctl': None, 'by_path': '/dev/disk/by-path/pci-0000:4b:00.0-nvme-1'}" matches the root device hints {'name': 's== /dev/nvme24n1'}

Feb 18 06:10:26 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:26.585 2085 INFO root [-] Picked root device /dev/nvme24n1 for node 83b642eb-3205-4d77-8132-03e931da3b9d based on root device hints {'name': 's== /dev/nvme24n1'}

Feb 18 06:10:32 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:28.128 2085 WARNING ironic_python_agent.utils [-] Couldn't re-read the partition table on device /dev/nvme24n1: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

Feb 18 06:10:32 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:32.136 2085 DEBUG ironic_lib.utils [-] Command stdout is: "Model: NVMe Device (nvme)
                                                                 Disk /dev/nvme24n1: 512GB
                                                                 Sector size (logical/physical): 512B/512B
                                                                 Partition Table: gpt
                                                                 Disk Flags:

                                                                 Number  Start   End     Size    File system  Name                  Flags
                                                                  1      17.4kB  1018kB  1000kB                                     bios_grub
                                                                  2      1018kB  201MB   200MB   fat32        EFI System Partition  boot, esp
                                                                  3      201MB   713MB   512MB   ext3
                                                                  4      713MB   21.5GB  20.8GB  ext4
                                                                  5      512GB   512GB   67.1MB


Feb 18 06:10:32 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:32.149 2085 DEBUG ironic_lib.utils [-] Command stdout is: "BYT;
                                                                 /dev/nvme24n1:488386MiB:nvme:512:512:gpt:NVMe Device:;
                                                                 1:0.02MiB:0.97MiB:0.95MiB:::bios_grub;
                                                                 2:0.97MiB:192MiB:191MiB:fat32:EFI System Partition:boot, esp;
                                                                 3:192MiB:680MiB:488MiB:ext3::;
                                                                 4:680MiB:20480MiB:19800MiB:ext4::;
                                                                 5:488322MiB:488386MiB:64.0MiB:::;

Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.144 2085 DEBUG ironic_lib.utils [-] Command stdout is: "BootCurrent: 0003
                                                                 Timeout: 1 seconds
                                                                 BootOrder: 0007,0003,0004,0005,0006,0001
                                                                 Boot0001* UEFI: Built-in EFI Shell        VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
                                                                 Boot0003* (B69/D0/F0) UEFI PXE: IPv4 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecef74610a)        PciRoot(0x1)/Pci(0x3,0x1)/Pci(0x0,0x0)/MAC(3cecef74610a,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
                                                                 Boot0004* (B69/D0/F1) UEFI PXE: IPv4 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecef74610b)        PciRoot(0x1)/Pci(0x3,0x1)/Pci(0x0,0x1)/MAC(3cecef74610b,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
                                                                 Boot0005* (B69/D0/F0) UEFI PXE: IPv6 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecef74610a)        PciRoot(0x1)/Pci(0x3,0x1)/Pci(0x0,0x0)/MAC(3cecef74610a,1)/IPv6([::]:<->[::]:,0,0)..BO
                                                                 Boot0006* (B69/D0/F1) UEFI PXE: IPv6 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecef74610b)        PciRoot(0x1)/Pci(0x3,0x1)/Pci(0x0,0x1)/MAC(3cecef74610b,1)/IPv6([::]:<->[::]:,0,0)..BO
                                                                 Boot0007* ubuntu        HD(2,GPT,601b431f-bebe-49b6-afe2-04312ef10c8c,0x7c4,0x5f5e2)/File(\EFI\UBUNTU\SHIMX64.EFI)..BO
                                                                 " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.145 2085 DEBUG ironic_lib.utils [-] Command stderr is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.145 2085 DEBUG ironic_python_agent.efi_utils [-] A CSV file has been identified as a bootloader hint. File: EFI/ubuntu/BOOTX64.CSV _run_efibootmgr /usr/lib/python3.6/site-packages/ironic_python_agent/efi_utils.py:297
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.150 2085 DEBUG ironic_python_agent.efi_utils [-] Executing _manage_uefi clean-up. manage_uefi /usr/lib/python3.6/site-packages/ironic_python_agent/efi_utils.py:157
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.151 2085 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): umount /tmp/tmp9xbaofp6/boot/efi execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.181 2085 DEBUG oslo_concurrency.processutils [-] CMD "umount /tmp/tmp9xbaofp6/boot/efi" returned: 0 in 0.031s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.182 2085 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.182 2085 DEBUG ironic_lib.utils [-] Command stderr is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.182 2085 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): sync execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.186 2085 DEBUG oslo_concurrency.processutils [-] CMD "sync" returned: 0 in 0.004s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.186 2085 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.186 2085 DEBUG ironic_lib.utils [-] Command stderr is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100
Feb 18 06:10:40 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.187 2085 ERROR ironic_python_agent.extensions.image [-] Error setting up bootloader. Error UTF-16 stream does not start with BOM: UnicodeError: UTF-16 stream does not start with BOM
Feb 18 06:10:44 localhost.localdomain ironic-python-agent[2085]: 2022-02-18 06:10:40.188 2085 ERROR root [-] Command failed: install_bootloader, error: UTF-16 stream does not start with BOM: UnicodeError: UTF-16 stream does not start with BOM
                                                                 2022-02-18 06:10:40.188 2085 ERROR root Traceback (most recent call last):
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 174, in run
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     result = self.execute_method(**self.command_params)
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 782, in install_bootloader
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     if _efi_boot_setup(device, efi_system_part_uuid, target_boot_mode):
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 660, in _efi_boot_setup
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     device, efi_system_part_uuid=efi_system_part_uuid)
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/efi_utils.py", line 148, in manage_uefi
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     efi_partition_mount_point)
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/efi_utils.py", line 301, in _run_efibootmgr
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     contents = str(csv.read())
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib64/python3.6/codecs.py", line 321, in decode
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     (result, consumed) = self._buffer_decode(data, self.errors, final)
                                                                 2022-02-18 06:10:40.188 2085 ERROR root   File "/usr/lib64/python3.6/encodings/utf_16.py", line 67, in _buffer_decode
                                                                 2022-02-18 06:10:40.188 2085 ERROR root     raise UnicodeError("UTF-16 stream does not start with BOM")
                                                                 2022-02-18 06:10:40.188 2085 ERROR root UnicodeError: UTF-16 stream does not start with BOM

isolinux.bin file not found in image

When enrolling nodes with virtualmedia, they get the error about couldn't create ISO because isolinux.bin file does not exist.
I actually logged into ironic-conductor pod and the /usr/lib/syslinux/isolinux.bin is not there. Even more, i installed syslinux package on ironic-conductor and i see the file goes to /usr/share/syslinux/isolinux.bin . Something may have changed in the base image.
A possible solution is install syslinux package on the ironic-image + creating a soft link to point to /usr/lib/syslinux/isolinux.bin

Proposal: Extract configuration generation to init-container

I think I have come up with a way to solve #468 in a more generic way and improve the user experience of the ironic-image.

The idea is to move all the scripts for generating configuration into its own container. This would run as an init-container just like ironic-ipa-downloader. It would generate all configuration files, setup TLS and auth and of course do the wait_for_interface_or_ip.

Doing it this way has multiple benefits:

  1. It is easy to opt out, since it is a separate container. Just like we can choose to run keepalived or not we could choose to use this init-container for convenience or provide configuration in some other way if we want.
  2. It becomes more obvious to the user when initialization is done. If the init-container gets stuck on wait_for_interface_or_ip it is immediately clear that the issue is during initialization, compared to current "pod not ready" symptoms.
  3. When the configuration is done before starting the actual container/component, it is possible to slim down each of these much more than we do today. The httpd and dnsmasq containers do not need python or ironic installed, and ironic does not need bash, etc.

I realize this may not be easy to achieve but I do think it would be a good way to enable advanced use cases and provide a greater flexibility in how we deploy ironic. What do you think?

CI is failing as Ironic fails to start

The ironic fails to start and after BMH is applied, it doesn't even go for registration.

kubectl get bmh -A
NAMESPACE   NAME     STATUS   PROVISIONING STATUS   CONSUMER   BMC                                                                                         HARDWARE PROFILE   ONLINE   ERROR
metal3      node-0                                             ipmi://192.168.111.1:6230                                                                                      false    
metal3      node-1                                             redfish+http://192.168.111.1:8000/redfish/v1/Systems/a5717dc1-48b0-46b0-b6b1-c90c7d4a742b                      false    
metal3      node-2                                             ipmi://192.168.111.1:6232                                                                                      false    
metal3      node-3                                             redfish+http://192.168.111.1:8000/redfish/v1/Systems/a4906598-e454-4811-8987-fa2fd48d63cc                      false    

The ironic log is complaining about the following:

Failed to register hardware types. For hardware type 'fake-hardware', no default value found for bios interface.: ironic.common.exception.NoValidDefaultForInterface: For hardware type 'fake-hardware', no default value found for bios interface.

This started in the CI after the PR #162 went in. Disabling it is making the CI to pass,

kubectl get bmh -A
NAMESPACE   NAME     STATUS   PROVISIONING STATUS   CONSUMER   BMC                                                                                         HARDWARE PROFILE   ONLINE   ERROR
metal3      node-0   OK       ready                            ipmi://192.168.111.1:6230                                                                   unknown            false    
metal3      node-1   OK       ready                            redfish+http://192.168.111.1:8000/redfish/v1/Systems/a5717dc1-48b0-46b0-b6b1-c90c7d4a742b   unknown            false    
metal3      node-2   OK       ready                            ipmi://192.168.111.1:6232                                                                   unknown            false    
metal3      node-3   OK       ready                            redfish+http://192.168.111.1:8000/redfish/v1/Systems/a4906598-e454-4811-8987-fa2fd48d63cc   unknown            false 

Enabling idrac needs further investigation.

Variable unset in dnsmasq configuration for ipv4

IRONIC_IP in dnsmasq.conf.ipv4 is not replaced with an ip when starting dnsmasq. It seems to not prevent it to start properly, but sed should probably be run to replace it in rundnsmasq script

Feature request: Environment variable to change IPA collectors

The Ironic-Python-Agent has the ability to run different sets of collectors, including custom collectors which are provided by Hardware Managers. It would be helpful to be able to change this setting.

The value is sent as a kernel param as ipa-inspection-collectors, and it seems like the current value is hard coded as default,extra-hardware,logs. Perhaps it could be set so that this value could be a custom value sent through an environment variable.

Nova power notification warning in log

This is cosmetic, but I see this in logs:

`2023-11-14T16:22:22.129348983+00:00 stderr F 2023-11-14 16:22:22.126 1 WARNING ironic.common.nova [None req-1393f16f-2b67-47ff-9173-2592fe139b74 - - - - - -] Could not connect to Nova to send a power notification, please check configuration. An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL�[00m

I think we can save some CPU cycles and disable that by adding?

[nova]
send_power_notifications = false

ProvisioningError :blkid returns with Exit code:2

I am using the default image format(raw) for the provisioning of baremetal nodes. Facing below issue during the provisioning .

Normal ProvisioningError  102s  metal3-baremetal-controller Image provisioning failed: Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node e5933f03-493d-4c75-a6b9-f823ac7b2284 : Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.
Command: blkid /dev/vda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: ''.

Ramdisk logs from ironic-log-watch container attached here
lsblk returns with exit code 0 and no errors related to GPT partition in the logs,

3da26541-9307-436f-9a67-5d423770befb_metal3~node-4_030e0040-e14f-4f38-9930-1f74ef32f61a_2021-08-30-05-51-42.tar.gz: Aug 30 05:51:39 node-4 ironic-python-agent[562]: 2021-08-30 05:51:39.212 562 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -Pbia -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID" returned: 0 in 0.022s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
-rw-r--r-- 0/0       74 2021-08-30 06:01 lsblk
3e88d346-cceb-4e57-9508-9552e3aa5962_metal3~node-3_cleaning_2021-08-30-05-03-24.tar.gz: Aug 30 05:03:18 node-3 ironic-python-agent[562]: 2021-08-30 05:03:18.816 562 DEBUG oslo_concurrency.processutils [-] CMD "sgdisk -Z /dev/vda" returned: 0 in 1.020s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
3e88d346-cceb-4e57-9508-9552e3aa5962_metal3~node-3_cleaning_2021-08-30-05-03-24.tar.gz: Aug 30 05:03:18 node-3 ironic-python-agent[562]: 2021-08-30 05:03:18.822 562 DEBUG ironic_lib.utils [-] Command stdout is: "Creating new GPT entries.
3e88d346-cceb-4e57-9508-9552e3aa5962_metal3~node-3_cleaning_2021-08-30-05-03-24.tar.gz:                                                  GPT data structures destroyed! You may now partition the disk using fdisk or

conductor logs ,

2021-08-30 05:51:41.360 1 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 3da26541-9307-436f-9a67-5d423770befb: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; collect_system_logs: result "{'system_logs': '<...>'}", error "None"; get_deploy_steps: result "{'deploy_steps': {'GenericHardwareManager': [{'step': 'erase_devices_metadata', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'apply_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'argsinfo': {'raid_config': {'description': 'The RAID configuration to apply.', 'required': True}, 'delete_existing': {'description': "Setting this to 'True' indicates to delete existing RAID configuration prior to creating the new configuration. Default value is 'True'.", 'required': False}}}, {'step': 'write_image', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'inject_files', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'argsinfo': {'files': {'description': "Files to inject, a list of file structures with keys: 'path' (path to the file), 'partition' (partition specifier), 'content' (base64 encoded string), 'mode' (new file mode) and 'dirmode' (mode for the leaf directory, if created). Merged with the values from node.properties[inject_files].", 'required': False}, 'verify_ca': {'description': 'Whether to verify TLS certificates. Global agent options are used by default.', 'required': False}}}]}, 'hardware_manager_version': {'generic_hardware_manager': '1.1'}}", error "None"; execute_deploy_step: result "None", error "{'type': 'DeploymentError', 'code': 500, 'message': 'Deploy step failed', 'details': "Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.\nCommand: blkid /dev/vda --match-tag UUID --match-tag PARTUUID\nExit code: 2\nStdout: ''\nStderr: ''"}" get_commands_status /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:347
2021-08-30 05:51:41.360 1 DEBUG ironic.drivers.modules.agent_base [-] deploy command status for node 3da26541-9307-436f-9a67-5d423770befb on step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}: {'id': '9588406e-4c96-41f2-a632-52134bc921e8', 'command_name': 'execute_deploy_step', 'command_status': 'FAILED', 'command_error': {'type': 'DeploymentError', 'code': 500, 'message': 'Deploy step failed', 'details': "Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.\nCommand: blkid /dev/vda --match-tag UUID --match-tag PARTUUID\nExit code: 2\nStdout: ''\nStderr: ''"}, 'command_result': None} process_next_step /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:1056
Command: blkid /dev/vda --match-tag UUID --match-tag PARTUUID

Tried with qcow2 format, with that also facing the same issue.
Storage is dev/vda, LocalGB: 12_
Let me know if you need more information to debug.
Thanks!!

Ironic does not support ilo5

I want to provision a baremetal server (HPE) using its ilo5 bmc.
When I apply the related baremetalhost on my cluster, an error occurs in the Ironic pod:
No valid host was found. Reason: No conductor service registered which supports driver ilo5 for conductor group.

Thanks for your help.

Mariadb fails to start on Minikube

Mariadb fails to start on the latest Minikube :

Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
/usr/bin/mysql_install_db: line 469:    35 Killed                  "$mysqld_bootstrap" $defaults $defaults_group_suffix "$mysqld_opt" --bootstrap $silent_startup "--basedir=$basedir" "--datadir=$ldata" --log-warnings=0 --enforce-storage-engine="" "--plugin-dir=${plugindir}" $args --max_allowed_packet=8M --net_buffer_length=16K

Installation of system tables failed!  Examine the logs in
/var/log/mariadb/mariadb.log or /var/lib/mysql for more information.

The log files / folders in /shared, /var/log and /var/lib/mysql are empty.
The container starts properly when not in Minikube (tested on Centos and Ubuntu).

Split up separate components into seperate Images

Currently everything is bundled into one image, this is usually not good practice as its brings quite a few problems with it, like longer pull times, not being able to do granular updates and such. It also adds complexity to the image build.
I would see a few components that should probably be their own images:

  • MariaDB: unsure if not just using official images is the better option here ( or bitnami charts if running on k8s)
  • webserver
  • ironic
  • dns/dhcp

Please correct me if im wrong with my assumption

Allow timeout customization on Ironic

If the BMC is slow to respond, Ironic will fail the current task (deploying, cleaning...).
For example, in a Redfish machine, Ironic would send a power off call, but the machine will take more than 30 seconds to report back a PowerOff state.

It seems that right now the timeout is 30 seconds.
Having customizable timeouts it's necessary, or at least increase it to more than 30 seconds.

2020-03-01 16:16:14.658 27 ERROR ironic.conductor.utils [req-653b5aea-7d75-4e0b-83a6-3dcddb0a250c - - - - -] Timed out after 30 secs waiting for power off on node 10fa3981-5b8b-4dd0-b0dc-064d38c99cab.: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 29.91 seconds
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager [req-653b5aea-7d75-4e0b-83a6-3dcddb0a250c - - - - -] Failed to tear down from cleaning for node 10fa3981-5b8b-4dd0-b0dc-064d38c99cab, reason: Failed to set node power state to power off.: ironic.common.exception.PowerStateFailure: Failed to set node power state to power off.
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager Traceback (most recent call last):
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/utils.py", line 149, in node_wait_for_power_state
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     return timer.start(initial_delay=1, timeout=retry_timeout).wait()
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     result = hub.switch()
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     return self.greenlet.switch()
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 154, in _run_loop
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     idle = idle_for_func(result, self._elapsed(watch))
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 351, in _idle_for
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     % self._error_time)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 29.91 seconds
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager 
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager During handling of the above exception, another exception occurred:
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager 
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager Traceback (most recent call last):
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 1487, in _do_next_clean_step
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     task.driver.deploy.tear_down_cleaning(task)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/agent.py", line 707, in tear_down_cleaning
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     task, manage_boot=CONF.agent.manage_agent_boot)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 967, in tear_down_inband_cleaning
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     manager_utils.node_power_action(task, states.POWER_OFF)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/utils.py", line 306, in node_power_action
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     fields.NotificationStatus.ERROR, new_state)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     self.force_reraise()
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     six.reraise(self.type_, self.value, self.tb)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     raise value
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/utils.py", line 288, in node_power_action
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     task.driver.power.set_power_state(task, new_state, timeout=timeout)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/redfish/power.py", line 120, in set_power_state
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     timeout=timeout)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager   File "/usr/lib/python3.6/site-packages/ironic/conductor/utils.py", line 155, in node_wait_for_power_state
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager     raise exception.PowerStateFailure(pstate=new_state)
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager ironic.common.exception.PowerStateFailure: Failed to set node power state to power off.
2020-03-01 16:16:14.679 27 ERROR ironic.conductor.manager 
2020-03-01 16:16:14.701 27 DEBUG ironic.common.states [req-653b5aea-7d75-4e0b-83a6-3dcddb0a250c - - - - -] Exiting old state 'cleaning' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:294
2020-03-01 16:16:14.701 27 DEBUG ironic.common.states [req-653b5aea-7d75-4e0b-83a6-3dcddb0a250c - - - - -] Entering new state 'clean failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:300

Ironic image cache cleaning removes some of the images

We have noticed when we do capi/metal3 upgrade, that pxe boot sometimes fails and from dnsmasq logs, we see:

dnsmasq-tftp: error 8 User aborted the transfer received from 172.18.0.70
dnsmasq-tftp: failed sending /shared/tftpboot/snponly.efi to 172.18.0.70
dnsmasq-tftp: sent /shared/tftpboot/snponly.efi to 172.18.0.70

Investigating from the dnsmasq container, we noticed actually the following images are not available in /shared/tftpboot
undionly.kpxe snponly.efi ipxe.efi

We have been checking the ironic pods and we noticed this log from ironic conductor:
2021-11-16 13:00:22.823 1 DEBUG ironic.drivers.modules.image_cache [req-0575d941-7dde-41bf-ab45-6623dd947852 - - - - -] Starting clean up for master image cache /shared/tftpboot clean_up /usr/lib/python3.6/site-packages/ironic/drivers/modules/image_cache.py:193�[00m

Investigating more, we see there is actually some automatic cache cleaning done by ironic which seem to clean /shared/tftpboot/ images and that causes issue.
https://github.com/openstack/ironic/blob/44c214dcedb41cd6ab24f62ca89ef8714e4ceb9b/ironic/drivers/modules/image_cache.py#L184

Another important note is that this issue happens very randomly so we are not sure what exactly trigger this. Any workaround for this issue?

Ironic listening on all interfaces, not only my_ip

We pass a PROVISIONING_INTERFACE so that we can bind the ironic services only to a specific internal network, but it seems like the my_ip configuration isn't enough, netstat shows the service (and inspector) listening on all all interfaces?

$ sudo netstat -taupen | grep 6385
tcp        0      0 0.0.0.0:6385            0.0.0.0:*               LISTEN      0          1822447    12064/python2       
tcp        0      0 127.0.0.1:55478         127.0.0.1:6385          TIME_WAIT   0          0          -                   
$ sudo podman exec -it ironic cat /etc/ironic/ironic.conf | grep my_ip
my_ip = 172.22.0.2

Am I missing something, or do we need some additional configuration to ensure binding to the expected interface?

IP fetched with prefix when configuring ironic

After this change : e4df693 when runnning

[root@ci-test-vm-20200507050740 /]# ip -br add show scope global up dev "provisioning" | awk '{print $3; exit}'
172.22.0.1/24

The prefix should be stripped. This prevents ironic from starting properly

missing consol logging makes bmh provisioning fail

After upgrade to v0.4.1 the bmh no longer logs syslog to the console during the different phases.
It seems to do with this: #160
This for some odd reason has the effect on my baremetal nodes (Hauwei ibmc managed, UEFI and Legacy bootmode) that it fails in random phases like clean wait, inspection wait.
I am not able to get anything to provision the servers as they all get stuck in clean wait.
After help from the community slack https://kubernetes.slack.com/archives/CHD49TLE7/p1606294233210600
I was able to get syslog to consol re-enabled in the ironic-image.

What did to get it back in working state:
I added ssh-debugging in order to get in the the image and try to run some debugging steps to figure out what was wrong
#226

Then I rebuild the ironic-image to get the required parameters included in the inspector.ipxe.j2- I have this file do to the work in the #226 - and the ironic.conf.j2 and enabled the systemd.journald.forward_to_console=yes

This getting the console logging back I am now able to get the bmh to behave as expected, from inspection to full deployment.

At least for me having the possibility to switch on console logging via a env variable would work better than having to hack the image on every new release.

Agent driver requires agent_url in driver_internal_info

I got error messages on ironic container below and couldn't deploy baremetalhost.

  • Error Messages ( $ kubectl logs ironic-79cdf49594-cd4qh -n metal3 -c ironic )
    2020-01-10 07:05:03.736 34 ERROR ironic.conductor.utils [req-1c1e975c-f9ee-4149-bfbf-212e898c53fc - - - - -] Node a2a01673-cefa-4c16-8f7e-7f4cfcd26d36 failed deploy step {u'priority': 100, u'interface': u'deploy', u'step': u'deploy', u'argsinfo': None}. Error: Agent driver requires agent_url in driver_internal_info: IronicException: Agent driver requires agent_url in driver_internal_info
    2020-01-10 07:05:03.922 34 ERROR ironic.conductor.task_manager [req-1c1e975c-f9ee-4149-bfbf-212e898c53fc - - - - -] Node a2a01673-cefa-4c16-8f7e-7f4cfcd26d36 moved to provision state "deploy failed" from state "deploying"; target provision state is "active": IronicException: Agent driver requires agent_url in driver_internal_info

  • Baremetalhost CR
    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
    name: baremetalhost01
    namespace: metal3
    spec:
    bmc:
    address: ipmi://10.1.1.17:623
    credentialsName: baremetalhost01-secret
    image:
    checksum: http://10.1.19.14:8080/images/my-image02.qcow2.md5sum
    url: http://10.1.19.14:8080/images/my-image02.qcow2
    online: true

Does anyone have ideas for solving this issue?

machine partition creation issue when qcow2 user image is supplied

We used to use a raw user OS image format (.img) and it was working fine. After switching to using the qcow2 format, it seems Ironic Python Agent is not happy when attempting to create partitions:

2021-05-17 21:01:02.337 2479 ERROR ironic_lib.disk_utils [-] Failed to fix GPT partition on disk /dev/sda for node None. Error: Unexpected error while running command.
Command: blkid /dev/sda --probe
Exit code: 2
Stdout: ''
Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2021-05-17 21:01:02.337 2479 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): blkid /dev/sda --match-tag UUID --match-tag PARTUUID execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384
2021-05-17 21:01:02.341 2479 DEBUG oslo_concurrency.processutils [-] CMD "blkid /dev/sda --match-tag UUID --match-tag PARTUUID" returned: 2 in 0.004s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
2021-05-17 21:01:02.341 2479 DEBUG oslo_concurrency.processutils [-] 'blkid /dev/sda --match-tag UUID --match-tag PARTUUID' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
2021-05-17 21:01:02.342 2479 ERROR root [-] Command failed: prepare_image, error: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: ''
2021-05-17 21:01:02.342 2479 ERROR root Traceback (most recent call last):
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 174, in run
2021-05-17 21:01:02.342 2479 ERROR root     result = self.execute_method(**self.command_params)
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/standby.py", line 704, in prepare_image
2021-05-17 21:01:02.342 2479 ERROR root     self._stream_raw_image_onto_device(image_info, stream_to)
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/standby.py", line 605, in _stream_raw_image_onto_device
2021-05-17 21:01:02.342 2479 ERROR root     root_uuid = disk_utils.block_uuid(device)
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_lib/disk_utils.py", line 511, in block_uuid
2021-05-17 21:01:02.342 2479 ERROR root     info = get_device_information(dev, fields=['UUID', 'PARTUUID'])
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_lib/disk_utils.py", line 207, in get_device_information
2021-05-17 21:01:02.342 2479 ERROR root     use_standard_locale=True, run_as_root=True)
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_lib/utils.py", line 97, in execute
2021-05-17 21:01:02.342 2479 ERROR root     result = processutils.execute(*cmd, **kwargs)
2021-05-17 21:01:02.342 2479 ERROR root   File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 441, in execute
2021-05-17 21:01:02.342 2479 ERROR root     cmd=sanitized_cmd)
2021-05-17 21:01:02.342 2479 ERROR root oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2021-05-17 21:01:02.342 2479 ERROR root Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
2021-05-17 21:01:02.342 2479 ERROR root Exit code: 2
2021-05-17 21:01:02.342 2479 ERROR root Stdout: ''
2021-05-17 21:01:02.342 2479 ERROR root Stderr: ''
2021-05-17 21:01:02.342 2479 ERROR root
2021-05-17 21:01:02.676 2479 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://21.0.12.32:9999, API version is 1.68 heartbeat /usr/lib/python3.6/site-packages/ironic_python_agent/ironic_api_client.py:162
2021-05-17 21:01:02.718 2479 INFO ironic_python_agent.agent [-] heartbeat successful
2021-05-17 21:01:02.718 2479 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 133.739396330338
2021-05-17 21:01:02.718 2479 ERROR root [-] Unexpected error dispatching write_image to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f102a092470>: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: '': ironic_python_agent.errors.CommandExecutionError: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: ''
2021-05-17 21:01:02.718 2479 ERROR root Traceback (most recent call last):
2021-05-17 21:01:02.718 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2454, in dispatch_to_managers
2021-05-17 21:01:02.718 2479 ERROR root     return getattr(manager, method)(*args, **kwargs)
2021-05-17 21:01:02.718 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2324, in write_image
2021-05-17 21:01:02.718 2479 ERROR root     return cmd.wait()
2021-05-17 21:01:02.718 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 89, in wait
2021-05-17 21:01:02.718 2479 ERROR root     raise self.command_error
2021-05-17 21:01:02.718 2479 ERROR root ironic_python_agent.errors.CommandExecutionError: Command execution failed: Unexpected error while running command.
2021-05-17 21:01:02.718 2479 ERROR root Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
2021-05-17 21:01:02.718 2479 ERROR root Exit code: 2
2021-05-17 21:01:02.718 2479 ERROR root Stdout: ''
2021-05-17 21:01:02.718 2479 ERROR root Stderr: ''
2021-05-17 21:01:02.718 2479 ERROR root
2021-05-17 21:01:02.719 2479 ERROR root [-] Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: '': ironic_python_agent.errors.CommandExecutionError: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: ''
2021-05-17 21:01:02.719 2479 ERROR root Traceback (most recent call last):
2021-05-17 21:01:02.719 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/deploy.py", line 77, in execute_deploy_step
2021-05-17 21:01:02.719 2479 ERROR root     **kwargs)
2021-05-17 21:01:02.719 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2454, in dispatch_to_managers
2021-05-17 21:01:02.719 2479 ERROR root     return getattr(manager, method)(*args, **kwargs)
2021-05-17 21:01:02.719 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2324, in write_image
2021-05-17 21:01:02.719 2479 ERROR root     return cmd.wait()
2021-05-17 21:01:02.719 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 89, in wait
2021-05-17 21:01:02.719 2479 ERROR root     raise self.command_error
2021-05-17 21:01:02.719 2479 ERROR root ironic_python_agent.errors.CommandExecutionError: Command execution failed: Unexpected error while running command.
2021-05-17 21:01:02.719 2479 ERROR root Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
2021-05-17 21:01:02.719 2479 ERROR root Exit code: 2
2021-05-17 21:01:02.719 2479 ERROR root Stdout: ''
2021-05-17 21:01:02.719 2479 ERROR root Stderr: ''
2021-05-17 21:01:02.719 2479 ERROR root
2021-05-17 21:01:02.739 2479 INFO eventlet.wsgi.server [-] ::ffff:21.0.12.28 "GET /v1/commands/ HTTP/1.1" status: 200  len: 249714 time: 0.0017476
2021-05-17 21:01:03.051 2479 ERROR root [-] Command failed: execute_deploy_step, error: Deploy step failed: Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: '': ironic_python_agent.errors.DeploymentError: Deploy step failed: Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.
Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
Exit code: 2
Stdout: ''
Stderr: ''
2021-05-17 21:01:03.051 2479 ERROR root Traceback (most recent call last):
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/deploy.py", line 77, in execute_deploy_step
2021-05-17 21:01:03.051 2479 ERROR root     **kwargs)
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2454, in dispatch_to_managers
2021-05-17 21:01:03.051 2479 ERROR root     return getattr(manager, method)(*args, **kwargs)
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py", line 2324, in write_image
2021-05-17 21:01:03.051 2479 ERROR root     return cmd.wait()
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 89, in wait
2021-05-17 21:01:03.051 2479 ERROR root     raise self.command_error
2021-05-17 21:01:03.051 2479 ERROR root ironic_python_agent.errors.CommandExecutionError: Command execution failed: Unexpected error while running command.
2021-05-17 21:01:03.051 2479 ERROR root Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
2021-05-17 21:01:03.051 2479 ERROR root Exit code: 2
2021-05-17 21:01:03.051 2479 ERROR root Stdout: ''
2021-05-17 21:01:03.051 2479 ERROR root Stderr: ''
2021-05-17 21:01:03.051 2479 ERROR root
2021-05-17 21:01:03.051 2479 ERROR root During handling of the above exception, another exception occurred:
2021-05-17 21:01:03.051 2479 ERROR root
2021-05-17 21:01:03.051 2479 ERROR root Traceback (most recent call last):
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 174, in run
2021-05-17 21:01:03.051 2479 ERROR root     result = self.execute_method(**self.command_params)
2021-05-17 21:01:03.051 2479 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/deploy.py", line 82, in execute_deploy_step
2021-05-17 21:01:03.051 2479 ERROR root     raise errors.DeploymentError(msg)
2021-05-17 21:01:03.051 2479 ERROR root ironic_python_agent.errors.DeploymentError: Deploy step failed: Error performing deploy_step write_image: Command execution failed: Unexpected error while running command.
2021-05-17 21:01:03.051 2479 ERROR root Command: blkid /dev/sda --match-tag UUID --match-tag PARTUUID
2021-05-17 21:01:03.051 2479 ERROR root Exit code: 2
2021-05-17 21:01:03.051 2479 ERROR root Stdout: ''
2021-05-17 21:01:03.051 2479 ERROR root Stderr: ''
2021-05-17 21:01:03.051 2479 ERROR root
2021-05-17 21:01:03.052 2479 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://21.0.12.32:9999, API version is 1.68 heartbeat /usr/lib/python3.6/site-packages/ironic_python_agent/ironic_api_client.py:162
2021-05-17 21:01:03.088 2479 INFO ironic_python_agent.agent [-] heartbeat successful
2021-05-17 21:01:03.089 2479 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 123.50160155794512

Is there any baremetal operator / Ironic API that should be set during provisioning to use the qcow2 image?

Introducing ipxe security hardening options

The motivation:
I am currently working on implementing basic_auth for user image downloading process of IPA https://bugs.launchpad.net/ironic-python-agent/+bug/2021947 but in case of environments using iPXE to boot IPA, passing the basic auth credentials securely requires https and our Ironic-image does not provide out of the box options to help users enabling https for ipxe and this is an important missing feature IMO and not just when basic_auth credentials are used but in general.

The core plan is to:

  • Same way as it is already done for Ironic,Inspector,Vmedia there would be options to specify a certificate for use during ipxe booting, this means additional config for httpd to handle
    certs for https ipxe booting (same as we already have for vmedia)
  • Chain loading of the self built ipxe firmware would be the default even when dnsmasq receives a BOOTP request with a valid ipxe flag
  • There would be an additional script similar to the current run scripts but that would build the IPXE firmware on the fly, (could inject cert, embedded scripts, setting changes) this could be then used
    as an entry point for a new optional init container of Metal3's Ironic deployments

Additional changes:

  • IMO the "images" directory should be "1 level up" next to directory like tftpboot there is no point to mix it with the Ironic conductor's root http directory so I would like to change that too and I suspect
    this change will require a bit of extra httpd config but not much
  • The default httpd config is edited with sed currently and I'd like to use a jinja template instead to provide a uniform templating process for all config files that the Ironic-image based containers are
    touching.

IP conflicts on dnsmasq crash PXE firmware

Hi!

We have encountered a problem with conflicting IPs on dnsmasq on Dell hardware. It is well explained on a bug in Ironic's bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1301659

$> kubectl logs metal3-ironic-57bc4b8958-qqtp8 -n dc254-cluster-capi -c ironic-dnsmasq | grep 172.18.0.43  
dnsmasq-dhcp: 1453030933 DHCPOFFER(provisionnet) 172.18.0.43 98:a4:04:22:2e:a0 
dnsmasq-dhcp: 1375794672 DHCPOFFER(provisionnet) 172.18.0.43 e4:43:4b:c1:4a:78 
dnsmasq-dhcp: 1453030933 DHCPOFFER(provisionnet) 172.18.0.43 98:a4:04:22:2e:a0 
dnsmasq-dhcp: 1453030933 DHCPREQUEST(provisionnet) 172.18.0.43 98:a4:04:22:2e:a0 
dnsmasq-dhcp: 1453030933 DHCPACK(provisionnet) 172.18.0.43 98:a4:04:22:2e:a0 
dnsmasq-dhcp: 1375794672 DHCPREQUEST(provisionnet) 172.18.0.43 e4:43:4b:c1:4a:78 
dnsmasq-dhcp: 1375794672 DHCPNAK(provisionnet) 172.18.0.43 e4:43:4b:c1:4a:78 address in use

The moment that NAK gets to the machine, PXE stops and the machine moves to boot from HDD. (Thanks Dell for your good DHCP implementation)

The workaround described there is to have Ironic serve sequential IP.
Could you add this workaround to the metal³ ironic?

Thanks a lot!

IPA can not fetch a token when "online" is false in BMHs creation

This issue has already been solved and is aimed to log what happened and how it was solved.

The symptoms

After changing the way we create the BMH resources, setting the "online" flag to false in the spec, all metal3 deployments started failing with :

Suspicious activity detected for node a6aa1d09-30a9-42f8-8b5a-f32226c1ccd0 when attempting to heartbeat. Heartbeat request has been rejected as the version of ironic-python-agent indicated in the heartbeat operation should support agent token functionality

The provisioning was starting, but all nodes were staying in "clean wait" until timeout.

The error flow

IPA authentication using tokens is now a feature in Ironic, and is required if both IPA and ironic support it. docs. Whenever booting up, IPA queries ironic, through a lookup, and ironic answers with a token that IPA must use afterwards in all communications with Ironic. Once generated, the token is only regenerated in some very specific cases.

The core problem was that after introspection, IPA was doing a lookup towards Ironic, requesting a token. However, since the online flag was false, BMO was turning off the node through ironic right after. When the node was booted again for deployment, IPA was requesting a token, but Ironic was refusing to give it since the previous request had been served, and the current one was unexpected. All subsequent operations were then failing with the given error message.

The root cause

There were actually two problems:

  1. Ironic was not removing the token when turning off the node. Whenever ironic causes the node to reboot, it should wipe the token as a new token is expected when the node reboots. This was fixed by @dtantsur in https://review.opendev.org/#/c/739964
  2. Ironic and IPA were misconfigured, leading to an unnecessary lookup after the introspection that was consuming a token. We will detail this below

the impact of fast track

Fast track is a feature of ironic used to limit the number of reboots of the hardware. The idea is that after introspection, the node will keep running until deployment, to not have to reboot. However, this was broken in Metal3 quite a while back, so we decided to turn it off. However, we only disabled fast track in the ironic configuration, missing some other elements that would have needed to be changed alongside. We only changed deploy/fast_track to false. From the documentation :

# Whether to allow deployment agents to perform lookup,
# heartbeat operations during initial states of a machine
# lifecycle and by-pass the normal setup procedures for a
# ramdisk. This feature also enables power operations which
# are part of deployment processes to be bypassed if the
# ramdisk has performed a heartbeat operation using the
# fast_track_timeout setting.

So this flag pretty much only controlled whether Ironic would (re)boot the node before deploying.

Other hidden fast track parameters

In addition, until recently, fast track was not working properly because IPA was missing some configuration. This was fixed in @dtantsur 's commit. However, setting this option means configuring IPA to behave as fast track, causing it to do a lookup after introspection. So if not using fast track, IPA should not be given this parameter, disabling its fast track behaviour. This way, it stopped consuming the token after introspection. The node would then be powered off, and when booting for deployment, it would be able to query a token properly.

Finally, a last configuration option had been missed : inspector/power_off. This controls whether the ramdisk is powered off after introspection. That should be the proper way to power off the node, not BMO forcing a shutdown through Ironic.

So by setting deploy/fast_track to false, inspector/power_off to true, and not giving the ipa-api-url parameter to the ramdisk, the fast track was finally effectively disabled. This was done in the ironic image fix from @dtantsur

When it comes together

The issue was triggered when we changed the "Online" flag of the BMHs because it caused them to be powered off in a way that was not expected by Ironic, that was then not wiping the token when powering off the node, because that operation was triggered by BMO. Since IPA had already done a lookup, at the next boot, it was failing to get a new token, causing the deployment to stall and timeout.

Conclusion

It is because of a bug in Ironic and a misconfiguration in ironic-image that this could happen. The fast track was not properly disabled in Metal3, allowing for this to happen. This bug has already been corrected, the configuration is now properly done, either fully disabling fast track or enabling it, but not having half way baked in the code. The fixes in Ironic now needs to go through review, be merged and make it to the RDO packages that we are using before we will be able to close this bug, but the configuration changes should solve the problem. Re-enabling fast track is our next item.

Issue in enabling ilo hardware type

Hi

I am working on enablement of ilo hardware type. And i am facing one issue when i am adding ilo hardware type in enabled_hardware_types (/etc/ironic/ironic.conf). This is the error i am getting

ERROR:

/usr/lib/python3.6/site-packages/oslo_serialization/jsonutils.py:180: UserWarning: Cannot convert <ironic.drivers.modules.ipxe.iPXEBoot object at 0x7fbca2dd5e10> to primitive, will raise ValueError instead of warning in version 3.0
"instead of warning in version 3.0" % (value,))
ERROR oslo_service.service [req-4243a8de-19fc-4688-936d-61312ec4091b - - - - -] Error starting thread.: ironic.common.exception.IncompatibleInterface: boot interface implementation '<ironic.drivers.modules.ipxe.iPXEBoot object at 0x7fbca2dd5e10>' is not supported by hardware type IloHardware.
ERROR oslo_service.service Traceback (most recent call last):
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/oslo_service/service.py", line 807, in run_service
ERROR oslo_service.service service.start()
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/ironic/common/rpc_service.py", line 61, in start
ERROR oslo_service.service self.manager.init_host(admin_context)
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/ironic/conductor/base_manager.py", line 166, in init_host
ERROR oslo_service.service self._register_and_validate_hardware_interfaces(hardware_types)
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/ironic/conductor/base_manager.py", line 355, in _register_and_validate_hardware_interfaces
ERROR oslo_service.service ht, interface_type, driver_name=ht_name)
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/ironic/common/driver_factory.py", line 140, in default_interface
ERROR oslo_service.service get_interface(hw_type, interface_type, impl_name)
ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/ironic/common/driver_factory.py", line 111, in get_interface
ERROR oslo_service.service hardware_type=hw_type.class.name)
ERROR oslo_service.service ironic.common.exception.IncompatibleInterface: boot interface implementation '<ironic.drivers.modules.ipxe.iPXEBoot object at 0x7fbca2dd5e10>' is not supported by hardware type IloHardware.
ERROR oslo_service.service
INFO oslo.service.wsgi [-] Stopping WSGI server.

This is because ilo hardware type does not support boot interface pxe or ipxe and in ironic.conf default_boot_interface is defined (which is ipxe). I try to remove the default_boot_interface and run the ironic service and it ran successfully.

Why do we add this(default_boot_interface)? Can we remove this to support ilo hardware type?

Allow overriding/specifying IRONIC_IP/IRONIC_URL_HOST

It is currently possible to set either the PROVISIONING_INTERFACE or PROVISIONING_IP. If PROVISIONING_IP is set, this IP is assumed to be associated with some interface and Ironic will wait until it can see it there. If PROVISIONING_INTERFACE is set, we check what IP is associated with it and assume that this is the IRONIC_IP and IRONIC_URL_HOST.

Ref:

wait_for_interface_or_ip()
{
# If $PROVISIONING_IP is specified, then we wait for that to become available on an interface, otherwise we look at $PROVISIONING_INTERFACE for an IP
if [[ -n "$PROVISIONING_IP" ]]; then
# Convert the address using ipcalc which strips out the subnet. For IPv6 addresses, this will give the short-form address
IRONIC_IP="$(ipcalc "${PROVISIONING_IP}" | grep "^Address:" | awk '{print $2}')"
export IRONIC_IP
until grep -F " ${IRONIC_IP}/" <(ip -br addr show); do
echo "Waiting for ${IRONIC_IP} to be configured on an interface"
sleep 1
done
else
until [[ -n "$IRONIC_IP" ]]; do
echo "Waiting for ${PROVISIONING_INTERFACE} interface to be configured"
IRONIC_IP="$(ip -br add show scope global up dev "${PROVISIONING_INTERFACE}" | awk '{print $3}' | sed -e 's%/.*%%' | head -n 1)"
export IRONIC_IP
sleep 1
done
fi
# If the IP contains a colon, then it's an IPv6 address, and the HTTP
# host needs surrounding with brackets
if [[ "$IRONIC_IP" =~ .*:.* ]]; then
export IPV=6
export IRONIC_URL_HOST="[$IRONIC_IP]"
else
export IPV=4
export IRONIC_URL_HOST="$IRONIC_IP"
fi
}

I would like to expose Ironic through a Service of type LoadBalancer instead of using host network. When doing this, the load balancer IP will not be directly associated with any interface in the container. This means that I cannot set the PROVISIONING_IP since Ironic would then wait indefinitely to see this IP on some interface (which will never happen). Instead I set the PROVISIONING_INTERFACE. This works great and I can reach both Ironic and Inspector when curling the load balancer IP (e.g. 192.168.222.200).

However, due to the snippet above, Ironic and Inspector will be configured to try to reach each other using the cluster network IP of the Pod (e.g. 10.244.0.13). This IP is volatile and not something that would be in the certificate when using TLS, so communication breaks down.

What can we do about this? Is it something that would go away together with the Inspector anyway? Maybe not worth doing anything at this point then. Or should we make it possible to override the IRONIC_IP and/or IRONIC_URL_HOST?

If this sounds interesting and you would like to play with it, try this:

  1. Clone https://github.com/lentzi90/playground/tree/ironic-loadbalancer#metal3 (use the branch ironic-loadbalancer)
  2. Run ./Metal3/dev-setup.sh
  3. Wait for all pods to be up
  4. Curl the APIs:
    1. curl https://192.168.222.200:5050 -k
    2. curl https://192.168.222.200:6385 -k
  5. Try creating a BMH to see the inspection error: NUM_BMH=1 ./Metal3/create-bmhs.sh

We still have logs in the filesystem.

We shouldn't be logging to the filesystem. We still have logs going into the shared volume. Eventually this will just balloon and cause problems. All logs should go to stdout, or else we need to implement a log rotation scheme.

unable to build base image

on 4c9583eab21ef80fca48ab95eec28dfac9524f99

#14 18.49 Complete!
#14 18.63 6400+0 records in
#14 18.63 6400+0 records out
#14 18.63 6553600 bytes (6.6 MB, 6.2 MiB) copied, 0.0200331 s, 327 MB/s
#14 18.64 mkfs.fat 4.1 (2017-01-24)
#14 18.64 Error converting to codepage 850 Invalid argument
#14 18.64 Cannot initialize '::'
#14 18.64 Error converting to codepage 850 Invalid argument
#14 18.64 Cannot initialize '::'
#14 18.64 Error converting to codepage 850 Invalid argument
#14 18.64 Cannot initialize '::'
#14 18.64 Bad target ::EFI/BOOT
#14 18.64 Error converting to codepage 850 Invalid argument
#14 18.64 Cannot initialize '::'
#14 18.64 Bad target ::EFI/BOOT
#14 18.64 Error converting to codepage 850 Invalid argument
#14 18.64 Cannot initialize '::'
------
executor failed running [/bin/sh -c prepare-efi.sh centos]: exit code: 1

ironic.common.exception.InvalidMAC: Expected a MAC address but received (WWN)

I'm using Openshift 4.8.12 with baremetal-operator installed. After creating new kind: BareMetalHost to provision new node on Lenovo ThinkSystem SR630 with QLogic PCI adapter (16Gb FC Dual-port HBA) installed I observe few errors in metal3-ironic-conductor container:

2021-10-14 21:08:00.545 1 DEBUG sushy.connector [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] HTTP response for GET https://10.zz.zz.zz/redfish/v1/Systems/1/EthernetInterfaces/NIC4: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:184�[00m
2021-10-14 21:08:00.546 1 DEBUG sushy.resources.base [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Received representation of EthernetInterface /redfish/v1/Systems/1/EthernetInterfaces/NIC4: {'_oem_vendors': None, 'description': 'External Network Interface', 'identity': 'NIC4', 'links': {'oem_vendors': None}, 'mac_address': '21:00:F4:E9:XX:XX:XX:XX', 'name': 'External Ethernet Interface', 'permanent_mac_address': '21:00:F4:E9:XX:XX:XX:XX', 'speed_mbps': 8000, 'status': {'health': 'ok', 'health_rollup': None, 'state': 'enabled'}} refresh /usr/lib/python3.6/site-packages/sushy/resources/base.py:634�[00m
2021-10-14 21:08:00.555 1 WARNING ironic.drivers.modules.inspect_utils [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Port already exists for MAC address 7C:D3:0A:XX:XX:XX for node 55eebd5c-340f-42ea-990c-44b348210466: ironic.common.exception.MACAlreadyExists: A port with MAC address 7c:d3:0a:xx:xx:xx already exists.�[00m
2021-10-14 21:08:00.568 1 WARNING ironic.drivers.modules.inspect_utils [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Port already exists for MAC address 7C:D3:0A:XX:XX:XX for node 55eebd5c-340f-42ea-990c-44b348210466: ironic.common.exception.MACAlreadyExists: A port with MAC address 7c:d3:0a:xx:xx:xx already exists.�[00m
2021-10-14 21:08:00.569 1 DEBUG ironic.common.states [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Exiting old state 'inspecting' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:295�[00m
2021-10-14 21:08:00.569 1 DEBUG ironic.common.states [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Entering new state 'inspect failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:301�[00m
2021-10-14 21:08:00.582 1 ERROR ironic.conductor.task_manager [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Node 55eebd5c-340f-42ea-990c-44b348210466 moved to provision state "inspect failed" from state "inspecting"; target provision state is "manageable": ironic.common.exception.InvalidMAC: Expected a MAC address but received 21:00:F4:E9:XX:XX:XX:XX.�[00m
2021-10-14 21:08:00.583 1 ERROR ironic.conductor.manager [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Failed to inspect node 55eebd5c-340f-42ea-990c-44b348210466: Expected a MAC address but received 21:00:F4:E9:XX:XX:XX:XX.: ironic.common.exception.InvalidMAC: Expected a MAC address but received 21:00:F4:E9:XX:XX:XX:XX.�[00m
2021-10-14 21:08:00.594 1 DEBUG ironic.conductor.task_manager [req-51e762af-4d11-4c4c-b72c-d134aea17358 ironic-user - - - -] Successfully released exclusive lock for hardware inspection on node 55eebd5c-340f-42ea-990c-44b348210466 (lock was held 2.32 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:447�[00m

As you can see after this error the state of provisioning changes to failed state. Since QLogic is the FC adapter it provides WWNs, but not MACs. How to deal with it?

Mariadb doesn't read conf from /etc/my.conf

Correct location for mariadb configuration file is either /etc/my.cnf or any cnf file under /etc/my.cnf.d/
In our specific case, the configuration is located in /etc/my.cnf.d/mariadb-server.cnf and this is what we should use

Healthcheck is too aggressive

It appears the healthcheck script timeout of 10 seconds is too aggressive. I'm not sure which part is failing. Bumping it up to 30 seconds might be more reasonable.

ipmitool PXE bug

I have a clean metal3-dev-env installation on baremetal. The installation was successfully completed.

When I applied provision_cluster and then provision_controlplane i got endless "Provisioning" state in "kubectl get baremetalhosts".

"openstack baremetal node list" showed a Clean Failed error (HTTP 400). That's why I decided to go watch Ironic logs. Ironic logs give me this:

2020-06-18 18:19:31.140 48 ERROR ironic.drivers.modules.ipmitool [req-132349a1-8a40-45a3-a81d-bc1b07ba1419 - - - - -] IPMI           Error while attempting "ipmitool -I lanplus -H 192.168.111.1 -L ADMINISTRATOR -p 6231 -U admin -R 1 -N 1 -f /tmp/tmpl5c2z          1m2 chassis bootdev pxe" for node 46de4129-2a16-422d-bf5f-99914907a170. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 192.168.111.1 -L ADMINISTRATOR -p 6231 -U admin -R 1 -N 1 -f /tmp/tmpl5c2z1m2 chassis boot          dev pxe
Exit code: 1
Stdout: ''
Stderr: 'Error setting Chassis Boot Parameter 5\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error           while running command.

And when I tried to apply this command manually i got the same error. The reason for this error is "-R 1" argument. If i change it to "-R 2", everything will be fine. But, how can i change it inside the code?

Thanks.

Consider using more threads to improve build speed

Currently this line

make bin/undionly.kpxe bin-x86_64-efi/ipxe.efi bin-x86_64-efi/snponly.efi

uses a single thread for compilation which is wayyyy too slow. Please consider using a reasonable number of threads like 'make -j4'.

runmariadb seems to fail with MariaDB 10.3.28

The latest centos8 build would update MariaDB to 10.3.28 and it seems the current runmariadb script doesn't work well with it.
I'm seeing the mariadb pod crash after the update. After pinning the mariadb-server version to 10.3.27 the issue is resolved.

This is the log of failing mariadb pod: https://pastebin.com/EAUpevfx

Few containers fail to start: iptables v1.4.21: can't initialize iptables table `filter'

Few ironic* containers fail to start as part of bmo pod:

$ oc get po -n openshift-machine-api | grep baremet
metal3-baremetal-operator-74fdb86688-pw6c4    4/8     CrashLoopBackOff    152        3h5m
$ oc logs po/metal3-baremetal-operator-74fdb86688-pw6c4 -n openshift-machine-api -c ironic-dnsmasq
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
$ oc logs po/metal3-baremetal-operator-74fdb86688-pw6c4 -n openshift-machine-api -c ironic-api
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
$ oc logs po/metal3-baremetal-operator-74fdb86688-pw6c4 -n openshift-machine-api -c ironic-httpd
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

Support building for multiple architectures

User Story

As a developer/user/operator I would like to use pre-built ironic images on multiple architectures because it is convenient (compared to building everything yourself).

Detailed Description
We currently only publish amd64 container images, but there are other popular architectures.
I suggest we look at how CAPI/CAPM3 has structured the Makefile with a docker-build-all target for building all supported architectures. It will probably be a bit different for Ironic though, since it is not written in Go.

Once we can build these images, it should be fairly trivial to also make CI build and publish them together with a multi-arch manifest for easy consumption.

/kind feature

Ironic container is not very kubernetes friendly

Hey guys!

So in the process of trying to use this in a pod with the baremetal operator, I noticed a few things which should probably be changed, or at least talked about, with respect to running this container in kubernetes. My intention here is just to record these so we can work with them.

  • We can't bind mount anything in, so I've got a PR to download the images at startup. Open to suggestions here.
  • We shouldn't really be running 4 processes in one container. It makes debugging much harder.
  • We shouldn't be logging to /var/log/. We can't access that from the host or anywhere else, and when the container dies its not possible to see the issues in the logs. Ideally we would have 1 process per container logging to stdout so we can get logs from kubernetes. Even just having the 4 processes log to stdout would be an improvement for debugging.
  • httpd is listening on port 80? hmm, tricky one to work with. We have to run this as net=host so we can do dhcp/pxe so this interferes with any service on the host (currently hitting this in testing with minishift).
  • The healthcheck should be implemented using the kubernetes/docker/podman healthcheck function.

Sorry to just point out a bunch of stuff.. like I say just trying to get it all summarised.

It would be great to have health checks compatible with kubernetes.

The current arrangement of having the container exit on failure is not a standard method of performing health checks in kubernetes (or other container systems).

In kubernetes you would usually use a livenessProbe which can specify a command. This is essentially exec'd every n seconds and the result is checked. Multiple failures can result in the container being restarted. This is especially nice because the failures are logged and you can still get logs of a failing container. Also for debug purposes you can disable the healthcheck and still exec into the container.

see eg: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

I thought podman does not natively support these, but it seems I was wrong. There is some information on how to do this here:

https://developers.redhat.com/blog/2019/04/18/monitoring-container-vitality-and-availability-with-podman/

Thanks everyone! These containers are getting better all the time!

Flask host and port are static in exporter conf

The change #101 re-introduced the prometheus-exporter app in the ironic image.
This app is based on Flask and in the original configuration its host and running port are statically defined in the startup script runexporterapp.sh
These two parameters should be made customizable and controllable from outside the container.

dhcp-sequential-ip issues

In order to prevent collisions in case of burst, we enabled dhcp-sequential-ip for dnsmasq. However, this turned out to be even more problematic in bigger scale deployments.

In a case of 30 nodes deployment, when the bmhs are created all at the same time, or the deployment is triggered (in CAPI all workers start to provision at the same time), then we see many issues, where dnsmasq does not answer to a DISCOVER from a node and prints "no address available" in the logs. After several retries the node usually gets an ip, but some of the PXE firmware do not retry enough to get the ip and instead fall back to boot from HDD, failing the introspection.

In addition, when booting from HDD, if the client already had an ip (for example .250), then it will request it directly upon reboot, dnsmasq will ack it, but then keep allocating addresses from there (.251, .252, .253 etc.) .

Overall I think we should remove this option. It seemed to fix a collision issue, but actually caused more problems in "big" scale deployments.

Security scan reports high level vulnerabilities in ironic and ironic-inspector images

I am working on checking image level security issues and found that Ironic and Ironic-inspector images are not passing security tests and reported more 5 high level security issues by scanning tool in CI/CD. I think all the reported security level issues should be fixed on priority.

Below are the details -

  1. Security vulnerability issues reported for images below

Every container runs 'configure-ironic.sh': Required inputs unclear

The container are split up now but I'm not sure which configuration options are important to which containers. I'm configuring them all for now but things like the http port etc may not be required by all containers?

I guess this is really just a docs request, we should update the README.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.