metal-stack / mini-lab Goto Github PK
View Code? Open in Web Editor NEWa small, virtual setup to locally run the metal-stack
License: MIT License
a small, virtual setup to locally run the metal-stack
License: MIT License
I was trying to reboot machines instead of rebooting whole mini-lab for testing. First by running make delete-machine0x
directive, after which machine goes to Planned Reboot
state and then running make reboot-machine0x
. After running last directive machines either hang in PXE Booting
status(with 💀 appearing after some time) or staying in Planned Reboot
state.
OS: Ubuntu 20.04
Vagrant : 2.2.9
Docker:
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:20 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Steps to reproduce:
Result:
Status: Downloaded newer image for mikefarah/yq:latest
Error: unknown command "/bin/sh" for "yq"
Run 'yq --help' for usage.
make: *** [Makefile:131: env] Error 1
Cause:
env.sh has a docker run "mikefarah/yq" which pulls the latest image which for some time now is version 4.x, which incorporates some changes that break compatibility, see https://mikefarah.gitbook.io/yq/upgrading-from-v3:
So we should simply use mikefarah/yq:3
https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.0.
This requires that control-plane clusters are updated to k8s >= 1.19, this is already true for mini-lab.
❯ host test.0.0.0.0.nip.io
❯ host test.1.0.0.0.nip.io
test.1.0.0.0.nip.io has address 1.0.0.0
Since we resolve all versions from the release vector now, the env
target in the Makefile does not work anymore, which defines the version of metalctl.
After some time after first start of mini-lab
i'm failing to restart it. I get very similar error, but at different stages(so far i see errors at deploy-partition | TASK [metal-roles/partition/roles/docker-on-cumulus : ensure dependencies are installed]
, deploy-partition | TASK [ansible-common/roles/systemd-docker-service : pre-pull docker image]
and deploy-partition | TASK [metal-roles/partition/roles/metal-core : wait for metal-core to listen on port]
). Here is last error that i got:
deploy-control-plane | fatal: [localhost]: FAILED! => changed=true
deploy-control-plane | cmd:
deploy-control-plane | - helm
deploy-control-plane | - upgrade
deploy-control-plane | - --install
deploy-control-plane | - --namespace
deploy-control-plane | - metal-control-plane
deploy-control-plane | - --debug
deploy-control-plane | - --set
deploy-control-plane | - helm_chart.config_hash=7fc19e1bc1a3ee41f622c3de7bc98ee33756844e
deploy-control-plane | - -f
deploy-control-plane | - metal-values.j2
deploy-control-plane | - --repo
deploy-control-plane | - https://helm.metal-stack.io
deploy-control-plane | - --version
deploy-control-plane | - 0.2.1
deploy-control-plane | - --wait
deploy-control-plane | - --timeout
deploy-control-plane | - 600s
deploy-control-plane | - metal-control-plane
deploy-control-plane | - metal-control-plane
deploy-control-plane | delta: '0:10:02.713685'
deploy-control-plane | end: '2020-12-09 08:47:29.432729'
deploy-control-plane | msg: non-zero return code
deploy-control-plane | rc: 1
deploy-control-plane | start: '2020-12-09 08:37:26.719044'
deploy-control-plane | stderr: |-
deploy-control-plane | history.go:53: [debug] getting history for release metal-control-plane
deploy-control-plane | install.go:172: [debug] Original chart version: "0.2.1"
deploy-control-plane | install.go:189: [debug] CHART PATH: /root/.cache/helm/repository/metal-control-plane-0.2.1.tgz
deploy-control-plane |
deploy-control-plane | client.go:255: [debug] Starting delete for "metal-api-initdb" Job
deploy-control-plane | client.go:284: [debug] jobs.batch "metal-api-initdb" not found
deploy-control-plane | client.go:109: [debug] creating 1 resource(s)
deploy-control-plane | client.go:464: [debug] Watching for changes to Job metal-api-initdb with timeout of 10m0s
deploy-control-plane | client.go:492: [debug] Add/Modify event for metal-api-initdb: ADDED
deploy-control-plane | client.go:531: [debug] metal-api-initdb: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
deploy-control-plane | client.go:492: [debug] Add/Modify event for metal-api-initdb: MODIFIED
deploy-control-plane | client.go:531: [debug] metal-api-initdb: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
deploy-control-plane | Error: failed pre-install: timed out waiting for the condition
deploy-control-plane | helm.go:81: [debug] failed pre-install: timed out waiting for the condition
deploy-control-plane | stderr_lines: <omitted>
deploy-control-plane | stdout: Release "metal-control-plane" does not exist. Installing it now.
deploy-control-plane | stdout_lines: <omitted>
deploy-control-plane |
deploy-control-plane | PLAY RECAP *********************************************************************
deploy-control-plane | localhost : ok=24 changed=11 unreachable=0 failed=1 skipped=8 rescued=0 ignored=0
I'm using mini-lab
on master branch with only change, metal_stack_release_version
set to develop
. Only thing that reliably helps is pruning everything(networks, build cache, containers, images) from docker.
OS: Ubuntu 20.04
Vagrant : 2.2.9
Docker:
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:20 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
cc @Gerrit91, @LimKianAn
We forget updating the OS images all the time but the latest images may be unstable, so it would be best to use the last released images for the mini-lab all the time.
These roles were placed into metal-roles but they are only used in the mini-lab.
Possibly we should create the requirements.yml
dynamically with the values from the release vector as otherwise we could forget updating them.
There are already quite a lot of new versions available.
hi,
i normally use only make control-plane
to startup the control plane locally in a cluster. Sadly this does not work if the IP adress 192.168.121.1
does not exist. This IP is created implicitly by using vagrant to spin off some machines which are not needed by the control plane (and which i don't want to spin of only to gather this IP).
my workaround is to do a
sudo ip a add 192.168.121.1/24 dev eno1 label eno1:1
where eno1
is my main interface on my machine. it would be great to do this automatically somewhere in the whole machinery
Repo Key of cumulus 3.x is not valid anymore and therefor updates/installs fail, this also breaks integration test in https://github.com/metal-stack/releases
Solution:
cumulus@switch:~$ wget http://repo3.cumulusnetworks.com/public-key/repo3-2023-key
cumulus@switch:~$ sudo apt-key add repo3-2023-key
cumulus@switch:~$ sudo -E apt-get update
Should be possible after #153.
containerlab added the new kind "Generic VM". Maybe we can remove our custom code for the machines and SONiC.
This would give us much higher test coverage as also the ipmi_sim from OpenIPMI seems to be pretty much feature complete.
First steps for trying it out would be:
-device ipmi-bmc-sim,id=bmc0
-chardev socket,id=ipmi0,host=localhost,port=9002,reconnect=10
-device ipmi-bmc-extern,id=bmc1,chardev=ipmi0
-device isa-ipmi-kcs,bmc=bmc1
ipmi_sim
to this deviceipmitool
If this works out, we can think about where we can deploy the metal-bmc to connect the system to the metal-stack. With this, we could start integration tests for go-hal and also refactor go-hal that we have a working default implementation for the IPMI protocol (wider hardware support).
References:
Additional information:
Containerlab can be used to define our network topology as YAML file (comparable to docker-compose) and use docker images / VM images as nodes.
The VM part is mostly implemented in https://github.com/plajjan/vrnetlab where Cumulus support would be needed to be added.
Instead of killing the process, we can maybe do something like Ctrl-a + x: https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg03212.html
Relates to #52.
To speed up deployment of the lab, it could make sense to build an own leaf cumulus image, which contains some of the runtime deps beforehand, e.g. docker.
Update documentation for containerlab based mini-lab.
OS: Ubuntu 20.04.1 LTS
Vagrant: 2.2.9
Docker: 19.03.8
Docker-Compose: 1.27.3, build 4092ae5d
Had problem running example from README. When running make
i get following error, although script finishes successfully:
deploy-partition | fatal: [leaf01]: UNREACHABLE! => changed=false
deploy-partition | msg: 'Failed to connect to the host via ssh: ssh: Could not resolve hostname leaf01: Name or service not known'
deploy-partition | unreachable: true
deploy-partition | fatal: [leaf02]: UNREACHABLE! => changed=false
deploy-partition | msg: 'Failed to connect to the host via ssh: ssh: Could not resolve hostname leaf02: Name or service not known'
deploy-partition | unreachable: true
deploy-partition |
deploy-partition | PLAY RECAP *********************************************************************
deploy-partition | leaf01 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
deploy-partition | leaf02 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
vagrant up
Bringing machine 'leaf02' up with 'libvirt' provider...
Bringing machine 'leaf01' up with 'libvirt' provider...
==> leaf02: Checking if box 'CumulusCommunity/cumulus-vx' version '3.7.13' is up to date...
==> leaf01: Checking if box 'CumulusCommunity/cumulus-vx' version '3.7.13' is up to date...
==> leaf02: Creating image (snapshot of base box volume).
==> leaf01: Creating image (snapshot of base box volume).
==> leaf02: Creating domain with the following settings...
==> leaf01: Creating domain with the following settings...
==> leaf02: -- Name: metalleaf02
==> leaf02: -- Domain type: kvm
==> leaf01: -- Name: metalleaf01
==> leaf01: -- Domain type: kvm
==> leaf02: -- Cpus: 1
==> leaf02: -- Feature: acpi
==> leaf01: -- Cpus: 1
==> leaf01: -- Feature: acpi
==> leaf02: -- Feature: apic
==> leaf02: -- Feature: pae
==> leaf01: -- Feature: apic
==> leaf01: -- Feature: pae
==> leaf02: -- Memory: 512M
==> leaf01: -- Memory: 512M
==> leaf02: -- Management MAC:
==> leaf01: -- Management MAC:
==> leaf01: -- Loader:
==> leaf02: -- Loader:
==> leaf01: -- Nvram:
==> leaf01: -- Base box: CumulusCommunity/cumulus-vx
==> leaf02: -- Nvram:
==> leaf02: -- Base box: CumulusCommunity/cumulus-vx
==> leaf01: -- Storage pool: default
==> leaf01: -- Image: /var/lib/libvirt/images/metalleaf01.img (6G)
==> leaf02: -- Storage pool: default
==> leaf02: -- Image: /var/lib/libvirt/images/metalleaf02.img (6G)
==> leaf01: -- Volume Cache: default
==> leaf02: -- Volume Cache: default
==> leaf01: -- Kernel:
==> leaf02: -- Kernel:
==> leaf01: -- Initrd:
==> leaf02: -- Initrd:
==> leaf01: -- Graphics Type: vnc
==> leaf01: -- Graphics Port: -1
==> leaf02: -- Graphics Type: vnc
==> leaf02: -- Graphics Port: -1
==> leaf01: -- Graphics IP: 127.0.0.1
==> leaf02: -- Graphics IP: 127.0.0.1
==> leaf01: -- Graphics Password: Not defined
==> leaf02: -- Graphics Password: Not defined
==> leaf01: -- Video Type: cirrus
==> leaf02: -- Video Type: cirrus
==> leaf01: -- Video VRAM: 9216
==> leaf02: -- Video VRAM: 9216
==> leaf01: -- Sound Type:
==> leaf01: -- Keymap: de
==> leaf02: -- Sound Type:
==> leaf01: -- TPM Path:
==> leaf02: -- Keymap: de
==> leaf02: -- TPM Path:
==> leaf01: -- INPUT: type=mouse, bus=ps2
==> leaf01: -- RNG device model: random
==> leaf02: -- INPUT: type=mouse, bus=ps2
==> leaf02: -- RNG device model: random
==> leaf01: Creating shared folders metadata...
==> leaf02: Creating shared folders metadata...
==> leaf01: Starting domain.
==> leaf02: Starting domain.
==> leaf01: Waiting for domain to get an IP address...
==> leaf02: Waiting for domain to get an IP address...
==> leaf01: Waiting for SSH to become available...
==> leaf02: Waiting for SSH to become available...
leaf01:
leaf01: Vagrant insecure key detected. Vagrant will automatically replace
leaf01: this with a newly generated keypair for better security.
leaf02:
leaf02: Vagrant insecure key detected. Vagrant will automatically replace
leaf02: this with a newly generated keypair for better security.
leaf02:
leaf02: Inserting generated public key within guest...
leaf01:
leaf01: Inserting generated public key within guest...
leaf02: Removing insecure key from the guest if it's present...
leaf01: Removing insecure key from the guest if it's present...
leaf01: Key inserted! Disconnecting and reconnecting using new SSH key...
leaf02: Key inserted! Disconnecting and reconnecting using new SSH key...
==> leaf01: Setting hostname...
==> leaf02: Setting hostname...
==> leaf01: Running provisioner: shell...
==> leaf02: Running provisioner: shell...
leaf01: Running: /tmp/vagrant-shell20201024-51781-7ivrnw.sh
leaf02: Running: /tmp/vagrant-shell20201024-51781-e8hvaf.sh
leaf01: #################################
leaf01: Running Switch Post Config (config_switch.sh)
leaf01: #################################
leaf02: #################################
leaf02: Running Switch Post Config (config_switch.sh)
leaf02: #################################
leaf01: #################################
leaf01: Finished
leaf01: #################################
leaf02: #################################
leaf02: Finished
leaf02: #################################
==> leaf01: Running provisioner: shell...
==> leaf02: Running provisioner: shell...
leaf01: Running: /tmp/vagrant-shell20201024-51781-otw21i.sh
leaf02: Running: /tmp/vagrant-shell20201024-51781-h3jegd.sh
leaf01: #### UDEV Rules (/etc/udev/rules.d/70-persistent-net.rules) ####
leaf01: INFO: Adding UDEV Rule: Vagrant interface = eth0
leaf01: INFO: Adding UDEV Rule: 44:38:39:00:00:1a --> swp1
leaf01: INFO: Adding UDEV Rule: 44:38:39:00:00:18 --> swp2
leaf01: ACTION=="add", SUBSYSTEM=="net", ATTR{ifindex}=="2", NAME="eth0", SUBSYSTEMS=="pci"
leaf01: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:1a", NAME="swp1", SUBSYSTEMS=="pci"
leaf01: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:18", NAME="swp2", SUBSYSTEMS=="pci"
==> leaf01: Running provisioner: shell...
leaf02: #### UDEV Rules (/etc/udev/rules.d/70-persistent-net.rules) ####
leaf02: INFO: Adding UDEV Rule: Vagrant interface = eth0
leaf02: INFO: Adding UDEV Rule: 44:38:39:00:00:04 --> swp1
leaf02: INFO: Adding UDEV Rule: 44:38:39:00:00:19 --> swp2
leaf02: ACTION=="add", SUBSYSTEM=="net", ATTR{ifindex}=="2", NAME="eth0", SUBSYSTEMS=="pci"
leaf02: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:04", NAME="swp1", SUBSYSTEMS=="pci"
leaf02: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:19", NAME="swp2", SUBSYSTEMS=="pci"
==> leaf02: Running provisioner: shell...
leaf01: Running: /tmp/vagrant-shell20201024-51781-7bdyq2.sh
leaf02: Running: /tmp/vagrant-shell20201024-51781-32eax6.sh
leaf01: ### RUNNING CUMULUS EXTRA CONFIG ###
leaf01: INFO: Detected a 3.x Based Release (3.7.13)
leaf01: ### Disabling default remap on Cumulus VX...
leaf01: INFO: Detected Cumulus Linux v3.7.13 Release
leaf01: ### Fixing ONIE DHCP to avoid Vagrant Interface ###
leaf01: Note: Installing from ONIE will undo these changes.
leaf02: ### RUNNING CUMULUS EXTRA CONFIG ###
leaf02: INFO: Detected a 3.x Based Release (3.7.13)
leaf02: ### Disabling default remap on Cumulus VX...
leaf02: INFO: Detected Cumulus Linux v3.7.13 Release
leaf02: ### Fixing ONIE DHCP to avoid Vagrant Interface ###
leaf02: Note: Installing from ONIE will undo these changes.
leaf01: ### Giving Vagrant User Ability to Run NCLU Commands ###
leaf02: ### Giving Vagrant User Ability to Run NCLU Commands ###
leaf01: Adding user `vagrant' to group `netedit' ...
leaf02: Adding user `vagrant' to group `netedit' ...
leaf01: Adding user vagrant to group netedit
leaf02: Adding user vagrant to group netedit
leaf02: Done.
leaf01: Done.
leaf02: Adding user `vagrant' to group `netshow' ...
leaf02: Adding user vagrant to group netshow
leaf01: Adding user `vagrant' to group `netshow' ...
leaf01: Adding user vagrant to group netshow
leaf01: Done.
leaf01: ### Disabling ZTP service...
leaf02: Done.
leaf02: ### Disabling ZTP service...
leaf01: Removed symlink /etc/systemd/system/multi-user.target.wants/ztp.service.
leaf02: Removed symlink /etc/systemd/system/multi-user.target.wants/ztp.service.
leaf01: ### Resetting ZTP to work next boot...
leaf02: ### Resetting ZTP to work next boot...
leaf01: Created symlink from /etc/systemd/system/multi-user.target.wants/ztp.service to /lib/systemd/system/ztp.service.
leaf02: Created symlink from /etc/systemd/system/multi-user.target.wants/ztp.service to /lib/systemd/system/ztp.service.
leaf01: ### DONE ###
leaf02: ### DONE ###
./env.sh
docker-compose up --remove-orphans --force-recreate control-plane partition && vagrant up machine01 machine02
Recreating deploy-partition ... done
Recreating deploy-control-plane ... done
Attaching to deploy-partition, deploy-control-plane
deploy-control-plane |
deploy-control-plane | PLAY [provide requirements.yaml] ***********************************************
deploy-partition |
deploy-partition | PLAY [provide requirements.yaml] ***********************************************
deploy-control-plane |
deploy-control-plane | TASK [download release vector] *************************************************
deploy-partition |
deploy-partition | TASK [download release vector] *************************************************
deploy-partition | ok: [localhost]
deploy-control-plane | ok: [localhost]
deploy-partition |
deploy-partition | TASK [write requirements.yaml from release vector] *****************************
deploy-control-plane |
deploy-control-plane | TASK [write requirements.yaml from release vector] *****************************
deploy-control-plane | ok: [localhost]
deploy-partition | ok: [localhost]
deploy-control-plane |
deploy-control-plane | PLAY RECAP *********************************************************************
deploy-control-plane | localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
deploy-control-plane |
deploy-partition |
deploy-partition | PLAY RECAP *********************************************************************
deploy-partition | localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
deploy-partition |
deploy-partition | - extracting ansible-common to /root/.ansible/roles/ansible-common
deploy-partition | - ansible-common (v0.5.5) was installed successfully
deploy-control-plane | - extracting ansible-common to /root/.ansible/roles/ansible-common
deploy-control-plane | - ansible-common (v0.5.5) was installed successfully
deploy-partition | - extracting metal-ansible-modules to /root/.ansible/roles/metal-ansible-modules
deploy-partition | - metal-ansible-modules (v0.1.1) was installed successfully
deploy-control-plane | - extracting metal-ansible-modules to /root/.ansible/roles/metal-ansible-modules
deploy-control-plane | - metal-ansible-modules (v0.1.1) was installed successfully
deploy-control-plane | - extracting metal-roles to /root/.ansible/roles/metal-roles
deploy-control-plane | - metal-roles (v0.3.3) was installed successfully
deploy-partition | - extracting metal-roles to /root/.ansible/roles/metal-roles
deploy-partition | - metal-roles (v0.3.3) was installed successfully
deploy-control-plane |
deploy-control-plane | PLAY [deploy control plane] ****************************************************
deploy-control-plane |
deploy-control-plane | TASK [ingress-controller : Apply mandatory nginx-ingress definition] ***********
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [ingress-controller : Deploy nginx-ingress service] ***********************
deploy-partition | [WARNING]: * Failed to parse /root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant/vagrant.py with script plugin: Inventory script
deploy-partition | (/root/.ansible/roles/ansible-common/inventory/vagrant/vagrant.py) had an
deploy-partition | execution error: Traceback (most recent call last): File
deploy-partition | "/root/.ansible/roles/ansible-common/inventory/vagrant/vagrant.py", line 452,
deploy-partition | in <module> main() File "/root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant/vagrant.py", line 447, in main hosts, meta_vars =
deploy-partition | list_running_hosts() File "/root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant/vagrant.py", line 414, in list_running_hosts _,
deploy-partition | host, key, value = line.split(',')[:4] ValueError: not enough values to unpack
deploy-partition | (expected 4, got 1)
deploy-partition | [WARNING]: * Failed to parse /root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant/vagrant.py with ini plugin:
deploy-partition | /root/.ansible/roles/ansible-common/inventory/vagrant/vagrant.py:6: Expected
deploy-partition | key=value host variable assignment, got: re
deploy-partition | [WARNING]: Unable to parse /root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant/vagrant.py as an inventory source
deploy-partition | [WARNING]: Unable to parse /root/.ansible/roles/ansible-
deploy-partition | common/inventory/vagrant as an inventory source
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/prepare : Create namespace for metal stack] ***
deploy-partition |
deploy-partition | PLAY [pre-deployment checks] ***************************************************
deploy-partition |
deploy-partition | TASK [get vagrant version] *****************************************************
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Gather release versions] ***********
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Deploy nsq] ************************
deploy-partition | changed: [localhost]
deploy-partition |
deploy-partition | TASK [check vagrant version] ***************************************************
deploy-partition | skipping: [localhost]
deploy-partition |
deploy-partition | PLAY [deploy leaves and docker] ************************************************
deploy-partition |
deploy-partition | TASK [Gathering Facts] *********************************************************
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Set services for patching ingress controller service exposal] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Patch tcp-services in ingress controller] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/nsq : Expose tcp services in ingress controller] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal-db : Gather release versions] ******
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal-db : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [Deploy metal db] *********************************************************
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/rethinkdb-backup-restore : Gather release versions] ***
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/rethinkdb-backup-restore : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/rethinkdb-backup-restore : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/rethinkdb-backup-restore : Deploy rethinkdb (backup-restore)] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/ipam-db : Gather release versions] *******
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/ipam-db : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [Deploy ipam db] **********************************************************
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Gather release versions] ***
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Deploy postgres (backup-restore)] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/masterdata-db : Gather release versions] ***
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/masterdata-db : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [Deploy masterdata db] ****************************************************
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Gather release versions] ***
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/postgres-backup-restore : Deploy postgres (backup-restore)] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Gather release versions] *********
deploy-control-plane | skipping: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Check mandatory variables for this role are set] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [Deploy metal control plane] **********************************************
deploy-control-plane |
deploy-control-plane | TASK [ansible-common/roles/helm-chart : Create folder for charts and values] ***
deploy-control-plane | changed: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [ansible-common/roles/helm-chart : Copy over custom helm charts] **********
deploy-control-plane | changed: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [ansible-common/roles/helm-chart : Template helm value file] **************
deploy-control-plane | changed: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [ansible-common/roles/helm-chart : Calculate hash of configuration] *******
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [ansible-common/roles/helm-chart : Deploy helm chart (metal-control-plane)] ***
deploy-partition | fatal: [leaf02]: UNREACHABLE! => changed=false
deploy-partition | msg: 'Failed to connect to the host via ssh: ssh: connect to host leaf02 port 22: No route to host'
deploy-partition | unreachable: true
deploy-partition | fatal: [leaf01]: UNREACHABLE! => changed=false
deploy-partition | msg: 'Failed to connect to the host via ssh: ssh: connect to host leaf01 port 22: No route to host'
deploy-partition | unreachable: true
deploy-partition |
deploy-partition | PLAY RECAP *********************************************************************
deploy-partition | leaf01 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
deploy-partition | leaf02 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
deploy-partition | localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
deploy-partition |
deploy-partition exited with code 4
deploy-control-plane | changed: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Set services for patching ingress controller service exposal] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Patch tcp-services in ingress controller] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Patch udp-services in ingress controller] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Expose tcp services in ingress controller] ***
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | TASK [metal-roles/control-plane/roles/metal : Wait until api is available] *****
deploy-control-plane | ok: [localhost]
deploy-control-plane |
deploy-control-plane | PLAY RECAP *********************************************************************
deploy-control-plane | localhost : ok=30 changed=4 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0
deploy-control-plane |
deploy-control-plane exited with code 0
Bringing machine 'machine01' up with 'libvirt' provider...
Bringing machine 'machine02' up with 'libvirt' provider...
==> machine01: Creating domain with the following settings...
==> machine02: Creating domain with the following settings...
==> machine02: -- Name: metalmachine02
==> machine01: -- Name: metalmachine01
==> machine02: -- Forced UUID: 2294c949-88f6-5390-8154-fa53d93a3313
==> machine02: -- Domain type: kvm
==> machine01: -- Forced UUID: e0ab02d2-27cd-5a5e-8efc-080ba80cf258
==> machine01: -- Domain type: kvm
==> machine02: -- Cpus: 1
==> machine02: -- Feature: acpi
==> machine01: -- Cpus: 1
==> machine02: -- Feature: apic
==> machine02: -- Feature: pae
==> machine01: -- Feature: acpi
==> machine01: -- Feature: apic
==> machine02: -- Memory: 1536M
==> machine02: -- Management MAC:
==> machine01: -- Feature: pae
==> machine02: -- Loader: /usr/share/OVMF/OVMF_CODE.fd
==> machine02: -- Nvram:
==> machine01: -- Memory: 1536M
==> machine01: -- Management MAC:
==> machine02: -- Storage pool: default
==> machine01: -- Loader: /usr/share/OVMF/OVMF_CODE.fd
==> machine01: -- Nvram:
==> machine02: -- Image: (G)
==> machine01: -- Storage pool: default
==> machine01: -- Image: (G)
==> machine02: -- Volume Cache: default
==> machine02: -- Kernel:
==> machine01: -- Volume Cache: default
==> machine02: -- Initrd:
==> machine01: -- Kernel:
==> machine02: -- Graphics Type: vnc
==> machine02: -- Graphics Port: -1
==> machine01: -- Initrd:
==> machine01: -- Graphics Type: vnc
==> machine02: -- Graphics IP: 127.0.0.1
==> machine01: -- Graphics Port: -1
==> machine02: -- Graphics Password: Not defined
==> machine01: -- Graphics IP: 127.0.0.1
==> machine02: -- Video Type: cirrus
==> machine01: -- Graphics Password: Not defined
==> machine01: -- Video Type: cirrus
==> machine02: -- Video VRAM: 9216
==> machine01: -- Video VRAM: 9216
==> machine02: -- Sound Type:
==> machine01: -- Sound Type:
==> machine02: -- Keymap: de
==> machine01: -- Keymap: de
==> machine02: -- TPM Path:
==> machine01: -- TPM Path:
==> machine02: -- Boot device: network
==> machine01: -- Boot device: network
==> machine02: -- Boot device: hd
==> machine02: -- Disks: sda(qcow2,6000M)
==> machine02: -- Disk(sda): /var/lib/libvirt/images/metalmachine02-sda.qcow2
==> machine01: -- Boot device: hd
==> machine01: -- Disks: sda(qcow2,6000M)
==> machine02: -- INPUT: type=mouse, bus=ps2
==> machine02: -- RNG device model: random
==> machine01: -- Disk(sda): /var/lib/libvirt/images/metalmachine01-sda.qcow2
==> machine01: -- INPUT: type=mouse, bus=ps2
==> machine01: -- RNG device model: random
==> machine02: Starting domain.
==> machine01: Starting domain.
After waiting for some time, vagrant global-status
returns:
id name provider state directory
-------------------------------------------------------------------------
4da85f4 leaf01 libvirt running /home/greesha/Data/Projects/mini-lab
45d4ab1 leaf02 libvirt running /home/greesha/Data/Projects/mini-lab
12f0ebf machine02 libvirt running /home/greesha/Data/Projects/mini-lab
1d95c76 machine01 libvirt running /home/greesha/Data/Projects/mini-lab
So machines and switches are running. But docker-compose run metalctl machine ls
returns empty list of machines. Would appreciate any help with it)
Currently the provisioned machines in the mini-lab have no internet connection but we will need that if we want to take the mini-lab to the next level: having an integration to k8s orchestrators like Gardener or Cluster-API.
These are the TODOs for that:
/etc/network/interfaces.d/
and parameters for metal-core: ADDITIONAL_BRIDGE_PORTS
, ADDITIONAL_BRIDGE_VIDS
eth0
of the leaf switches OReth0
IPs of the leaf switches on the host systemFor integration testing purposes it could be interesting to add another deployment flavor to the mini-lab where we also spin up VMs for spines, exits, mgmt-servers, etc.
This can probably not be called "mini" anymore, but there is a need for something like this. It is useful in order to integration test more sophisticated network scenarios, plus it accelerates moving partition deployment roles to metal-roles such that adopters can use them for bootstrapping their partitions.
Since Dec 2023 ignite is officially deprecated, time to search for a alternative.
Weaveworks officially forwards to Flintlock
But maybe pure cloudhypervisor is also possible
Sometimes the mini-lab starts up correctly and everything works nicely but there are situations that lead to this error:
reconfiguration failed {"app": "metal-core", "error": "could not build switcher config: no vlan mapping could be determined for vxlan interface vniInternet", "errorVerbose": "no vlan mapping could be determined for vxlan interface vniInternet\ncould not build switcher config\ngithub.com/metal-stack/metal-core/internal/event.(*eventHandler).reconfigureSwitch\n\t/work/internal/event/reconfigureSwitch.go:65\ngithub.com/metal-stack/metal-core/internal/event.(*eventHandler).ReconfigureSwitch\n\t/work/internal/event/reconfigureSwitch.go:28\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
The error can be seen by metalctl switch ls -o wide
and is responsible for networking not working as expected.
Tried 5 times, mini-lab fails to come up, whats missing
deploy-partition | fatal: [leaf02]: FAILED! => changed=false
deploy-partition | elapsed: 300
deploy-partition | msg: metal-core did not come up
deploy-partition | fatal: [leaf01]: FAILED! => changed=false
deploy-partition | elapsed: 300
deploy-partition | msg: metal-core did not come up
deploy-partition |
deploy-partition | PLAY RECAP *********************************************************************
deploy-partition | leaf01 : ok=79 changed=51 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0
deploy-partition | leaf02 : ok=63 changed=45 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
Sometimes make up
stops proceeding after setting up the first groups of wires. Though sometimes it proceeds without any issues. I cannot detect any pattern. When I run make down && make
only 1 in 5 calls succeed. Waiting several minutes doesn't solve this either.
Aborting the process prints ^Cmake: *** [Makefile:80: partition-bake] Error 130
.
I already had a quick chat with @Gerrit91 on this. Here is some requested information:
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9a951775b316 ghcr.io/metal-stack/mini-lab-vms:latest "/mini-lab/vms_entry…" 3 minutes ago Up 3 minutes vms
3b254f2982cf grigoriymikh/sandbox:latest "/usr/local/bin/igni…" 3 minutes ago Up 3 minutes ignite-eb8de119eecdaa65
dc03782d709c kindest/node:v1.24.0 "/usr/local/bin/entr…" 3 minutes ago Up 3 minutes 0.0.0.0:4150->4150/tcp, 0.0.0.0:4161->4161/tcp, 0.0.0.0:4443->4443/tcp, 0.0.0.0:6443->6443/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:50051->50051/tcp metal-control-plane-control-plane
❯ make ssh-leaf01
ssh -o StrictHostKeyChecking=no -o "PubkeyAcceptedKeyTypes +ssh-rsa" -i files/ssh/id_rsa root@leaf01
ssh: Could not resolve hostname leaf01: Temporary failure in name resolution
make: *** [Makefile:142: ssh-leaf01] Error 255
❯ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 yubihill
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
As we have already started to reduce the dependency stack, I think it would also make sense to wrap docker-compose inside a docker container.
Especially configuring the network is much easier than with qemu: https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/networking.md
simple API is also available:
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/api.md
curl --unix-socket /tmp/cloud-hypervisor.sock -i \
-X PUT 'http://localhost/api/v1/vm.create' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"cpus":{"boot_vcpus": 4, "max_vcpus": 4},
"kernel":{"path":"/opt/clh/kernel/vmlinux-virtio-fs-virtio-iommu"},
"cmdline":{"args":"console=ttyS0 console=hvc0 root=/dev/vda1 rw"},
"disks":[{"path":"/opt/clh/images/focal-server-cloudimg-amd64.raw"}],
"rng":{"src":"/dev/urandom"},
"net":[{"ip":"192.168.10.10", "mask":"255.255.255.0", "mac":"12:34:56:78:90:01"}]
}'
Hi,
today I tried the tutorial found in the README.md. After several cleanups and restarts I did not get it to work. Every time creating the metal-core I got the following error:
deploy-partition | TASK [ansible-common/roles/systemd-docker-service : start service metal-core] ***
deploy-partition | changed: [leaf01]
deploy-partition | changed: [leaf02]
deploy-partition |
deploy-partition | TASK [ansible-common/roles/systemd-docker-service : ensure service is started] ***
deploy-partition | ok: [leaf02]
deploy-partition | ok: [leaf01]
deploy-partition |
deploy-partition | TASK [metal-roles/partition/roles/metal-core : wait for metal-core to listen on port] ***
deploy-partition | fatal: [leaf01]: FAILED! => changed=false
deploy-partition | elapsed: 300
deploy-partition | msg: metal-core did not come up
deploy-partition | fatal: [leaf02]: FAILED! => changed=false
deploy-partition | elapsed: 300
deploy-partition | msg: metal-core did not come up
deploy-partition |
deploy-partition | PLAY RECAP *********************************************************************
deploy-partition | leaf01 : ok=65 changed=47 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0
deploy-partition | leaf02 : ok=59 changed=43 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0
deploy-partition |
deploy-partition exited with code 2
docker exec vms /mini-lab/manage_vms.py --names machine01,machine02 create
Formatting '/machine01.img', fmt=qcow2 size=5368709120 cluster_size=65536 lazy_refcounts=off refcount_bits=16
Formatting '/machine02.img', fmt=qcow2 size=5368709120 cluster_size=65536 lazy_refcounts=off refcount_bits=16
QEMU 4.2.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
qemu-system-x86_64 -name machine01 -uuid e0ab02d2-27cd-5a5e-8efc-080ba80cf258 -m 2G -boot n -drive if=virtio,format=qcow2,file=/machine01.img -drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd -drive if=pflash,format=raw,file=/usr/share/OVMF/OVMF_VARS.fd -serial telnet:127.0.0.1:4000,server,nowait -enable-kvm -nographic -net nic,model=virtio,macaddr=aa:c1:ab:87:4e:82 -net nic,model=virtio,macaddr=aa:c1:ab:c1:29:2c -net tap,fd=30 30<>/dev/tap2 -net tap,fd=40 40<>/dev/tap3 &
qemu-system-x86_64 -name machine02 -uuid 2294c949-88f6-5390-8154-fa53d93a3313 -m 2G -boot n -drive if=virtio,format=qcow2,file=/machine02.img -drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd -drive if=pflash,format=raw,file=/usr/share/OVMF/OVMF_VARS.fd -serial telnet:127.0.0.1:4001,server,nowait -enable-kvm -nographic -net nic,model=virtio,macaddr=aa:c1:ab:90:3a:db -net nic,model=virtio,macaddr=aa:c1:ab:46:52:e4 -net tap,fd=50 50<>/dev/tap4 -net tap,fd=60 60<>/dev/tap5 &
QEMU 4.2.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@leaf01 -i files/ssh/id_rsa 'systemctl restart metal-core'
Warning: Permanently added 'leaf01,172.17.0.4' (ECDSA) to the list of known hosts.
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@leaf02 -i files/ssh/id_rsa 'systemctl restart metal-core'
Warning: Permanently added 'leaf02,172.17.0.3' (ECDSA) to the list of known hosts.
The error tells me, the host does not support a requested feature. I have found similar issues in other virtualization software like podman (see containers/podman#11479).
Is there something I missed during configuration of my machine or software?
Hopefully you could help me out here.
Best regards Julian
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.