kubeinit / kubeinit Goto Github PK

View Code? Open in Web Editor NEW

214.0 13.0 58.0 263.33 MB

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...

Home Page: https://www.kubeinit.org

License: Apache License 2.0

Shell 16.31% Python 51.91% HTML 10.32% Dockerfile 0.44% CSS 0.73% JavaScript 1.17% Mako 0.10% Jinja 19.02%

kubernetes automation okd rke k8s cdk eks

kubeinit's Introduction

The KUBErnetes INITiator

What is KubeInit?

KubeInit provides Ansible playbooks and roles for the deployment and configuration of multiple Kubernetes distributions. KubeInit's mission is to have a fully automated way to deploy in a single command a curated list of prescribed architectures.

Documentation

KubeInit's documentation is hosted in this same repository.

Periodic jobs status

There is a set of predefined scenarios that are tested on a weekly basis, the result of those executions is presented in the periodic job execution page.

KubeInit supported scenarios

K8s distribution: OKD (testing K8S, RKE, EKS, RKE)

Driver: Libvirt

OS: CentOS/Fedora, Debian/Ubuntu

Requirements

A fresh deployed server with enough RAM and disk space (120GB in RAM and 300GB in disk) and CentOS 8 (it should work also in Fedora/Debian/Ubuntu hosts).
Adjust the inventory file to suit your needs.
By default the first hypervisor node is called nyctea (defined in the inventory). Replace it with the hostname you specified if you changed it. You can also use the names in the inventory as aliases for your own hostnames using ~/.ssh/config (described in more detail below).
Have root passwordless access with certificates.
Having podman installed in the machine where you are running ansible-playbook.

Check if nyctea is reachable via passwordless root access

If you need to setup aliases in ssh for nyctea, tyto, strix, or any other hypervisor hosts that you have added or are mentioned in the inventory, you can create a file named config in ~/.ssh with contents like this:

echo "Host nyctea" >> ~/.ssh/config
echo "  Hostname actual_hostname" >> ~/.ssh/config

For example, if you have a deployed server that you can already ssh into as root called server.mysite.local you can create a ~/.ssh/config with these contents:

Host nyctea
  Hostname server.mysite.local

Now you should be ready to try access to your ansible host like this:

ssh root@nyctea

If it fails. check if you have an ssh key, and generate one if you don't

if [ -f ~/.ssh/id_rsa ]; then
  ssh-keygen
  ssh-copy-id /root/.ssh/id_rsa root@nyctea
fi

How to run

There are two ways of launching Kubeinit, directly using the ansible-playbook command, or by running it inside a container.

Directly executing the deployment playbook

The following example command will deploy an OKD 4.8 cluster with a 3 node control-plane and 1 worker node in a single command and in approximately 30 minutes.

# Install the requirements assuming python3/pip3 is installed
pip3 install \
        --upgrade \
        pip \
        shyaml \
        ansible \
        netaddr

# Get the project's source code
git clone https://github.com/Kubeinit/kubeinit.git
cd kubeinit

# Install the Ansible collection requirements
ansible-galaxy collection install --force --requirements-file kubeinit/requirements.yml

# Build and install the collection
rm -rf ~/.ansible/collections/ansible_collections/kubeinit/kubeinit
ansible-galaxy collection build kubeinit --verbose --force --output-path releases/
ansible-galaxy collection install --force --force-with-deps releases/kubeinit-kubeinit-`cat kubeinit/galaxy.yml | shyaml get-value version`.tar.gz

# Run the playbook
ansible-playbook \
    -v --user root \
    -e kubeinit_spec=okd-libvirt-3-1-1 \
    -e hypervisor_hosts_spec='[{"ansible_host":"nyctea"},{"ansible_host":"tyto"}]' \
    ./kubeinit/playbook.yml

After provisioning any of the scenarios, you should have your environment ready to go. To connect to the nodes from the hypervisor use the IP addresses from the inventory files.

Running the deployment command from a container

The whole process is explained in the HowTo's. The following commands build a container image with the project inside of it, and then launches the container executing the ansible-playbook command with all the standard ansible-playbook parameters.

Kubeinit is built and installed when deploying from a container as those steps are included in the Dockerfile, there is no need to build and install the collection locally if its used through a container.

Note: When running the deployment from a container, nyctea can not be 127.0.0.1, it needs to be the hypervisor's IP address. Also when running the deployment as a user different than root, the keys needs to be also updated.

Running from the GIT repository

Note: Won't work with ARM.

git clone https://github.com/Kubeinit/kubeinit.git
cd kubeinit
podman build -t kubeinit/kubeinit .

podman run --rm -it \
    -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
    -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \
    -v ~/.ssh/config:/root/.ssh/config:z \
    kubeinit/kubeinit \
        -v --user root \
        -e kubeinit_spec=okd-libvirt-3-1-1 \
        -i ./kubeinit/inventory.yml \
        ./kubeinit/playbook.yml

Running from a release

Install [jq](https://stedolan.github.io/jq/)

# Get latest release tag name
TAG=$(curl --silent "https://api.github.com/repos/kubeinit/kubeinit/releases/latest" | jq -r .tag_name)
podman run --rm -it \
    -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
    -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z \
    -v ~/.ssh/config:/root/.ssh/config:z \
    quay.io/kubeinit/kubeinit:$TAG \
        -v --user root \
        -e kubeinit_spec=okd-libvirt-3-1-1 \
        -i ./kubeinit/inventory.yml \
        ./kubeinit/playbook.yml

HowTo's and presentations

Supporters

kubeinit's People

Contributors

Stargazers

Watchers

Forkers

nsxsoft gmarcy iamsenorespana cesarxcesar aaam gsuryatej iijere edwardjrp giatule oguzhalit clementcohen ccamacho lapd-devops sonicwilson xiaoruiguo prabhashkuv mytry1 benswinney pjcafonso mangelajo stevoltgutsy bartosz-gorny cgoguyer alainlompo xrow bilal-io hifijhc tlhconsulting tolaeon njc-gov samberthol codegazers danielzhanghl swipswaps sean-m-sullivan dagnirko ascherbakov686 tsungming ioannisgk iqre8 rmlandvreugd harana-oss raghavendra-talur nadenf ywsfay jankul02 khklau starkillercrypto tokix jeffabailey nicaiseeric jbadiapa rambeloson9 chongdershubhayu fao89 jlarriba drejmar ansible-project

kubeinit's Issues

Drive-by feedback

Hi all,

just a super-quick drive-by feedback as I spent some time today trying kubeinit (on centos 8 hypervisor).
The first error I got was:
TASK [../../roles/kubeinit_okd : update packages] ******************************************************************
fatal: [hypervisor-01]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}

The reason for that is that the code has many sections like the following:

- name: Deploy a CentOS based guest
  block:
    - name: wait for {{ kubeinit_deployment_node_name }} to boot
      ansible.builtin.wait_for:
        port: 22
        host: "{{ hostvars[kubeinit_deployment_node_name].ansible_host }}"
        search_regex: OpenSSH
        delay: 10

These only check that openssh is listening to port 22, they do not guarantee that ansible is able to login. I had to add the following to make it past this issue:

--- a/kubeinit/roles/kubeinit_libvirt/tasks/20_check_nodes_up.yml
+++ b/kubeinit/roles/kubeinit_libvirt/tasks/20_check_nodes_up.yml
@@ -31,6 +31,13 @@
         # We can not resolve  by name from the hypervisors
         # ssh-keyscan -H {{ kubeinit_deployment_node_name }} >> ~/.ssh/known_hosts
         # ssh-keyscan -H {{ kubeinit_deployment_node_name }}.{{ kubeinit_inventory_cluster_domain }} >> ~/.ssh/known_hosts
+
+    - name: wait for {{ kubeinit_deployment_node_name }} to be accessible via ssh
+      ansible.builtin.wait_for_connection:
+        delay: 10
+        timeout: 600
+      delegate_to: "{{ hostvars[kubeinit_deployment_node_name].ansible_host }}"
+
   delegate_to: "{{ groups['hypervisor_nodes'][0] }}"
   tags:
     - provision_libvirt

Then I had a similar error but in a different spot (kubeinit/roles/kubeinit_okd/tasks/10_configure_service_nodes.yml).
Transforming the wait_for into wait_for_connection made it work for me.

On the next run it failed with:

TASK [../../roles/kubeinit_okd : verify that master nodes are ok] **********************************************************************************************************************************************************************
failed: [hypervisor-01] (item=okd-service-01) => {"ansible_loop_var": "cluster_node", "cluster_node": "okd-service-01", "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}
fatal: [hypervisor-01]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "cluster_node", "cluster_node": "okd-service-01", "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}]}

I was going to try and debug this as well but then docker.io started ratelimiting me and I ran out of time :)

TLDR: the wait_for tasks that check for openssh should be moved to wait_for_connection to avoid races

Add all nodes to /etc/hosts

In addition to adding the hypervisor it would be nice to see all the nodes added to /etc/hosts e.g.

eks-master 10.0.0.1
eks-worker 10.0.0.6

This would allow you to easily SSH into the right host without consulting the inventory.

Master-1 node not ok

what is wrong ?

Missing packages when installing on Ubuntu 20+

In Ubuntu 20 and newer Python 2 has been removed.

Which causes kubeinit to fail to be installed due to the following missing packages In this file: kubeinit/roles/kubeinit_libvirt/defaults/main.yml

python-pip, python-libvirt, python-lxml

timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out

Hey there. Great project. I've been making progress, but recently ran into this error

TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] *********************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "elapsed": 611, "msg": "timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out"}

But, from what I can tell, the device is reachable? Here is from my hypervisor (CentOS 8).

[root@prealpha tmp]# virsh list --all
 Id   Name             State
--------------------------------
 2    okd-service-01   running

[root@prealpha tmp]# ping 10.0.0.100
PING 10.0.0.100 (10.0.0.100) 56(84) bytes of data.
64 bytes from 10.0.0.100: icmp_seq=1 ttl=64 time=0.737 ms
64 bytes from 10.0.0.100: icmp_seq=2 ttl=64 time=0.456 ms
64 bytes from 10.0.0.100: icmp_seq=3 ttl=64 time=0.409 ms
^C
--- 10.0.0.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 35ms
rtt min/avg/max/mdev = 0.409/0.534/0.737/0.144 ms

For what it's worth, I had another issue I had to fix, network related. I had to modify my /etc/resolv.conf and change my nameserver IP. This was due to a misconfig/old-config of pfSense (my DHCP server). I've tried fixing this config, but I honestly don't think it's related to some SSH/ping over IP.

Any ideas? I suspect maybe I misconfigured my inventory too? Specifically, here are some changes I made:

# CentOS can access the internet via ens192. There is no eth1, but ens192 comes out of the box with my CentOS 8 install.
kubeinit_inventory_network_bridge_external_dev=ens192
# This is the public IP of the pfSense firewall in front of it
kubeinit_inventory_network_bridge_external_ip=xxx.yyy.zz.xx
# This is the LAN IP of the pfSense firewall, on ens192
kubeinit_inventory_network_bridge_external_gateway=192.168.255.1

I also made a few name / domain changes, but those are the relevant ones I believe.

Logs

Lastly, here's a bit longer snippet of my logs when running the playbook in case it helps

TASK [../../roles/kubeinit_libvirt : Create VM definition for the service nodes] **********************************************
changed: [hypervisor-01 -> 207.216.46.92] => {"changed": true, "cmd": "virt-install    --connect qemu:///system    --name=okd-service-01    --memory memory=12288    --cpuset=auto    --vcpus=8,maxvcpus=16    --os-type=linux    --os-variant=rhel8.0    --autostart                            --network network=kimgtnet0,mac=52:54:00:47:94:58,model=virtio                          --graphics none    --noautoconsole    --import    --disk /var/lib/libvirt/images/okd-service-01.qcow2,format=qcow2,bus=virtio\n", "delta": "0:00:04.939186", "end": "2021-02-08 19:15:03.984983", "rc": 0, "start": "2021-02-08 19:14:59.045797", "stderr": "", "stderr_lines": [], "stdout": "\nStarting install...\nDomain creation completed.", "stdout_lines": ["", "Starting install...", "Domain creation completed."]}

TASK [../../roles/kubeinit_libvirt : Create VM definition for the service nodes] **********************************************
skipping: [hypervisor-01] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [Check that the service node is up and running] **************************************************************************
[WARNING]: The loop variable 'cluster_role_item' is already in use. You should set the `loop_var` value in the `loop_control`
option for the task to something else to avoid variable collisions and unexpected behavior.

TASK [../../roles/kubeinit_libvirt : wait for okd-service-01 to boot] *********************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "elapsed": 611, "msg": "timed out waiting for ping module test: Failed to connect to the host via ssh: ssh: connect to host 10.0.0.100 port 22: Operation timed out"}

PLAY RECAP ********************************************************************************************************************
hypervisor-01              : ok=84   changed=18   unreachable=0    failed=1    skipped=23   rescued=0    ignored=3

Also, here's journactl -u -f libvirtd. I see some references to SELinux; might that be something?

Feb 08 19:12:35 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Succeeded.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 2013 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 2014 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 157585 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: libvirtd.service: Found left-over process 157586 (dnsmasq) in control group while starting unit. Ignoring.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: Starting Virtualization daemon...
Feb 08 19:12:38 prealpha.openshift.aot-technologies.com systemd[1]: Started Virtualization daemon.
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: read /etc/hosts - 3 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[2013]: read /etc/hosts - 3 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[2013]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: read /var/lib/libvirt/dnsmasq/kimgtnet0.addnhosts - 0 addresses
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[2013]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Feb 08 19:12:39 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[157585]: read /var/lib/libvirt/dnsmasq/kimgtnet0.hostsfile
Feb 08 19:13:39 prealpha.openshift.aot-technologies.com dnsmasq[157585]: exiting on receipt of SIGTERM
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167683]: listening on kimgtbr0(#11): 10.0.0.254
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: started, version 2.79 cachesize 150
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: DHCP, IP range 10.0.0.1 -- 10.0.0.253, lease time 1h
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: DHCP, sockets bound exclusively to interface kimgtbr0
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: using nameserver 10.0.0.100#53
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: read /etc/hosts - 3 addresses
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq[167690]: read /var/lib/libvirt/dnsmasq/kimgtnet0.addnhosts - 0 addresses
Feb 08 19:13:49 prealpha.openshift.aot-technologies.com dnsmasq-dhcp[167690]: read /var/lib/libvirt/dnsmasq/kimgtnet0.hostsfile
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: libvirt version: 6.0.0, package: 28.module_el8.3.0+555+a55c8938 (CentOS Buildsys <[email protected]>, 2020-11-04-01:04:00, )
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: hostname: prealpha.openshift.aot-technologies.com
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: Domain id=1 name='guestfs-r2s6b7ck88qymrqe' uuid=6fba960a-255d-4baa-9c24-c506801ae5b2 is tainted: custom-argv
Feb 08 19:14:23 prealpha.openshift.aot-technologies.com libvirtd[163819]: Domain id=1 name='guestfs-r2s6b7ck88qymrqe' uuid=6fba960a-255d-4baa-9c24-c506801ae5b2 is tainted: host-cpu
Feb 08 19:14:27 prealpha.openshift.aot-technologies.com libvirtd[163819]: missing device in NIC_RX_FILTER_CHANGED event
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: info : libvirt version: 6.0.0, package: 28.module_el8.3.0+555+a55c8938 (CentOS Buildsys <[email protected]>, 2020-11-04-01:04:00, )
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: info : hostname: prealpha.openshift.aot-technologies.com
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: warning : virSecuritySELinuxRestoreFileLabel:1503 : cannot lookup default selinux label for /tmp/libguestfsO8wacC/console.sock
Feb 08 19:14:56 prealpha.openshift.aot-technologies.com libvirtd[163819]: 2021-02-09 00:14:56.588+0000: 168939: warning : virSecuritySELinuxRestoreFileLabel:1503 : cannot lookup default selinux label for /tmp/libguestfsO8wacC/guestfsd.sock

Thanks!

Latest CentOS upgrade has a bug in Libvirt

Error: Activating the networks fails at the beginning of the installer with

TASK [../../roles/kubeinit_libvirt : Activate KubeInit networks] *****************************************************************************************************************************

failed: [hypervisor-01] (item={'name': 'kimgtnet0', 'net': '10.0.0.0', 'cidr': 24, 'gateway': '10.0.0.254', 'netmask': '255.255.255.0', 'start': '10.0.0.1', 'end': '10.0.0.253', 'bridge': 'kimgtbr0', 'template': 'cluster-net.xml.j2', 'type': 'internal', 'main': True, 'enabled': True}) => {"ansible_loop_var": "item", "changed": false, "item": {"bridge": "kimgtbr0", "cidr": 24, "enabled": true, "end": "10.0.0.253", "gateway": "10.0.0.254", "main": true, "name": "kimgtnet0", "net": "10.0.0.0", "netmask": "255.255.255.0", "start": "10.0.0.1", "template": "cluster-net.xml.j2", "type": "internal"}, "msg": "COMMAND_FAILED: '/usr/sbin/iptables -w10 -w --table filter --insert LIBVIRT_INP --in-interface kimgtbr0 --protocol tcp --destination-port 67 --jump ACCEPT' failed: iptables: No chain/target/match by that name.\n"}

Workaround:

Restart the libvirt daemon before deploying.

systemctl restart libvirtd

Impacts the CI.

Failing on container creation for registry

Setting up a completely new and vanilla setup with this. I end up with this error on CentOS 8, whether I run it directly or from podman, this same error appears. Wondering what I'm doing wrong, but thought a github issue might point me in the right direction :-)

TASK [../../roles/kubeinit_registry : Create container to serve the registry] *********************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "msg": "Can't create container kubeinit-registry", "stderr": "Error: unknown flag: --detach\n", "stderr_lines": ["Error: unknown flag: --detach"], "stdout": "", "stdout_lines": []}

Continue provisioning cluster after fixing bug

Hi @ccamacho
Currently, if user has any issues during the deployment, she/he need to fix something and retry again. Current behavior of kubeinit is destroying whole cluster and restart from scratch everything. It's great if kubeinit has an option to allow user to continue to provisioning cluster after fixing bugs.

error starting domain unsupported configuration security driver model 'selinux' is not available

A pre-existing KVM guest no longer start

Fixed Mac address for external_service_interface

I have attempted to add a macaddress variable to kubeinit_libvirt_external_service_interface and then use it in the ifcfg-eth1.j2 template like this: MACADDR:<MAC-address> but it did not work. I have not tried HWADDR:<MAC-address> instead. What do you think would be the best solution? And will you accept a PR for it?

Thank you!

Make sure there is enough free space

Hi,
My test system has only one filesystem. / has enough space, however kubeinit say to me, ""msg": "It seems there is not enough disk space (Required: 302.5 Total: 0)""

TASK [../../roles/kubeinit_validations : Make sure there is enough free space] ***************************************************************************************************************************************************************
fatal: [hypervisor-01]: FAILED! => {
"assertion": "kubeinit_validations_libvirt_free_space.stdout[:-1]|int > kubeinit_validations_libvirt_disk_usage|float * 1.1",
"changed": false,
"evaluated_to": false,
"msg": "It seems there is not enough disk space (Required: 302.5 Total: 0)"
}

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
hypervisor-01 : ok=4 changed=1 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

[root@lab08 kubeinit]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_seyyah-fedora_root 901G 92G 809G 11% /
[root@las08 kubeinit]#

kubeinit can check /, if libvirt does not have separate file system.?
Thanks.

Get all the networks step fails

Getting the following error message:

TASK [../../roles/kubeinit_libvirt : Get all the networks] ***********************************************************
fatal: [hypervisor-01 -> harana]: FAILED! => {"changed": false, "msg": "missing required arguments: name"}

Setting an empty name for list networks seems to fix it.

With versions:
Ansible 2.10.5
Ubuntu 21.04

Install failing after centos 8 current updates

Failing to Install

Seems the install file is missing

internal error: child reported (status=125): unable to stat: /var/lib/libvirt/boot/virtinst-sueu1m4i-fedora-coreos-32.20200715.3.0-live-kernel-x86_64: No such file or directory

I traced it to this

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-Troubleshooting-Common_libvirt_errors_and_troubleshooting#sect-Migration_fails_with_Unable_to_allow_access_for_disk_path_No_such_file_or_directory

Podman container dictates hypervisor environment

Issue

Podman container is Debian based. It seems the container env dictates the hypervisor environment, so when the hypervisor is Centos, you get errors.

Current workaround:

Run ansible on the same OS as your hypervisors. Only run podman container if your hypervisors are Debian based.

Is the deployment ready for production?

Update community.libvirt.virt_net

version 1.0.0 in requirements file causes the list_nets command step to fail with a "name required" exeption.

using version 1.0.1 resolves this

Potential hardcoding of subnet

In these two files:

kubeinit/kubeinit/roles/kubeinit_nexus/tasks/main.yml
kubeinit/kubeinit/roles/kubeinit_submariner/templates/iptables_hypervisor_nat_config.j2

The 10.0.* subnet is hardcoded and not picked up from the inventory file.

okd-service-01 "module_stdout": "/bin/sh: /usr/bin/python3: No such file or directory

Hi! I'm getting this error, i'm not sure how solve it.

TASK [Configure the cluster service node] ************************************************************************************************************

TASK [../../roles/kubeinit_okd : update packages] ****************************************************************************************************
failed: [hypervisor-01 -> 10.0.0.100] (item=okd-service-01) => {"ansible_loop_var": "item", "changed": false, "item": "okd-service-01", "module_stderr": "Shared connection to 10.0.0.100 closed.\r\n", "module_stdout": "/bin/sh: /usr/bin/python3: No such file or directory\r\n", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 127}

PLAY RECAP *******************************************************************************************************************************************
hypervisor-01              : ok=58   changed=18   unreachable=0    failed=1    skipped=16   rescued=0    ignored=4

In this try i changed the task "install services requirements" at the top of kubeinit/roles/kubeinit_okd/tasks/10_configure_service_nodes.yml and added python3 as kubeinit_okd_service_dependencies , but didn't work.

TASK [Configure the cluster service node] ************************************************************************************************************

TASK [../../roles/kubeinit_okd : install services requirements] **************************************************************************************
failed: [hypervisor-01 -> 10.0.0.100] (item=okd-service-01) => {"ansible_loop_var": "item", "changed": false, "item": "okd-service-01", "module_stderr": "Shared connection to 10.0.0.100 closed.\r\n", "module_stdout": "/bin/sh: /usr/bin/python3: No such file or directory\r\n", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 127}

root@okd-service-01 ~]# sudo rpm -q python3
package python3 is not installed
[root@okd-service-01 ~]# cat /etc/centos-release
CentOS Linux release 8.2.2004 (Core) 
[root@okd-service-01 ~]#

user@user-desktop:~/git_projects/kubeinit$ rgrep kubeinit_okd_service_dependencies -A5 |head -n5
kubeinit/roles/kubeinit_okd/defaults/main.yml:kubeinit_okd_service_dependencies:
kubeinit/roles/kubeinit_okd/defaults/main.yml-  - python3
kubeinit/roles/kubeinit_okd/defaults/main.yml-  - haproxy
kubeinit/roles/kubeinit_okd/defaults/main.yml-  - httpd
kubeinit/roles/kubeinit_okd/defaults/main.yml-  - bind

Any ideas ?

TODO: Include the final role organization

Current interfaces won't be affected.

NFS dynamic provisioning

In order to dynamically provision persistent volume claims from the service node NFS share, add a storage class and provisioner to the nfs role.

Move https://github.com/Kubeinit/kubeinit/blob/master/kubeinit/roles/kubeinit_okd/tasks/30_post_deployment_tasks.yml#L106-L158 to https://github.com/Kubeinit/kubeinit/blob/master/kubeinit/roles/kubeinit_nfs/tasks/main.yml#L72
Add https://github.com/firemound/nfs-client-provisioning to the end ofhttps://github.com/Kubeinit/kubeinit/blob/master/kubeinit/roles/kubeinit_nfs/tasks/main.yml
And move https://github.com/Kubeinit/kubeinit/blob/master/kubeinit/roles/kubeinit_okd/tasks/10_configure_service_nodes.yml#L337-L348 to https://github.com/Kubeinit/kubeinit/blob/master/kubeinit/roles/kubeinit_okd/tasks/30_post_deployment_tasks.yml#L101

TODO: Add kubevirt role

Issue on deploying okd-service-01 DNS-error?

Hello, thanks for this project and everyone working on it!

I have a problem when deploying (either via podman or directly with the ansible), i get this message:


TASK [Configure the cluster service node] **************************************

TASK [../../roles/kubeinit_okd : update packages] ******************************

fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "msg": "Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: Curl error (6): Couldn't resolve host name for http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=genclo [Could not resolve host: mirrorlist.centos.org]", "rc": 1, "results": []}

PLAY RECAP *********************************************************************
hypervisor-01              : ok=86   changed=20   unreachable=0    failed=1    skipped=22   rescued=0    ignored=3

When ssh-ing into the virutal machine (okd-service-01), i cannot do any yum installations and curl google.com fails as well with the same message.

I CAN ping 8.8.8.8, and the issue does not change when changing the resolve.conf file

Is it maybe a problem that i try to execute this deployment from the hypervisor-node itself?

I tried it with the release-branch as well as the current master branch, and my host is a centOS 8 machine

Issue with rhel8.0 not existing in the dictionary.

@git4liluo this is the correct place for the issue

My hypervisor is ubuntu. I got the following task failed. It might not be an issue, but I would appreciate your input/advice.

TASK [../../roles/kubeinit_libvirt : Create VM definition for the service nodes] *********************************************************************************
failed: [hypervisor-01] (item=okd-service-01) => {"ansible_loop_var": "item", "changed": false, "cmd": "virt-install --connect qemu:///system --name=okd-service-01 --memory memory=12288 --cpuset=auto --vcpus=8,maxvcpus=16 --os-type=linux --os-variant=rhel8.0 --autostart --network network=kimgtnet0,mac=52:54:00:f2:46:a7,model=virtio --graphics none --noautoconsole --import --disk /var/lib/libvirt/images/okd-service-01.qcow2,format=qcow2,bus=virtio\n", "delta": "0:00:00.791032", "end": "2020-09-29 13:45:49.480230", "item": "okd-service-01", "msg": "non-zero return code", "rc": 1, "start": "2020-09-29 13:45:48.689198", "stderr": "ERROR Error validating install location: Distro 'rhel8.0' does not exist in our dictionary", "stderr_lines": ["ERROR Error validating install location: Distro 'rhel8.0' does not exist in our dictionary"], "stdout": "", "stdout_lines": []}

Autoscaling worker nodes

It's an advanced scenario where the worker nodes are auto spin up upon workload and current baremeal resources. It's just an idea. I'm not sure how we can support this scenario(may be it is needed an agent to monitor cluster...and do adding/removing new nodes).

Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host

Sometimes if there are leftovers in the known_hosts file in the hypervisors or where you are running ansible-playbook command the deployment command might fail with:

Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host

The solution is to clean those known_hosts files in the hypervisors and in the deployment workstation.

This cleaning is made by the Ansible code itself but sometimes there might be left some of them.

Add/remove worker nodes to/from deployed cluster

Hi @ccamacho,
Adding new worker nodes to deployed cluster is not easy now. It's actually deploy whole new cluster instead of adding new nodes.
It's better to support adding/removing worker nodes to/from deployed cluster.

Use existing network

Hi!

Is there a way to make the OKD deployment use an existing network, instead of creating a virtual network? Keeping in mind that I also want to expose the service externally.

Meanwhile, I will try to edit the files myself to achieve that.
Thank you!

TODO: Add freeipa as an option instead of bind

Cleanup fails if nodes are undefined

It is possible to get into a state where one or more of the nodes are shut-off.
These nodes need to be undefined and not destroyed.

In this file:

kubeinit/kubeinit/roles/kubeinit_libvirt/tasks/10_cleanup.yml

It tries to destroy the node and then undefine it. But if a node is shut-off the destroy step will fail and undefine won't get a chance to run. One option is to just ignore errors in the destroy step.

Potential hardcoding of cluster name

In this file:
kubeinit/kubeinit/roles/kubeinit_bind/templates/create-external-ingress.sh.j2

The name of the cluster and domain name is hardcoded:

podman pod create --name kubeinit-ingress-pod --dns ${KUBEINIT_INGRESS_IP} --dns 8.8.8.8 --dns-search okdcluster.kubeinit.local

Multiple clusters on the same bare metal?

I used a bare metal to be hypervisor-01, then created a cluster(starting with one master, one worker, one service, one bootstrap) successfully using the tool. I actually wanted to create MANY such SIMPLE clusters on the same bare metal. So my question: How can I create a second cluster on the same bare metal? I think there should be only one hypervisor on one bare metal. BTW, after the cluster is deployed, anyway I can reach/manage it from somewhere else. How can a created cluster have a web interface?

FAILED hypervisor-01 ../../roles/kubeinit_registry : Generate the htpasswd entry

Exception

Traceback (most recent call last):   File "/tmp/ansible_community.general.htpasswd_payload_82ug34gu/ansible_community.general.htpasswd_payload.zip/ansible_collections/community/general/plugins/modules/htpasswd.py", line 104, in <module> ModuleNotFoundError: No module named 'passlib'

Msg

Failed to import the required Python library (passlib) on rke-service-01's Python /usr/bin/python3. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter

Solution

On the machine where you trigger the Ansible command execute

sudo pip3 install passlib

Task failed in role apply nfs security policy to nfs user

I am getting the following error when deploying.

TASK [../../roles/kubeinit_nfs : add security context constraint for nfs provisioner] **************************************************************
fatal: [hypervisor-01 -> 10.0.0.100]: FAILED! => {"changed": false, "cmd": "cat << EOF > ~/nfs_scc.yaml\napiVersion: security.openshift.io/v1\nkind: SecurityContextConstraints\nmetadata:\n  name: nfs-provisioner\nallowHostDirVolumePlugin: true\nallowHostIPC: false\nallowHostNetwork: false\nallowHostPID: false\nallowHostPorts: false\nallowPrivilegedContainer: false\nallowedCapabilities:\n- DAC_READ_SEARCH\n- SYS_RESOURCE\ndefaultAddCapabilities: null\nfsGroup:\n  type: MustRunAs\npriority: null\nreadOnlyRootFilesystem: false\nrequiredDropCapabilities:\n- KILL\n- MKNOD\n- SYS_CHROOT\nrunAsUser:\n  type: RunAsAny\nseLinuxContext:\n  type: MustRunAs\nsupplementalGroups:\n  type: RunAsAny\nvolumes:\n- configMap\n- downwardAPI\n- emptyDir\n- hostPath\n- nfs\n- persistentVolumeClaim\n- secret\nEOF\nexport KUBECONFIG=~/.kube/config\nkubectl apply -f ~/nfs_scc.yaml\n", "delta": "0:00:04.823758", "end": "2021-02-07 06:43:29.773731", "msg": "non-zero return code", "rc": 1, "start": "2021-02-07 06:43:24.949973", "stderr": "error: unable to recognize \"/root/nfs_scc.yaml\": no matches for kind \"SecurityContextConstraints\" in version \"security.openshift.io/v1\"", "stderr_lines": ["error: unable to recognize \"/root/nfs_scc.yaml\": no matches for kind \"SecurityContextConstraints\" in version \"security.openshift.io/v1\""], "stdout": "", "stdout_lines": []}                                                                                                                 

PLAY RECAP *****************************************************************************************************************************************
hypervisor-01              : ok=222  changed=121  unreachable=0    failed=1    skipped=81   rescued=0    ignored=3

Do you know why?

Failed to install any thoughts?

Hi,
I have built a 40 Core centos 8 server with 500GB storage and 377gb RAM and managed to get part of the way through the install via ansible fut it failed. I wonder if you could help out. Here are the errors.
Version of Centos: CentOS Linux release 8.3.2011

Thx

TASK [../../roles/kubeinit_libvirt : Destroy deployment networks] ************************************************************************************
failed: [hypervisor-01] (item={u'bridge': u'kimgtbr0', u'end': u'10.0.0.253', u'name': u'kimgtnet0', u'enabled': True, u'start': u'10.0.0.1', u'netmask': u'255.255.255.0', u'template': u'cluster-net.xml.j2', u'cidr': 24, u'net
': u'10.0.0.0', u'main': True, u'type': u'internal', u'gateway': u'10.0.0.254'}) => {"ansible_loop_var": "item", "changed": false, "item": {"bridge": "kimgtbr0", "cidr": 24, "enabled": true, "end": "10.0.0.253", "gateway": "10
.0.0.254", "main": true, "name": "kimgtnet0", "net": "10.0.0.0", "netmask": "255.255.255.0", "start": "10.0.0.1", "template": "cluster-net.xml.j2", "type": "internal"}, "msg": "Failed to connect socket to '/var/run/libvirt/lib
virt-sock': No such file or directory"}
...ignoring

TASK [../../roles/kubeinit_libvirt : Undefine deployment networks] ***********************************************************************************
failed: [hypervisor-01] (item={u'bridge': u'kimgtbr0', u'end': u'10.0.0.253', u'name': u'kimgtnet0', u'enabled': True, u'start': u'10.0.0.1', u'netmask': u'255.255.255.0', u'template': u'cluster-net.xml.j2', u'cidr': 24, u'net
': u'10.0.0.0', u'main': True, u'type': u'internal', u'gateway': u'10.0.0.254'}) => {"ansible_loop_var": "item", "changed": false, "item": {"bridge": "kimgtbr0", "cidr": 24, "enabled": true, "end": "10.0.0.253", "gateway": "10
.0.0.254", "main": true, "name": "kimgtnet0", "net": "10.0.0.0", "netmask": "255.255.255.0", "start": "10.0.0.1", "template": "cluster-net.xml.j2", "type": "internal"}, "msg": "Failed to connect socket to '/var/run/libvirt/lib
virt-sock': No such file or directory"}
...ignoring

TASK [../../roles/kubeinit_libvirt : Destroy default network] ****************************************************************************************
fatal: [hypervisor-01]: FAILED! => {"changed": false, "msg": "Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory"}
...ignoring

TASK [../../roles/kubeinit_libvirt : Undefine default network] ***************************************************************************************
fatal: [hypervisor-01]: FAILED! => {"changed": false, "msg": "Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory"}
...ignoring

ASK [../../roles/kubeinit_libvirt : define KubeInit networks] ***************************************************************************************
failed: [hypervisor-01] (item={u'bridge': u'kimgtbr0', u'end': u'10.0.0.253', u'name': u'kimgtnet0', u'enabled': True, u'start': u'10.0.0.1', u'netmask': u'255.255.255.0', u'template': u'cluster-net.xml.j2', u'cidr': 24, u'net
': u'10.0.0.0', u'main': True, u'type': u'internal', u'gateway': u'10.0.0.254'}) => {"ansible_loop_var": "item", "changed": false, "item": {"bridge": "kimgtbr0", "cidr": 24, "enabled": true, "end": "10.0.0.253", "gateway": "10
.0.0.254", "main": true, "name": "kimgtnet0", "net": "10.0.0.0", "netmask": "255.255.255.0", "start": "10.0.0.1", "template": "cluster-net.xml.j2", "type": "internal"}, "msg": "Failed to connect socket to '/var/run/libvirt/lib
virt-sock': No such file or directory"}

PLAY RECAP *******************************************************************************************************************************************
hypervisor-01              : ok=50   changed=7    unreachable=0    failed=1    skipped=10   rescued=0    ignored=7

Multiple hypervisor nodes

Currently, kubeinit supports deploy HA mode (3 master nodes) but on single hypervisor server. It's actually not really HA if the hypervisor server is down. It's better option where master nodes run on different hypervisor server.

[10_configure.yml] - 'dict object' has no attribute 'ansible_facts'

TASK [../../roles/kubeinit_registry : Setting Docker facts about the container that will run the registry] ********************************************************************************************************
task path: /home/danielyeap/ansible/kubeinit/kubeinit/roles/kubeinit_registry/tasks/10_configure.yml:244
fatal: [hypervisor-01]: FAILED! => {
"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'ansible_facts'\n\nThe error appears to be in '/home/danielyeap/ansible/kubeinit/kubeinit/roles/kubeinit_registry/tasks/10_configure.yml': line 244, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Setting Docker facts about the container that will run the registry\n ^ here\n"
}

===============

TASK [../../roles/kubeinit_registry : debug] **********************************************************************************************************************************************************************
task path: /home/danielyeap/ansible/kubeinit/kubeinit/roles/kubeinit_registry/tasks/10_configure.yml:240
ok: [hypervisor-01 -> 10.0.0.100] => {
"registry_docker_container_info": {
"changed": true,
"container": {
"AppArmorProfile": "",
"Args": [
"/etc/docker/registry/config.yml"
],
"Config": {
"AttachStderr": false,
"AttachStdin": false,
"AttachStdout": false,
"Cmd": [
"/etc/docker/registry/config.yml"
],
"Domainname": "",
"Entrypoint": [
"/entrypoint.sh"
],
"Env": [
"REGISTRY_AUTH=htpasswd",
"REGISTRY_AUTH_HTPASSWD_REALM=Registry",
"REGISTRY_HTTP_SECRET=ALongRandomSecretForRegistry",
"REGISTRY_AUTH_HTPASSWD_PATH=auth/htpasswd",
"REGISTRY_HTTP_TLS_CERTIFICATE=certs/domain.crt",
"REGISTRY_HTTP_TLS_KEY=certs/domain.key",
"REGISTRY_COMPATIBILITY_SCHEMA1_ENABLED=true",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"ExposedPorts": {
"5000/tcp": {}
},
"Hostname": "6f2a5b61a503",
"Image": "docker.io/library/registry:2",
"Labels": {},
"OnBuild": null,
"OpenStdin": false,
"StdinOnce": false,
"Tty": false,
"User": "",
"Volumes": {
"/var/lib/registry": {}
},
"WorkingDir": ""
},
"Created": "2021-02-05T09:04:12.281946781Z",
"Driver": "overlay2",
"ExecIDs": null,
"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/ced18f790db72ba967cec3696c405376a9e40bd02e7258a0f6b7ddb04363f772-init/diff:/var/lib/docker/overlay2/c2f7e0160242e599b0a5d6914f346fef28e4e530727ab0f7a8cf60335731cb8d/diff:/var/lib/docker/overlay2/d2612f20e5d9c7cc524c59ef83a21710bbf7f7debefb410294228d731e693ca7/diff:/var/lib/docker/overlay2/d3bf4b65f7cea7cffafeb645a291cb7b563d5fdb2a82612ae9645c35a592623f/diff:/var/lib/docker/overlay2/ebd652c8211b52ec3591fc2e95b10506f9109bbef8816a1adb1bcab9ea036ceb/diff:/var/lib/docker/overlay2/503e8675b9491750dedfaab5292962bb6b63bccf4a5827063ce317a3c09d1b3c/diff",
"MergedDir": "/var/lib/docker/overlay2/ced18f790db72ba967cec3696c405376a9e40bd02e7258a0f6b7ddb04363f772/merged",
"UpperDir": "/var/lib/docker/overlay2/ced18f790db72ba967cec3696c405376a9e40bd02e7258a0f6b7ddb04363f772/diff",
"WorkDir": "/var/lib/docker/overlay2/ced18f790db72ba967cec3696c405376a9e40bd02e7258a0f6b7ddb04363f772/work"
},
"Name": "overlay2"
},
"HostConfig": {
"AutoRemove": false,
"Binds": [
"/var/kubeinit/local_registry/data:/var/lib/registry:z",
"/var/kubeinit/local_registry/auth:/auth:z",
"/var/kubeinit/local_registry/certs:/certs:z"
],
"BlkioDeviceReadBps": null,
"BlkioDeviceReadIOps": null,
"BlkioDeviceWriteBps": null,
"BlkioDeviceWriteIOps": null,
"BlkioWeight": 0,
"BlkioWeightDevice": null,
"CapAdd": null,
"CapDrop": null,
"Cgroup": "",
"CgroupParent": "",
"CgroupnsMode": "host",
"ConsoleSize": [
0,
0
],
"ContainerIDFile": "",
"CpuCount": 0,
"CpuPercent": 0,
"CpuPeriod": 0,
"CpuQuota": 0,
"CpuRealtimePeriod": 0,
"CpuRealtimeRuntime": 0,
"CpuShares": 0,
"CpusetCpus": "",
"CpusetMems": "",
"DeviceCgroupRules": null,
"DeviceRequests": null,
"Devices": null,
"Dns": null,
"DnsOptions": null,
"DnsSearch": null,
"ExtraHosts": null,
"GroupAdd": null,
"IOMaximumBandwidth": 0,
"IOMaximumIOps": 0,
"Init": false,
"IpcMode": "private",
"Isolation": "",
"KernelMemory": 0,
"KernelMemoryTCP": 0,
"Links": null,
"LogConfig": {
"Config": {},
"Type": "json-file"
},
"MaskedPaths": [
"/proc/asound",
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware"
],
"Memory": 0,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": null,
"NanoCpus": 0,
"NetworkMode": "default",
"OomKillDisable": false,
"OomScoreAdj": 0,
"PidMode": "host",
"PidsLimit": null,
"PortBindings": {
"5000/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "5000"
}
]
},
"Privileged": false,
"PublishAllPorts": false,
"ReadonlyPaths": [
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
],
"ReadonlyRootfs": false,
"RestartPolicy": {
"MaximumRetryCount": 0,
"Name": ""
},
"Runtime": "runc",
"SecurityOpt": [
"label=disable"
],
"ShmSize": 67108864,
"UTSMode": "",
"Ulimits": null,
"UsernsMode": "",
"VolumeDriver": "",
"VolumesFrom": null
},
"HostnamePath": "/var/lib/docker/containers/6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159/hostname",
"HostsPath": "/var/lib/docker/containers/6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159/hosts",
"Id": "6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159",
"Image": "sha256:678dfa38fcfa349ccbdb1b6d52ac113ace67d5746794b36dfbad9dd96a9d1c43",
"LogPath": "/var/lib/docker/containers/6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159/6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159-json.log",
"MountLabel": "",
"Mounts": [
{
"Destination": "/var/lib/registry",
"Mode": "z",
"Propagation": "rprivate",
"RW": true,
"Source": "/var/kubeinit/local_registry/data",
"Type": "bind"
},
{
"Destination": "/auth",
"Mode": "z",
"Propagation": "rprivate",
"RW": true,
"Source": "/var/kubeinit/local_registry/auth",
"Type": "bind"
},
{
"Destination": "/certs",
"Mode": "z",
"Propagation": "rprivate",
"RW": true,
"Source": "/var/kubeinit/local_registry/certs",
"Type": "bind"
}
],
"Name": "/kubeinit-registry",
"NetworkSettings": {
"Bridge": "",
"EndpointID": "b1ae8d7016e444c0c2583d6bb66c99c122fc487facbd6cdf2b50428695ab5149",
"Gateway": "172.17.0.1",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"HairpinMode": false,
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"MacAddress": "02:42:ac:11:00:02",
"Networks": {
"bridge": {
"Aliases": null,
"DriverOpts": null,
"EndpointID": "b1ae8d7016e444c0c2583d6bb66c99c122fc487facbd6cdf2b50428695ab5149",
"Gateway": "172.17.0.1",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAMConfig": null,
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"Links": null,
"MacAddress": "02:42:ac:11:00:02",
"NetworkID": "16e3cafffe012c6f4324771570c8d6a247fd118f7579e50ec338da9ec9b9cbb2"
}
},
"Ports": {
"5000/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "5000"
}
]
},
"SandboxID": "6b480a0c8ce1c8a81a422ff654582a183f339c542dff4ece2a2544cd03362c2f",
"SandboxKey": "/var/run/docker/netns/6b480a0c8ce1",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null
},
"Path": "/entrypoint.sh",
"Platform": "linux",
"ProcessLabel": "",
"ResolvConfPath": "/var/lib/docker/containers/6f2a5b61a503b63b3442b7c72bb485dce2dc65b169810a46b0b6cc12b44c3159/resolv.conf",
"RestartCount": 0,
"State": {
"Dead": false,
"Error": "",
"ExitCode": 0,
"FinishedAt": "0001-01-01T00:00:00Z",
"OOMKilled": false,
"Paused": false,
"Pid": 30997,
"Restarting": false,
"Running": true,
"StartedAt": "2021-02-05T09:04:13.054022522Z",
"Status": "running"
}
},
"deprecations": [
{
"msg": "The container_default_behavior option will change its default value from "compatibility" to "no_defaults" in community.docker 2.0.0. To remove this warning, please specify an explicit value for it now",
"version": "2.0.0"
}
],
"failed": false
}
}

=============

What is wrong with the task? Can you please help as I am not that familiar with Ansible yet?

Thanks.

Cleanup fails requiring password

Trying to run the clean command as documented here and getting the follow error message:

TASK [../../roles/kubeinit_prepare : gather network facts] ***********************************************************************************************************************************************************************************
fatal: [hypervisor-01]: FAILED! => {
	"ansible_facts": {},
	"changed": false,
	"failed_modules": {
		"ansible.legacy.setup": {
			"failed": true,
			"module_stderr": "sudo: a password is required\n",
			"module_stdout": "",
			"msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
			"rc": 1
		}
	},
	"msg": "The following modules failed to execute: ansible.legacy.setup\n"
}

Inventory being used is available here.

And have no issues completing most of the install steps and logging in to my hypervisor without SSH password.

OKD UI not opening

Let me thank you first for making this available.

I can do oc get nodes, however when using the UI the firefox errors out. Secure Connection Failed

Any help around this is appreciated.

Centos service fails due to libgomp/gcc incompatibility

When trying to install a service the following error occurs:

TASK [../../roles/kubeinit_eks : install services requirements] ******************************************************************************************************************************************************************************
fatal: [localhost -> 11.0.0.100]: FAILED! => {
	"changed": false,
	"failures": [],
	"msg": "Depsolve Error occured: \n Problem: cannot install the best candidate for the job\n  - nothing provides libgcc >= 8.5.0-1.el8 needed by gcc-8.5.0-1.el8.x86_64\n  - nothing provides libgomp = 8.5.0-1.el8 needed by gcc-8.5.0-1.el8.x86_64",
	"rc": 1,
	"results": []
}

Inventory being used is available here.

Contact via a slack of similar channel?

Sorry for writing here as I do not think this is the best place. I wonder if you could direct me to a slack channel or similar, where I can quickly ask a couple of implementation questions? I am ready to install this but am a little unsure of one or two of the inventory options. Thx

Christopher

ERROR! this task 'ansible.builtin.shell' has extra params, which is only allowed in the following modules: set_fact, ansible.windows.win_shell, meta, raw, script, group_by, add_host, win_shell, import_tasks, include, command, include_role, shell, win_command, include_vars, import_role, include_tasks, ansible.windows.win_command

How to reproduce:

time ansible-playbook     --user root     -v -i ./hosts/k8s/inventory     --become     --become-user root     ./playbooks/k8s.yml

Error:

No config file found; using defaults

ERROR! this task 'ansible.builtin.shell' has extra params, which is only allowed in the following modules: set_fact, ansible.windows.win_shell, meta, raw, script, group_by, add_host, win_shell, import_tasks, include, command, include_role, shell, win_command, include_vars, import_role, include_tasks, ansible.windows.win_command

The error appears to be in '/home/ccamacho/Dev/kubeinit/playbooks/k8s.yml': line 31, column 7, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:
  tasks:
    - name: Create a public key in the hypervisor hosts
      ^ here

Root cause: ansible/ansible#71817 and ansible/ansible#72458

How to fix: Update your Ansible (2.9 or 2.10) to the latest minor version.

TODO: Allow the host machine to be Debian/Ubuntu/Centos8/Fedora - The server itself needs to be rhel/centos/fedora?

as far as i understand the kubeinit setup is kvm-based and i'm aware that the whole okd-nodes need to be coreos/centos/fedora/rhel. but it should be possible to use kubeinit also on another other OS based Server (with minor changes), right?

i have a debian/kvm server and want to test okd there. i dont think i need to change much besides "kubeinit_provision_hypervisor_dependencies" (as some names possibly differ) and the OS-check?

thanx.ivo

Documentation is missing the fact that nodes should not be renamed

In the inventory file I renamed my worker nodes to:

[worker_nodes]
eks-core ansible_host=10.0.0.6 mac=52:54:00:39:22:52 interfaceid=9edc6913-d6f9-4091-ad11-1138d1caacb1 target=hypervisor-01 type=virtual
eks-task1 ansible_host=10.0.0.7 mac=52:54:00:14:61:67 interfaceid=73139f80-1564-40e4-ba9f-5b0941e780da target=hypervisor-01 type=virtual
eks-task2 ansible_host=10.0.0.8 mac=52:54:00:34:34:47 interfaceid=3584b702-4b70-44be-be6d-0995277b8a6d target=hypervisor-01 type=virtual

Which is a problem since Kubeinit needs "worker" in the name for that node to be recognised.

In the documentation and within the inventory it would be good to mention this fact.

Document the cleanup process

Conflicts with existing KVM and DNS

I had a first try and found out that it conflict with my pre-existing virbr1.
As well, I already had named.service and the playbook is trying to change it.
Finally, I am running with Fedora 32, again another change to the playbook.
__
Ease of use seems to be there only when starting with a fresh install of the host.
I put this aside until I have more time.
Not a simple 30 minutes in my case.

Registry is not accessible with multiple service nodes

Using the following inventory:

https://gist.github.com/nadenf/73d7cc843c8887c8b5309268fed64c84

I get the following error:

TASK [../../roles/kubeinit_registry : Check if the registry is up and running] ***************************************************************************************************************************************************************
fatal: [localhost -> 11.0.0.101]: FAILED! => 
{
	"changed": false,
	"cmd": "set -o pipefail\nset -e\ncurl -v --silent --user registryusername:registrypassword https://eks-service1.eks.kubeinit.local:5000/v2/_catalog --stderr - | grep '\\{\"repositories\":'\n# curl -v --silent --user registryusername:registrypassword https://eks-service1.eks.kubeinit.local:5000/v2/openshift/tags/list\n",
	"delta": "0:00:00.060488",
	"end": "2021-05-29 12:49:41.650796",
	"msg": "non-zero return code",
	"rc": 1,
	"start": "2021-05-29 12:49:41.590308",
	"stderr": "",
	"stderr_lines": [],
	"stdout": "",
	"stdout_lines": []
}

And nslookup on eks-service1.eks.kubeinit.local indicates that the domain can't be found.

Add more services to service node

This is an improvement where more E2E services are added such as:

GitLab
Artifactory
Portainer
Keycloak
OpenLDAP
Project Quay

error verifying master nodes ok

I am attempting to load this onto my supermicro. it keeps failing after the retries.

TASK [../../roles/okd : wait for master nodes to start SSH] *************************************************
task path: /root/kubeinit/kubeinit/roles/okd/tasks/configure_cluster_nodes.yml:54
ESTABLISH SSH CONNECTION FOR USER: root
SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/72dc686b8e localhost '/bin/sh -c '"'"'echo ~root && sleep 0'"'"''
(0, b'/root\n', b'')
ESTABLISH SSH CONNECTION FOR USER: root
SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/72dc686b8e localhost '/bin/sh -c '"'"'( umask 77 && mkdir -p "echo /root/.ansible/tmp"&& mkdir /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577 && echo ansible-tmp-1598204770.388947-123039-117095049795577="echo /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577" ) && sleep 0'"'"''
(0, b'ansible-tmp-1598204770.388947-123039-117095049795577=/root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577\n', b'')
Using module file /usr/lib/python3.6/site-packages/ansible/modules/utilities/logic/wait_for.py
PUT /root/.ansible/tmp/ansible-local-109879iagsm4yb/tmpfd6100c9 TO /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577/AnsiballZ_wait_for.py
SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/72dc686b8e '[localhost]'
(0, b'sftp> put /root/.ansible/tmp/ansible-local-109879iagsm4yb/tmpfd6100c9 /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577/AnsiballZ_wait_for.py\n', b'')
ESTABLISH SSH CONNECTION FOR USER: root
SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/72dc686b8e localhost '/bin/sh -c '"'"'chmod u+x /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577/ /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577/AnsiballZ_wait_for.py && sleep 0'"'"''
(0, b'', b'')
ESTABLISH SSH CONNECTION FOR USER: root
SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/72dc686b8e -tt localhost '/bin/sh -c '"'"'/usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1598204770.388947-123039-117095049795577/AnsiballZ_wait_for.py && sleep 0'"'"''

TASK [../../roles/okd : verify that master nodes are ok] *********************************************************
[WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found:
cmd_res.stdout_lines | list | count == groups['okd_{{ kubeinit_deployment_role }}_nodes'] | count
FAILED - RETRYING: verify that master nodes are ok (60 retries left).
FAILED - RETRYING: verify that master nodes are ok (59 retries left).

Document / display information on how to use cluster

So after the cluster has been setup it would be useful to have the following:

URL and CA Cert for the API Server
Generated Kubeconfig
Sample commands e.g. kubectl get po

It could be either dynamically generated and output from Ansible or simply documented.

Documentation needed for node types

It is not clear what the role of each of the nodes are for e.g. master, worker, service.