redhat-emea-ssa-team / hetzner-ocp4 Goto Github PK

View Code? Open in Web Editor NEW

181.0 15.0 113.0 15.22 MB

Installing OCP 4 on single bare metal server.

License: Apache License 2.0

Dockerfile 0.18% Shell 0.24% Markdown 35.97% YAML 61.92% Jinja 1.68%

hetzner-ocp4's People

Stargazers

Watchers

Forkers

latouchek dwojciec sandrich flyemsafe karstengresch tahonen openshift-training craq2017 ikke-t alezzandro wolfspyre hbenmansour lucamaf jjnavsofs0 clustership eformat jeanphilippelevy mohammedidriss mikecali ddrozdow jchraibi clovis-monmousseau bernhardcygan rh-tmichels giannisalinetti robertodijo simhaonline xymox andymaier augustrh manfredmuth zhangchl007 gestrem mschindl supernoodz alexr03 btison dominikhahn hhue13 oscarlind rcarrata namatedev pixeljonas emmanuelkasper mndambuki ortwinschneider ffroehli 2innovate tmicheli mahmoudalide rikkola buuhsmead asthanasaurav archehandoro dagnirko gc-ss manhah fabioabreureis shadowmanportfolio themiri chrismeller thomathom wrenkredhat pkapp dfleming9 marcoklaassen jeichler knumskull cmeissner mamurak dmarrazzo pipopopo tzengkiz ibalago nexus-six przlod ableischwitz mickume karifpromos kimpfy kborup-redhat abozzoni ghoha marcelomrwin debovema wondratsch rhyspowell fmeulenk sluetze arnav3000 ekinmeroglu hhemied djoudi anissmajlovic kounex stinkybenji schwesig rbo jkhelil yigitpolat

hetzner-ocp4's Issues

Add documentation how to customize installation

Document those parts of group_vars/all that can be changed.

This should also work on qemu that's shipped with RHEL

You can install coreos using the coreos iso by extracting the:

efiboot.img
initramfs.img
vmlinuz

Creating a .treeinfo

cat /tmp/coreos/.treeinfo 
[general]
arch = x86_64
family = Fedora
platforms = x86_64
version = 29
[images-x86_64]
initrd = initramfs.img
kernel = vmlinuz

Then run virt-install with the --location argument.

#!/bin/bash

args='coreos.inst=yes '
args+='coreos.inst.install_dev=vda '
args+='coreos.inst.image_url=http://172.24.24.3:8080/pub/rhcos-42.80.20190828.2-metal-bios.raw.gz '
args+='coreos.inst.ignition_url=http://172.24.24.3:8080/pub/bootstrap.ign '
args+='ip=dhcp '
args+='rd.neednet=1'

virt-install --location /tmp/coreos --extra-args="${args}" --network network=ocp4 --name ocp4-compute-1 --memory 8192 --disk /var/lib/libvirt/images/ocp4-compute-1.qcow2

exit $?

Reference: coreos/coreos-installer

installation fails - unreachable api

Hi
The installer fails waiting for the api

"stderr_lines": [
        "level=debug msg=\"OpenShift Installer v4.2.0-201908282219-dirty\"",
        "level=debug msg=\"Built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b\"",
        "level=info msg=\"Waiting up to 30m0s for the Kubernetes API at https://api.ocp4.sanc.ch:6443...\"",
        "level=debug msg=\"Still waiting for the Kubernetes API: Get https://api.ocp4.xxx:6443/version?timeout=32s: EOF\"",

On the bootstrap node I see

Sep 23 05:06:35 bootstrap podman[2318]: 2019-09-23 05:06:35.924008516 +0000 UTC m=+0.687375687 container attach 545491a0fa8e2c6e9275e2228547512d77015022f1912dd2d8025a729cb7e0ec (image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc, name=elegant_knuth)
Sep 23 05:06:37 bootstrap bootkube.sh[722]: Starting etcd certificate signer...
Sep 23 05:06:37 bootstrap bootkube.sh[722]: Error: name etcd-signer is in use: container already exists
Sep 23 05:06:37 bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Sep 23 05:06:37 bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Sep 23 05:06:42 bootstrap systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Sep 23 05:06:42 bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 93.
Sep 23 05:06:42 bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Sep 23 05:06:42 bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Sep 23 05:06:43 bootstrap podman[2587]: 2019-09-23 05:06:43.572904432 +0000 UTC m=+0.372897314 container create 001f86c9ba3065693b1abda46f4594aec1909cfe01e80d6adc5528057a0af7e2 (image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc, name=quizzical_rhodes)

and

Sep 22 19:11:01 bootstrap bootkube.sh[19735]: Waiting for etcd cluster...
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.259737221 +0000 UTC m=+7.442582895 image pull
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.580244772 +0000 UTC m=+7.763090429 container create b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a>
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.907017175 +0000 UTC m=+8.089862853 container init b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a8b>
Sep 22 19:11:10 bootstrap podman[22919]: 2019-09-22 19:11:10.038039098 +0000 UTC m=+8.220884731 container start b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a8>
Sep 22 19:11:10 bootstrap podman[22919]: 2019-09-22 19:11:10.038132193 +0000 UTC m=+8.220977884 container attach b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a>
Sep 22 19:21:09 bootstrap bootkube.sh[19735]: https://etcd-2.ocp4.xxx:2379 is unhealthy: failed to connect: dial tcp 192.168.50.12:2379: connect: connection refus>
Sep 22 19:21:09 bootstrap bootkube.sh[19735]: Error: unhealthy cluster
Sep 22 19:21:10 bootstrap bootkube.sh[19735]: etcdctl failed. Retrying in 5 seconds...

Any hints where to start debug would be highly appreciated.

Include user list to cluster.yml

cluster.yml should have list of users which are added to htpasswd secret for htpasswd based auth. Also one of those users should be labeled cluster-admin.

pure-ansible: Rewrite post-terraform.sh in ansible

openshift-install --dir=/root/{{ cluster_name }}-install wait-for bootstrap-complete --log-level debug
virsh shutdown bootstrap
sleep 120
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
# apiserver certs are not yet working.

#oc create secret tls letsencrypt-api-certs    --cert={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/fullchain.crt --key={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/cert.key -n openshift-config
#oc patch apiserver cluster --type=merge -p '{"spec":{"servingCerts": {"namedCertificates":[{"names": ["api.{{ cluster_name }}.{{ public_domain }}"], "servingCertificate": {"name": "letsencrypt-api-certs"}}]}}}'
# Install certificate
oc create secret tls letsencrypt-router-certs --cert={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/fullchain.crt --key={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/cert.key -n openshift-ingress
oc patch ingresscontroller default -n openshift-ingress-operator --type=merge --patch='{"spec": { "defaultCertificate": { "name": "letsencrypt-router-certs" }}}'

openshift-install --dir=/root/{{ cluster_name }}-install wait-for install-complete --log-level debug

Setup proper SELinux context for extra ports

ports are 6443 and 22623

Create pipeline for automated testing

Work-in-progress checkout pipeline branch

ToDo

Reuse letsencrypt accountkey

CoreOS images for v4.2 remote location changed

I've experienced some issues in installing the current version (4.2).

I discovered that the remote location on openshift.com changed (ansible/roles/openshift-4-cluster/defaults/main.yml):

-coreos_download_url: "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-{{ coreos_version }}-qemu.qcow2"
+coreos_download_url: "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/4.2.0-rc.5/rhcos-{{ coreos_version }}-qemu.qcow2"

The /latest remote subdir now has been bumped to 4.3 version and qemu version disappeared.

We should always point to the latest stable to avoid these issues.

I'm going to create a PR.

ansible playbook fails to start

Issue

The command ansible-playbook ./ansible/setup.yml reports the following error

[root@ocp4 hetzner-ocp4]# ansible-playbook ./ansible/setup.yml
 [WARNING]: Could not match supplied host pattern, ignoring: all

 [WARNING]: provided hosts list is empty, only localhost is available

ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here


The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here

exception type: <class 'ansible.errors.AnsibleParserError'>
exception: no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here

Additional info

OS: Centos7
Ansible version : 2.4.2.0
cluster.yml

---
cluster_name: ocp4
public_domain: example.com
dns_provider: [route53|cloudflare|gcp]
letsencrypt_account_email: [email protected]
# Depending on the dns provider:
# CloudFlare
cloudflare_account_email: [email protected]
cloudflare_account_api_token: 9348234sdsd894.....
cloudflare_zone: example.com
# Route53
aws_access_key: key
aws_secret_key: secret
aws_zone: example.com
# GCP
gcp_project: project-name
gcp_managed_zone_name: 'zone-name'
gcp_managed_zone_domain: 'example.com.'
gcp_serviceaccount_file: ../gcp_service_account.json

auth_htpasswd:
  - admin:$ttttttttt//
  - local:$ttttttttt//

storage_nfs: false # Default is false

auth_redhatsso:
  client_id: "xxxxx.apps.googleusercontent.com"
  client_secret: "xxxxxxx"

cluster_role_bindings:
  - cluster_role: sudoers
    name: [email protected]
  - cluster_role: cluster-admin
    name: admin


image_pull_secret: |-
  ttttttttt

Disk size so small

Default disk size is a little bit to small:

[core@master-0 ~]$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    252:0    0   16G  0 disk
├─vda1 252:1    0    1M  0 part
├─vda2 252:2    0    1G  0 part /boot
└─vda3 252:3    0   15G  0 part /sysroot

hetzner-ocp4/ansible/roles/openshift-4-cluster/tasks/create-vm.yml

Line 6 in 37791a3

    
           command: "qemu-img convert -O qcow2 -o size=10G {{ coreos_image_location }} /var/lib/libvirt/images/{{ vm_instance_name }}.qcow2"

Please add variable in ./ansible/roles/openshift-4-loadbalancer/defaults/main.yml and use it.

default router running on master nodes and not on worker node loadbalanced by haproxy

I added 2 labels for the worker nodes as infra

node-role.kubernetes.io/infra: ""
infra: infra

by editing
$ oc edit nodes worker-0.ocp4.dwojciec.com
and I added the 2 new labels inside

  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    infra: infra
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: worker-0.ocp4.dwojciec.com
    kubernetes.io/os: linux
    node-role.kubernetes.io/infra: ""
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
  name: worker-0.ocp4.dwojciec.com
  resourceVersion: "26912"

see the result

[root@CentOS-76-64-minimal haproxy]# oc get nodes --show-labels
NAME                         STATUS   ROLES           AGE   VERSION             LABELS
master-0.ocp4.dwojciec.com   Ready    master,worker   69m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-1.ocp4.dwojciec.com   Ready    master,worker   69m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-1.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-2.ocp4.dwojciec.com   Ready    master,worker   70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-2.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-0.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-0.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-1.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-1.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-2.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-2.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos

I assign infra label to the ingresscontroller
$ oc edit ingresscontroller default -n openshift-ingress-operator
when done I deleted
$ oc delete deployment router-default -n openshift-ingress
$ oc get pod -o wide
to check if the default-router is running on worker node.

oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE                         NOMINATED NODE   READINESS GATES
router-default-56d656f6b7-8fgzq   1/1     Running   0          24m   192.168.222.34   worker-0.ocp4.dwojciec.com   <none>           <none>
router-default-56d656f6b7-h49l4   1/1     Running   0          24m   192.168.222.35   worker-1.ocp4.dwojciec.com   <none>           <none>

Set NFS provisioner as default storage class

Add annotation storageclass.kubernetes.io/is-default-class: "true"

Support more than one DNS forwarders ansible/group_vars/all:forward_dns

Add support to add more than one DNS forwarders:

grep forward ansible/group_vars/all
# forwarder to access the internet for your prviate DNS server
forward_dns: 8.8.8.8

https://wiki.hetzner.de/index.php/Hetzner_Standard_Name_Server/en

Add valid certs using Lets Encrypt

Create separate roles for CentOS and RHEL package installs

There are differences what packages and where they are installed based on OS. There should be host OS based roles for package management.

Installation does not work rhcos image not available. - 4.3 installation is broken!

pwd

/var/lib/libvirt/images

cat rhcos-4.3.0.qcow2

<title>404 Not Found</title>

Not Found

The requested URL /pub/openshift-v4/dependencies/rhcos/4.3/4.3.0/rhcos-4.3.0-x86_64-qemu.qcow2.gz was not found on this server.

Apache/2.2.15 (Red Hat) Server at mirror.openshift.com Port 443

What is this 10-mainif.network task?

What is this file? Why is it needed?

name: ensure IPForward is set in /etc/systemd/network/10-mainif.network

Configure letsencypt certificate to api server

Follow the documentation: https://docs.openshift.com/container-platform/4.2/authentication/certificates/api-server.html

Add playbook to setup NFS storage

Using root server as NFS server.

https://medium.com/faun/openshift-dynamic-nfs-persistent-volume-using-nfs-client-provisioner-fcbb8c9344e

make cluster start at boot

At the moment cluster remains down if the host reboots. Let's fix that by either doing (pseudo code):

for all machines
virsh autostart machine-name
done

in ansible command for list of servers,

or another way, just do

for all machines
ln -s /etc/libvirt/qemu/autostart/ocp4-compute-0.xml /etc/libvirt/qemu/ocp4-compute-0.xml
done

both with ansible, naturally.

SElinux prevents bootstrap VM to start

When running on RHEL 7.7 and SELinux is in Enforcing mode bootstrap VM does not start.

TASK [openshift-4-cluster : Start VirtualMachine iot-bootstrap] ***************************************************************************************************************************************************
Tuesday 01 October 2019  14:34:13 +0300 (0:00:00.404)       0:17:06.550 ******* 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-10-01T11:34:15.354797Z qemu-kvm: -fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/iot-bootstrap.ign: can't load /var/lib/libvirt/images/iot-bootstrap.ign
fatal: [localhost]: FAILED! => {"changed": false, "msg": "internal error: qemu unexpectedly closed the monitor: 2019-10-01T11:34:15.354797Z qemu-kvm: -fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/iot-bootstrap.ign: can't load /var/lib/libvirt/images/iot-bootstrap.ign"}

Destination directory /root/terraform does not exist

TASK [ign : Create small ign for bootstrap] *********************************************************************************************************************************************
task path: /root/hetzner-ocp4/ansible/roles/ign/tasks/main.yml:25
fatal: [localhost -> localhost]: FAILED! => {"changed": false, "checksum": "0b3017f31dea301f097f544cd0a9b47ca00bea51", "msg": "Destination directory /root/terraform does not exist"}
        to retry, use: --limit @/root/hetzner-ocp4/ansible/03-prepare-install.retry

Adding OpenShift Container Storage 4 (rook & ceph)

We have to add OpenShit Container Storage 4 (rook & ceph) because, we need storage for the Image Registry and for applications too.

Working branch: ocs_issue#31

Current status

You can install ocs upstream on OCP4 - quick'n'dirty

git clone [email protected]:RedHat-EMEA-SSA-Team/hetzner-ocp4.git
cd hetzner-ocp4
git branch ocs_issue#31 origin/ocs_issue#31
git checkout ocs_issue#31
# Create cluster.yml
vi cluster.yml
./ansible/02-create-cluster.yml
export KUBECONFIG=....
./deploy-ocs.sh

ToDo

Metrics do not work, create service monitor and metrics do not show up in cluster Prometheus.

Move Hetzner server documentation to own document

Keep only installation related docs in main README and move all infra related to own documents. Like maybe in the future own docs if RHV is used.

pure-ansible: Create & Use an ssa cluster configuration operator

Replace the ansible post-installation with an operator

Be careful with ssl cert and lookup('file' because it strips the last important \n:

- name: Check certificates exist
  stat:
    path: "{{ ign_certificates_path }}/fullchain.crt"
  register: crt
- name: Check ssl key exist
  stat:
    path: "{{ ign_certificates_path }}/cert.key"
  register: key

- name: Create openshift-ingress config
  block:
    - name: Create openshift router certs secret
      copy:
        content: |
          apiVersion: v1
          kind: Secret
          data:
            tls.crt: {{  lookup('file',ign_certificates_path + '/fullchain.crt', rstrip=false) | b64encode }}
            tls.key: {{  lookup('file',ign_certificates_path + '/cert.key', rstrip=false)  | b64encode }}
          metadata:
            name: letsencrypt-router-certs
            namespace: openshift-ingress
          type: kubernetes.io/tls
        dest: "{{ ign_openshift_install_dir }}/openshift/99_openshift-ingress-letsencrypt-router-certs-secret.yaml"

  when: crt.stat.exists == True and key.stat.exists == True

OpenShift 4.3 new default: managementState: Removed for image registry

$ oc get configs.imageregistry.operator.openshift.io/cluster -o yaml | grep managementState
  managementState: Removed

https://docs.openshift.com/container-platform/4.3/release_notes/ocp-4-3-release-notes.html#ocp-4-3-image-registry-operator

Means, new cluster installation don't have a running registry.

MCO not accessible via https://apt-int.....:22623/config/worker

oc debug node/compute-0
Starting pod/compute-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.51.13
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# curl -i -k https://api-int.demo.openshift.pub:22623/config/worker
curl: (7) Failed to connect to api-int.demo.openshift.pub port 22623: Connection refused
sh-4.4# nslookup api-int.demo.openshift.pub
Server:		192.168.51.1
Address:	192.168.51.1#53

Name:	api-int.demo.openshift.pub
Address: 192.168.51.1
sh-4.4# ping 192.168.51.1
PING 192.168.51.1 (192.168.51.1) 56(84) bytes of data.
64 bytes from 192.168.51.1: icmp_seq=1 ttl=64 time=0.113 ms
64 bytes from 192.168.51.1: icmp_seq=2 ttl=64 time=0.118 ms
^C
--- 192.168.51.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 55ms
rtt min/avg/max/mdev = 0.113/0.115/0.118/0.011 ms
sh-4.4# curl -vvv -i -k https://192.168.51.1:22623/config/worker
*   Trying 192.168.51.1...
* TCP_NODELAY set
* connect to 192.168.51.1 port 22623 failed: Connection refused
* Failed to connect to 192.168.51.1 port 22623: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 192.168.51.1 port 22623: Connection refused

From host it worked:

root@homer:~ $ curl -vvv -I -k https://192.168.51.1:22623/config/worker
* About to connect() to 192.168.51.1 port 22623 (#0)
*   Trying 192.168.51.1...
* Connected to 192.168.51.1 (192.168.51.1) port 22623 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
* Server certificate:
* 	subject: CN=api-int.demo.openshift.pub
* 	start date: Oct 17 11:05:22 2019 GMT
* 	expire date: Oct 14 11:05:25 2029 GMT
* 	common name: api-int.demo.openshift.pub
* 	issuer: CN=root-ca,OU=openshift
> HEAD /config/worker HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 192.168.51.1:22623
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Length: 91607
Content-Length: 91607
< Content-Type: application/json
Content-Type: application/json
< Date: Tue, 29 Oct 2019 15:47:52 GMT
Date: Tue, 29 Oct 2019 15:47:52 GMT

<
* Connection #0 to host 192.168.51.1 left intact

Move openshift-install & oc download to cluster steps from host prep

We should move the openshift-install download [1] from prepare host to create cluster part.
AND important, add the version to the binary. Because RHEL CoreOS Version and openshift-install versions have to match together.

[1]

hetzner-ocp4/ansible/roles/openshift-4-cluster/tasks/prepare-host.yml

Line 54 in 53ca832

# Can not use get_url because: get_url do not support --compressed

Add documentation about missing storage backends

We have to add a note that

image registry
metriscs (prometheus)

run's with ephemeral storage.

making scripts generic enough so that playbooks can be run on baremetal or other IaaS platform

bootstrap vm failes to download ocp-release-nightly image from quay

Install fails at TASK [openshift-4-cluster : Waiting bootstrap to complete]

journalctl on bootstrap node has the following error:

Sep 17 09:25:21 bootstrap release-image-download.sh[1016]: Error: error pulling image "quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc": unable to pull quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc: pinging docker registry returned: Get https://quay.io/v2/: dial tcp 23.23.73.73:443: i/o timeout

Curling the quay api fails as well on bootstrap vm:

[root@bootstrap ~]# curl -v https://quay.io/v2/

Trying 54.225.213.19...
TCP_NODELAY set
connect to 54.225.213.19 port 443 failed: Connection timed out
Trying 54.243.184.178...
TCP_NODELAY set
After 85578ms connect time, move on!
connect to 54.243.184.178 port 443 failed: Connection timed out
Trying 23.23.73.73...
TCP_NODELAY set
After 42789ms connect time, move on!
connect to 23.23.73.73 port 443 failed: Connection timed out
Trying 23.23.187.164...
TCP_NODELAY set
After 21394ms connect time, move on!
connect to 23.23.187.164 port 443 failed: Connection timed out
Trying 54.243.157.21...
TCP_NODELAY set
After 10696ms connect time, move on!
connect to 54.243.157.21 port 443 failed: Connection timed out
Trying 54.225.149.151...
TCP_NODELAY set
After 5347ms connect time, move on!
connect to 54.225.149.151 port 443 failed: Connection timed out
Failed to connect to quay.io port 443: Connection timed out
Closing connection 0
curl: (7) Failed to connect to quay.io port 443: Connection timed out

Curling the quay API works fine from Hetzner root server or local laptop.

[wait for ansible 2.10 release on RHEL] Rename cloudflare_account_api_token variable to avoid HTTP 400 errors.

The variable cloudflare_account_api_token should be renamed accordingly since the user should provide a global key and not an api token. The current naming leads to possible misunderstanng errors.

Steps to reproduce

Obtain an API Token in Cloudflare and assign it to the cloudflare_account_api_token.
Run the playbook.

Expected error message

The message will appear in the Ansible logs.

API bad request; Status: 400; Method: GET: Call: /zones?name=rhocplab.com; Error details: code: 6003, error: Invalid request headers; code: 6103, error: Invalid format for X-Auth-Key header;

Causes

When passing an
In Cloudflare APIs tokens are consumed as bearer token while global keys are consumed as X-Auth-Token.

The ansible module cloudflare_dns despite calling the related parameter account_api_token, sends a request passing an X-Auth-Token header. To match the kind of header the user should provide a global key.

Resolution

To help users to not fall in this issue I suggest to rename our variable cloudflare_account_api_token to cloudflare_account_global_key.

CentOS 8 support

I collect here items that need to be fixed for CentOS 8, which is nowadays available from Hetzner. After done, I'll hopefully close this with PR.

fatal: [localhost]: FAILED! => {"changed": false, "failures": ["No package python-lxml available.", "No package python-boto available.", "No package python2-openshift available."], "msg": ["Failed to install some of the specified packages"], "rc": 1, "results": []}

-> Changed to pip, works

fatal: [localhost]: FAILED! => {"changed": false, "failures": ["No package centos-release-openstack-stein available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}

-> removed, probably useless in CentOS 8

fatal: [localhost]: FAILED! => {"changed": false, "cmd": "yum-config-manager -q --disable centos-ceph-nautilus centos-nfs-ganesha28 centos-openstack-stein", "msg": "[Errno 2] No such file or directory: 'yum-config-manager': 'yum-config-manager'", "rc": 2}

-> removed, probably useless in CentOS 8

Installation proceeds, let's see.

The roles should live in separate github repos

I find the accompanying roles to be very useful for many other use cases and thing they should be broken out into their own repos.

The project can include an ansible.cfg that points to ansible/roles directory for the installation of roles. And there can be a requirements.yml

Example

- src: https://github.com/flyemsafe/swygue-redhat-subscription.git
  version: master

Then the roles can be installed with:

ansible-galaxy install --force -r requirements.yml

Question concerning DNS provider

Question

What can I do to install the ocp4 cluster on hetzner if I dont have a domain registered and an account with one of the DNS providers currently supported : AWS Route53, Cloudflare or GCP DNS ?

Task Install bind failed because of use_backend: yum

TASK [bind : Install bind] **************************************************************************************************************************************************************
task path: /root/hetzner-ocp4/ansible/roles/bind/tasks/main.yml:17
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unsupported parameters for (yum) module: use_backend Supported parameters include: allow_downgrade, bugfix, conf_file, disable_gpg_check, disable_plugin, disablerepo, enable_plugin, enablerepo, exclude, install_repoquery, installroot, list, name, security, skip_broken, state, update_cache, update_only, validate_certs"}
        to retry, use: --limit @/root/hetzner-ocp4/ansible/03-prepare-install.retry

root@a:~ $ ansible --version
ansible 2.6.18
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jun 11 2019, 14:33:56) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
root@a:~ $ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
root@a:~ $

Is use_backend: yum necessary on Centos or with newer Ansible versions?

Image Registry not ready after fresh 4.3 installation

When installing 4.3 as baremetal installation, image registry is marked offline as no object storage is available. Fix is written in the OCP docs, would be nice to automate it as well.

Link to docs:
https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-storage/configuring-registry-storage-baremetal.html#configuring-registry-storage-baremetal

Manual Fix:
"oc edit configs.imageregistry.operator.openshift.io " and change "managementState: Removed" to "managementState: Managed"

Old oc, kubectl and openshift-installer files remain in /usr/local/bin

When doing a repeated install, the old files in /usr/local/bin remain active and lead to wrong version selection.

Automate ssh key creation

Creare ssh key automatically and include it to install config

Change cloudflare_account_email to letsencrypt account email

Name is now misleading and it doesn't tell that email is also used with route53

Wrong letsencrypt variable

hetzner-ocp4/ansible/roles/letsencrypt/tasks/check-variables.yml

Line 28 in 2b4ac39

- le_cloudflare_account_email

fixed by changing to le_letsencrypt_account_email

Restrict bind listed addresses

Set bind to listen 192.168.222.1 and 127.0.0.1 only

CoreOS hosts unreachable with SSH

[root@hack02]# ssh bootstrap
ssh: connect to host bootstrap port 22: No route to host
[root@hack02]# ssh bootstrap.ocp42.ocp.ninja
ssh: connect to host bootstrap.ocp42.ocp.ninja port 22: No route to host
[root@hack02]# nslookup bootstrap
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: bootstrap.ocp42.ocp.ninja
Address: 192.168.222.30

[root@hack02]#

Add a option to disable letsencrypt

Cloudflare missing account email for Letsencrypt

I found an issue on Letsencript playbook, the cloudflare account email is missing and the playbook fails with an error.

Letsencrypt role search for the variable:
le_cloudflare_account_email"

So the best way is to define it in ansible/roles/openshift-4-cluster/tasks/create.yml:
le_cloudflare_account_email: "{{ cloudflare_account_email }}"

I'll make a PR for that

Use fullpath to openshift-install and not only the command

Please use /opt/openshift-install-{{ openshift_version }}/openshift-install instead of openshift-install to take care the right version is in use!

ansible/roles/openshift-4-cluster/tasks/create-ignition.yml:  command: "openshift-install --dir={{ openshift_install_dir }} create ignition-configs"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    src: "https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/{{ openshift_version }}/openshift-install-linux-{{ openshift_version }}.tar.gz"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    dest: "/opt/openshift-install-{{ openshift_version }}/"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    creates: "/opt/openshift-install-{{ openshift_version }}/openshift-install"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    "/usr/local/bin/openshift-install": "/opt/openshift-install-{{ openshift_version }}/openshift-install"
ansible/roles/openshift-4-cluster/tasks/post-install.yml:  command: "openshift-install wait-for bootstrap-complete --dir {{ openshift_install_dir }} --log-level debug"
ansible/roles/openshift-4-cluster/tasks/post-install.yml:  command: "openshift-install wait-for install-complete --dir {{ openshift_install_dir }}"

TASK [openshift-4-cluster : Add emptyDir storage to registry] ********************************************************************************************************************************************************************************
Tuesday 24 September 2019  22:28:09 +0200 (0:00:00.644)       2:00:47.563 *****
: ["oc", "patch", "configs.imageregistry.operator.openshift.io", "cluster", "--type", "merge", "--patch", "{\"spec\":{\"storage\":{\"emptyDir\":{}}}}", "--config", "/root/hetzner-ocp4/ansible/../ocp4/auth/kubeconfig"], "delta": "0:00:48.827473", "end": "2019-09-24 23:26:00.512679", "msg": "non-zero return code", "rc": 1, "start": "2019-09-24 23:25:11.685206", "stderr": "Error from server (NotFound): configs.imageregistry.operator.openshift.io \"cluster\" not found", "stderr_lines": ["Error from server (NotFound): configs.imageregistry.operator.openshift.io \"cluster\" not found"], "stdout": "", "stdout_lines": []}

The API at least seems to work and is reachable but returns an error for the bootstrap-roles

https://api.ocp4.sanc.ch:6443/healthz

[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/kube-apiserver-requestheader-reload ok
[+]poststarthook/kube-apiserver-clientCA-reload ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-discovery-available ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/openshift.io-clientCA-reload ok
[+]poststarthook/openshift.io-requestheader-reload ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-kubernetes-informers-synched ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed

Might look like a network error

From: sdn-controller-k26q4_openshift-sdn_sdn-controller-839c5cf81e1ef067eb60b6b4e2d3d79466e70bd79b9b98bf7e2b57d7820a9855.log

/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:22:52.964727739+00:00 stderr F E0925 04:22:52.964676       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:23:11.572064660+00:00 stderr F E0925 04:23:11.572023       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:29:44.826459167+00:00 stderr F E0925 04:29:44.826373       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""

redhat-emea-ssa-team / hetzner-ocp4 Goto Github PK

hetzner-ocp4's People

Stargazers

Watchers

Forkers

hetzner-ocp4's Issues

Issue

Additional info

pwd

cat rhcos-4.3.0.qcow2

Not Found

Current status

ToDo

Steps to reproduce

Expected error message

Causes

Resolution

Question

Recommend Projects

Recommend Topics

Recommend Org