Giter Site home page Giter Site logo

hetzner-ocp4's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hetzner-ocp4's Issues

This should also work on qemu that's shipped with RHEL

You can install coreos using the coreos iso by extracting the:

  • efiboot.img
  • initramfs.img
  • vmlinuz

Creating a .treeinfo

cat /tmp/coreos/.treeinfo 
[general]
arch = x86_64
family = Fedora
platforms = x86_64
version = 29
[images-x86_64]
initrd = initramfs.img
kernel = vmlinuz

Then run virt-install with the --location argument.

#!/bin/bash

args='coreos.inst=yes '
args+='coreos.inst.install_dev=vda '
args+='coreos.inst.image_url=http://172.24.24.3:8080/pub/rhcos-42.80.20190828.2-metal-bios.raw.gz '
args+='coreos.inst.ignition_url=http://172.24.24.3:8080/pub/bootstrap.ign '
args+='ip=dhcp '
args+='rd.neednet=1'

virt-install --location /tmp/coreos --extra-args="${args}" --network network=ocp4 --name ocp4-compute-1 --memory 8192 --disk /var/lib/libvirt/images/ocp4-compute-1.qcow2

exit $?

installation fails - unreachable api

Hi
The installer fails waiting for the api

"stderr_lines": [
        "level=debug msg=\"OpenShift Installer v4.2.0-201908282219-dirty\"",
        "level=debug msg=\"Built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b\"",
        "level=info msg=\"Waiting up to 30m0s for the Kubernetes API at https://api.ocp4.sanc.ch:6443...\"",
        "level=debug msg=\"Still waiting for the Kubernetes API: Get https://api.ocp4.xxx:6443/version?timeout=32s: EOF\"",

On the bootstrap node I see

Sep 23 05:06:35 bootstrap podman[2318]: 2019-09-23 05:06:35.924008516 +0000 UTC m=+0.687375687 container attach 545491a0fa8e2c6e9275e2228547512d77015022f1912dd2d8025a729cb7e0ec (image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc, name=elegant_knuth)
Sep 23 05:06:37 bootstrap bootkube.sh[722]: Starting etcd certificate signer...
Sep 23 05:06:37 bootstrap bootkube.sh[722]: Error: name etcd-signer is in use: container already exists
Sep 23 05:06:37 bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Sep 23 05:06:37 bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Sep 23 05:06:42 bootstrap systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Sep 23 05:06:42 bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 93.
Sep 23 05:06:42 bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Sep 23 05:06:42 bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Sep 23 05:06:43 bootstrap podman[2587]: 2019-09-23 05:06:43.572904432 +0000 UTC m=+0.372897314 container create 001f86c9ba3065693b1abda46f4594aec1909cfe01e80d6adc5528057a0af7e2 (image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc, name=quizzical_rhodes)

and

Sep 22 19:11:01 bootstrap bootkube.sh[19735]: Waiting for etcd cluster...
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.259737221 +0000 UTC m=+7.442582895 image pull
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.580244772 +0000 UTC m=+7.763090429 container create b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a>
Sep 22 19:11:09 bootstrap podman[22919]: 2019-09-22 19:11:09.907017175 +0000 UTC m=+8.089862853 container init b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a8b>
Sep 22 19:11:10 bootstrap podman[22919]: 2019-09-22 19:11:10.038039098 +0000 UTC m=+8.220884731 container start b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a8>
Sep 22 19:11:10 bootstrap podman[22919]: 2019-09-22 19:11:10.038132193 +0000 UTC m=+8.220977884 container attach b00e7ef09f5b9a778c2ed1b0fcc58bc5403f1876cde241c0ca39a>
Sep 22 19:21:09 bootstrap bootkube.sh[19735]: https://etcd-2.ocp4.xxx:2379 is unhealthy: failed to connect: dial tcp 192.168.50.12:2379: connect: connection refus>
Sep 22 19:21:09 bootstrap bootkube.sh[19735]: Error: unhealthy cluster
Sep 22 19:21:10 bootstrap bootkube.sh[19735]: etcdctl failed. Retrying in 5 seconds...

Any hints where to start debug would be highly appreciated.

Include user list to cluster.yml

cluster.yml should have list of users which are added to htpasswd secret for htpasswd based auth. Also one of those users should be labeled cluster-admin.

pure-ansible: Rewrite post-terraform.sh in ansible

openshift-install --dir=/root/{{ cluster_name }}-install wait-for bootstrap-complete --log-level debug
virsh shutdown bootstrap
sleep 120
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
# apiserver certs are not yet working.

#oc create secret tls letsencrypt-api-certs    --cert={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/fullchain.crt --key={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/cert.key -n openshift-config
#oc patch apiserver cluster --type=merge -p '{"spec":{"servingCerts": {"namedCertificates":[{"names": ["api.{{ cluster_name }}.{{ public_domain }}"], "servingCertificate": {"name": "letsencrypt-api-certs"}}]}}}'
# Install certificate
oc create secret tls letsencrypt-router-certs --cert={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/fullchain.crt --key={{ playbook_dir }}/../certificate/{{ cluster_name }}.{{ public_domain }}/cert.key -n openshift-ingress
oc patch ingresscontroller default -n openshift-ingress-operator --type=merge --patch='{"spec": { "defaultCertificate": { "name": "letsencrypt-router-certs" }}}'

openshift-install --dir=/root/{{ cluster_name }}-install wait-for install-complete --log-level debug

CoreOS images for v4.2 remote location changed

I've experienced some issues in installing the current version (4.2).

I discovered that the remote location on openshift.com changed (ansible/roles/openshift-4-cluster/defaults/main.yml):

-coreos_download_url: "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-{{ coreos_version }}-qemu.qcow2"
+coreos_download_url: "https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/4.2.0-rc.5/rhcos-{{ coreos_version }}-qemu.qcow2"

The /latest remote subdir now has been bumped to 4.3 version and qemu version disappeared.

We should always point to the latest stable to avoid these issues.

I'm going to create a PR.

ansible playbook fails to start

Issue

The command ansible-playbook ./ansible/setup.yml reports the following error

[root@ocp4 hetzner-ocp4]# ansible-playbook ./ansible/setup.yml
 [WARNING]: Could not match supplied host pattern, ignoring: all

 [WARNING]: provided hosts list is empty, only localhost is available

ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here


The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here

exception type: <class 'ansible.errors.AnsibleParserError'>
exception: no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/root/temp/hetzner-ocp4/ansible/roles/openshift-4-loadbalancer/tasks/create.yml': line 25, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: Collect services facts
  ^ here

Additional info

OS: Centos7
Ansible version : 2.4.2.0
cluster.yml

---
cluster_name: ocp4
public_domain: example.com
dns_provider: [route53|cloudflare|gcp]
letsencrypt_account_email: [email protected]
# Depending on the dns provider:
# CloudFlare
cloudflare_account_email: [email protected]
cloudflare_account_api_token: 9348234sdsd894.....
cloudflare_zone: example.com
# Route53
aws_access_key: key
aws_secret_key: secret
aws_zone: example.com
# GCP
gcp_project: project-name
gcp_managed_zone_name: 'zone-name'
gcp_managed_zone_domain: 'example.com.'
gcp_serviceaccount_file: ../gcp_service_account.json

auth_htpasswd:
  - admin:$ttttttttt//
  - local:$ttttttttt//

storage_nfs: false # Default is false

auth_redhatsso:
  client_id: "xxxxx.apps.googleusercontent.com"
  client_secret: "xxxxxxx"

cluster_role_bindings:
  - cluster_role: sudoers
    name: [email protected]
  - cluster_role: cluster-admin
    name: admin


image_pull_secret: |-
  ttttttttt

Disk size so small

Default disk size is a little bit to small:

[core@master-0 ~]$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    252:0    0   16G  0 disk
├─vda1 252:1    0    1M  0 part
├─vda2 252:2    0    1G  0 part /boot
└─vda3 252:3    0   15G  0 part /sysroot

command: "qemu-img convert -O qcow2 -o size=10G {{ coreos_image_location }} /var/lib/libvirt/images/{{ vm_instance_name }}.qcow2"

Please add variable in ./ansible/roles/openshift-4-loadbalancer/defaults/main.yml and use it.

default router running on master nodes and not on worker node loadbalanced by haproxy

I added 2 labels for the worker nodes as infra

node-role.kubernetes.io/infra: ""
infra: infra

by editing
$ oc edit nodes worker-0.ocp4.dwojciec.com
and I added the 2 new labels inside

  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    infra: infra
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: worker-0.ocp4.dwojciec.com
    kubernetes.io/os: linux
    node-role.kubernetes.io/infra: ""
    node-role.kubernetes.io/worker: ""
    node.openshift.io/os_id: rhcos
  name: worker-0.ocp4.dwojciec.com
  resourceVersion: "26912"

see the result

[root@CentOS-76-64-minimal haproxy]# oc get nodes --show-labels
NAME                         STATUS   ROLES           AGE   VERSION             LABELS
master-0.ocp4.dwojciec.com   Ready    master,worker   69m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-0.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-1.ocp4.dwojciec.com   Ready    master,worker   69m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-1.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
master-2.ocp4.dwojciec.com   Ready    master,worker   70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-2.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-0.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-0.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-1.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-1.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
worker-2.ocp4.dwojciec.com   Ready    infra,worker    70m   v1.14.0+44b46b52b   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,infra=infra,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker-2.ocp4.dwojciec.com,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos

I assign infra label to the ingresscontroller
$ oc edit ingresscontroller default -n openshift-ingress-operator
when done I deleted
$ oc delete deployment router-default -n openshift-ingress
$ oc get pod -o wide
to check if the default-router is running on worker node.

oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE                         NOMINATED NODE   READINESS GATES
router-default-56d656f6b7-8fgzq   1/1     Running   0          24m   192.168.222.34   worker-0.ocp4.dwojciec.com   <none>           <none>
router-default-56d656f6b7-h49l4   1/1     Running   0          24m   192.168.222.35   worker-1.ocp4.dwojciec.com   <none>           <none>

make cluster start at boot

At the moment cluster remains down if the host reboots. Let's fix that by either doing (pseudo code):

for all machines
virsh autostart machine-name
done

in ansible command for list of servers,

or another way, just do

for all machines
ln -s /etc/libvirt/qemu/autostart/ocp4-compute-0.xml /etc/libvirt/qemu/ocp4-compute-0.xml
done

both with ansible, naturally.

SElinux prevents bootstrap VM to start

When running on RHEL 7.7 and SELinux is in Enforcing mode bootstrap VM does not start.

TASK [openshift-4-cluster : Start VirtualMachine iot-bootstrap] ***************************************************************************************************************************************************
Tuesday 01 October 2019  14:34:13 +0300 (0:00:00.404)       0:17:06.550 ******* 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-10-01T11:34:15.354797Z qemu-kvm: -fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/iot-bootstrap.ign: can't load /var/lib/libvirt/images/iot-bootstrap.ign
fatal: [localhost]: FAILED! => {"changed": false, "msg": "internal error: qemu unexpectedly closed the monitor: 2019-10-01T11:34:15.354797Z qemu-kvm: -fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/iot-bootstrap.ign: can't load /var/lib/libvirt/images/iot-bootstrap.ign"}

Destination directory /root/terraform does not exist

TASK [ign : Create small ign for bootstrap] *********************************************************************************************************************************************
task path: /root/hetzner-ocp4/ansible/roles/ign/tasks/main.yml:25
fatal: [localhost -> localhost]: FAILED! => {"changed": false, "checksum": "0b3017f31dea301f097f544cd0a9b47ca00bea51", "msg": "Destination directory /root/terraform does not exist"}
        to retry, use: --limit @/root/hetzner-ocp4/ansible/03-prepare-install.retry

Adding OpenShift Container Storage 4 (rook & ceph)

We have to add OpenShit Container Storage 4 (rook & ceph) because, we need storage for the Image Registry and for applications too.

Working branch: ocs_issue#31

Current status

You can install ocs upstream on OCP4 - quick'n'dirty

git clone [email protected]:RedHat-EMEA-SSA-Team/hetzner-ocp4.git
cd hetzner-ocp4
git branch ocs_issue#31 origin/ocs_issue#31
git checkout ocs_issue#31
# Create cluster.yml
vi cluster.yml
./ansible/02-create-cluster.yml
export KUBECONFIG=....
./deploy-ocs.sh

ToDo

  • Metrics do not work, create service monitor and metrics do not show up in cluster Prometheus.

pure-ansible: Create & Use an ssa cluster configuration operator

Replace the ansible post-installation with an operator

Be careful with ssl cert and lookup('file' because it strips the last important \n:

- name: Check certificates exist
  stat:
    path: "{{ ign_certificates_path }}/fullchain.crt"
  register: crt
- name: Check ssl key exist
  stat:
    path: "{{ ign_certificates_path }}/cert.key"
  register: key

- name: Create openshift-ingress config
  block:
    - name: Create openshift router certs secret
      copy:
        content: |
          apiVersion: v1
          kind: Secret
          data:
            tls.crt: {{  lookup('file',ign_certificates_path + '/fullchain.crt', rstrip=false) | b64encode }}
            tls.key: {{  lookup('file',ign_certificates_path + '/cert.key', rstrip=false)  | b64encode }}
          metadata:
            name: letsencrypt-router-certs
            namespace: openshift-ingress
          type: kubernetes.io/tls
        dest: "{{ ign_openshift_install_dir }}/openshift/99_openshift-ingress-letsencrypt-router-certs-secret.yaml"

  when: crt.stat.exists == True and key.stat.exists == True

MCO not accessible via https://apt-int.....:22623/config/worker

MCO not accessible via https://apt-int.....:22623/config/worker

oc debug node/compute-0
Starting pod/compute-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.51.13
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# curl -i -k https://api-int.demo.openshift.pub:22623/config/worker
curl: (7) Failed to connect to api-int.demo.openshift.pub port 22623: Connection refused
sh-4.4# nslookup api-int.demo.openshift.pub
Server:		192.168.51.1
Address:	192.168.51.1#53

Name:	api-int.demo.openshift.pub
Address: 192.168.51.1
sh-4.4# ping 192.168.51.1
PING 192.168.51.1 (192.168.51.1) 56(84) bytes of data.
64 bytes from 192.168.51.1: icmp_seq=1 ttl=64 time=0.113 ms
64 bytes from 192.168.51.1: icmp_seq=2 ttl=64 time=0.118 ms
^C
--- 192.168.51.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 55ms
rtt min/avg/max/mdev = 0.113/0.115/0.118/0.011 ms
sh-4.4# curl -vvv -i -k https://192.168.51.1:22623/config/worker
*   Trying 192.168.51.1...
* TCP_NODELAY set
* connect to 192.168.51.1 port 22623 failed: Connection refused
* Failed to connect to 192.168.51.1 port 22623: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 192.168.51.1 port 22623: Connection refused

From host it worked:

root@homer:~ $ curl -vvv -I -k https://192.168.51.1:22623/config/worker
* About to connect() to 192.168.51.1 port 22623 (#0)
*   Trying 192.168.51.1...
* Connected to 192.168.51.1 (192.168.51.1) port 22623 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
* Server certificate:
* 	subject: CN=api-int.demo.openshift.pub
* 	start date: Oct 17 11:05:22 2019 GMT
* 	expire date: Oct 14 11:05:25 2029 GMT
* 	common name: api-int.demo.openshift.pub
* 	issuer: CN=root-ca,OU=openshift
> HEAD /config/worker HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 192.168.51.1:22623
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Length: 91607
Content-Length: 91607
< Content-Type: application/json
Content-Type: application/json
< Date: Tue, 29 Oct 2019 15:47:52 GMT
Date: Tue, 29 Oct 2019 15:47:52 GMT

<
* Connection #0 to host 192.168.51.1 left intact

bootstrap vm failes to download ocp-release-nightly image from quay

Install fails at TASK [openshift-4-cluster : Waiting bootstrap to complete]

journalctl on bootstrap node has the following error:

Sep 17 09:25:21 bootstrap release-image-download.sh[1016]: Error: error pulling image "quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc": unable to pull quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-release-nightly@sha256:d48a15ea564293934eb188e6eb8737e56903453d50bc70830cdac2641fb63acc: pinging docker registry returned: Get https://quay.io/v2/: dial tcp 23.23.73.73:443: i/o timeout

Curling the quay api fails as well on bootstrap vm:

[root@bootstrap ~]# curl -v https://quay.io/v2/

  • Trying 54.225.213.19...
  • TCP_NODELAY set
  • connect to 54.225.213.19 port 443 failed: Connection timed out
  • Trying 54.243.184.178...
  • TCP_NODELAY set
  • After 85578ms connect time, move on!
  • connect to 54.243.184.178 port 443 failed: Connection timed out
  • Trying 23.23.73.73...
  • TCP_NODELAY set
  • After 42789ms connect time, move on!
  • connect to 23.23.73.73 port 443 failed: Connection timed out
  • Trying 23.23.187.164...
  • TCP_NODELAY set
  • After 21394ms connect time, move on!
  • connect to 23.23.187.164 port 443 failed: Connection timed out
  • Trying 54.243.157.21...
  • TCP_NODELAY set
  • After 10696ms connect time, move on!
  • connect to 54.243.157.21 port 443 failed: Connection timed out
  • Trying 54.225.149.151...
  • TCP_NODELAY set
  • After 5347ms connect time, move on!
  • connect to 54.225.149.151 port 443 failed: Connection timed out
  • Failed to connect to quay.io port 443: Connection timed out
  • Closing connection 0
    curl: (7) Failed to connect to quay.io port 443: Connection timed out

Curling the quay API works fine from Hetzner root server or local laptop.

[wait for ansible 2.10 release on RHEL] Rename cloudflare_account_api_token variable to avoid HTTP 400 errors.

The variable cloudflare_account_api_token should be renamed accordingly since the user should provide a global key and not an api token. The current naming leads to possible misunderstanng errors.

Steps to reproduce

Obtain an API Token in Cloudflare and assign it to the cloudflare_account_api_token.
Run the playbook.

Expected error message

The message will appear in the Ansible logs.

API bad request; Status: 400; Method: GET: Call: /zones?name=rhocplab.com; Error details: code: 6003, error: Invalid request headers; code: 6103, error: Invalid format for X-Auth-Key header;

Causes

When passing an
In Cloudflare APIs tokens are consumed as bearer token while global keys are consumed as X-Auth-Token.

The ansible module cloudflare_dns despite calling the related parameter account_api_token, sends a request passing an X-Auth-Token header. To match the kind of header the user should provide a global key.

Resolution

To help users to not fall in this issue I suggest to rename our variable cloudflare_account_api_token to cloudflare_account_global_key.

CentOS 8 support

I collect here items that need to be fixed for CentOS 8, which is nowadays available from Hetzner. After done, I'll hopefully close this with PR.

fatal: [localhost]: FAILED! => {"changed": false, "failures": ["No package python-lxml available.", "No package python-boto available.", "No package python2-openshift available."], "msg": ["Failed to install some of the specified packages"], "rc": 1, "results": []}

-> Changed to pip, works

fatal: [localhost]: FAILED! => {"changed": false, "failures": ["No package centos-release-openstack-stein available."], "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}

-> removed, probably useless in CentOS 8

fatal: [localhost]: FAILED! => {"changed": false, "cmd": "yum-config-manager -q --disable centos-ceph-nautilus centos-nfs-ganesha28 centos-openstack-stein", "msg": "[Errno 2] No such file or directory: 'yum-config-manager': 'yum-config-manager'", "rc": 2}

-> removed, probably useless in CentOS 8

Installation proceeds, let's see.

The roles should live in separate github repos

I find the accompanying roles to be very useful for many other use cases and thing they should be broken out into their own repos.

The project can include an ansible.cfg that points to ansible/roles directory for the installation of roles. And there can be a requirements.yml

  • Example
- src: https://github.com/flyemsafe/swygue-redhat-subscription.git
  version: master

Then the roles can be installed with:

ansible-galaxy install --force -r requirements.yml

Question concerning DNS provider

Question

What can I do to install the ocp4 cluster on hetzner if I dont have a domain registered and an account with one of the DNS providers currently supported : AWS Route53, Cloudflare or GCP DNS ?

Task Install bind failed because of use_backend: yum

TASK [bind : Install bind] **************************************************************************************************************************************************************
task path: /root/hetzner-ocp4/ansible/roles/bind/tasks/main.yml:17
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unsupported parameters for (yum) module: use_backend Supported parameters include: allow_downgrade, bugfix, conf_file, disable_gpg_check, disable_plugin, disablerepo, enable_plugin, enablerepo, exclude, install_repoquery, installroot, list, name, security, skip_broken, state, update_cache, update_only, validate_certs"}
        to retry, use: --limit @/root/hetzner-ocp4/ansible/03-prepare-install.retry
root@a:~ $ ansible --version
ansible 2.6.18
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jun 11 2019, 14:33:56) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
root@a:~ $ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
root@a:~ $

Is use_backend: yum necessary on Centos or with newer Ansible versions?

Image Registry not ready after fresh 4.3 installation

When installing 4.3 as baremetal installation, image registry is marked offline as no object storage is available. Fix is written in the OCP docs, would be nice to automate it as well.

Link to docs:
https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-storage/configuring-registry-storage-baremetal.html#configuring-registry-storage-baremetal

Manual Fix:
"oc edit configs.imageregistry.operator.openshift.io " and change "managementState: Removed" to "managementState: Managed"

CoreOS hosts unreachable with SSH

[root@hack02]# ssh bootstrap
ssh: connect to host bootstrap port 22: No route to host
[root@hack02]# ssh bootstrap.ocp42.ocp.ninja
ssh: connect to host bootstrap.ocp42.ocp.ninja port 22: No route to host
[root@hack02]# nslookup bootstrap
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: bootstrap.ocp42.ocp.ninja
Address: 192.168.222.30

[root@hack02]#

Cloudflare missing account email for Letsencrypt

I found an issue on Letsencript playbook, the cloudflare account email is missing and the playbook fails with an error.

Letsencrypt role search for the variable:
le_cloudflare_account_email"

So the best way is to define it in ansible/roles/openshift-4-cluster/tasks/create.yml:
le_cloudflare_account_email: "{{ cloudflare_account_email }}"

I'll make a PR for that

Use fullpath to openshift-install and not only the command

Please use /opt/openshift-install-{{ openshift_version }}/openshift-install instead of openshift-install to take care the right version is in use!

ansible/roles/openshift-4-cluster/tasks/create-ignition.yml:  command: "openshift-install --dir={{ openshift_install_dir }} create ignition-configs"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    src: "https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/{{ openshift_version }}/openshift-install-linux-{{ openshift_version }}.tar.gz"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    dest: "/opt/openshift-install-{{ openshift_version }}/"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    creates: "/opt/openshift-install-{{ openshift_version }}/openshift-install"
ansible/roles/openshift-4-cluster/tasks/download-openshift-artifacts.yml:    "/usr/local/bin/openshift-install": "/opt/openshift-install-{{ openshift_version }}/openshift-install"
ansible/roles/openshift-4-cluster/tasks/post-install.yml:  command: "openshift-install wait-for bootstrap-complete --dir {{ openshift_install_dir }} --log-level debug"
ansible/roles/openshift-4-cluster/tasks/post-install.yml:  command: "openshift-install wait-for install-complete --dir {{ openshift_install_dir }}"

Use Quay.io as source for image registry mirror

Disconnected installation is using docker registry image from docker.io. That may be blocked and cannot be use in some cases. Quay.io is usually whitelisted so change registry image to point to quay.io/redhat-emea-ssa-team/registry image repo.

Can't install on new Hetzner server

Hi

I got an additional Hetzner root server but this time the installer does not work for whatever reason. The only difference between the servers are the interface names.

TASK [openshift-4-cluster : Add emptyDir storage to registry] ********************************************************************************************************************************************************************************
Tuesday 24 September 2019  22:28:09 +0200 (0:00:00.644)       2:00:47.563 *****
: ["oc", "patch", "configs.imageregistry.operator.openshift.io", "cluster", "--type", "merge", "--patch", "{\"spec\":{\"storage\":{\"emptyDir\":{}}}}", "--config", "/root/hetzner-ocp4/ansible/../ocp4/auth/kubeconfig"], "delta": "0:00:48.827473", "end": "2019-09-24 23:26:00.512679", "msg": "non-zero return code", "rc": 1, "start": "2019-09-24 23:25:11.685206", "stderr": "Error from server (NotFound): configs.imageregistry.operator.openshift.io \"cluster\" not found", "stderr_lines": ["Error from server (NotFound): configs.imageregistry.operator.openshift.io \"cluster\" not found"], "stdout": "", "stdout_lines": []}

The API at least seems to work and is reachable but returns an error for the bootstrap-roles

https://api.ocp4.sanc.ch:6443/healthz

[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/kube-apiserver-requestheader-reload ok
[+]poststarthook/kube-apiserver-clientCA-reload ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-discovery-available ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/openshift.io-clientCA-reload ok
[+]poststarthook/openshift.io-requestheader-reload ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/openshift.io-kubernetes-informers-synched ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed

Might look like a network error

From: sdn-controller-k26q4_openshift-sdn_sdn-controller-839c5cf81e1ef067eb60b6b4e2d3d79466e70bd79b9b98bf7e2b57d7820a9855.log

/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:22:52.964727739+00:00 stderr F E0925 04:22:52.964676       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:23:11.572064660+00:00 stderr F E0925 04:23:11.572023       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: EOF
2019-09-25T04:29:44.826459167+00:00 stderr F E0925 04:29:44.826373       1 leaderelection.go:306] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ocp4.sanc.ch:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.