Giter Site home page Giter Site logo

raballew / okd-the-hard-way Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 6.0 158 KB

Bootstrap an OKD cluster the hard way on user-provisioned infrastructure in a disconnected environment. No scripts.

License: MIT License

bare-metal disconnected kubernetes libvirt okd openshift upi

okd-the-hard-way's People

Contributors

mexok avatar raballew avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

okd-the-hard-way's Issues

Lab: Use

The lab docs/22-usage.md needs to be extended to explain on how to onboard a tenant in a automated and standardized way.

Containerize services to ensure reproducibility

As of now, the services machine must be manually configured to serve the required functionalities to bootstrap and run a cluster. This process is error prone and introduces issues regarding reproducibility. A better approach would be to run each service in its own container.

Improve modification of libvirt iptables

Currently libvirt creates IP table rules on the hosts system which get modified to simulate a disconnected environment. Basically only the services machine is allowed to connect to the internet, other traffic from and to other nodes will be dropped. As iptables are not recommended it would be helpful to disable the automagical creation of those rules and replace them with nmcli rules to standardize the usage accross the labs.

Use IPv6 on hypervisor and overlay network

The current implementation uses IPv4 only on both the hypervisor and OKD overlay networking. Even though we are not facing any IP address shortages, switching to IPv6 should be done for academical purposes as this is increasingly becoming used in the real world. The idea is to define one or multiple IPv6 subnets using libvirt and use a flat network approach to make pods or services directly accessible.

Also compatiblity with ceph and metallb needs to be verified. Additionally many services need to be reworked due to their current IPv4 configuration.

Improve mirror-registry.service after reboot

The mirror registry service works fine on already started machines but if the VM gets rebooted the unit fails to do a missing dependency and thus leaving the mirror registry offline. This behaviour needs to be improved so that the services node can be rebooted without the need to perform manual actions afterwards.

Use enterprise grade registry

As of now the Docker registry container image is used to host the mirror registry. While the basic functionalities are the same, an enterprise-quality registry usually offers supprt for building, securing and serving container images which is not covered by the Docker registry. As an effort to move the lab environment closer to a real world production environment, add a procedure that describes how to deploy a registry that offers enterprise grade functionality for proof-of-concept (non-production) purposes.

So instead of using oc adm mirror to provide the resources for a release one could simple configure an pull trough or mirror registry that automatically stays synced.

Lab: Maintain

After providing an automated way on how to install the cluster, one usually has tasks that need to be performed frequently. Those tasks should be automated to reduce friction.

Describe how to do the following in an automated way

  • delete node
  • add node
  • update cluster
  • update operators
  • update bastion
  • update certificates

Authentication cluster operator stuck at progressing

During the bootstrap process the authentication cluster operator fails to move from progressing to available because the .well-known/oauth-authorization-server endpoint is not reachable.

oc get clusteroperator authentication -o yaml

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
  creationTimestamp: "2020-09-02T07:56:05Z"
  generation: 1
  managedFields:
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
      f:spec: {}
      f:status:
        .: {}
        f:extension: {}
        f:versions: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-09-02T07:56:05Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
        f:relatedObjects: {}
    manager: authentication-operator
    operation: Update
    time: "2020-09-02T08:07:26Z"
  name: authentication
  resourceVersion: "18233"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/authentication
  uid: e85a7dd9-f8f8-4fda-bc0a-d3dce0e61323
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-09-02T08:07:23Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-09-02T08:07:26Z"
    message: 'Progressing: got ''404 Not Found'' status while trying to GET the OAuth
      well-known https://192.168.200.31:6443/.well-known/oauth-authorization-server
      endpoint data'
    reason: _WellKnownNotReady
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-09-02T08:07:26Z"
    status: "False"
    type: Available
  - lastTransitionTime: "2020-09-02T07:58:50Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: authentications
  - group: config.openshift.io
    name: cluster
    resource: authentications
  - group: config.openshift.io
    name: cluster
    resource: infrastructures
  - group: config.openshift.io
    name: cluster
    resource: oauths
  - group: route.openshift.io
    name: oauth-openshift
    namespace: openshift-authentication
    resource: routes
  - group: ""
    name: oauth-openshift
    namespace: openshift-authentication
    resource: services
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-authentication
    resource: namespaces
  - group: ""
    name: openshift-authentication-operator
    resource: namespaces
  - group: ""
    name: openshift-ingress
    resource: namespaces

This results in other cluster operators to stay in progressing or degraded state.

oc get clusteroperators

NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                             False       True          False      127m
cloud-credential                           4.5.0-0.okd-2020-08-12-020541   True        False         False      138m
cluster-autoscaler                         4.5.0-0.okd-2020-08-12-020541   True        False         False      130m
config-operator                            4.5.0-0.okd-2020-08-12-020541   True        False         False      130m
console                                    4.5.0-0.okd-2020-08-12-020541   False       True          True       127m
csi-snapshot-controller                    4.5.0-0.okd-2020-08-12-020541   True        False         False      73m
dns                                        4.5.0-0.okd-2020-08-12-020541   True        False         False      134m
etcd                                       4.5.0-0.okd-2020-08-12-020541   True        True          True       134m
image-registry                             4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
ingress                                    4.5.0-0.okd-2020-08-12-020541   True        False         False      73m
insights                                   4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
kube-apiserver                             4.5.0-0.okd-2020-08-12-020541   True        True          True       133m
kube-controller-manager                    4.5.0-0.okd-2020-08-12-020541   True        False         False      133m
kube-scheduler                             4.5.0-0.okd-2020-08-12-020541   True        False         False      133m
kube-storage-version-migrator              4.5.0-0.okd-2020-08-12-020541   True        False         False      74m
machine-api                                4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
machine-approver                           4.5.0-0.okd-2020-08-12-020541   True        False         False      134m
machine-config                             4.5.0-0.okd-2020-08-12-020541   True        False         False      133m
marketplace                                4.5.0-0.okd-2020-08-12-020541   True        False         False      130m
monitoring                                 4.5.0-0.okd-2020-08-12-020541   True        False         False      121m
network                                    4.5.0-0.okd-2020-08-12-020541   True        False         False      135m
node-tuning                                4.5.0-0.okd-2020-08-12-020541   True        False         False      135m
openshift-apiserver                        4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
openshift-controller-manager               4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
openshift-samples                          4.5.0-0.okd-2020-08-12-020541   True        False         False      130m
operator-lifecycle-manager                 4.5.0-0.okd-2020-08-12-020541   True        False         False      134m
operator-lifecycle-manager-catalog         4.5.0-0.okd-2020-08-12-020541   True        False         False      134m
operator-lifecycle-manager-packageserver   4.5.0-0.okd-2020-08-12-020541   True        False         False      131m
service-ca                                 4.5.0-0.okd-2020-08-12-020541   True        False         False      135m
storage                                    4.5.0-0.okd-2020-08-12-020541   True        False         False      131m

The endpoint is reachable from the oc clients machine:

curl -X GET https://192.168.200.31:6443/.well-known/oauth-authorization-server -k

{
  "paths": [
    "/apis",
    "/apis/",
    "/apis/apiextensions.k8s.io",
    "/apis/apiextensions.k8s.io/v1",
    "/apis/apiextensions.k8s.io/v1beta1",
    "/healthz",
    "/healthz/etcd",
    "/healthz/log",
    "/healthz/ping",
    "/healthz/poststarthook/crd-informer-synced",
    "/healthz/poststarthook/generic-apiserver-start-informers",
    "/healthz/poststarthook/priority-and-fairness-config-consumer",
    "/healthz/poststarthook/start-apiextensions-controllers",
    "/healthz/poststarthook/start-apiextensions-informers",
    "/livez",
    "/livez/etcd",
    "/livez/log",
    "/livez/ping",
    "/livez/poststarthook/crd-informer-synced",
    "/livez/poststarthook/generic-apiserver-start-informers",
    "/livez/poststarthook/priority-and-fairness-config-consumer",
    "/livez/poststarthook/start-apiextensions-controllers",
    "/livez/poststarthook/start-apiextensions-informers",
    "/metrics",
    "/openapi/v2",
    "/readyz",
    "/readyz/etcd",
    "/readyz/log",
    "/readyz/ping",
    "/readyz/poststarthook/crd-informer-synced",
    "/readyz/poststarthook/generic-apiserver-start-informers",
    "/readyz/poststarthook/priority-and-fairness-config-consumer",
    "/readyz/poststarthook/start-apiextensions-controllers",
    "/readyz/poststarthook/start-apiextensions-informers",
    "/readyz/shutdown",
    "/version"
  ]
}

Enable FIPS mode

By default, FIPS mode is not enabled. If FIPS mode is enabled, the Fedora CoreOS (FCOS) machines that OKD runs on bypass the default Kubernetes cryptography suite and use the cryptography modules that are provided with FCOS instead. This setup is common for production workloads. It seems that currently, if FIPS is enabled, the FCOS installation fails. Having this feature enabled would be a step in moving the configuration closer to what a lot of real world setups use.

Verify functionality after hypervisor reboot

Rebooting up to two nodes of each node type is possible without any bigger issues. But the functionality after rebooting everything at once in the case that the hypervisor must be updated e.g. has not be verified.

Smoke test

Add a smoke test to verify that the following works as expected:

  • dynamic storage provisioning for each storage class
  • load balancer service types

Lab: Automated Installation

The section docs/20-deploy.md should cover the following content:

  • Introduction in to Ansible, Terraform e.g.
  • Definition of the scope: Basic install without configurations other than Argo
  • Step by step instructions on how to setup the cluster automatically on libvirt
  • kustomize + ArgoCD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.