sassoftware / viya4-iac-k8s Goto Github PK

This project contains Terraform scripts to provision cloud infrastructure resources, when using vSphere, and Ansible to apply the needed elements of a Kubernetes cluster that are required to deploy SAS Viya platform product offerings.

License: Apache License 2.0

Dockerfile 4.62% Shell 32.14% HCL 59.46% Jinja 3.78%

iac kubernetes oss terraform sas-viya

viya4-iac-k8s's Introduction

SAS Viya 4 Infrastructure as Code (IaC) for Open Source Kubernetes

Release Notes

A problem with the implementation of the default storage class and its usage of an NFS server as its backing store has been addressed with this issue.

Overview

This project helps you to automate the cluster-provisioning phase of SAS Viya platform deployment. It contains Terraform scripts to provision cloud infrastructure resources for VMware, and it contains Ansible files to apply the elements of a Kubernetes cluster that are required to deploy SAS Viya 4 product offerings. Here is a list of resources that this project can create:

An open source Kubernetes cluster with the following components:

Container Runtime Interface (CRI): containerd

Container Network Interface (CNI): Calico

Cluster-level virtual IP address (VIP): kube-vip

Cluster load balancer: kube-vip or MetalLB

Nodes with required labels and taints

Infrastructure to deploy the SAS Viya CAS server in SMP or MPP mode

To learn about all phases and options of the SAS Viya platform deployment process, see Getting Started with SAS Viya and Open Source Kubernetes in SAS® Viya® Platform Operations.

Once the resources are provisioned, use the viya4-deployment project to deploy SAS Viya platform in your cloud environment. For more information about SAS Viya platform requirements and documentation for the deployment process, refer to SAS Viya Platform Operations.

This project supports infrastructure that is built on physical machines ("bare metal" machines or Linux VMs) or on VMware vSphere or vCenter machines. If you need to create a cluster in AWS, Microsoft Azure, or GCP, use the appropriate SAS Viya IaC repository to perform the associated tasks.

Prerequisites

Use of these tools requires operational knowledge of the following technologies:

Systems
Networking
Bash
Terraform
Docker
Ansible
Helm
kube-vip
MetalLB
Kubernetes

Machines

The tools in this repository can create systems as needed only if you are running on VMware vSphere or vCenter. If you are not using vSphere or vCenter, you must supply your own machines (either VMs or physical machines).

Regardless of which method you choose, the machines in your deployment must meet the minimal requirements listed below:

Machines in your target environment are running Ubuntu Linux LTS 20.04 or 22.04
Machines have a default user account with password-less sudo capabilities
At least 3 machines for the control plane nodes in your cluster
At least 6 machines for the application nodes in your cluster
1 machine to serve as a jump server
1 machine to serve as an NFS server
(Optional) At least 1 machine to host a PostgreSQL server (for the SAS Infrastructure Data Server component) if you plan to use an external database server with your cluster.

You can instead use the internal PostgreSQL server, which is deployed by default on a node in the cluster.

NOTE: Remember that these machines are not managed by a provider or by automated tooling. The nodes that you add here dictate the capacity of the cluster. If you need to increase or decrease the number of nodes in the cluster, you must perform the task manually. There is NO AUTOSCALING with this setup.

VMware vSphere

Deployment with vSphere requires a Linux image that can be used as the basis for your machines. This image requires the following minimal settings:

Ubuntu Linux LTS 20.04 or 22.04 minimal installation
2 CPUs
4 GB of memory
8 GB disk, thin provisioned
Root file system mounted at /dev/sd2

NOTE: These items are all automatically adjusted to suit each individual deployment. These values are only the minimum starting point. They will be changed as components are created.

Physical or Virtual Machines

In addition to supporting VMware, this project also works with existing physical or virtual machines. You will need root access to these machines, and you will need to pass this along, following the sample inventory and ansible-vars.yaml files that are provided in this repository.

Networking

The following items are required to support the systems that are created in your environment:

A network that is routable by all the target machines
A static or assignable IP address for each target machine
At least 3 floating IP addresses for the following components:
- The Kubernetes cluster virtual IP address
- The load balancer IP address
- A CIDR block or range of IP addresses for additional load balancers. These are used when exposing user interfaces for various SAS product offerings.

A more comprehensive description of these items and their requirements can be found in the Requirements document.

Technical Prerequisites

This project supports the following options for running the scripts in this repository to automate cluster provisioning:

Running the bash oss-k8s.sh script on your local machine
Using a Docker container to run the oss-k8s.sh script

For more information, see Docker Usage. Using Docker to run the Terraform and Ansible scripts is recommended.

Script Requirements

View the Dependencies Documentation to see the required software that needs to installed in order to run the SAS Viya IaC tools here on your local system

Docker Requirements

If you are using the predefined dockerfile in this project in order to run the script, you need only an instance of Docker.

Getting Started

When you have prepared your environment with the prerequisites, you are ready to obtain and customize the Terraform scripts that will set up your Kubernetes cluster.

Clone This Project

Run the following commands from a terminal session:

# clone this repo
git clone https://github.com/sassoftware/viya4-iac-k8s

# move to the project directory
cd viya4-iac-k8s

Customize Input Values

vSphere/vCenter Machines

Terraform scripts require variable definitions as input. Review the variables files and modify default values to meet your requirements. Create a file named terraform.tfvars in order to customize the input variable values that are documented in the CONFIG-VARS.md file.

To get started, you can copy one of the example variable definition files that are provided in the ./examples folder. For more information about the variables that are declared in each file, refer to the CONFIG-VARS.md file.

You have the option to specify variable definitions that are not included in terraform.tfvars or to use a variable definition file other than terraform.tfvars. See Advanced Terraform Usage for more information.

SAS Viya IaC Configuration Files

In order to use this repository, modify the inventory file to provide information about the machine targets for the SAS Viya platform deployment.

Modify the ansible-vars.yaml file to customize the configuration settings for your environment.

Create and Manage Cluster Resources

Create and manage the required cluster resources for your SAS Viya 4 deployment. Perform one of the following steps, based on whether you are using Docker:

Run the oss-k8s.sh script directly on your workstation
Start the Docker container (recommended)

Contributing

We welcome your contributions! See CONTRIBUTING.md for details on how to submit contributions to this project.

License

This project is licensed under the Apache 2.0 License.

Additional Resources

viya4-iac-k8s's People

Contributors

Stargazers

Watchers

Forkers

awsmith0216 korsml chago1991 coen72 supear markha73 fiolbs rabem00 frasep mik-now ghas-results han-tun jefftang2016 primayuda michellefekula

viya4-iac-k8s's Issues

how to push images to Container Registry (harbor)

hello,
i used this IAC to setup pare metal cluster.
the thing related to CR is not clear, what will be the port ,URL, username and password. that should be passed to viya4-Deployment.
viya4-deployment is talking about tfstate but it's not created here and not used

Node labelling assumes that the node names are the short hostname

In my environment the kubernetes nodes are represented by their FQDN, rather than short hostname:

# kubectl get nodes
NAME                        STATUS   ROLES           AGE   VERSION
adsmit-oc1-m1.foo.bar.com   Ready    control-plane   79m   v1.24.3
adsmit-oc1-n1.foo.bar.com   Ready    <none>          78m   v1.24.3
adsmit-oc1-n2.foo.bar.com   Ready    <none>          78m   v1.24.3
adsmit-oc1-n3.foo.bar.com   Ready    <none>          78m   v1.24.3
adsmit-oc1-n4.foo.bar.com   Ready    <none>          78m   v1.24.3
adsmit-oc1-n5.foo.bar.com   Ready    <none>          78m   v1.24.3

However, the roles/kubernetes/node/labels_taints/tasks/labels.yaml file uses the ansible_hostname magic variable, which seems to only resolve to the short hostname. This causes an error similar to:

failed: [adsmit-oc1-m1.foo.bar.com] (item=launcher.sas.com/prepullImage=sas-programming-environment) => {"ansible_loop_var": "label", "changed": true, "cmd": "kubectl label nodes adsmit-oc1-m1 launcher.sas.com/prepullImage=sas-programming-environment --overwrite \n", "delta": "0:00:00.122628", "end": "2023-03-16 14:11:51.555860", "label": "launcher.sas.com/prepullImage=sas-programming-environment", "msg": "non-zero return code", "rc": 1, "start": "2023-03-16 14:11:51.433232", "stderr": "Error from server (NotFound): nodes \"adsmit-oc1-m1\" not found", "stderr_lines": ["Error from server (NotFound): nodes \"adsmit-oc1-m1\" not found"], "stdout": "", "stdout_lines": []}

Seems like the task could be improved to first determine if the node names are the short hostname or FQDN, and then use that name during the labelling.

(IAC-570) vSphere - Missing IPs in Terraform output when using DHCP for NFS & Jump

Terraform Version

1.1.9 # From Docker Container

Terraform Variable File

..truncated
# Jump server
#
#   Suggested server specs shown below.
#
create_jump    = true # Creation flag
jump_num_cpu   = 4    # 4 CPUs
jump_ram       = 8092 # 8 GB
jump_disk_size = 100  # 100 GB
jump_ip        = ""   # Assigned values for static IPs

# NFS server
#
#   Suggested server specs shown below.
#
create_nfs    = true  # Creation flag
nfs_num_cpu   = 8     # 8 CPUs
nfs_ram       = 16384 # 16 GB
nfs_disk_size = 500   # 500 GB
nfs_ip        = ""    # Assigned values for static IPs

Steps to Reproduce

docker run --rm -it --group-add root --user $(id -u):$(id -g) --env-file /home/$MYUSER/.vsphere/.vsphere_creds.env --volume $(pwd):/workspace viya4-iac-k8s:local install vsphere

Expected Behavior

In the Terraform output I expect nfs_public_ip & jump_public_ip to be populated with values

Actual Behavior

Those fields are not populated, so we run into an error when the ansible playbook executes

TASK [kubernetes/storage/nfs-subdir-external-provisioner : Setting up default storage for cluster using nfs-subdir-external-provisioner] *******************************************************************************************************************************************************************************
Monday 16 May 2022  16:33:49 +0000 (0:00:00.568)       0:04:34.018 ************ 
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: {'nfs': {'server': '{{ nfs_ip }}', 'path': '/srv/nfs/kubernetes/sc/default', 'mountOptions': ['noatime', 'nodiratime', 'rsize=262144', 'wsize=262144']}, 'storageClass': {'archiveOnDelete': 'false', 'de
faultClass': 'true', 'name': 'default'}}: 'nfs_ip' is undefined\n\nThe error appears to be in '/viya4-iac-k8s/roles/kubernetes/storage/nfs-subdir-external-provisioner/tasks/main.yaml': line 9, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:
\n\n#\n- name: Setting up default storage for cluster using nfs-subdir-external-provisioner\n  ^ here\n"}

Additional Context

I tried removing nfs_ip & jump_ip from .tfvars so they they would instead default to null in hopes that was the proper way to do this instead Terraform fails with the following:

│ Call to function "templatefile" failed: ./templates/ansible/ansible-vars.yaml.tmpl:45,14-21: Invalid function argument; Invalid value for "value" parameter: argument must not be null., and 1 other diagnostic(s).

References

-->

Update docs to include more info on Cluster VIP and Cloud Provider

Clarification on kube_vip and kubernets_vip items in the terraform and ansible files. There are new concepts for folks not used to managing a full cluster from scratch. Need more doc and explanations on these items.

documentation : node pools "misc disks" settings not documented and/or explained.

In the CONFIG-VARS.md file the "misc disks" option that allow to provision additional empty disks for the local storage provisioner does not appear in the Node pool options table and are not documented.

feat/doc (IAC-1270) the documentation says we don't create PG DB during infra setup but the code creates the SharedServices DB.

From https://github.com/sassoftware/viya4-iac-k8s/blob/main/docs/CONFIG-VARS.md :

PostgreSQL Servers

When setting up external database servers, you must provide information about those servers in the postgres_servers variable block. Each entry in the variable block represents a single database server.

This code only configures database servers. No databases are created during the infrastructure setup.

However, the task "https://raw.githubusercontent.com/sassoftware/viya4-iac-k8s/main/roles/kubernetes/database/postgres/create_databases/tasks/main.yaml" creates the SharedServices DB when the postgres_server_name is "default".

Failure during installation of nfs-subdir-external-provisioner

Hi,

We are trying to install this to our bare-metal servers and executing the oss-k8s.sh script through a docker container.
Works well until it comes to the nfs-subdir-external-provisioner installation task, where it fails.

Short error:

fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --version=4.0.16 --repo=https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ upgrade -i --reset-values --wait --create-namespace -f=/tmp/tmpszu05z7m.yml nfs-subdir-external-provisioner-default nfs-subdir-external-provisioner", "msg": "Failure when executing Helm command. Exited 1.\nstdout: Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now.\n\nstderr: Error: timed out waiting for the condition\n", "stderr": "Error: timed out waiting for the condition\n", "stderr_lines": ["Error: timed out waiting for the condition"], "stdout": "Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now.\n", "stdout_lines": ["Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now."]}

Nothing happens, sits there and waits and times out after about 5 minutes.

This didn't provide so much information, so we run the whole thing with "-vvvvv" and got the following:

TASK [kubernetes/storage/nfs-subdir-external-provisioner : Setting up default storage for cluster using nfs-subdir-external-provisioner] ******************************************************************
task path: /viya4-iac-k8s/roles/kubernetes/storage/nfs-subdir-external-provisioner/tasks/main.yaml:9
Friday 26 August 2022  12:16:12 +0000 (0:00:01.391)       0:06:55.428 *********
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: root
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp `"&& mkdir "` echo $HOME/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314 `" && echo ansible-tmp-1661516172.2773957-6455-25144315223314="` echo $HOME/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314 `" ) && sleep 0'
Using module file /viya4-iac-k8s/.ansible/collections/ansible_collections/kubernetes/core/plugins/modules/helm.py
<127.0.0.1> PUT /viya4-iac-k8s/.ansible/tmp/ansible-local-802z26hs1fm/tmpjaxv60vk TO /viya4-iac-k8s/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314/AnsiballZ_helm.py
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /viya4-iac-k8s/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314/ /viya4-iac-k8s/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314/AnsiballZ_helm.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '/usr/bin/python3 /viya4-iac-k8s/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314/AnsiballZ_helm.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c 'rm -f -r /viya4-iac-k8s/.ansible/tmp/ansible-tmp-1661516172.2773957-6455-25144315223314/ > /dev/null 2>&1 && sleep 0'
fatal: [localhost]: FAILED! => {
    "changed": false,
    "command": "/usr/local/bin/helm --version=4.0.16 --repo=https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ upgrade -i --reset-values --wait --create-namespace -f=/tmp/tmpfze9iz5i.yml nfs-subdir-external-provisioner-default nfs-subdir-external-provisioner",
    "invocation": {
        "module_args": {
            "api_key": null,
            "atomic": false,
            "binary_path": null,
            "ca_cert": null,
            "chart_ref": "nfs-subdir-external-provisioner",
            "chart_repo_url": "https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/",
            "chart_version": "4.0.16",
            "context": null,
            "create_namespace": true,
            "disable_hook": false,
            "force": false,
            "history_max": null,
            "host": null,
            "kubeconfig": "/workspace/ucr-oss-kubeconfig.conf",
            "name": "nfs-subdir-external-provisioner-default",
            "namespace": "kube-system",
            "purge": true,
            "release_name": "nfs-subdir-external-provisioner-default",
            "release_namespace": "kube-system",
            "release_state": "present",
            "release_values": {
                "nfs": {
                    "mountOptions": [
                        "noatime",
                        "nodiratime",
                        "rsize=262144",
                        "wsize=262144"
                    ],
                    "path": "/srv/nfs/kubernetes/sc/default",
                    "server": "xx.xx.xx.xx"
                },
                "storageClass": {
                    "archiveOnDelete": "false",
                    "defaultClass": "true",
                    "name": "default"
                }
            },
            "replace": false,
            "skip_crds": false,
            "timeout": null,
            "update_repo_cache": false,
            "validate_certs": true,
            "values": {
                "nfs": {
                    "mountOptions": [
                        "noatime",
                        "nodiratime",
                        "rsize=262144",
                        "wsize=262144"
                    ],
                    "path": "/srv/nfs/kubernetes/sc/default",
                    "server": "xx.xx.xx.xx"
                },
                "storageClass": {
                    "archiveOnDelete": "false",
                    "defaultClass": "true",
                    "name": "default"
                }
            },
            "values_files": [],
            "wait": true,
            "wait_timeout": null
        }
    },
    "msg": "Failure when executing Helm command. Exited 1.\nstdout: Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now.\n\nstderr: Error: timed out waiting for the condition\n",
    "stderr": "Error: timed out waiting for the condition\n",
    "stderr_lines": [
        "Error: timed out waiting for the condition"
    ],
    "stdout": "Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now.\n",
    "stdout_lines": [
        "Release \"nfs-subdir-external-provisioner-default\" does not exist. Installing it now."
    ]
}

The helm chart is available and can be installed, but the feeling is that the timeout is happening in the kubernetes cluster level.

Any help is appreciated!

feat: (IAC-1269) The ansible reboot should only happen on the K8s nodes and K8s control plane nodes

In one of the playbook (roles/kubernetes/common/tasks/main.yaml) there is a task to reboot all the nodes, in order to force the OS to pick up some low level system changes (like the grub configuration change).

But if you were using the jumphost provisioned machine as your ansible controller, then you get automatically disconnected in the middle of the playbook execution (since ansible force the machine from where your run the the playbook to reboot). And the rest of the playbook tasks can not be executed.

The ansible.reboot should only happen on the K8s nodes and K8s control plan nodes, I don't see why NFS or Jump server nodes should be rebooted.
If we don't reboot the NFS and jump servers, it leaves the possibility to co-locate them on a node and use this node as the ansible controller to run the tool (instead of having to provide yet another "bastion" machine - outside of the provisionned hosts - to run the playbook).

Update node labelling to work with pattern matching rather than a simple string search

Currently file roles/kubernetes/node/labels_taints/tasks/main.yaml does a simple string find to determine if the node name matches the key value. It would be nice to enhance this so that we could have the capability to match on a regex pattern as well. This would allow users to set a key such as ".+" to match on all node names.

fix: (IAC-1246) Declare "cluster_node_pool_mode" in variables.tf to remove warning message

Using the 'minimal' example file, I get the following warning message after running the apply:

Warning: Value for undeclared variable

The root module does not declare a variable named "cluster_node_pool_mode"
but a value was found in file "terraform.tfvars".
If you meant to use this value, add a "variable" block to the configuration.

Seems like this variable should either be declared in variables.tf, or removed from the example and output files.

docs: (IAC-1267) export variables in .bare_metal_creds.env

Here is how I created my own .bare_metal_creds.env

# generate a env file with ansible credentials 
bash -c "cat << EOF > ~/.bare_metal_creds.env
export ANSIBLE_USER=cloud-user
export ANSIBLE_PASSWORD=lnxsas
EOF"
chmod 600 ~/.bare_metal_creds.env

if I'm not using the "export" commands, the script keeps asking me the ansible credentials.
I would suggest that we add them in the sample.
Thanks

feat: (IAC-1334) "kube-system/kube-vip-cloud-provider" pod is not running which prevents the external IP address allocation.

Hello
When I deploy with the bare-metal mode, the playbook execution is successful but the kube-system/kube-vip-cloud-provider pod is not running (Error then crash loop).
I have the same issue with IaC 3.5.0/k8S 1.27.6 (kube-vip 0.5.5) and IAC 3.7.0/1.27.9 (kube-vip 0.5.5 and 0.5.7 tested).

panic: version string "" doesn't match expected regular expression: "^v(\d+\.\d+\.\d+)"                                                                                                                            │
│                                                                                                                                                                                                                    │
│ goroutine 1 [running]:                                                                                                                                                                                             │
│ k8s.io/component-base/metrics.parseVersion({{0x0, 0x0}, {0x0, 0x0}, {0x1f44b17, 0x0}, {0x1c97daf, 0xb}, {0x0, 0x0}, ...})                                                                                          │
│     /go/pkg/mod/k8s.io/[email protected]/metrics/version_parser.go:47 +0x274                                                                                                                                  │
│ k8s.io/component-base/metrics.newKubeRegistry({{0x0, 0x0}, {0x0, 0x0}, {0x1f44b17, 0x0}, {0x1c97daf, 0xb}, {0x0, 0x0}, ...})                                                                                       │
│     /go/pkg/mod/k8s.io/[email protected]/metrics/registry.go:320 +0x119                                                                                                                                       │
│ k8s.io/component-base/metrics.NewKubeRegistry()                                                                                                                                                                    │
│     /go/pkg/mod/k8s.io/[email protected]/metrics/registry.go:335 +0x78                                                                                                                                        │
│ k8s.io/component-base/metrics/legacyregistry.init()                                                                                                                                                                │
│     /go/pkg/mod/k8s.io/[email protected]/metrics/legacyregistry/registry.go:29 +0x1d                                                                                                                          │
│ Stream closed EOF for kube-system/kube-vip-cloud-provider-578d9b7bf7-z6t4f (kube-vip-cloud-provider)

Then my ingress service external IP allocation is <pending> but I suppose it is a consequence of the issue with the kube-vip cloud provider.
Any help would be very appreciated.
Thanks !

PS : See bellow my ansible-vars.yaml file :

# Ansible items
ansible_user     : "cloud-user"
#ansible_password : "lnxsas"

# VM items
vm_os   : "ubuntu" # Choices : [ubuntu|rhel] - Ubuntu 20.04 LTS / RHEL ???
vm_arch : "amd64"  # Choices : [amd64] - 64-bit OS / ???

# System items
enable_cgroup_v2    : true     # TODO - If needed hookup or remove flag
system_ssh_keys_dir : "~/.ssh" # Directory holding public keys to be used on each system

# Generic items
prefix : "GEL-k8s"
deployment_type: "bare_metal" # Values are: [bare_metal|vsphere]

# Kubernetes - Common
#
# TODO: kubernetes_upgrade_allowed needs to be implemented to either
#       add or remove locks on the kubeadm, kubelet, kubectl packages
#
kubernetes_cluster_name    : "{{ prefix }}-oss" # NOTE: only change the prefix value above
#kubernetes_version         : "1.23.8" 
#kubernetes_version         : "1.24.10"
#kubernetes_version          : "1.25.8"
#kubernetes_version          : "1.26.6" https://kubernetes.io/releases/
kubernetes_version          : "1.27.6"

kubernetes_upgrade_allowed : true
kubernetes_arch            : "{{ vm_arch }}"
kubernetes_cni             : "calico"        # Choices : [calico]
kubernetes_cni_version     : "3.24.4"
kubernetes_cri             : "containerd"    # Choices : [containerd|docker|cri-o] NOTE: cri-o is not currently functional
kubernetes_service_subnet  : "10.42.0.0/16" # default values 
kubernetes_pod_subnet      : "10.43.0.0/16" # default values

# Kubernetes - VIP : https://kube-vip.io
# 
# Useful links:
#
#   VIP IP : https://kube-vip.chipzoller.dev/docs/installation/static/
#   VIP Cloud Provider IP Range : https://kube-vip.chipzoller.dev/docs/usage/cloud-provider/#the-kube-vip-cloud-provider-configmap
#
kubernetes_loadbalancer             : "kube_vip"
kubernetes_vip_version              : "0.5.5"
# we need to create static VIPs (eth0) - needs to run some commands to create/find the VIP IP in the network + register in DNS
# mandatory even for 1 control plan node
kubernetes_vip_interface            : "eth0"
kubernetes_vip_ip                   : "10.96.18.1" # for RACE EXNET pick a value in the "10.96.18.0+" unused range 
kubernetes_vip_fqdn                 : "osk-api-stud0.gelenable.sas.com" # DNS alias associated to the K8s CP VIP (names)
kubernetes_loadbalancer_addresses :
  - "range-global: 10.96.18.2-10.96.18.4" # IP range  for services type that require the LB IP access, range-<namespace>

# Kubernetes - Control Plane
control_plane_ssh_key_name : "cp_ssh"

# Labels/Taints , we associate label and taints to the K8s nodes 
# Note : here "hostname" command is used behind the scene. It does not necessarily correspond to the names used in the inventory

## Labels
node_labels:
  sasnode02:
    - kubernetes.azure.com/mode=system
  sasnode03:
    - kubernetes.azure.com/mode=system
  sasnode04:
    - kubernetes.azure.com/mode=system
  sasnode05:
    - workload.sas.com/class=cas
  sasnode06:
    - workload.sas.com/class=stateful
  sasnode07:
    - workload.sas.com/class=stateless
  sasnode08:
    - launcher.sas.com/prepullImage=sas-programming-environment
    - workload.sas.com/class=compute

## Taints
node_taints:
  sasnode05:
    - workload.sas.com/class=cas:NoSchedule

# Jump Server
jump_ip : rext03-0200.race.sas.com

# NFS Server
nfs_ip  : rext03-0175.race.sas.com

fix: (IAC-1247) OS Specific variables created without guardrails or OS validation

Found this variable: kubernetes_cri_deb_rev located here and being used here.

The code should have an OS guard for Ubuntu and the variable used for revisions should be just tagged as such and not identified as deb anything.

Here is the mocked up task. The when statement is written so both must be true

- name: set containerd.io package debian revision if not specified
  set_fact:
    kubernetes_cri_deb_rev: "-*"
  when:
     - kubernetes_cri_version | regex_search("^(\d+\.)(\d+\.)(\d+)$")
     - ansible_distribution == "Ubuntu" and (ansible_distribution_version == "20.04" or ansible_distribution_version == "22.04")
  tags:
    - install
    - update

failure in the kubeadm init task with missing optional cgroups: hugetlb

Hi
I faced this failure in the kubeadm init task this morning.
Not sure if it is reproductible/systematic or not yet.

fatal: [sasnode02]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--config", "/etc/kubernetes/kubeadm-config.yaml", "--upload-certs"], "delta": "0:04:20.783361", "end": "2022-08-26 07:35:26.018951", "msg": "non-zero return code", "rc": 1, "start": "2022-08-26 07:31:05.235590", "stderr": "\t[WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["\t[WARNING SystemVerification]: missing optional cgroups: hugetlb", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.23.8
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder \"/etc/kubernetes/pki\"
[certs] Generating \"ca\" certificate and key
[certs] Generating \"apiserver\" certificate and key
[certs] apiserver serving cert is signed for DNS names [kube-vip-11.gelk8s.demo.sas.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost sasnode02] and IPs [10.42.0.1 10.96.14.171 10.96.18.11]
[certs] Generating \"apiserver-kubelet-client\" certificate and key
[certs] Generating \"front-proxy-ca\" certificate and key
[certs] Generating \"front-proxy-client\" certificate and key
[certs] Generating \"etcd/ca\" certificate and key
[certs] Generating \"etcd/server\" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost sasnode02] and IPs [10.96.14.171 127.0.0.1 ::1]
[certs] Generating \"etcd/peer\" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost sasnode02] and IPs [10.96.14.171 127.0.0.1 ::1]
[certs] Generating \"etcd/healthcheck-client\" certificate and key
[certs] Generating \"apiserver-etcd-client\" certificate and key
[certs] Generating \"sa\" key and public key
[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"
[kubeconfig] Writing \"admin.conf\" kubeconfig file
[kubeconfig] Writing \"kubelet.conf\" kubeconfig file
[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file
[kubeconfig] Writing \"scheduler.conf\" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"
[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"
[control-plane] Creating static Pod manifest for \"kube-apiserver\"
[control-plane] Creating static Pod manifest for \"kube-controller-manager\"
[control-plane] Creating static Pod manifest for \"kube-scheduler\"
[etcd] Creating static Pod manifest for local etcd in \"/etc/kubernetes/manifests\"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

\tUnfortunately, an error has occurred:
\t\ttimed out waiting for the condition

\tThis error is likely caused by:
\t\t- The kubelet is not running
\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
\t\t- 'systemctl status kubelet'
\t\t- 'journalctl -xeu kubelet'

\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.
\tTo troubleshoot, list all containers using your preferred container runtimes CLI.

\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
\t\tOnce you have found the failing container, you can inspect its logs with:
\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'", "stdout_lines": ["[init] Using Kubernetes version: v1.23.8", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[certs] Using certificateDir folder \"/etc/kubernetes/pki\"", "[certs] Generating \"ca\" certificate and key", "[certs] Generating \"apiserver\" certificate and key", "[certs] apiserver serving cert is signed for DNS names [kube-vip-11.gelk8s.demo.sas.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost sasnode02] and IPs [10.42.0.1 10.96.14.171 10.96.18.11]", "[certs] Generating \"apiserver-kubelet-client\" certificate and key", "[certs] Generating \"front-proxy-ca\" certificate and key", "[certs] Generating \"front-proxy-client\" certificate and key", "[certs] Generating \"etcd/ca\" certificate and key", "[certs] Generating \"etcd/server\" certificate and key", "[certs] etcd/server serving cert is signed for DNS names [localhost sasnode02] and IPs [10.96.14.171 127.0.0.1 ::1]", "[certs] Generating \"etcd/peer\" certificate and key", "[certs] etcd/peer serving cert is signed for DNS names [localhost sasnode02] and IPs [10.96.14.171 127.0.0.1 ::1]", "[certs] Generating \"etcd/healthcheck-client\" certificate and key", "[certs] Generating \"apiserver-etcd-client\" certificate and key", "[certs] Generating \"sa\" key and public key", "[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"", "[kubeconfig] Writing \"admin.conf\" kubeconfig file", "[kubeconfig] Writing \"kubelet.conf\" kubeconfig file", "[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file", "[kubeconfig] Writing \"scheduler.conf\" kubeconfig file", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Starting the kubelet", "[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"", "[control-plane] Creating static Pod manifest for \"kube-apiserver\"", "[control-plane] Creating static Pod manifest for \"kube-controller-manager\"", "[control-plane] Creating static Pod manifest for \"kube-scheduler\"", "[etcd] Creating static Pod manifest for local etcd in \"/etc/kubernetes/manifests\"", "[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s", "[kubelet-check] Initial timeout of 40s passed.", "", "\tUnfortunately, an error has occurred:", "\t\ttimed out waiting for the condition", "", "\tThis error is likely caused by:", "\t\t- The kubelet is not running", "\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)", "", "\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:", "\t\t- 'systemctl status kubelet'", "\t\t- 'journalctl -xeu kubelet'", "", "\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.", "\tTo troubleshoot, list all containers using your preferred container runtimes CLI.", "", "\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:", "\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'", "\t\tOnce you have found the failing container, you can inspect its logs with:", "\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'"]}

bare metal deployment fail because terraform command is not found

bare metal deployments does not requires Terraform.

However when running the playbook with "bare_metal" as the deployment model value, I get this failure :

TASK [kubernetes/sas-iac-buildinfo : Register IAC Tooling information] ********************************
Wednesday 06 July 2022  09:21:16 +0000 (0:00:00.222)       0:02:29.728 ********
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "cd \"/home/cloud-user/viya4-iac-k8s\"\n\"/home/cloud-user/viya4-iac-k8s/files/tools/iac_tooling_version.sh\"\n", "delta": "0:00:00.006851", "end": "2022-07-06 09:21:16.880517", "msg": "non-zero return code", "rc": 127, "start": "2022-07-06 09:21:16.873666", "stderr": "/home/cloud-user/viya4-iac-k8s/files/tools/iac_tooling_version.sh: line 17: terraform:command not found", "stderr_lines": ["/home/cloud-user/viya4-iac-k8s/files/tools/iac_tooling_version.sh: line 17: terraform: command not found"], "stdout": "", "stdout_lines": []}`

the called script assume TF is installed. It should not, since TF is not required in this case.

Pod connectivity issue

Terraform Version Details

I'm creating a bare-metal environment on Ubuntu 22.04. I ran the IaC setup and install, and it finished successfully.
However, I initially noticed that DNS resolution is not working for pods (all but pods running on the first control plane node, where core-dns is also running).

This is from a (helper) pod running in the first control plane node, where core-dns is also running:

This is from the helper pod running in all other nodes:

I then tried to reach the core-dns IP from the nodes.
I can reach it from the node where it's running:

But I cannot reach it from any other nodes:

Terraform Variable File Details

No response

Ansible Variable File Details

ansible-vars-suppressed.yaml.txt
inventory-suppressed.txt

Steps to Reproduce

Create VMs according to the requirements. I'm using Openstack and - Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-106-generic x86_64) images.
Run IaC setup phase
Run IaC install phase
Test pod networking connectivity

Expected Behavior

Pod networking should work. Pods should be able to talk with each other.

Actual Behavior

Pod networking is broken. Pods cannot talk with each other.

Additional Context

I have this environment available if that would make troubleshooting easier.

References

No response

Code of Conduct

I agree to follow this project's Code of Conduct

IaC didn't update postgresql.conf

Hi,

I found that the IaC didn't update the max_connections and max_prepared_transactions on the postgresql.conf.

See below my config:

[test_k8s_default_pgsql]
sasnode03
[test_k8s_default_pgsql:vars]
postgres_server_name = "default"
postgres_ip="10.96.14.146"
postgres_server_version="12"
postgres_server_ssl="off"
postgres_administrator_login="postgres"
postgres_administrator_password="admin123"
postgres_system_setting_max_prepared_transactions="1024"
postgres_system_setting_max_connections="1024"

[postgres:children]
test_k8s_default_pgsql

After the IaC created the DB:

ps -efl | grep postgresql
0 S postgres 98448 1 0 80 0 - 53888 - 09:20 ? 00:00:00 /usr/lib/postgresql/12/bin/postgres -D /var/lib/postgresql/12/main -c config_file=/etc/postgresql/12/main/postgresql.conf

egrep 'max_connections|max_prepared_transactions' /etc/postgresql/12/main/postgresql.conf
max_connections = 100 # (change requires restart)
#max_prepared_transactions = 0 # zero disables the feature

Caution: it is not advisable to set max_prepared_transactions nonzero unless

Regards

Doc improvement: There is no run.sh script

In the main README page we have :

Run the run.sh script directly on your workstation

However there is no run.sh script in the whole project. I suggest we remove it or replace it with oss-k8s.sh

(IAC-716) Doc improvement : provide inventory sample without postgres Host groups

In order to disable the external postgres I had to remove and comment specific lines in the default/example inventory file.
People with limited knowledge of ansible might struggle to do it.
It would be nice so provide an inventory sample without postgres Host groups as it corresponds to deployment with internal postgres (which is likely the most common choice).
Thanks

kubernetes/toolbox

Hello,

the role kubernetes/toolbox (Step "Apply Google Cloud public signing key") stops with Error.
The URL https://packages.cloud.google.com/apt/doc/apt-key.gpg return an http 500 error.

BR
Markus

sample-inventory is not correct

Hi,

Could you please add the postgres_server_name paramètre on the examples/bare-metal/sample-inventory or fixe the roles/systems/postgres/tasks/main.yaml?
Because if it's not set the IaC raise the following error:

fatal: [10.96.14.146]: FAILED! => {"msg": "The conditional check 'postgres_server_name != "default"' failed. The error was: error while evaluating conditional (postgres_server_name != "default"): 'postgres_server_name' is undefined\n\nThe error appears to be in '/home/cloud-user/viya4-iac-k8s/roles/systems/postgres/tasks/main.yaml': line 201, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Create admin postgres user\n ^ here\n"}

Best regards

Cloud-init support for VMware infrastructure

Is your feature request related to a problem? Please describe.

A very nice feature is the possibility to provision the VMs starting with a cloud-image, like Ubuntu Cloud Image, stored in a Content Library and using the Cloud-Init to setup user and ssh authorized keys.

Describe the solution you'd like

Terraform provider for VMware Vsphere supports Cloud-Init and Content Library. A small Terraform code change will be required to support Cloud-Init.
The user used by the Ansible playbooks can be created with the Cloud-Init by the "terraform apply".
In this way there is no need to have the user already present into the VM template.

Describe alternatives you've considered

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Remove references to delphi.unx.sas.com

In roles/systems/common/tasks/main.yaml and roles/kubernetes/common/tasks/main.yaml, certificate CA files are pulled from delphi.unx.sas.com. This is an internal SAS URL and should not be referenced. I don't believe the SAS CA is needed for anything in this project, so the whole "Adding SAS certs" task could likely be removed.

kubeadm fails if swap is enabled on the nodes

I had a failure of the kubadm init task because the swap was enabled on my K8s nodes.

Failed to run kubelet" err="failed to run Kubelet: running with swap on is not supported, please disable swap>

After running the code below (inspired from there) to disable the swap on the K8s nodes, I was able to have a sucessful kubeadm init task.

bash -c "cat << EOF > ~/viya4-iac-k8s/disable-swap.yaml
---
- hosts: k8s
  tasks:
  - name: Disable SWAP since kubernetes can't work with swap enabled (1/2)
    shell: |
      swapoff -a
  - name: Disable SWAP in fstab since kubernetes can't work with swap enabled (2/2)
    replace:
      path: /etc/fstab
      regexp: '^([^#].*?\sswap\s+sw\s+.*)$'
      replace: '# \1'
EOF"
cd ~/viya4-iac-k8s
ansible-playbook -i inventory disable-swap.yaml -b

We might want to add this task in the playbook.

Problem metrics-server error: metrics not available yet

I have an issue with the deployment of a bare-metal cluster using oss-k8s.sh. The deployment completes successfully, but when I run 'kubectl top nodes,' I encounter the error message 'error: metrics not available yet.'

Can you help me figure out what the issue might be? This could potentially cause problems with the deployment of Pods using HPA."

$ kubectl top nodes
error: metrics not available yet

$ kubectl get all -l app.kubernetes.io/name=metrics-server -n kube-system
NAME                                  READY   STATUS    RESTARTS   AGE
pod/metrics-server-84b8898677-mbjqt   1/1     Running   0          26m
pod/metrics-server-84b8898677-n8m5n   1/1     Running   0          26m
pod/metrics-server-84b8898677-zt6gr   1/1     Running   0          26m

NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/metrics-server   ClusterIP   10.96.71.122   <none>        443/TCP   26m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/metrics-server   3/3     3            3           26m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/metrics-server-84b8898677   3         3         3       26m
$

$ kubectl get nodes -o wide
NAME                  STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-sasnode-cp        Ready    control-plane   37m   v1.26.6   192.168.31.30   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20
k8s-sasnode-system    Ready    <none>          37m   v1.26.6   192.168.31.35   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20
k8s-sasnode-worker1   Ready    <none>          37m   v1.26.6   192.168.31.31   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20
k8s-sasnode-worker2   Ready    <none>          37m   v1.26.6   192.168.31.32   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20
k8s-sasnode-worker3   Ready    <none>          37m   v1.26.6   192.168.31.33   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20
k8s-sasnode-worker4   Ready    <none>          37m   v1.26.6   192.168.31.34   <none>        Ubuntu 20.04.6 LTS   5.4.0-164-generic   containerd://1.6.20

: deploy

PLAY RECAP ***********************************************************************************************************************************************************************************************
k8s-sasnode-cp             : ok=53   changed=28   unreachable=0    failed=0    skipped=8    rescued=0    ignored=0
k8s-sasnode-nfs            : ok=15   changed=4    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0
k8s-sasnode-system         : ok=46   changed=20   unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
k8s-sasnode-worker1        : ok=48   changed=21   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
k8s-sasnode-worker2        : ok=48   changed=21   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
k8s-sasnode-worker3        : ok=48   changed=21   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
k8s-sasnode-worker4        : ok=48   changed=21   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
localhost                  : ok=7    changed=6    unreachable=0    failed=0    skipped=4    rescued=0    ignored=0

Playbook run took 0 days, 0 hours, 10 minutes, 44 seconds
miércoles 18 octubre 2023  17:23:59 +0000 (0:00:00.099)       0:10:44.693 *****
===============================================================================
kubernetes/common : Reboot machines to enable added items like cgroup for cri, etc. ------------------------------------------------------------------------------------------------------------- 345.42s
kubernetes/control_plane/init/primary : Run kubeadm init ----------------------------------------------------------------------------------------------------------------------------------------- 46.35s
kubernetes/metrics/metrics-server : Deploy metrics-server ---------------------------------------------------------------------------------------------------------------------------------------- 37.51s
kubernetes/toolbox : Update apt package index and install kubelet, kubeadm, kubectl -------------------------------------------------------------------------------------------------------------- 34.38s
kubernetes/common : Update OS -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 21.28s
kubernetes/node/init : Join compute nodes to the cluster ----------------------------------------------------------------------------------------------------------------------------------------- 17.11s
kubernetes/storage/nfs-subdir-external-provisioner : Setting up default storage for the cluster using nfs-subdir-external-provisioner ------------------------------------------------------------ 14.81s
kubernetes/cri/containerd : Installing containerd.io --------------------------------------------------------------------------------------------------------------------------------------------- 12.84s
kubernetes/cri/containerd : Uninstall old Docker/Containerd versions ----------------------------------------------------------------------------------------------------------------------------- 12.37s
kubernetes/storage/sig-storage-local-static-provisioner : Cloning sig-storage-local-static-provisioner ------------------------------------------------------------------------------------------- 10.71s
kubernetes/common : Update GRUB ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 9.60s
kubernetes/storage/sig-storage-local-static-provisioner : Setting up local storage for the cluster using sig-storage-local-static-provisioner ----------------------------------------------------- 4.38s
kubernetes/toolbox : Download crictl -------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.73s
Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3.66s
kubernetes/common : Execute helm installation script ---------------------------------------------------------------------------------------------------------------------------------------------- 3.66s
kubernetes/common : Update limits to support SAS software ----------------------------------------------------------------------------------------------------------------------------------------- 3.44s
kubernetes/cni/calico : Install Operator ---------------------------------------------------------------------------------------------------------------------------------------------------------- 3.31s
kubernetes/toolbox : Install crictl --------------------------------------------------------------------------------------------------------------------------------------------------------------- 2.82s
kubernetes/node/baseline : Install nfs-common for nfs-subdir-external-provisioner ----------------------------------------------------------------------------------------------------------------- 2.66s
kubernetes/common : Install required packages for every machine ----------------------------------------------------------------------------------------------------------------------------------- 2.11s

small typo in README

Hi there
It is a very minor issue but I think there is a typo in this sentence : "At least 1 machine to host a PostgreSQL server (for the SAS Infrastructure Data Server component) if you are plan on using an external database server with your cluster"

feat: (IAC-1245) add a requirement in the README : all nodes should be at the same time.

For a successful kadmin init step, all the machines in the collection must have the same date and time.
While not explicitly listed this requirement is very important… if one of the node is 2 minutes behind the others, the installation will not succeed. When trying to connect to your API server you’ll see this kind of error message :

Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-09-05T09:11:23Z is before 2022-09-05T09:12:15Z

It might seem obvious that all servers are at the same time, however in a VMWare environment, on boot the VM picks up the time from the BIOS (which would be from the VMWare hosts) and real life experience taught us, that the time could drift on specific VMWare hosts…resulting in different date/time on the collection’s machines.

kube-vip configMap failed to be applied (task: kubernetes/loadbalancer/kube_vip)

When creating the cluster on bare metal with Ansible, it fails while trying to apply the kube-vip ConfigMap.

The template here does not specify the properties range-development: as this behavior was changed in this commit

# Reference Link: https://kube-vip.io/docs/usage/cloud-provider/#the-kube-vip-cloud-provider-configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevip
  namespace: kube-system
data:
{% for address in kubernetes_loadbalancer_addresses %}
  {{ address }}
{% endfor %}

This is the error:

TASK [kubernetes/loadbalancer/kube_vip : Apply kube-vip Cloud Provider configMap] ***********************************************************************************
Wednesday 04 October 2023 09:49:27 +0000 (0:00:00.553) 0:01:58.174 *****
fatal: [control]: FAILED! => {"changed": true, "cmd": "kubectl apply -f /tmp/kube-vip-cm.yaml\n", "delta": "0:00:00.282581", "end": "2023-10-04 09:49:27.960276", "msg": "non-zero return code", "rc": 1, "start": "2023-10-04 09:49:27.677695", "stderr": "Error from server (BadRequest): error when creating "/tmp/kube-vip-cm.yaml": ConfigMap in version "v1" cannot be handled as a ConfigMap: json: cannot unmarshal string into Go struct field ConfigMap.data of type map[string]string", "stderr_lines": ["Error from server (BadRequest): error when creating "/tmp/kube-vip-cm.yaml": ConfigMap in version "v1" cannot be handled as a ConfigMap: json: cannot unmarshal string into Go struct field ConfigMap.data of type map[string]string"], "stdout": "", "stdout_lines": []}

If /tmp/kube-vip-cm.yaml is updated with range-development: the ConfigMap is created correctly

Problem: unable to create vms in vsphere/vcenter environment

Terraform Version Details

{
"terraform_version": ""1.7.4"",
"terraform_revision": "null",
"terraform_outdated": "true",
"provider_selections": "{}"
}

Terraform Variable File Details

vSphere

vsphere_server = "vcenter.mydomain.com" # Name of the vSphere server
vsphere_datacenter = "vSAN Datacenter" # Name of the vSphere data center
vsphere_datastore = "vsanDatastore" # Name of the vSphere data store to use for the VMs
vsphere_resource_pool = "default-rp" # Name of the vSphere resource pool to use for the VMs
vsphere_folder = "/SAS-Viya-4" # Name of the vSphere folder to store the vms
vsphere_template = "mytemplate" # Name of the VM template to clone to create VMs for the cluster
vsphere_network = "VM Network" # Name of the network to to use for the VMs

Ansible Variable File Details

No response

Steps to Reproduce

docker run --rm -it  --group-add root   --user $(id -u):$(id -g)   --env-file $HOME/.vsphere_creds.env --volume $(pwd):/workspace   viya4-iac-k8s apply setup

Expected Behavior

VMs created successfully

Actual Behavior

Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.postgresql["default"].vsphere_virtual_machine.server,
│ on modules/server/main.tf line 19, in resource "vsphere_virtual_machine" "server":
│ 19: resource "vsphere_virtual_machine" "server" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.nfs.vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.system["system"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.system["system"].vsphere_virtual_machine.static[1],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["cas"].vsphere_virtual_machine.static[1],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["cas"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["cas"].vsphere_virtual_machine.static[2],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["compute"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["stateless"].vsphere_virtual_machine.static[1],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["stateful"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.node["stateless"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.jump.vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.control_plane["control_plane"].vsphere_virtual_machine.static[0],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.control_plane["control_plane"].vsphere_virtual_machine.static[1],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵
╷
│ Error: this virtual machine requires a client CDROM device to deliver vApp properties
│
│ with module.control_plane["control_plane"].vsphere_virtual_machine.static[2],
│ on modules/vm/main.tf line 24, in resource "vsphere_virtual_machine" "static":
│ 24: resource "vsphere_virtual_machine" "static" {
│
╵

Additional Context

No response

References

No response

Code of Conduct

I agree to follow this project's Code of Conduct

(IAC-1042) Issue with the url used to download the kubernetes-xenial APT repo public key.

Hi there

The following issue has been reported (bareos mode with Ubuntu 22.04).

I was able to reproduce the problem and it appears to be a known issue with the GPG key used for the kubernetes APT repository, see Ubuntu kubernetes-xenial public key is not available: NO_PUBKEY B53DC80D13EDEF05

One solution seems to be to change the URL from where the GPG key is pulled.

I was able to recover with the following steps (where I pull the GPG key from https://dl.k8s.io instead of https://packages.cloud.google.com) :

# uninstall OSS
./oss-k8s.sh uninstall

# Remove the old apt GPG key
ansible -i ~/viya4-iac-k8s/inventory k8s -m shell -a "rm -Rf /usr/share/keyrings/kubernetes-archive-keyring.gpg" -b

# Download the apt GPG from an alternative location
ansible -i ~/viya4-iac-k8s/inventory k8s -m shell -a "curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://dl.k8s.io/apt/doc/apt-key.gpg" -b

# Re-install OSS
./oss-k8s.sh install

I don't know if it is a transient issue or if there are better ways to fix the issue, but the playbook might be currently broken and require a urgent fix to use the alternative URL.

Remove or comment the kubernetes_vip_interface line in the ansible-vars.yaml sample ?

The kubernetes_vip_interface variable is no longer listed in the CONFIG-VARS.md file for vsphere, however it is still listed in the Ansible ansible-vars.yaml file in the same page and also in the sample.

docs: (IAC-1244) jmespath requirement not listed

I'm using the tool for a bare metal deployment using the ./oss-k8s.sh script (not the docker container) and I get the following failure because jmespath is not installed.

TASK [kubernetes/sas-iac-buildinfo : Create the sas-iac-buildinfo ConfigMap manifest file] *************************************************************
Friday 08 July 2022  07:27:35 +0000 (0:00:00.372)       0:00:00.982 ***********
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleError: You need to install "jmespath" prior to running json_query filter
fatal: [localhost]: FAILED! => {"changed": false, "msg": "AnsibleError: You need to install \"jmespath\" prior to running json_query filter"}`

I see it is part of the requirements.txt file and installed when building the docker image, but it should be part of the documented requirements when not using the docker image.

(IAC-590) Add the capability to create a local storage class for use with the SAS Information Catalog and VI products

DNS resolution fails. Tried both kube-vip and mitalLB

Terraform Version Details

No response

Terraform Variable File Details

No response

Ansible Variable File Details

No response

Steps to Reproduce

Deploy env using this project on vpshere. used kubevip and also redeployed using metallb but with same issue. deployed sas or ngnix and see same issue.

1 dispatcher.go:217] Failed calling webhook, failing closed validate.nginx.ingress.kubernetes.io: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": Unknown Host

Expected Behavior

works as expected

Actual Behavior

Additional Context

No response

References

No response

Code of Conduct

I agree to follow this project's Code of Conduct

sassoftware / viya4-iac-k8s Goto Github PK

viya4-iac-k8s's Introduction

SAS Viya 4 Infrastructure as Code (IaC) for Open Source Kubernetes

Release Notes

Overview

Prerequisites

Machines

VMware vSphere

Physical or Virtual Machines

Networking

Technical Prerequisites

Script Requirements

Docker Requirements

Getting Started

Clone This Project

Customize Input Values

vSphere/vCenter Machines

SAS Viya IaC Configuration Files

Create and Manage Cluster Resources

Contributing

License

Additional Resources

viya4-iac-k8s's People

Contributors

Stargazers

Watchers

Forkers

viya4-iac-k8s's Issues

Terraform Version

Terraform Variable File

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Terraform Version Details

Terraform Variable File Details

Ansible Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Caution: it is not advisable to set max_prepared_transactions nonzero unless

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Code of Conduct

Terraform Version Details

Terraform Variable File Details

vSphere

Ansible Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Terraform Version Details

Terraform Variable File Details

Ansible Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Recommend Projects

Recommend Topics

Recommend Org