sassoftware / viya4-iac-gcp Goto Github PK

This project contains Terraform configuration files to provision infrastructure components required to deploy SAS Viya platform products on Google Cloud

License: Apache License 2.0

Shell 2.84% HCL 95.34% Dockerfile 1.82%

terrafrom sas-viya cloud-resources iac gke google-gcp

viya4-iac-gcp's Introduction

SAS Viya 4 Infrastructure as Code (IaC) for Google Cloud

Overview

This project contains Terraform scripts to provision Google Cloud infrastructure resources required to deploy SAS Viya 4 platform products. Here is a list of resources this project can create -

VPC Network and Network Firewalls

Managed Google Kubernetes Engine (GKE) cluster

System and User GKE Node pools with required Labels and Taints

Infrastructure to deploy SAS Viya platform CAS in SMP or MPP mode

Shared Storage options for SAS Viya platform - Google Filestore (ha) or NFS Server (standard)

Google Cloud SQL for PostgreSQL instance, optional

Once the cloud resources are provisioned, see the viya4-deployment repository to deploy SAS Viya 4 platform products. If you need more information on the SAS Viya 4 platform products refer to the official SAS® Viya® platform Operations documentation for more details.

Prerequisites

Operational knowledge of

Required

Access to a Google Cloud "Project" with these API Services enabled.
A Google Cloud Service Account.
Terraform or Docker
- Terraform
  - Terraform - v1.8.5
  - kubectl - v1.29.7
  - jq - v1.6
  - gcloud CLI - (optional - useful as an alternative to the Google Cloud Platform Portal) - v479.0.0
  - gke-gcloud-auth-plugin - (optional - only for provider based Kubernetes configuration files) - >= v1.26
- Docker
  - Docker

Getting Started

Clone this project

Run these commands in a Terminal session:

# clone this repository
git clone https://github.com/sassoftware/viya4-iac-gcp

# move to directory
cd viya4-iac-gcp

Authenticating Terraform to access Google Cloud

See Terraform Google Cloud Authentication for details.

Customize Input Values

Create a file named terraform.tfvars to customize any input variable value documented in the CONFIG-VARS.md file. For starters, you can copy one of the provided example variable definition files in ./examples folder. For more details on the variables declared refer to the CONFIG-VARS.md file.

NOTE: You will need to update the cidr_blocks in the variables.tf file to allow traffic from your current network. Without these rules, access to the cluster will only be allowed via the Google Cloud Console.

When using a variable definition file other than terraform.tfvars, see Advanced Terraform Usage for additional command options.

Creating and Managing the Cloud Resources

Create and manage the Google Cloud resources by either

using Terraform directly on your workstation, or
using a Docker container.

Troubleshooting

See troubleshooting page.

Contributing

We welcome your contributions! Please read CONTRIBUTING.md for details on how to submit contributions to this project.

License

This project is licensed under the Apache 2.0 License.

Additional Resources

Google Cloud

Google Cloud CLI - https://cloud.google.com/sdk/gcloud
Terraform on Google Cloud - https://cloud.google.com/docs/terraform
Terraform and Google Cloud Service Accounts - https://medium.com/@gmusumeci/how-to-create-a-service-account-for-terraform-in-gcp-google-cloud-platform-f75a0cf918d1
GKE intro - https://cloud.google.com/kubernetes-engine

Terraform

Google Provider - https://www.terraform.io/docs/providers/google/index.html
Google GKE - https://www.terraform.io/docs/providers/google/r/container_cluster

viya4-iac-gcp's People

Contributors

Stargazers

Watchers

Forkers

rippmn paataugrekhelidze dzhan85 porped76 carus11 anupkumarnaik indirection rajeshkumarkarnati azaadshatru balimidi24 jebdeb01 fiolbs arjun-github2075 ghas-results pranayinfo crbueno

viya4-iac-gcp's Issues

VPC requirements

We'd like to know if in the BYON approch we can provide a "Shared VPC" or not, have you ever tested it?

thanks

auto-scaling not working when initial node number is zero

While working with the minimal sample file, I noted that while the generic node pool scaled from 0 to 5 nodes. Zero was the default in the sample for terraform.tfvars, https://github.com/sassoftware/viya4-iac-gcp/blob/main/examples/sample-input-minimal.tfvars

However CAS never scaled from it's default of 0 and it's stuck in pending. Manually scaled up 4 nodes for MPP. JK and I tested auto-scaling in GCP last week and we watched CAS scale. However the default value of the cas node pool in other sample files is 1. Perhaps that makes a difference in the success and failure of auto-scaling in GCP?

Errors when creating GCP Cluster

Hello,

I'm facing issues when creating cluster on GCP. Is it normal?

Error: Post "https://IP_Address/api/v1/namespaces/kube-system/configmaps": dial tcp IP_Address:443: i/o timeout
│
│ with kubernetes_config_map.sas_iac_buildinfo[0],
│ on main.tf line 46, in resource "kubernetes_config_map" "sas_iac_buildinfo":
│ 46: resource "kubernetes_config_map" "sas_iac_buildinfo" {
│
╵
╷
│ Error: Post "https://IP_Address/api/v1/namespaces/kube-system/serviceaccounts": dial tcp IP_Address:443: i/o timeout
│
│ with module.kubeconfig.kubernetes_service_account.kubernetes_sa[0],
│ on modules/kubeconfig/main.tf line 27, in resource "kubernetes_service_account" "kubernetes_sa":
│ 27: resource "kubernetes_service_account" "kubernetes_sa" {
│
╵
╷
│ Error: Post "https://IP_Address/apis/rbac.authorization.k8s.io/v1/clusterrolebindings": dial tcp IP_Address:443: i/o timeout
│
│ with module.kubeconfig.kubernetes_cluster_role_binding.kubernetes_crb[0],
│ on modules/kubeconfig/main.tf line 36, in resource "kubernetes_cluster_role_binding" "kubernetes_crb":
│ 36: resource "kubernetes_cluster_role_binding" "kubernetes_crb" {

Any idea?
Hossein

Update notes that keeps refer the create_postgres variable

in the notes for database_subnet_cidr we still refer to the create_postgres : "Only used when create_postgres=true"
However my understanding is that this variable is now deprecated. Maybe it could be replaced with "Only used with external postgres"
thanks

Error out early if Kubernetes version doesn't match expected format

In variables.tf is:

variable "kubernetes_channel" {
  default = "UNSPECIFIED"
}

However, this doesn't seem to be honored. Terraform apply fails with:
Error: error creating NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false when release_channel REGULAR is set., badRequest

Terraform state indicates channel is indeed REGULAR:
terraform.tfstate: "channel": "REGULAR"

GKE VM disks seem to stay after a Terraform destroy

I've seen that a lot of disks stay even after a Terraform destroy. Is it expected? Is there an option that I missed during the destroy operation?
I start one env a day and delete it at the end of the day. Recently I had to clean hundreds of disks in Compute Engine => Disks

"Error: Invalid count argument" After failure applying

I pulled the latest code and ran a terraform apply, which failed with:

Error: error creating NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false when release_channel REGULAR is set., badRequest

Error: Request "Create IAM Members roles/logging.logWriter serviceAccount:[email protected] for \"project \\\"rdorgasub5\\\"\"" returned error: Error retrieving IAM policy for project "rdorgasub5": googleapi: Error 403: The caller does not have permission, forbidden

I've not looked into those errors yet, but attempting to do another apply, with no changes to anything, proceeds to fail with:

Error: Invalid count argument

  on .terraform/modules/gke.gcloud_delete_default_kube_dns_configmap/main.tf line 63, in resource "null_resource" "module_depends_on":
  63:   count = length(var.module_depends_on) > 0 ? 1 : 0

The "count" value depends on resource attributes that cannot be determined
until apply, so Terraform cannot predict how many instances will be created.
To work around this, use the -target argument to first apply only the
resources that the count depends on.

exceeded quota: gke-resource-quotas when deploying on a GKE cluster created with the iac-gcp tool

Hello
I'd like to report an issue that I'm seeing when using the viya4-deployment container to deploy viya in a GKE cluster and I see a task failure that seems related to the quotas.
Note that the same issue happens when I'm doing a manual deployment. (I created the cluster with the iac tool).

TASK [vdm : Deploy Manifest] *************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "kubectl --kubeconfig /config/kubeconfig apply-n viya4gcp --selector=\"sas.com/admin=namespace\" -f /data/raphpoumarv4fa-gke/viya4gcp/site.yaml --prune\n", "delta": "0:01:55.948360", "end": "2021-05-07 07:51:51.920042", "msg": "non-zero return code"          , "rc": 1, "start": "2021-05-07 07:49:55.971682", "stderr": "Error from server (Forbidden): error when  creating \"/data/raphpoumarv4fa-gke/viya4gcp/site.yaml\": ingresses.networking.k8s.io \"sas-model-publish\" is forbidden: exceeded quota: gke-resource-quotas, requested: count/ingresses.networking.k8s.io          =1, used: count/ingresses.networking.k8s.io=100, limited: count/ingresses.networking.k8s.io=100\nError           from server (Forbidden): error when creating \"/data/raphpoumarv4fa-gke/viya4gcp/site.yaml\": ingress          es.networking.k8s.io \"sas-model-repository\" is forbidden: exceeded quota: gke-resource-quotas, reque          sted: count/ingresses.networking.k8s.io=1, used: count/ingresses.networking.k8s.io=100, limited: count          /ingresses.networking.k8s.io=100\nError from server (Forbidden): error when creating \"/data/raphpouma          rv4fa-gke/viya4gcp/site.yaml\": ingresses.networking.k8s.io \"sas-model-studio-app\" is forbidden: exceeded quota: gke-resource-quotas, requested: count/ingresses.networking.k8s.io=1, used: count/ingresse          s.networking.k8s.io=100, limited: count/ingresses.networking.k8s.io=100\nError from server (Forbidden)          : error when creating \"/data/raphpoumarv4fa-gke/viya4gcp/site.yaml\": ingresses.networking.k8s.io \"s          as-natural-language-conversations\" is forbidden: exceeded quota: gke-resource-quotas, requested: coun          t/ingresses.networking.k8s.io=1, used: count/ingresses.networking.k8s.io=100, limited: count/ingresses          .networking.k8s.io=100\nError from server (Forbidden): error when creating \"/data/raphpoumarv4fa-gke/          viya4gcp/site.yaml\": ingresses.networking.k8s.io \"sas-natural-language-generation\" is forbidden: ex          ceeded quota: gke-resource-quotas, requested: count/ingresses.networking.k8s.io=1, used: count/ingress          es.networking.k8s.io=100, limited: count/ingresses.networking.k8s.io=100\nError from server (Forbidden          ): error when creating \"/data/raphpoumarv4fa-gke/viya4gcp/site.yaml\": ingresses.networking.k8s.io \"          sas-natural-language-understanding\" is forbidden: exceeded quota: gke-resource-quotas, requested: cou          nt/ingresses.networking.k8s.io=1, used: count/ingresses.networking.k8s.io=100,

It seems to be a transient issue as it disappears when I run the docker viya4-deployment --tags "baseline,viya,install" a second time.
Thanks
Raphael

Add in Viya 4 reference section

Add in the following information:

Once the cloud resources are provisioned, see the viya4-deployment repo to deploy SAS Viya 4 products. If you need more information on the SAS Viya 4 products refer to the official SAS® Viya® 4 Operations documentation for more details.

With links etc. like we've have in the Azure repo.

node pool autoscaling not working as expected

I have provisioned my CAS node pool like this :

  node_pools = {
    cas = {
        "vm_type"      = "n1-highmem-16"
        "os_disk_size" = 200
        "min_nodes"    = 1
        "max_nodes"    = 5
        "node_taints"  = ["workload.sas.com/class=cas:NoSchedule"]
        "node_labels" = {
        "workload.sas.com/class" = "cas"
        }
        "local_ssd_count" = 0
    },

When I run the Terraform apply, I get 1 CAS node (as expected).
But later on, when I run the Viya deployment (configured with CAS MPP and 3 workers), all my CAS Worker pods remain pending because they can't find a suitable node.
I was expecting to see more CAS nodes being automatically provisioned by the autoscaler to host my CAS worker pods (that is what happens with AKS).
However in the K8s event log I can see that the pending CAS pods did NOT trigger the autoscaler:
"pod didn't trigger scale-up: 4 max cluster cpu, memory limit reached, 2 max node group size reached"

It seems to be related to the general "auto-provisionning" settings as explained there :
When I disabled the "Node auto-provisioning" settings in the GCP web console, suddenly additional CAS nodes were provisioned and all my CAS pods were successfully allocated.

Looking at the main.tf code I found this line :

cluster_autoscaling = { "enabled": true, "max_cpu_cores": 1, "max_memory_gb": 1, "min_cpu_cores": 1, "min_memory_gb": 1 }
which I think corresponds to the "Node auto-provisioning" settings.
I think we should either change the values or document/expose this Cluster settings as the current settings seems to contradict what is expected from the Node Pools min and max settings.

Thanks

Istio

Hello,

This customer is looking to implement Istio (Managed Anthos Service Mesh for GKE) for ingress and load balancing.
As per documentation, Istio is not supported.
Is there a way to deploy Viya on a cluster that uses Istio for ingress and load balancing instead of NGINX?

Thx a lot!

kubernetes_channel selection not being honored

I deleted my git code and pulled brand new code from https://github.com/sassoftware/viya4-iac-gcp.git this morning. I cd'd into that directory, created a new terraform.tfvars and ran terraform. I cannot seem to change K8s version to any 1.19 release. I supplied the following values in terraform.tfvars:

kubernetes_version = "1.19"
kubernetes_channel = "RAPID"

From @thpang "...did find an issue on the GCP side of things. I'll create an issue. We don't currently honor the channel selection so you may end up with a version 1.18 of k8s. Looking into this now."

Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above

Hello
When I run the IAC tool for GCP with kubernetes_version= "1.24.7-gke.900", I get the following warning after the terraform apply command :

module.kubeconfig.local_file.kubeconfig: Creation complete after 0s [id=15fa29454743e2d3ebf2c63a258abc831d2d9f1d]
╷
│ Warning: "default_secret_name" is no longer applicable for Kubernetes v1.24.0 and above
│
│   with module.kubeconfig.kubernetes_service_account.kubernetes_sa[0],
│   on modules/kubeconfig/main.tf line 68, in resource "kubernetes_service_account" "kubernetes_sa":
│   68: resource "kubernetes_service_account" "kubernetes_sa" {
│
│ Starting from version 1.24.0 Kubernetes does not automatically generate a
│ token for service accounts, in this case, "default_secret_name" will be
│ empty
╵

Could you please confirm that the IaC tool is supported with Kubernetes 1.24 ?
Is there anything I could do to avoid this message ?
thanks

Error 403: Permission iam.serviceAccounts.create is required and Error waiting for Create Service Networking Connection

Hi there
Everything was working earlier this week but when I git cloned the latest code, now I get these error message when I run the Terraform apply:

Error: Error creating service account: googleapi: Error 403: Permission iam.serviceAccounts.create is required to perform this operation on project projects/sas-gelsandbox., forbidden

Error: Error waiting for Create Service Networking Connection: Error code 3, message: An IP range in the peer network (192.168.0.0/24) overlaps with an IP range in the local network (192.168.0.0/23).

See below my tfvars file:
Am I missing something ?
Thanks

# !NOTE! - These are only a subset of variables.tf provided for sample.
# Customize this file to add any variables from 'variables.tf' that you want
# to change their default values.

# ****************  REQUIRED VARIABLES  ****************
# These required variables' values MUST be provided by the User
prefix                  = "raphpoumarv4"
location                = "us-east1-b" # e.g., "us-east1-b""
project                 = "sas-gelsandbox"
service_account_keyfile = "/home/cloud-user/.viya4-tf-gcp-service-account.json"
ssh_public_key          = "~/.ssh/id_rsa.pub"
#
# ****************  REQUIRED VARIABLES  ****************

# Source address ranges to allow client admin access to the cloud resources
default_public_access_cidrs = ["x.x.x.x/16"] # e.g., ["x.x.x.x/32"]

# add labels to the created resources
tags = { "resourceowner" = "raphpoumar" , project_name = "sasviya4gcp", environment = "dev", gel_project = "deployviya4gcp" } # e.g., { "key1" = "value1", "key2" = "value2" }

# Postgres config
create_postgres                  = true # set this to "false" when using internal Crunchy Postgres
postgres_ssl_enforcement_enabled = false
postgres_administrator_password  = "mySup3rS3cretPassw0rd"

# GKE config
default_nodepool_node_count = 2
default_nodepool_vm_type    = "e2-standard-8"

# Node Pools config
node_pools = {
cas = {
    "vm_type"      = "n1-highmem-16"
    "os_disk_size" = 200
    "min_nodes"    = 1
    "max_nodes"    = 5
    "node_taints"  = ["workload.sas.com/class=cas:NoSchedule"]
    "node_labels" = {
    "workload.sas.com/class" = "cas"
    }
    "local_ssd_count" = 0
},
compute = {
    "vm_type"      = "n1-highmem-16"
    "os_disk_size" = 200
    "min_nodes"    = 1
    "max_nodes"    = 1
    "node_taints"  = ["workload.sas.com/class=compute:NoSchedule"]
    "node_labels" = {
    "workload.sas.com/class"        = "compute"
    "launcher.sas.com/prepullImage" = "sas-programming-environment"
    }
    "local_ssd_count" = 0
},
connect = {
    "vm_type"      = "n1-highmem-16"
    "os_disk_size" = 200
    "min_nodes"    = 1
    "max_nodes"    = 1
    "node_taints"  = ["workload.sas.com/class=connect:NoSchedule"]
    "node_labels" = {
    "workload.sas.com/class"        = "connect"
    "launcher.sas.com/prepullImage" = "sas-programming-environment"
    }
    "local_ssd_count" = 0
},
stateless = {
    "vm_type"      = "e2-standard-16"
    "os_disk_size" = 200
    "min_nodes"    = 1
    "max_nodes"    = 2
    "node_taints"  = ["workload.sas.com/class=stateless:NoSchedule"]
    "node_labels" = {
    "workload.sas.com/class" = "stateless"
    }
    "local_ssd_count" = 0
},
stateful = {
    "vm_type"      = "e2-standard-8"
    "os_disk_size" = 200
    "min_nodes"    = 1
    "max_nodes"    = 3
    "node_taints"  = ["workload.sas.com/class=stateful:NoSchedule"]
    "node_labels" = {
    "workload.sas.com/class" = "stateful"
    }
    "local_ssd_count" = 0
}
}

# Jump Box
create_jump_public_ip = true
jump_vm_admin         = "jumpuser"

# Storage for SAS Viya CAS/Compute
storage_type = "ha"

Add public IP or hostname for GCP external PG

I'm creating a cluster using docker-dope.cyber.sas.com/viya4-iac-gcp:0.2.0.
The external PG has private IP only.

Other external PGs created by IAC on AKE or EKS have public IP or hostname.

enable_cluster_autoscaling does not disable node auto-provisionning

When I set the enable_cluster_autoscaling to false (with release 0.3.0)

# Configuring Node auto-provisioning
enable_cluster_autoscaling = false

it does not seem to disable the node auto-provisionning : I see gke-nap nodes being created when I run the deployment and in the GKE console I can see these parameters:

Thanks

feat: (IAC-1111) HTTP/HTTPS proxies instead of Cloud NAT

In the documentation (https://github.com/sassoftware/viya4-iac-gcp/blob/main/docs/CONFIG-VARS.md#use-existing), there is a reference table:

Name	Description	Type	Default	Notes
nat_address_name	Name of existing IP address for existing Cloud NAT	string	null	If not given, a Cloud NAT and associated external IP will be created

NAT and associated external IP mandatory for deployment of Viya 4 2021.1?
Even in the case when clients are inside same VPC as K8s cluster?
if yes, which is the reason?
if not, how can we eliminate the action of “If not given, a Cloud NAT and associated external IP will be created”, avoided?
The plan is to use inside of Cloud NAT, hence the questions above.
Looking at https://github.com/sassoftware/viya4-iac-gcp/blob/main/network.tf we saw:

module "nat_address" {
  count        = length(var.nat_address_name) == 0 ? 1 : 0
  source       = "terraform-google-modules/address/google"
  version      = "3.0.0"
  project_id   = var.project
  region       = local.region
  address_type = "EXTERNAL"
  names = [
    "${var.prefix}-nat-address"
  ]
}

module "cloud_nat" {
  count         = length(var.nat_address_name) == 0 ? 1 : 0
  source        = "terraform-google-modules/cloud-nat/google"
  version       = "2.0.0"
  project_id    = var.project
  name          = "${var.prefix}-cloud-nat"
  region        = local.region
  create_router = true
  router        = "${var.prefix}-router"
  network       = module.vpc.network_self_link
  nat_ips       = module.nat_address.0.self_links
}

Are any changes required there?

Error: namespaces "cloud-sql-proxy" not found

The TF plan command was successful but when running the TF apply command with external postgres, I get this error:

module.postgresql.kubernetes_deployment.sql_proxy_deployment[0]: Still creating... [10m0s elapsed]

Error: namespaces "cloud-sql-proxy" not found

  on modules/postgresql/main.tf line 93, in resource "kubernetes_secret" "cloudsql-instance-credentials":
  93: resource "kubernetes_secret" "cloudsql-instance-credentials" {



Error: Waiting for rollout to finish: 0 of 2 updated replicas are available...

  on modules/postgresql/main.tf line 104, in resource "kubernetes_deployment" "sql_proxy_deployment":
 104: resource "kubernetes_deployment" "sql_proxy_deployment" {



Error: namespaces "cloud-sql-proxy" not found

  on modules/postgresql/main.tf line 153, in resource "kubernetes_service" "sql_proxy_service":
 153: resource "kubernetes_service" "sql_proxy_service" {

When I look at the cluster workload, I see that there is indeed an issue with the sql-proxy deployment.

Looking at the pod status, it seems like there was an error to mount a secret.
I'm deleting everything now and will retry.

(IAC-875) Revert location value from local.location -> local.zone

Terraform Version Details

{
  "terraform_version": "\"1.3.7\"",
  "terraform_revision": "null",
  "terraform_outdated": "false",
  "provider_selections": "{}"
}

Terraform Variable File Details

No response

Steps to Reproduce

When storage type is set to "ha" there is an issue creating the jump server.

Expected Behavior

Jump server should be created without any issues

Actual Behavior

Jump server failure. I don't have the exact issue as the person reporting did not provide and working off of a teams conversation.

Additional Context

No response

References

https://github.com/sassoftware/viya4-iac-gcp/pull/144/files#diff-dc46acf24afd63ef8c556b77c126ccc6e578bc87e3aa09a931f33d9bf2532fbbR68

Code of Conduct

I agree to follow this project's Code of Conduct

Reference to the new KubernetesVersion.md in the CONFIG-VARS.md file.

Shouldn't the new doc on GCP version and channel selections https://github.com/sassoftware/viya4-iac-gcp/blob/main/docs/user/KubernetesVersions.md be referenced somewhere in the doc? I expected to find it under the following location,
https://github.com/sassoftware/viya4-iac-gcp/blob/main/docs/CONFIG-VARS.md#general

fix network module output for byo scenario

When specifying values for the "subnet_names" input variable, this error is thrown on terraform plan/apply:

on modules/network/outputs.tf line 11, in output "subnets":
11: gke : var.create_subnets ? element(coalescelist(google_compute_subnetwork.gke_subnet,[" "]),0) : data.google_compute_subnetwork.gke_subnet.0
|----------------
| data.google_compute_subnetwork.gke_subnet[0] is object with 11 attributes
| google_compute_subnetwork.gke_subnet is empty tuple
| var.create_subnets is false

The true and false result expressions must have consistent types. The given
expressions are string and object, respectively.

all nodepools were created under us-east1-b even though I set to use us-east1-d

I set location = us-east1-d in my terraform.tfvars file.

When a cluster was created, jump server was created under us-east1-d but all others were created under us-east1-b.

Here is comment from Thomas:

the code is working as designed right now. The GKE compute modules or nodes are placed in the first zone which seems to be calculated from the region determined from your location so if you set the location variable with the following value:

location = us-east1-d

The following will be set internally to the region used in the kuberentes cluster location to:

zone = us-east1-b

Since that is the first zone from the region us-east1

issue while upgrading k8s from 1.21.x to 1.22.x

Terraform Version Details

No response

Terraform Variable File Details

No response

Steps to Reproduce

I created a cluster using this:

kubernetes_channel = "REGULAR"

K8S version used was v1.21.6-gke.1500.

I wanted to upgrade K8S version to 1.22.x.
So, I updated tfvars with these and applied:

kubernetes_version = "1.22.4-gke.1501"
kubernetes_channel = "RAPID"

Applying tfvars was done successfully but the K8S version of my cluster wasn't updated.

Expected Behavior

I expected that my cluster got updated for 1.22.x.

Actual Behavior

No changes. Your infrastructure matches the configuration.
Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Outputs:
cluster_endpoint =
cluster_name = "jkgke2-gke"
...

Additional Context

No response

References

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Rename "enable_cluster_autoscaling" vars into "enable_nodes_autoprovisioning"

According to the Google doc:
Node auto-provisioning automatically manages a set of node pools on the user's behalf. Without node auto-provisioning, GKE considers starting new nodes only from the set of user created node pools. With node auto-provisioning, new node pools can be created and deleted automatically.

If enable_cluster_autoscaling is set to false, then "standard" autoscaling will happen but based on the node pool autoscaling.

It is not a major issue but the name of the variable is wrong (confusing at best).
The variable drives the "node auto provisioning" feature which is different from the node pools autoscaling (determined by the min and max nodes set for the node pool).

So I would suggest to rename the variable enable_cluster_autoscaling var into enable_nodes_autoprovisioning as well as the 2 related variables : cluster_autoscaling_max_cpu_cores and cluster_autoscaling_max_memory_gb to keep the variables consistent.

Thanks

Modify postgres version list description

The description for postgres_server_version in https://github.com/sassoftware/viya4-iac-gcp/blob/main/docs/CONFIG-VARS.md#postgres is:

Valid values are 9.6, 10, 11, and 12

The description for postgres_server_version in https://github.com/sassoftware/viya4-iac-gcp/blob/main/variables.tf is:

The version of PostgreSQL to use. Valid values are 9.6, 10, 11, and 12.

The information in both of these descriptions is accurate, however Viya does not support anything lower than v11. To help avoid confusion or user error, I suggest only v11 and v12 are mentioned.

Issue Pulling Images From Container Registry

Terraform Version Details

From the Dockerfile:

ARG TERRAFORM_VERSION=1.0.0
ARG GCP_CLI_VERSION=342.0.0
FROM hashicorp/terraform:$TERRAFORM_VERSION as terraform

FROM google/cloud-sdk:$GCP_CLI_VERSION
ARG KUBECTL_VERSION=1.19.9

Terraform Variable File Details

N/A

Steps to Reproduce

Create an Artifact Registry and populate it with a copy of a SAS Viya4 deployment
Run viya4-iac-gcp to stand up infra and viya4-deployment for ingress/etc
Deploy Viya4 using the Deployment Operator

Expected Behavior

Kubernetes Nodes can pull container images from the Artifact Registry during the Viya deployment process

Actual Behavior

Pods hang with errors such as:

Warning Failed 13m (x4 over 14m) kubelet Failed to pull image "gcr.io/viya4-poc/sasviyav4-9cl4s-0-lts-2021-1-20210813-1628876294130/viya-4-x64_oci_linux_2-docker/sas-orchestration:1.41.6-20210513.1620868672630": rpc error: code = Unknown desc = Error response from daemon: pull access denied for gcr.io/viya4-poc/sasviyav4-9cl4s-0-lts-2021-1-20210813-1628876294130/viya-4-x64_oci_linux_2-docker/sas-orchestration, repository does not exist or may require 'docker login': denied: Permission denied for "1.41.6-20210513.1620868672630" from request "/v2/viya4-poc/sasviyav4-9cl4s-0-lts-2021-1-20210813-1628876294130/viya-4-x64_oci_linux_2-docker/sas-orchestration/manifests/1.41.6-20210513.1620868672630"

Additional Context

During the viya4-iac-deploy, a Service Account is created with a name similar to [email protected], which is assigned to the Kubernetes Nodes. The SA is given the following roles:

Logs Writer
Monitoring Metric Writer
Monitoring Viewer
Stackdriver Resource Metadata Writer

I'm thinking it should also have Storage Object Viewer, in order to be able to pull containers from a GCP registry? (At least this seems to be one way to work past the issue...)

Thanks for the help! -Michael

References

No response

Code of Conduct

I agree to follow this project's Code of Conduct

clarify byo subnet variable syntax

The examples given for specifying existing in docs/CONFIG-VARS.md and examples/sample-input-byo.tfvars need to clarify that the "name" of the secondory ip ranges is requred, not the cidr range.

gcloud auth login Authorization Error with pinned gcloud version

Hi there

When I try to use the IaC docker image to login into my GCP project, with the code below:

docker container  run -it --rm \
        -v $HOME/.config/gcloud/:/root/.config/gcloud/ \
        --entrypoint gcloud \
        viya4-iac-gcp \
        auth login

It generates the authentication URL for me, but when I click on it, then I get this error:

The problem was reported there : https://issuetracker.google.com/issues/246996424?pli=1
From what I read there, I tried to install the current version of the gcloud CLI (405.0.0) and it generates a working URL for me, so the issue seems to be linked to the gcloud version used in the docker image.

I'm wondering if the IaC can work with the gcloud version that is currently pinned to "342.0.0" in the Docker file.

ARG GCP_CLI_VERSION=342.0.0

Internal LB instead of external

Is the Terraform project creating an external Load Balancer?
It is possible to convert it as an internal Load Balancer, if Viya clients(browsers etc...) are inside the same VPC as the K8s cluster?

cloud-init scripts on GCP are not working correctly for /volumes/pvs directory

I created a cluster using docker-dope.cyber.sas.com/viya4-iac-gcp:1.0.0.
I got a problem When I tried to install baseline.
I keep getting this error:

TASK [nfs-subdir-external-provisioner : Deploy nfs-subdir-external-provisioner] ***
fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --kubeconfig /config/kubeconfig --namespace=nfs-client --version=4.0.8 --repo=https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ upgrade -i --reset-values --wait --create-namespace -f=/tmp/tmpj9av9ozd.yml nfs-subdir-external-provisioner nfs-subdir-external-provisioner", "msg": "Failure when executing Helm command. Exited 1.\nstdout: Release "nfs-subdir-external-provisioner" does not exist. Installing it now.\n\nstderr: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /config/kubeconfig\nWARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /config/kubeconfig\nError: timed out waiting for the condition\n", "stderr": "WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /config/kubeconfig\nWARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /config/kubeconfig\nError: timed out waiting for the condition\n", "stderr_lines": ["WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /config/kubeconfig", "WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /config/kubeconfig", "Error: timed out waiting for the condition"], "stdout": "Release "nfs-subdir-external-provisioner" does not exist. Installing it now.\n", "stdout_lines": ["Release "nfs-subdir-external-provisioner" does not exist. Installing it now."]}

IaC 3.2.0 fails to build the cluster with K8s 1.23.12

When using the latest IaC version (3.2.0) with K8s 1.23.12-gke.100, I kept having the same failure (after 2 attempts at different times) :

module.gke.google_container_node_pool.pools["compute"]: Still creating... [26m51s elapsed]
module.gke.google_container_node_pool.pools["cas"]: Still creating... [26m51s elapsed]
╷
│ Error: error creating NodePool: googleapi: Error 400: Service account "tf-gke-raphpoumar-v4-g-i0ne@sas-gelsandbox.iam.gserviceaccount.com" does not exist., badRequest
│
│   with module.gke.google_container_node_pool.pools["compute"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 207, in resource "google_container_node_pool" "pools":
│  207: resource "google_container_node_pool" "pools" {
│
╵
╷
│ Error: error creating NodePool: googleapi: Error 400: Service account "tf-gke-raphpoumar-v4-g-i0ne@sas-gelsandbox.iam.gserviceaccount.com" does not exist., badRequest
│
│   with module.gke.google_container_node_pool.pools["cas"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 207, in resource "google_container_node_pool" "pools":
│  207: resource "google_container_node_pool" "pools" {
│
╵
╷
│ Error: error creating NodePool: googleapi: Error 400: Service account "tf-gke-raphpoumar-v4-g-i0ne@sas-gelsandbox.iam.gserviceaccount.com" does not exist., badRequest
│
│   with module.gke.google_container_node_pool.pools["default"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 207, in resource "google_container_node_pool" "pools":
│  207: resource "google_container_node_pool" "pools" {
│
╵
╷
│ Error: Error waiting for creating GKE NodePool: All cluster resources were brought up, but: only 0 nodes out of 2 have registered; cluster may be unhealthy.
│
│   with module.gke.google_container_node_pool.pools["stateless"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 207, in resource "google_container_node_pool" "pools":
│  207: resource "google_container_node_pool" "pools" {

The first time only the CAS node pool was created, the second time, only the stateless node pool was created.
I'm pretty sure it was working fine with the previous IaC version.

I was able to have a successful build with IaC 3.2.0 changing the K8s version to 1.23.8-gke.1900 (which was listed in the tested version for the latest IaC codebase version), but I thought I'll report the problem.

Thanks

azure label for system node pool appearing in GCP main.tf file

This line should probably be updated/removed

"node_labels" = merge(var.tags, var.default_nodepool_labels,{"kubernetes.azure.com/mode"="system"})

it does not seem to hurt but I checked and my GKE system nodes get this "azure" label.

Node labels of the default nodepool

The node labels of the default nodepool are generated as follows:

"node_labels" = merge(var.tags, var.default_nodepool_labels,{"kubernetes.azure.com/mode"="system"})

it may be appropriate to set the value to something GCP related.

IAC for google should allow HA option (Google Fileshare) to specify tier

Looking to document the filestore_tier variable and it's values for the google_filestore_instance resource being used.

Split terraform and docker instructions into separate docs

allow authentication through VM Service Accounts

By default, the google terraform provider automatically picks up any Service Accounts that are associated with GCP VMs.
We currently always require a persisted JSON file with the Service Account information.
We should not block the look-through to the associated Service Accounts, if available.

This change is needed in support of Private Clusters (#97)

Update node image type

The IaC scripts hardcode the node image to COS (container optimized OS) with Docker. Google has standardized on Container-Optimized OS with containerd

https://cloud.google.com/kubernetes-engine/docs/concepts/using-containerd

Can we change accordingly ? Most cloud providers have abandoned Docker anyway in favour of containerd

External PostgreSQL cannot be deleted by terraform when a database has been created externally

If a user, for example pgadmin, created a database and one tried to delete the external postgres instance with terraform it will fail with the following error

Error: Error, failed to deleteuser pgadmin in instance thpang-test-std-pgsql-b5cf6c33:

Looking for a way to clean this up in the terraform code, but a short-term solutions seem to be simply running terraform destroy a second time.

Private cluster creation

Cluster location type: "Regional" insted "Zonal"

I set in my terraform.tfvars file.
location = "europe-central2-c"

I assumed a single-zone cluster would be created instead regional.

After create i have:

Can I change the availability type?

Error: googleapi: Error 400: Master version "1.18.16-gke.1200" is unsupported., badRequest

Hi there
I hit this error, this morning after the terraform apply:
Error: googleapi: Error 400: Master version "1.18.16-gke.1200" is unsupported., badRequest

It looks like our default version is not supported any longer.
https://cloud.google.com/kubernetes-engine/docs/release-notes#March_29_2021
I was able to deploy with 1.18.16-gke.2100.

Instances where google_compute_subnetwork generates an output error

Found that in some instances the output for subnets can still produce this problem:

Error: Invalid index

on modules/network/outputs.tf line 11, in output "subnets":
11: gke : var.create_subnets ? google_compute_subnetwork.gke_subnet.0 : data.google_compute_subnetwork.gke_subnet.0
|----------------
| google_compute_subnetwork.gke_subnet is empty tuple

The given key does not identify an element in this collection value.

Error: Invalid index

on modules/network/outputs.tf line 12, in output "subnets":
12: misc : var.create_subnets ? google_compute_subnetwork.misc_subnet.0 : data.google_compute_subnetwork.misc_subnet.0
|----------------
| google_compute_subnetwork.misc_subnet is empty tuple

The given key does not identify an element in this collection value.

GKE cluster going straight to "Repairing mode"

Hi there
Each time I use the viya4-iac-gcp,I'm using the sample-input var file and I end up with a "not ready" GKE cluster going straight to "Repairing mode".
I think I might be hitting this : https://maumau.medium.com/why-is-your-gke-cluster-so-slow-443f7a4b6af9
Did anyone face the same issue ?
My guess is that anyone using the default sample-input var files would get the same issue.
Thanks

sassoftware / viya4-iac-gcp Goto Github PK

viya4-iac-gcp's Introduction

SAS Viya 4 Infrastructure as Code (IaC) for Google Cloud

Overview

Prerequisites

Required

Terraform

Docker

Getting Started

Clone this project

Authenticating Terraform to access Google Cloud

Customize Input Values

Creating and Managing the Cloud Resources

Troubleshooting

Contributing

License

Additional Resources

Google Cloud

Terraform

viya4-iac-gcp's People

Contributors

Stargazers

Watchers

Forkers

viya4-iac-gcp's Issues

Terraform Version Details

Terraform Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Terraform Version Details

Terraform Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Terraform Version Details

Terraform Variable File Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

References

Code of Conduct

Recommend Projects

Recommend Topics

Recommend Org