Giter Site home page Giter Site logo

content-k8s-provisioner's Introduction

Description

This repository contains core automation required to create/update/destroy Kubernetes clusters for UPP (delivery, publishing) and PAC platforms. The setup uses the kube-aws tool to manage Kubernetes infrastructure on AWS. Ansible and bash are used to script and integrate kube-aws with any additional tasks and Docker to package the provisioner setup.

Prerequisites

To use the provisioner the below tools need to be available:

Build the provisioner Docker image

The docker image is used to package the provisioner setup with pinned tool versions. To use the provisioner first build the docker image locally:

docker build -t k8s-provisioner:local .

Provision a new cluster

To provision a cluster for a specific platform or cluster type (e.g. pac/publishing/delivery or test cluster) follow the steps below:

  1. Build the provisioner docker image locally.
  2. Create an empty directory and inside create a directory named credentials:
    PROV_DIR=provision-cluster-$(date +%F-%H-%M)/credentials
    cd $HOME ; mkdir -p $PROV_DIR ; cd $PROV_DIR
  3. Create AWS credentials and set environment variables as described in appropriate LastPass note:
    • For generic test cluster: "UPP - k8s Cluster Provisioning env variables". Use the test credentials and set prov for the cluster environment.
    • For UPP cluster: "UPP - k8s Cluster Provisioning env variables"
    • For PAC cluster: "PAC - k8s Cluster Provisioning env variables"
  4. Run the docker container that will provision the cluster in AWS:
    docker run \
        -v $(pwd)/credentials:/ansible/credentials \
        -e "AWS_REGION=$AWS_REGION" \
        -e "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
        -e "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
        -e "CLUSTER_NAME=$CLUSTER_NAME" \
        -e "CLUSTER_ENVIRONMENT=$CLUSTER_ENVIRONMENT" \
        -e "ENVIRONMENT_TYPE=$ENVIRONMENT_TYPE" \
        -e "OIDC_ISSUER_URL=$OIDC_ISSUER_URL" \
        -e "PLATFORM=$PLATFORM" \
        -e "VAULT_PASS=$VAULT_PASS" \
        k8s-provisioner:local /bin/bash provision.sh

At this point, you have provisioned a Kubernetes cluster without anything running on it. The next step is to integrate the cluster and add services to run on it. Proceed only if you have to create a complete new environment.

  1. If this is a test/team cluster that needs to be shutdown overnight put the ec2Powercycle tag on the autoscaling groups in the cluster. You can do this from the AWS console:
    1. Login to the cluster's account
    2. Go to Ec2 -> Auto Scaling Groups
    3. Select ASG, go to Tags, and add the ec2Powercycle tag. Example value:
    { "start":"30 6 * * 1-5", "stop": "30 18 * * 1-5", "desired": 2, "min": 2}
    
    Make sure you get the desired and min values in sync with the current ASG's values.
  2. IMPORTANT: Upload the zip found at credentials/${CLUSTER_NAME}.zip with the TLS assets in the LastPass note from earlier. These initial credentials are vital for subsequent updates in the cluster.
  3. Add the new environment to the auth setup following the instructions here
  4. Add the new environment to the Jenkins pipeline. Instructions can be found here.
  5. Make sure you have defined the credentials for the new cluster in Jenkins. See the previous step.
  6. For UPP Clusters: Create/ amend the app-configs for the upp-global-configs repository. Release and deploy a new version of this app to the new environment using this Jenkins Job
  7. Deploy all the apps necessary in the current cluster. This can be done in 2 ways:
    1. One slower way, but which is fire & forget: synchronize the cluster with an already existing cluster using this Jenkins Job.
    2. One quick way, but this would require some more manual steps: Restore the config from an S3 backup of another cluster
  8. For the ASG controlling the dedicated node for Thanos put the ec2Powercycle to be in sync with the Thanos compactor job. See Thanos Compactor for details. Follow the same steps as above and look for the workers with the Wt in the name. If this is prod, you should also change the environment tag to t, so that the ec2Powercycle will be considered

Update a cluster

IMPORTANT: Before updating the cluster make sure you put the initial credentials(certificates & keys) that were used when the cluster was initially provisioned in the /credentials folder. Failure in doing this will damage the cluster**.

  1. Build the provisioner docker image locally.
  2. Create AWS credentials and set environment variables as described in appropriate LastPass note:
    • For UPP cluster: "UPP - k8s Cluster Provisioning env variables"
    • For PAC cluster: "PAC - k8s Cluster Provisioning env variables"
  3. Run the provisioner docker container:
    docker run \
        -v $(pwd)/credentials:/ansible/credentials \
        -e "AWS_REGION=$AWS_REGION" \
        -e "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
        -e "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
        -e "CLUSTER_NAME=$CLUSTER_NAME" \
        -e "CLUSTER_ENVIRONMENT=$CLUSTER_ENVIRONMENT" \
        -e "ENVIRONMENT_TYPE=$ENVIRONMENT_TYPE" \
        -e "OIDC_ISSUER_URL=$OIDC_ISSUER_URL" \
        -e "PLATFORM=$PLATFORM" \
        -e "VAULT_PASS=$VAULT_PASS" \
        k8s-provisioner:local /bin/bash update.sh

Decommission a cluster

  1. Build the provisioner docker image locally.
  2. Create AWS credentials and set environment variables as described in appropriate LastPass note:
    • For UPP cluster: "UPP - k8s Cluster Provisioning env variables"
    • For PAC cluster: "PAC - k8s Cluster Provisioning env variables"
  3. Run the provisioner docker container:
    docker run \
        -e "AWS_REGION=$AWS_REGION" \
        -e "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
        -e "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
        -e "CLUSTER_NAME=$CLUSTER_NAME" \
        -e "CLUSTER_ENVIRONMENT=$CLUSTER_ENVIRONMENT" \
        -e "ENVIRONMENT_TYPE=$ENVIRONMENT_TYPE" \
        -e "PLATFORM=$PLATFORM" \
        -e "VAULT_PASS=$VAULT_PASS" \
        k8s-provisioner:local /bin/bash decom.sh
    Wait for the cluster resource to be removed. Watch the CF stack deletion in the AWS console.
  4. Delete the leftover AWS resources associated with the cluster. Go to AWS console -> Resource groups -> Tag editor -> find all resources by the k8s cluster tag & delete them

Access a cluster

Existing cluster environments can be accessed via kubectl-login or a backup token. Read the content-k8s-auth-setup README to set up login.

Restore Kubernetes Configuration

The kube-resources-autosave on Kubernetes is a pod which takes and uploads snapshots of all the Kubernetes resources to an S3 bucket in 24 hours interval. The backups are stored in a timestamped folder in the S3 bucket in the following format.

s3://<s3-bucket-name>/kube-aws/clusters/<cluster-name>/backup/<backup_timestamped_folder>

To restore the k8s cluster state, do the following:

  1. Before restoring make sure you have deployed the following apps, using this Jenkins job, so that they don't get overwritten by the restore:

    • upp-global-configs
    • kafka-bridges
  2. Clone this repository

  3. Determine the S3 bucket name where the backup of the source cluster resides. Choose one of the exports bellow:

    • When the source cluster is a test (team) cluster in the EU region. The AWS account is Content Test:
      export RESTORE_BUCKET=k8s-provisioner-test-eu
    • When the source cluster is a test (team) cluster in the US region. The AWS account is Content Test:
      export RESTORE_BUCKET=k8s-provisioner-test-us
    • When the source cluster is staging or prod in the EU region. The AWS account is Content Prod:
      export RESTORE_BUCKET=k8s-provisioner-prod-eu
    • When the source cluster is staging or prod in the US region. The AWS account is Content Prod:
      export RESTORE_BUCKET=k8s-provisioner-prod-us
  4. Set the AWS credentials for the AWS account where the source cluster resides, based on the previous choice. They are stored in LastPass.

    • For PAC Cluster "PAC - k8s Cluster Provisioning env variables"
    • For UPP Cluster "UPP - k8s Cluster Provisioning env variables"
    export AWS_ACCESS_KEY_ID=
    export AWS_SECRET_ACCESS_KEY=
    # This should be either eu-west-1 or us-east-1, depending on the cluster's region.
    export AWS_DEFAULT_REGION=
  5. Determine the backup folder that should be used for restore. Use awscli for this

    # Check that the source cluster is in the chosen bucket
    aws s3 --human-readable ls s3://$RESTORE_BUCKET/kube-aws/clusters/
    
    # Determine the backup S3 folder that should be used for restore. This should be a recent folder.
    aws s3 ls --page-size 100 --human-readable s3://$RESTORE_BUCKET/kube-aws/clusters/<source_cluster>/backup/ | sort | tail -n 7
  6. Make sure you are connected to the right cluster that you are restoring the config to. Test if you are connected to the correct cluster:

    kubectl cluster-info
  7. Run the following command from the root of this repository to restore the default and the kube-system namespace

    ./sh/restore.sh s3://$RESTORE_BUCKET/kube-aws/clusters/<source_cluster>/backup/<source_backup_folder> kube-system
    ./sh/restore.sh s3://$RESTORE_BUCKET/kube-aws/clusters/<source_cluster>/backup/<source_backup_folder> default
  8. In order to get the cluster green after an S3 restoration, some manual steps are further required for mongo, kafka and varnish. Steps are detailed here

Rotate TLS assets of a cluster

  1. Build the provisioner docker image locally.
  2. Create AWS credentials and set environment variables as described in appropriate LastPass note:
    • For UPP cluster: "UPP - k8s Cluster Provisioning env variables"
    • For PAC cluster: "PAC - k8s Cluster Provisioning env variables"
  3. Run the provisioner docker container:
    docker run \
        -v $(pwd)/credentials:/ansible/credentials \
        -e "AWS_REGION=$AWS_REGION" \
        -e "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
        -e "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
        -e "CLUSTER_NAME=$CLUSTER_NAME" \
        -e "CLUSTER_ENVIRONMENT=$CLUSTER_ENVIRONMENT" \
        -e "ENVIRONMENT_TYPE=$ENVIRONMENT_TYPE" \
        -e "OIDC_ISSUER_URL=$OIDC_ISSUER_URL" \
        -e "PLATFORM=$PLATFORM" \
        -e "VAULT_PASS=$VAULT_PASS" \
        -e "KEEP_CA=n" \
        k8s-provisioner:local /bin/bash rotate-tls.sh

After rotating the TLS assets, there are some important manual steps that should be done:

  1. Validate that the login using the backup token works. Using the new token from the output, check the kubectl-login config on how to check this. If this validation doesn't work, there must be something wrong. Check the Troubleshooting section.
  2. Update the backup token access in the LP note kubectl-login config. You can find the new token value in the provisioner output: backup-access token value is: .....
  3. Update the token used by Jenkins to access the K8s cluster. The credential has the id ft.k8s-auth.${full-cluster-name}.token. Look it up here and update it with the token from the provisioner output: "jenkins token value is: .......
  4. Validate that Jenkins still has access to the cluster by deploying an existing helm chart version onto the cluster through the Jenkins job.
  5. Update the TLS assets used by Jenkins for cluster updates. The credential has the id ft.k8s-provision.${full-cluster-name}.credentials. Look it up here and update the zip with the one created in the credentials folder with the name ${full-cluster-name}.zip.
  6. Validate that this update worked by triggering a cluster update using the Jenkins job on the cluster. It should finish quickly as it doesn't have anything to do. If it takes a long time and really goes through updating please check that you did the previous step and try again.
  7. Validate that the normal flow of login through DEX works.
  8. Update the TLS assets in the LP note UPP - k8s Cluster Provisioning env variables or PAC - k8s Cluster Provisioning env variables
  9. Validate that logs are getting into Splunk after the rotation from this environment.

Troubleshooting

Here are the situations encountered so far when the rotation did not complete successfully:

After rotation one could not login using the normal flow

Possible problems:

  1. Dex may not be started yet. Wait for 5 mins then give it another go. As an alternative try using the backup token that was newly generated and you updated in the LP note kubectl-login config
  2. The state of the etcd cluster is not consistent between the nodes. Here's how to check that this is the situation and overcome this:
    1. First, you need to connect to the cluster. Create a kubeconfig file that uses the newly created certificates to login into the cluster. It may look like:
      apiVersion: v1
      clusters:
      - cluster:
          server: https://{{full-cluster-name}}-api.ft.com
          insecure-skip-tls-verify: true
        name: prov-test
      contexts:
      - context:
          cluster: prov-test
          namespace: kube-system
          user: cert
        name: prov-test-cert-ctx
      
      current-context: prov-test-cert-ctx
      
      kind: Config
      preferences: {}
      users:
      - name: cert
        user:
          client-certificate: credentials/admin.pem
          client-key: credentials/admin-key.pem
      
    2. set KUBECONFIG in the shell to point to the newly created cluster.
    3. issue some kubectl get secret multiple times. If this returns different values each time or there are duplicates in the secrets, it means the state of the etcds is out of sync.
    4. Check which etcd is the leader & terminate the other instances that are not
      1. Connect to SSH to one of the etcd node using the jumpbox & portforwarding
      2. do an etcdctl member list
      3. The leader would be printed in the above command.
      4. From the AWS console to and terminate the 2 instances from the cluster that are not the leader.

Upgrade kube-aws

When upgrading the Kubernetes version, it is wise to do it on the latest kube-aws version, since they might upgraded already to a closer version that you need. Here are some guidelines on how to do it:

  1. Read all the changelogs involved (kube-aws, kubernetes) to spot any show-stoppers.
  2. Generate the plain kube-aws artifacts with the new version of kube-aws
    1. Open a terminal
    2. Create a new folder and go into it mkdir kube-aws-upgrade; cd kube-aws-upgrade
    3. Download the kube-aws version that you want to upgrade to and put it in this new folder
    4. Init the cluster.yaml of kube-aws using some dummy values:
      ./kube-aws init \
        --cluster-name=kube-aws-up \
        --external-dns-name=kube-aws-up \
        --region=eu-west-1 \
        --key-name="dum" \
        --kms-key-arn="arn:aws:kms:eu-west-1:99999999999:key/99999999-9999" \
        --no-record-set \
        --s3-uri s3://k8s-provisioner-test-eu \
        --availability-zone=eu-west-1a
    5. Render the stack:
      ./kube-aws render
  3. At this point, kube-aws should have created 2 folders: stack-templates & userdata
  4. Carefully update the file ansible/templates/cluster-template.yaml.j2 adapting it to the changes from kube-aws-upgrade/cluster.yaml. One way to do this is to do a merge with a tool like Intellij Idea between the two files.
  5. Carefully update the contents of the files from ansible/stack-templates/ adapting them to the changes from kube-aws-upgrade/stack-templates. Before doing this, it is advisable to look at the Git history of the folder and see if there have been executed some manual changes on the files, as those need to be kept. Use the same technique of merging the files.
  6. Carefully update the contents of the files from ansible/userdata/ adapting them to the changes from kube-aws-upgrade/userdata. Before doing this, it is advisable to look at the Git history of the folder and see if there have been executed some manual changes on the files, as those need to be kept. Use the same technique of merging the files.
  7. Compare the contents of the credentials folder with an older credentials folder, for example, the one of upp-prod-delivery-eu. You can find these old ones in the LP note UPP - k8s Cluster Provisioning env variables. It is usual that between upgrades some new files will appear in this folder. If this is the case you must be careful and check that at cluster upgrades these files are generated and recreate the credentials zips that are kept in the same LP note.

The update part should be done. Now we need to validate it is really working.

Validate that the update works

It is advisable to go through the following steps for doing a full validation:

  1. Create a new branch in the k8s-cli-utils repo & update the kube-aws version.
  2. Update the Dockefile of the provisioner to extend from the new version of k8s-cli-utils Docker image & build the new docker image of the provisioner
  3. Test provisioning of a new simple cluster. Use CLUSTER_ENVIRONMENT=prov when provisioning. Validate that everything worked well (nodes, kube-system namespace pods)
  4. Test that decommissioning still works. Decommission this new cluster and check that AWS resources were deleted.
  5. Test upgrading a simple cluster to the new version
    1. Provision a new cluster using the master version of the provisioner. Use the same CLUSTER_ENVIRONMENT=prov
    2. Update the cluster with the new version of the provisioner
    3. Validate that after the upgrade everything works (nodes, kube-system namespace pods)
  6. Test upgrading a replica of a delivery cluster
    1. Provision a new cluster using the master version of the provisioner. Use CLUSTER_ENVIRONMENT=delivery
    2. Go through the restore procedure. Use as source cluster the upp-uptst-delivery-eu cluster
    3. Get the cluster as green as possible
    4. Update the cluster with the new version of the provisioner
    5. Validate that everything is ok & the cluster is still green after the update
    6. Add the environment to the Jenkins pipeline.
    7. Validate that Jenkins can deploy to the updated cluster. You can trigger a Diff & Sync job to update from prod.
  7. Don't forget to decommission the cluster after all these validations.

After all these validations succeed, you are ready to update the dev clusters.

content-k8s-provisioner's People

Contributors

bboykov avatar bernivarga avatar davidbalazs93 avatar doramatadora avatar georgi-denchew avatar ioanaciont avatar iulianp26 avatar ivan-georgiev-zetta avatar izzyblues avatar jpdhas avatar mihaylovmihail avatar peteclark-ft avatar peterschubert avatar tamas-molnar avatar tosan88 avatar yolandeleungft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

content-k8s-provisioner's Issues

Pass Konsturctor key as env variable

The only key that is stored in the ansible vault is the konstructor dns key.
When a key needs rotation, it is changed in the ansible vault file and a PR is submitted.

To avoid the hassle of submitting PR's, lets pass the key as an env variable and the only place the key has to be update will be the provisioner last pass note.

Making this change in the provisioner would also involve making changes to the jenkins pipeline:

Update SQS access to content-container-apps user based on environment

We have content-container-apps user for Prod and Staging

https://github.com/Financial-Times/upp-provisioners/blob/master/upp-concept-publishing-provisioner/cloudformation/sns-sqs-cf-multi-region.yml#L57

The permission on the SQS policy needs to be set based on environment

Prod: content-container-apps
Staging: content-container-apps-staging

https://trello.com/c/nf3IIPuU/122-provisioner-update-sqs-access-to-content-container-apps-user-based-on-environment

Generate proper TLS keys for our k8s clusters

From the kube-aws docs:

PRODUCTION NOTE: the TLS keys and certificates generated by kube-aws should not be used to deploy a production Kubernetes cluster. Each component certificate is only valid for 90 days, while the CA is valid for 365 days. If deploying a production Kubernetes cluster, consider establishing PKI independently of this tool first. Read more below.

https://kubernetes-incubator.github.io/kube-aws/getting-started/step-2-render.html#render-cluster-assets


https://trello.com/c/qq0XofH8/16-generate-proper-tls-keys-for-our-k8s-clusters

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.