Giter Site home page Giter Site logo

terraform-aws-vault's Introduction

HashiCorp Vault on AWS

This code spins up a HashiCorp Vault cluster:

  • Spread over availability zones.
  • Using automatic unsealing.
  • Automatically join other nodes in the cluster.
  • With a load balancer.
  • An optional bastion host.
  • Use 3 or 5 nodes, based on the amount of availability zones.
  • Either create a VPC or use an existing one.

Overview


    \0/        +--------------+
     | ------> | loadbalancer |
    / \        +--------------+
    OPS               | :8200/tcp
                      V
+---------+    +------------+
| bastion | -> | instance 0 |+
+---------+    +------------+|+
     ^          +------------+|
     |           +------------+
    \0/
     |
    / \
    DEV             

These (most important) variables can be used.

  • vault_name - default: "unset".
  • vault_aws_certificate_arn - The AWS certificate ARN that can be installed on the load balancer.
  • vault_aws_key_name - The key to use to login. (Conflicts with vault_keyfile_path. Use either, not both.)

More variables can be found in variables.tf.

Deployment

After spinning up a Vault cluster for the fist time, login to one of the Vault cluster members and initialize Vault:

vault operator init

This generates recovery tokens and a root key, keep them safe and secure.

You must turn on auto-cleanup of dead raft peers in order to remove dead nodes and keep a majority of the Vault nodes healthy during scaling activities.

vault login ROOT_TOKEN
vault operator raft autopilot set-config \
  -min-quorum=3 \
  -cleanup-dead-servers=true \
  -dead-server-last-contact-threshold=120

NOTE: There are some variable values in the above example, please have a look at the output of the module, it instructs the specific command to run.

Network

You can not specify a vault_aws_vpc_id. In that case, this module will create all required network resources.

If you do specify a vault_aws_vpc_id, you will need to have:

  • aws_vpc
  • aws_internet_gateway
  • aws_route_table
  • aws_route
  • aws_subnet
  • aws_route_table_association

Backup & restore

To create a backup, log in to a Vault node, use vault login and run:

vault operator raft snapshot save /vault/data/raft/snapshots/raft-$(date +'%d-%m-%Y-%H:%M').snap

To restore a snapshot, run:

vault operator raft snapshot restore FILE

Logging, monitoring and alerting

Currently logging, monitoring and alerting are available with AWS native tools by setting the vault_enable_cloudwatch boolean to true, see also: examples/cloudwatch.

Known bug: terraform destroy will not clean up Cloudwatch Alarms, the lambda function that is ment to clean the alarms is destroyed before it has a chance to clean the alarms.

  • TODO something about where to send the alerts to, mail,mobile,slack ?

By default the following logs and metrics are being collected by the Cloudwatch agent:

  • Metrics:
    • memory_used_percent
    • disk_used_percent - /opt/vault
    • disk_used_percent - /
  • Logs:
    • /var/log/vault/vault.log
    • /var/log/cloud-init-output.log

By default the following alarms are created:

  • Alarms:
    • at 80% - memory_used_percent
    • at 80% - disk_used_percent - /opt/vault
    • at 80% - disk_used_percent - /

Metrics dashboard preview:
Cloudwatch dashboard preview.

Logging dashboard preview: Cloudwatch logging preview.

Cost

To understand the cost for this service, you can use cost.modules.tf:

terraform apply
terraform state pull | curl -s -X POST -H "Content-Type: application/json" -d @- https://cost.modules.tf/

Here is a table relating vault_size to a monthly price. (Date: Feb 2022)

Size (vault_size) Monthly price x86_64 ($) Monthly price arm64 ($)
custom Varies: 223.34 * Varies: +- 193.00 **
development 50.98 vault_size != custom ***
minimum 257.47 vault_size != custom ***
small 488.59 vault_size != custom ***
large 950.83 vault_size != custom ***
maximum 1875.31 vault_size != custom ***

When vault_size is set to custom, these parameters determine the price:

  • vault_volume_iops
  • vault_volume_size
  • vault_volume_type

() The price for vault_size = "custom" in the table above is based on the settings in examples/custom. () The cost analysis tool does not support Graviton, so the price was analysed manually. () The Graviton types can only be used when vault_size is set to custom.

terraform-aws-vault's People

Contributors

joe-rua avatar repping avatar robert-de-bock avatar robertdebock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

terraform-aws-vault's Issues

Validate production readiness

Taken from Production Hardening.

Tasks:

  • End-to-End TLS
  • Single Tenancy
  • Firewall Traffic
  • Disable SSH (RDP)
  • Enable memory locking (mlock)
  • Disable Swap
  • Don't Run as root
  • Turn core dumps off
  • Immutable Upgrades
  • Good Root Token management

Inconsistent variables names

The variables are not consistent; some have vault_ in front, some have enabled_* or *_enabled. I propose a form that will result in a more predictable style:

  • Starts with vault_*.
  • If a list or map is used, use plural, ending in vault_*s.
  • If a boolean is used, use vault_VERB_FEATURE, where VERB is something like create, use, or enable.
  • If directly_ referring to an (AWS) service, use vault_aws_SERVICE. For example vault_aws_key_name
  • When referring to a file, use vault_*)path.
  • Use vault_*_id, vault_*_name, vault_*_seconds to indicate what is expected. (or _ids and _names for lists or maps.)
Old variable name New variable name
name vault_name
vault_version -
key_name vault_aws_key_name
key_filename vault_keyfile_path
size vault_size
volume_type vault_volume_type
volume_size vault_volume_size
volume_iops vault_volume_iops
amount vault_node_amount
vault_ui vault_enable_ui
allowed_cidr_blocks vault_allowed_cidr_blocks
vpc_id vault_aws_vpc_id
bastion_host vault_create_bastionhost
vpc_cidr_block_start vault_cidr_block_start (Replace this clunky variable later)
tags vault_tags
max_instance_lifetime vault_asg_instance_lifetime
certificate_arn vault_aws_certificate_arn
log_level vault_loglevel
default_lease_ttl vault_default_lease_time
max_lease_ttl vault_max_lease_time
vault_path vault_data_path
private_subnet_ids vault_private_subnet_ids
public_subnet_ids vault_public_subnet_ids
vault_type -
vault_license -
api_addr vault_api_addr
allowed_cidr_blocks_replication vault_replication_allowed_cidr_blocks
cooldown vault_asg_cooldown_seconds
vault_ca_cert vault_ca_cert_path
vault_ca_key vault_ca_key_path
vault_replication vault_allow_replication
telemetry vault_enable_telemetry
prometheus_retention_time vault_prometheus_retention_time
prometheus_disable_hostname vault_prometheus_disable_hostname
telemetry_unauthenticated_metrics_access vault_enable_telemetry_unauthenticated_metrics_access
aws_kms_key_id vault_aws_kms_key_id
warmup vault_asg_warmup_seconds
api_port vault_api_port
replication_port vault_replicatIon_port
vault_aws_s3_snapshots_bucket vault_aws_s3_snapshot_bucket_name
aws_lb_internal vault_aws_lb_availability (internal or external)
extra_security_group_ids vault_extra_security_group_ids
advanced_monitoing REMOVE THE VARIABLE
audit_device vault_create_audit_device
audit_device_size vault_audit_device_size
audit_device_path vault_audit_device_path
allow_ssh vault_allow_ssh
minimum_memory vault_asg_minimum_required_memory
minimum_vcpus vault_asg_minimum_required_vcpus
cpu_manufacturer vault_asg_cpu_manufacturer
cloudwatch_monitoring vault_enable_cloudwatch

Give the above suggestion a good think before implementing.

The templatefile now places a local file, this is not required.

Maybe something like this is simpler:

resource "aws_launch_configuration" "default" {
  iam_instance_profile = aws_iam_instance_profile.default.name
  image_id             = data.aws_ami.default.id
  instance_type        = local.instance_type
  key_name             = local.key_name
  name_prefix          = "${var.name}-"
  security_groups = [aws_security_group.private.id, aws_security_group.public.id]
  spot_price      = var.size == "development" ? var.spot_price : null
  user_data       = templatefile("${path.module}/user_data_vault.sh.tpl",
    {
      api_addr                       = local.api_addr
      default_lease_ttl              = var.default_lease_ttl
      instance_name                  = local.instance_name
      kms_key_id                     = local.aws_kms_key_id
      log_level                      = var.log_level
      max_lease_ttl                  = var.max_lease_ttl
      name                           = var.name
      prometheus_disable_hostname    = var.prometheus_disable_hostname
      prometheus_retention_time      = var.prometheus_retention_time
      random_string                  = random_string.default.result
      region                         = var.region
      telemetry                      = var.telemetry
      unauthenticated_metrics_access = var.telemetry_unauthenticated_metrics_access
      vault_ca_cert                  = file(var.vault_ca_cert)
      vault_ca_key                   = file(var.vault_ca_key)
      vault_path                     = var.vault_path
      vault_ui                       = var.vault_ui
      vault_version                  = var.vault_version
      vault_package                  = local.vault_package
      vault_license                  = try(var.vault_license, null)
      warmup                         = var.warmup
    }
  )
  root_block_device {
    encrypted   = true
    iops        = local.volume_iops
    volume_size = local.volume_size
    volume_type = local.volume_type
  }
  lifecycle {
    create_before_destroy = true
  }
}

Encrypt S3

Currently the bucket to store the cloud watch and log rotate script are not encrypted.

Encrypting them helps to improve overall security.

Bastion could be replaced by `terraform-asw-instance`.

To reduce the size of this module, the bastion host could be replaced by the [terraform-aws-instance](https://registry.terraform.io/modules/robertdebock/instance/aws/latest) module.

Benefits:

  • A specialised module for an instance.
  • Less code in this module.

Drawbacks:

  • An extra dependency.
  • More complex to integrate two modules rather than one.

Before working on this issue, please analyse the value.

Feature: Add CloudWatch

It would be nice to allow users to enable CloudWatch.

When enabled: (likely a variable: cloudwatch_enabled (bool))

  • Install a package, configure and start cloud watch.
  • (nice to have) Have Terraform create a Dashboard for Vault.
  • (nice to have) Add alerting. (Interesting to look at memory usage)
  • (nice to have) Send logs to AWS. (logs: journalctl and/or audit logs.)

Bug: Using a `vault_data_path` other then the default "/opt/vault" causes provisioning to fail.

The vault rpm installed uses /opt/vault as the default installation path, this is the same as the default value for the terraform variable vault_data_path. Because of this the provisioning completes succesfully.

When using a non-default vault_data_path we also need a folder called data to be created in this location.
This was not a problem before as the RPM would create it. With a non-default vault_data_path we need to create it during provisioning.

My suggestion is to add the following code to the userdata script:

# Create and configure the Vault data folder when it is different from the default path created by the rpm.
if [ "${vault_data_path}" != "/opt/vault" ] ; then
  mkdir ${vault_data_path}/data
  chown vault:vault ${vault_data_path}/data
  chmod 755 ${vault_data_path}/data
fi

Vault commands slow when using the load balancer address.

Vault commands take some 5 - 15 seconds to process. This eventually works, but is annoying.

Any vault command results in:

uname({sysname="Linux", nodename="ip-172-16-127-35.eu-west-1.compute.internal", ...}) = 0
epoll_pwait(4, [], 128, 0, NULL, 138566752) = 0
futex(0x4000094550, FUTEX_WAKE_PRIVATE, 1) = 1

NOTE: The nodename is set to ip-172-16-127-35.eu-west-1.compute.internal. That's not correct, should be the external address. (api_address).

From now on a loop starts: (Likely causing the delay.)

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1797, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1

The api_addr is set correctly in /etc/vault.d/vault.hcl:

api_addr          = "https://dev.robertdebock.nl:8200"

Maybe the node_id causes this behaviour:

storage "raft" {
  path    = "/opt/vault/data"
  node_id = "ip-172-16-127-35.eu-west-1.compute.internal"

Allow running a custom script on Bastion as well.

The vault nodes have an option to run a custom script. That script is typically used to install a backup or monitoring agent.

The bastion host can't run this now. Please add an option to run a custom script on the bastion host as well.

Make aws_key_pair variable.

Some users have a key-pair and would like to refer to the identifier of that key, rather than specifying a key-pair file.

Random is nice, but `name_prefix` is better.

Many resources use the random provider to create some uniqueness, some resources have name_prefix, which offers the same functionality, but built into Terraform natively.

  • Make a list of all modules. grep '^resource' * | cut -d: -f2 | awk '{ print $2 }'| sort | uniq | sed 's/"//g'
  • Check which resources support name_prefix.
  • Remove random part, use name_prefix.

AWS KMS key

Currently, the KMS key is generated within the module.

When you now delete a Vault deployment, the KMS key is also destroyed.

This means a saved snapshot can't be used, because the decryption key does not exist anymore.

Options to overcome this:

  • Let the user of the module bring his/her own key.
  • Prevent deleting KMS keys.

Implement a naming convention for `name` parameters in resources.

Resources have a name parameter, which differs over multiple files.

  • Think of a naming convention, maybe: "vault-${var.vault_name}-${random_string.default.result}".
  • Find all name parameters. (Not fully working, but a good start: grep -E 'name +=' *.tf.
  • Implement naming convention.

A load balancer cannot be attached to multiple subnets in the same Availability Zone

Error: error creating application Load Balancer: InvalidConfigurationRequest: A load balancer cannot be attached to multiple subnets in the same Availability Zone
│ status code: 400, request id: 9831def4-be92-4315-a399-ddeb1b8f9dde

│ with module.vault.aws_lb.default,
│ on .terraform/modules/vault/main.tf line 270, in resource "aws_lb" "default":
│ 270: resource "aws_lb" "default" {

Deploying Vault as OSS with ASG health checks based on LB causes standby nodes to be replaced continuously

The problem was encountered when deploying the "Cloudwatch" example with:

  • var.replication == false
  • var.vault_enable_metrics == true
  • var.vault_enable_metrics_unauthenticated_access == false

Problem / Symptoms

  • Vault standby (http return code:429) are marked unhealthy and continuously replaced by the ASG even though they are behaving as expected. They should be marked as unhealthy so they do not receive traffic from the loadbalancer. However from the point of Vault these nodes are perfectly healthy standby cluster nodes and behaving as expected.

deploying in an existing VPC fails

When calling the module with an existing VPC:

  vpc_id = "vpc-123abc"
  allowed_cidr_blocks  = [ "10.1.3.0/24", "10.1.4.0/24", "10.1.5.0/24"] 

An error occurs:

Error: Incorrect attribute value type
│ 
│   on .terraform/modules/vault/main.tf line 279, in resource "aws_lb" "default":
│  279:   subnets            = local.aws_subnet_ids
│     ├────────────────
│     │ local.aws_subnet_ids is "vpc-123abc"
│ 
│ Inappropriate value for attribute "subnets": set of string required.
╵
╷
│ Error: Invalid function argument
│ 
│   on .terraform/modules/vault/main.tf line 328, in resource "aws_autoscaling_group" "default":
│  328:   vpc_zone_identifier   = tolist(local.aws_subnet_ids)
│     ├────────────────
│     │ local.aws_subnet_ids is "vpc-123abc"
│ 
│ Invalid value for "v" parameter: cannot convert string to list of any single type.
╵
╷
│ Error: Invalid function argument
│ 
│   on .terraform/modules/vault/main.tf line 388, in resource "aws_instance" "bastion":
│  388:   subnet_id                   = tolist(local.aws_subnet_ids)[0]
│     ├────────────────
│     │ local.aws_subnet_ids is "vpc-123abc"
│ 
│ Invalid value for "v" parameter: cannot convert string to list of any single type. 

License not picked up by Vault

When running vault operator init, an error is returned.

# vault operator init
Error initializing: Error making API request.

URL: PUT https://172.16.1.14:8200/v1/sys/init
Code: 400. Errors:

* cannot initialize storage without an autoloaded license

I've tried:

  • To export VAULT_LICENSE=xyz"
  • To source /etc/vault.d/vault.env
  • To set license_path pointing to a file with the license.
  • To run VAULT_LICENSE=xyz vault operator init.

Details:

vault status
Key                      Value
---                      -----
Recovery Seal Type       awskms
Initialized              false
Sealed                   true
Total Recovery Shares    0
Threshold                0
Unseal Progress          0/0
Unseal Nonce             n/a
Version                  1.9.3+ent
Storage Type             raft
HA Enabled               true

Move heredoc (`<<EOF` ... `EOF`) to different files.

There are some heredoc markers in .tf file. (cloudwatch.tf, iam.tf and outputs.tf) (grep -Ril EOF *.tf).

  • Verify if using heredoc is preferable over reading files. (file(...) or templatefile(...))

If heredoc is not preferred:

  • Identify files with heredoc.
  • Move content in separate files. (something.json)
  • Use the file or template file function to read the file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.