lucretius / vault_raft_snapshot_agent Goto Github PK

⛔️ DEPRECATED ⛔️ An agent which provides periodic snapshotting capabilities of Vault's Raft backend

License: MIT License

Go 97.45% Dockerfile 2.55%

vault_raft_snapshot_agent's Introduction

⛔️ This project is no longer being supported as I have not needed to use Vault for some time. If you wish to continue to use this project, please consider forking it or check out one of the existing forks (see this discussion) ⛔️

Raft Snapshot Agent

Raft Snapshot Agent is a Go binary that is meant to run alongside every member of a Vault cluster and will take periodic snapshots of the Raft database and write it to the desired location. It's configuration is meant to somewhat parallel that of the Consul Snapshot Agent so many of the same configuration properties you see there will be present here.

"High Availability" explained

It works in an "HA" way as follows:

Each running daemon checks the IP address of the machine its running on.
If this IP address matches that of the leader node, it will be responsible for performing snapshotting.
The other binaries simply continue checking, on each snapshot interval, to see if they have become the leader.

In this way, the daemon will always run on the leader Raft node.

Another way to do this, which would allow us to run the snapshot agent anywhere, is to simply have the daemons form their own Raft cluster, but this approach seemed much more cumbersome.

Running

The recommended way of running this daemon is using systemctl, since it handles restarts and failure scenarios quite well. To learn more about systemctl, checkout this article. begin, create the following file at /etc/systemd/system/snapshot.service:

[Unit]
Description="An Open Source Snapshot Service for Raft"
Documentation=https://github.com/Lucretius/vault_raft_snapshot_agent/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/vault.d/snapshot.json

[Service]
Type=simple
User=vault
Group=vault
ExecStart=/usr/local/bin/vault_raft_snapshot_agent
ExecReload=/usr/local/bin/vault_raft_snapshot_agent
KillMode=process
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Your configuration is assumed to exist at /etc/vault.d/snapshot.json and the actual daemon binary at /usr/local/bin/vault_raft_snapshot_agent.

Then just run:

sudo systemctl enable snapshot
sudo systemctl start snapshot

If your configuration is right and Vault is running on the same host as the agent you will see one of the following:

Not running on leader node, skipping. or Successfully created <type> snapshot to <location>, depending on if the daemon runs on the leader's host or not.

Configuration

addr The address of the Vault cluster. This is used to check the Vault cluster leader IP, as well as generate snapshots. Defaults to "https://127.0.0.1:8200".

retain The number of backups to retain.

frequency How often to run the snapshot agent. Examples: 30s, 1h. See https://golang.org/pkg/time/#ParseDuration for a full list of valid time units.

Default authentication mode

role_id Specifies the role_id used to call the Vault API. See the authentication steps below.

secret_id Specifies the secret_id used to call the Vault API.

approle Specifies the approle name used to login. Defaults to "approle".

Kubernetes authentication mode

Incase we're running the application under kubernetes, we can use Vault's Kubernetes Auth as below. Read more on kubernetes auth mode

vault_auth_method Set it to "k8s", otherwise, approle will be chosen

vault_auth_role Specifies vault k8s auth role

vault_auth_path Specifies vault k8s auth path

Storage options

Note that if you specify more than one storage option, all options will be written to. For example, specifying local_storage and aws_storage will write to both locations.

local_storage - Object for writing to a file on disk.

aws_storage - Object for writing to an S3 bucket (Support AWS S3 but also S3 Compatible Storage).

google_storage - Object for writing to GCS.

azure_storage - Object for writing to Azure.

Local Storage

path - Fully qualified path, not including file name, for where the snapshot should be written. i.e. /etc/raft/snapshots

AWS Storage

access_key_id - Recommended to use the standard AWS_ACCESS_KEY_ID env var, but its possible to specify this in the config

secret_access_key - Recommended to use the standard SECRET_ACCESS_KEY env var, but its possible to specify this in the config

s3_endpoint - S3 compatible storage endpoint (ex: http://127.0.0.1:9000)

s3_force_path_style - Needed if your S3 Compatible storage support only path-style or you would like to use S3's FIPS Endpoint.

s3_region - S3 region as is required for programmatic interaction with AWS

s3_bucket - bucket to store snapshots in (required for AWS writes to work)

s3_key_prefix - Prefix to store s3 snapshots in. Defaults to raft_snapshots

s3_server_side_encryption - Encryption is off by default. Set to true to turn on AWS' AES256 encryption. Support for AWS KMS keys is not currently supported.

s3_static_snapshot_name - Use a single, static key for s3 snapshots as opposed to autogenerated timestamped-based ones. Unless S3 versioning is used, this means there will only ever be a single point-in-time snapshot stored in S3.

Google Storage

bucket - The Google Storage Bucket to write to. Auth is expected to be default machine credentials.

Azure Storage

account_name - The account name of the storage account

account_key - The account key of the storage account

container_name The name of the blob container to write to

Authentication

Default authentication mode

You must do some quick initial setup prior to being able to use the Snapshot Agent. This involves the following:

vault login with an admin user. Create the following policy vault policy write snapshot ./my_policies/snapshot_policy.hcl where snapshot_policy.hcl is:

path "/sys/storage/raft/snapshot"
{
  capabilities = ["read"]
}

Then run:

vault write auth/approle/role/snapshot token_policies="snapshot"
vault read auth/approle/role/snapshot/role-id
vault write -f auth/approle/role/snapshot/secret-id

and copy your secret and role ids, and place them into the snapshot file. The snapshot agent will use them to request client tokens, so that it can interact with your Vault cluster. The above policy is the minimum required policy to be able to generate snapshots. The snapshot agent will automatically renew the token when it is going to expire.

The AppRole allows the snapshot agent to automatically rotate tokens to avoid long-lived credentials.

To learn more about AppRole's and why this project chose to use them, see the Vault docs

Kubernetes authentication mode

To Enable Kubernetes authentication mode, we should follow these steps from the Vault docs

vault_raft_snapshot_agent's People

Contributors

Stargazers

Watchers

Forkers

nehrman alwaysastudent cvstebut kyrklund lix0st nbisweden genericgithubuser louisnitag corabank kirrmann rdeanmcdonald vanveele iamwel xabinapal kristoflemmens luanphantiki pburgisser pyjou prividen r-ashaev erlisb don-stuart chifac08 novikovoleg142 vldsubbotin zetifi pimmerks egenia luong-komorebi nunainc srinathrangaramanujam werwolfby vikramhansawat adamgoodbar wasilak boostport ns-rkusumoto argelbargel

vault_raft_snapshot_agent's Issues

Not performing snapshots when running on the leader

Hello,

We reference the DNS name of the node with all nodes using the same TLS certificate with SAN entries for each node. When running the script on the leader (and any node) it determines that it is not the leader, and doesn't run the snapshot. All of the nodes are behind a global traffic manager so that the cluster runs in different data centers for HA. The leader refers to the FQDN of the node, however the node itself is just configured with it's own hostname. Are there any changes that I would need to make to the nodes for the script to work (or conversely any changes in the script to make it work with my configuration)?

Kubernetes

Hi,

Can we run this agent inside a Kubernetes cluster?

Intermittent 0 byte snapshots with S3 storage

Many thanks for creating this project.

All working well except when using the S3 bucket storage for configuration I am seeing 0 byte snapshots intermittently. Version and encryption is enabled on the bucket. I have tried disabling these but the agent continues to upload 0 byte snapshots.

any way to skip tls certificate verify?

was able to get server configured and starting but it's refusing connection at my addr value.

since i'm taking this value straight from vault config (and can connect with it via clients), i'm guessing that it's because i'm using a private cert that tls doesn't like.

in the vault client or curl, you can disable tls certificate verification (-tls-skip-verify in vault cli).

I don't see an option in the config file to do something similiar. Is there an undocumented way to acheive same? Or can I somehow modify this in go with an environment variable?

appreciate any guidance and thanks for all your hard work.

Compilation fails on go 1.15

By default, the go version on ubuntu-latest is v1.15 due to os.readFile only happening since Go v1.16
The Dockerfile is already using go1.16.

I suggest to add actions on the workflow to use go 1.16

Every instance is executing the upload

Hello, thank you for this tool.
I've already installed in my instances and the backups are working properly. There's one minor issue, every instance is executing the snapshot upload, instead of only the leader. I'm using Vault inside containers, could it be related?

➜ vault operator raft list-peers
Node                                                        Address                    State       Voter
----                                                        -------                         -----       -----
vault-internal-1.example.com    vault-internal-1.example.com:8201    follower    true
vault-internal-2.example.com    vault-internal-2.example.com:8201    leader      true
vault-internal-3.example.com    vault-internal-3.example.com:8201    follower    true

[Feature] Timeout parameter support when snapshot a huge raft cluster

The default timeout of Vault-sdk when doing snapshot is 60s which may lead to be timed out. We'd better find a way to make is custom-able with higher values when needed.

Ref: #9

snapshot agent service does not create snapshots

I have enabled and configured the snapshot agent as described in the instructions but when the service is started , I get the following message:

"Not running on leader node, skipping"

This is a one node cluster and it is indeed leader :
vault operator raft list-peers
Node Address State Voter

node_1 127.0.0.1:8201 leader true

Any idea how that can be fixed, because I do not know how to get a more verbose output and I am stuck with debugging.

Create a fork?

It's been quite a while since @Lucretius was active on GitHub and the last commit to this repository was back in October 2021.

Personally, I still find this project to be really useful and use it along with Vault quite often. What does the community think about creating a fork to maintain and improve the project?

k8s auth documentation error

In the README.md documentation of the Kubernetes authentication mode it is specified:

vault_auth_role Specifies vault k8s auth role
vault_auth_path Specifies vault k8s auth path

This seems to be a mistake, as the configuration parameters need to be called k8s_auth_role and k8s_auth_path. See config.go.

Feature-Request: sturctured logging / log-format json

Unfortunately go is not a language I am experienced in, but I love your project and wanted to drop the idea of adding a flag to enable structured logging.

In my setup your snapshot agent logs to a file thanks to systemd

ExecStart=/usr/local/bin/vault_raft_snapshot_agent
ExecReload=/usr/local/bin/vault_raft_snapshot_agent
StandardOutput=append:/var/log/vault_snapshots.log

Filebeat is harvesting the file and sends the data to elasticsearch, which is a data source in my grafana.
Backups are run hourly, so if there is no successful backup in the last hour according to the logs, grafana triggers an alert.

Different frequency for each storage backend

Hi,

I'd like to be able to write locally multiple times a day, but only push to AWS once a day.
Also retention would be managed differently.
I browsed the configuration documentation, and it does not seem to be possible, but would like to confirm.

Thanks.

Support encryption

It would be nice if the snapshots can be encrypted before being stored. For example, being encypted with a GPG key.

Determining leader

Hello and thanks for creating this project.

We tried using the agent in our setup, but could not get it to create snapshots even though the agent was on the leader node.
After some digging I found out how you determine if the agent is running on the leader node by checking against 8.8.8.8 with the function getInstanceIP()
The issue we faced is that we don't use IPs to reference our nodes, but instead use DNS registered names (for TLS to work)

My question is:
Is there is a reason you get node IP and compare it to LeaderClusterAddress rather than just checking the IsSelf value.
This of course assumes you query the nodes local api_addr.

Thanks again
// Jonathan

No license is stated for this project

What is license that vault_raft_snapshot_agent is being created under?

Majority of aws snaps fail to upload, none delete.

I'm running the service as described here using an on-prem S3 service(not minio) as the aws endpoint. I was seeing tons of failed snaps and it appears to be in the aws s3manager sdk Upload portion. I was initially doing a 30s increment, thought that might be too much, backed it off to 30m and it is still failing 9/10 times to upload.
I also see no logs related to failed deletion even though my retain is set to 1440 and there are currently 33000+ in there after a month.

logs collected with :'sudo journalctl -u vault-snapshot.service'

2021/04/12 08:49:52 Failed to generate aws snapshot to : MultipartUpload: upload multipart failed Apr 12 08:49:52 host.local vault_raft_snapshot_agent[114297]: upload id: 22478197652 Apr 12 08:49:52 host.local vault_raft_snapshot_agent[114297]: caused by: InternalError: Apr 12 08:49:52 host.local vault_raft_snapshot_agent[114297]: status code: 500, request id: , host id:

lucretius / vault_raft_snapshot_agent Goto Github PK

vault_raft_snapshot_agent's Introduction

Raft Snapshot Agent

"High Availability" explained

Running

Configuration

Default authentication mode

Kubernetes authentication mode

Storage options

Local Storage

AWS Storage

Google Storage

Azure Storage

Authentication

Default authentication mode

Kubernetes authentication mode

vault_raft_snapshot_agent's People

Contributors

Stargazers

Watchers

Forkers

vault_raft_snapshot_agent's Issues

Recommend Projects

Recommend Topics

Recommend Org