Giter Site home page Giter Site logo

rancher-reaper's Introduction

Rancher AWS Terminated Host Reaper

Note: This service deletes hosts from Rancher if they are terminated in AWS. All care but no responsibility taken. Validate it in a test environment first using the "dry run" setting.

Overview

This is a Docker service which automatically deletes hosts from Rancher if they have been terminated in AWS.

If you have set up an autoscaled fleet of Cattle hosts which scale up and down automatically, you've probably noticed that Rancher does not automatically deactivate and delete the terminated hosts. As well as generally cluttering the Rancher UI/API, if not all your containers have health checks this can result in the containers that were on the terminated host not being rescheduled onto healthy hosts. You must then manually delete the terminated hosts in Rancher to force it to reschedule these containers.

Although somewhat annoying this is really the correct behaviour by Rancher. It has no way to determine if the host has really been terminated or if it has just lost contact with the agent on that host (say due to a network partition).

So to work around this problem, this container constantly checks the Rancher API for instances in the "reconnecting" state. For each of these instances it tries to find the corresponding instance in AWS. If the instance exists and is in the "terminated" state, then it deactivates and deletes the host in Rancher.

Running

Labelling Hosts

In order to be able to determine if a Rancher host has been terminated in AWS or not, this service needs to be able to find the corresponding AWS instance in the AWS API. This turns out to be quite difficult using only the information that is availabile in the Rancher API, so this service presently requires you to label your Rancher hosts with the following labels:

  • aws.instance_id - the AWS instance ID, eg "i-8b92d524"
  • aws.availability_zone - the availability zone in which the instance resides, eg "us-west-1a"

The easiest way to do this is to look these values up from the AWS metadata service when starting the Rancher agent. For example:

$ sudo docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock \
    -e CATTLE_HOST_LABELS="aws.instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)&aws.availability_zone=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)" \
    rancher/agent:v1.0.2 http://<rancher-server>/v1/scripts/<registrationToken>

The names of these labels can be configured through environment variables. See the configuration documentation below for more details.

Running the Service

You will generally run one instance of this container in each Rancher environment.The following Rancher config should provide you with a good starting point:

docker-compose.yml:

rancher-reaper:
  image: ampedandwired/rancher-reaper:latest
  tty: true
  environment:
    AWS_ACCESS_KEY_ID: ${AccessKeyId}
    AWS_SECRET_ACCESS_KEY: ${SecretAccessKey}
  labels:
    io.rancher.container.create_agent: 'true'
    io.rancher.container.agent.role: environment

rancher-compose.yml:

rancher-reaper:
  scale: 1
  health_check:
    port: 3000
    interval: 2000
    unhealthy_threshold: 3
    strategy: recreate
    response_timeout: 2000
    request_line: GET / HTTP/1.0
    healthy_threshold: 2

This container requires the following environment variables to be set:

  • CATTLE_URL
  • CATTLE_ACCESS_KEY
  • CATTLE_SECRET_KEY
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

The easiest way to set the CATTLE_* variables is to set up your container with a service account by applying the following labels:

io.rancher.container.create_agent: true
io.rancher.container.agent.role: environment

The AWS_* variables should contain AWS API keys that have ec2:DescribeInstances and ec2:DescribeRegions permissions on all resources. Use this IAM policy as a guide:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeRegions"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Note that it is possible to run a single global instance of this service that acts on hosts in all environments by using a global API key. Please note that if you take this approach, you need to manually set the CATTLE_* environment variables, as the "service account" approach described above creates single-environment keys only.

Configuration

The following environment variables can be used to control the behaviour of this service:

  • REAPER_INTERVAL_SECS - the interval in seconds between checking host status (default: 30). If set to "-1" the container will run in "one-shot" mode, in which it will check hosts once and then shut down. This is useful if you're running the container using an external scheduler such as Rancher cron.
  • REAPER_DRY_RUN - If set to "true", this service will simply log what it would do without actually doing it (default: false)
  • REAPER_INSTANCE_ID_LABEL_NAME - The name of the Rancher host label that holds the AWS instance ID. Defaults to aws.instance_id.
  • REAPER_AVAILABILITY_ZONE_LABEL_NAME - The name of the Rancher host label that holds the AWS availability zone. Defaults to aws.availability_zone.

Developing

Suggestions and pull requests welcome at the GitHub repo.

To run locally in Docker, set the environment variables listed above and run:

$ docker-compose up

Or without docker (local Ruby 2.x installation required):

$ bundle install
$ bundle exec thin -R lib/config.ru start

rancher-reaper's People

Contributors

ampedandwired avatar denverj avatar jhmartin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rancher-reaper's Issues

eu-west-2 (London) availability zones being treated as invalid

Getting this issue when removing hosts in the London region:

06/04/2017 16:23:26I, [2017-04-06T15:23:26.799212 #8]  INFO -- : Reaping terminated AWS hosts...
06/04/2017 16:23:26W, [2017-04-06T15:23:26.872705 #8]  WARN -- : Host ip-10-5-2-55.eu-west-2.compute.internal is labelled with an invalid availability zone: eu-west-2b
06/04/2017 16:23:26I, [2017-04-06T15:23:26.872826 #8]  INFO -- : Host ip-10-5-2-55.eu-west-2.compute.internal is not labelled correctly with AWS instance ID and region - skipping
06/04/2017 16:23:26W, [2017-04-06T15:23:26.891732 #8]  WARN -- : Host ip-10-5-1-128.eu-west-2.compute.internal is labelled with an invalid availability zone: eu-west-2a
06/04/2017 16:23:26I, [2017-04-06T15:23:26.891819 #8]  INFO -- : Host ip-10-5-1-128.eu-west-2.compute.internal is not labelled correctly with AWS instance ID and region - skipping
06/04/2017 16:23:26W, [2017-04-06T15:23:26.919266 #8]  WARN -- : Host ip-10-5-2-192.eu-west-2.compute.internal is labelled with an invalid availability zone: eu-west-2b
06/04/2017 16:23:26I, [2017-04-06T15:23:26.919367 #8]  INFO -- : Host ip-10-5-2-192.eu-west-2.compute.internal is not labelled correctly with AWS instance ID and region - skipping

aws_sdk bug prevents rancher-reaper from removing hosts

It appears there is an issue with the AWS SDK which prevents Rancher Reaper from reaping hosts that have been terminated for more than 1 hour.

aws/aws-sdk-ruby#1449

As far as I can tell, this isn't really a flaw in rancher-reaper, other than perhaps to catch and more cleanly handle the NoMethodError.

This is what shows up in my logs:

I, [2017-03-17T01:46:00.182630 #7]  INFO -- : Reaping terminated AWS hosts...
E, [2017-03-17T01:46:00.511135 #7] ERROR -- : undefined method `[]' for nil:NilClass (NoMethodError)
/usr/local/bundle/gems/aws-sdk-resources-2.6.3/lib/aws-sdk-resources/resource.rb:223:in `block in add_data_attribute'
/usr/src/app/reaper.rb:79:in `host_terminated?'
/usr/src/app/reaper.rb:45:in `block in reap_hosts'
/usr/src/app/rancher_api.rb:20:in `yield'
/usr/src/app/rancher_api.rb:20:in `block (3 levels) in get_all'
/usr/src/app/rancher_api.rb:20:in `each'
/usr/src/app/rancher_api.rb:20:in `block (2 levels) in get_all'
/usr/src/app/rancher_api.rb:17:in `loop'
/usr/src/app/rancher_api.rb:17:in `block in get_all'
/usr/src/app/reaper.rb:44:in `each'
/usr/src/app/reaper.rb:44:in `each'
/usr/src/app/reaper.rb:44:in `reap_hosts'
/usr/src/app/reaper.rb:24:in `run'
config.ru:24:in `block (2 levels) in <main>'

Otherwise, the solution will probably require building a new container once the upstream bug is resolved.

Make labels configurable?

I'm looking at trying this but already have labels for hosts that don't match the code. Is there some way to override these easily or make them configurable? Maybe an environment variable?

Missing region error while reaping terminated host

I have Rancher setup in eu-west-1b region

hosts have labels setup like suggested in dockumentation
aws.availability_zone=eu-west-1b
aws.instance_id=i-63ba5cf5

and when I terminate host this error shows up:

11/11/2016 10:09:36I, [2016-11-11T08:09:36.505244 #5] INFO -- : Reaping terminated AWS hosts...
11/11/2016 10:09:36E, [2016-11-11T08:09:36.545251 #5] ERROR -- : missing region; use :region option or export region name to ENV['AWS_REGION'] (Aws::Errors::MissingRegionError)
11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/aws-sdk-core/plugins/regional_endpoint.rb:34:in after_initialize' 11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:84:in block in after_initialize'
11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:83:in each' 11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:83:in after_initialize'
11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:21:in initialize' 11/11/2016 10:09:36/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:105:in new'
11/11/2016 10:09:36/usr/src/app/reaper.rb:119:in valid_regions' 11/11/2016 10:09:36/usr/src/app/reaper.rb:109:in region'
11/11/2016 10:09:36/usr/src/app/reaper.rb:97:in has_aws_tags?' 11/11/2016 10:09:36/usr/src/app/reaper.rb:71:in host_terminated?'
11/11/2016 10:09:36/usr/src/app/reaper.rb:45:in block in reap_hosts' 11/11/2016 10:09:36/usr/src/app/rancher_api.rb:20:in yield'
11/11/2016 10:09:36/usr/src/app/rancher_api.rb:20:in block (3 levels) in get_all' 11/11/2016 10:09:36/usr/src/app/rancher_api.rb:20:in each'
11/11/2016 10:09:36/usr/src/app/rancher_api.rb:20:in block (2 levels) in get_all' 11/11/2016 10:09:36/usr/src/app/rancher_api.rb:17:in loop'
11/11/2016 10:09:36/usr/src/app/rancher_api.rb:17:in block in get_all' 11/11/2016 10:09:36/usr/src/app/reaper.rb:44:in each'
11/11/2016 10:09:36/usr/src/app/reaper.rb:44:in each' 11/11/2016 10:09:36/usr/src/app/reaper.rb:44:in reap_hosts'
11/11/2016 10:09:36/usr/src/app/reaper.rb:24:in run' 11/11/2016 10:09:36config.ru:24:in block (2 levels) in

'

ERROR undefined method `request_uri'

I get the following errors when trying to run the reaper.

docker run -e CATTLE_URL=<redacted> -e CATTLE_ACCESS_KEY=<redacted> -e CATTLE_SECRET_KEY=<redacted> -e AWS_ACCESS_KEY_ID=<redacted> -e AWS_SECRET_ACCESS_KEY=<redacted> -e REAPER_DRY_RUN=true -e REAPER_INTERVAL_SECS=-1 -e REAPER_INSTANCE_ID_LABEL_NAME=instanceid --rm ampedandwired/rancher-reaper:latest
Unable to find image 'ampedandwired/rancher-reaper:latest' locally
latest: Pulling from ampedandwired/rancher-reaper
90f4dba627d6: Pull complete 
98c1a7514ba6: Pull complete 
e970acb20b34: Pull complete 
5a1603643434: Pull complete 
505ec6100fdd: Pull complete 
ccd0db5bc7b3: Pull complete 
0353f8ee33b8: Pull complete 
5e4129a6295c: Pull complete 
340db6589be4: Pull complete 
09933c1412db: Pull complete 
5d147c81f520: Pull complete 
Digest: sha256:ce5e47c655aa5f9594cf638d245d18fe327e53b7604fc920587fa8dd2bfdd048
Status: Downloaded newer image for ampedandwired/rancher-reaper:latest
/usr/local/bundle/gems/thin-1.7.0/lib/thin/server.rb:107: warning: constant ::Fixnum is deprecated
I, [2018-05-17T22:43:50.830220 #7]  INFO -- : Rancher AWS host reaper started
I, [2018-05-17T22:43:50.830272 #7]  INFO -- : Reaping terminated AWS hosts...
W, [2018-05-17T22:43:50.830307 #7]  WARN -- : *** Dry run - no changes will be applied
E, [2018-05-17T22:43:50.830462 #7] ERROR -- : undefined method `request_uri' for #<URI::Generic:0x0055df8aa7b178> (NoMethodError)
/usr/src/app/rancher_api.rb:58:in `make_api_request'
/usr/src/app/rancher_api.rb:12:in `get'
/usr/src/app/rancher_api.rb:18:in `block (2 levels) in get_all'
/usr/src/app/rancher_api.rb:17:in `loop'
/usr/src/app/rancher_api.rb:17:in `block in get_all'
/usr/src/app/reaper.rb:52:in `each'
/usr/src/app/reaper.rb:52:in `each'
/usr/src/app/reaper.rb:52:in `reap_hosts'
/usr/src/app/reaper.rb:32:in `run'
config.ru:30:in `block (2 levels) in <main>'
I, [2018-05-17T22:43:50.831671 #7]  INFO -- : Rancher AWS host reaper exited

Is this a user error or is the container not up-to-date? Thanks for any help with this.

https CATTLE_URL not working?

Hi,

Great work you have done here. When trying to deploy the container on one of our environments we ran in to the issue that we have our rancher server setup on https only.

You get a stack trace like the one below. Opening the http port (as a workaround) fixes that but not very good for our security policy. Unfortunately my Ruby skills are not that great as I would tried to fix it.

Thanks,

E, [2016-09-30T11:31:10.430732 #5] ERROR -- : 784: unexpected token at '
30-9-2016 13:31:10
30-9-2016 13:31:10<title>400 Bad Request</title>
30-9-2016 13:31:10
30-9-2016 13:31:10

Bad Request


30-9-2016 13:31:10

Your browser sent a request that this server could not understand.

30-9-2016 13:31:10Reason: You're speaking plain HTTP to an SSL-enabled server port.

30-9-2016 13:31:10 Instead use the HTTPS scheme to access this URL, please.

30-9-2016 13:31:10


30-9-2016 13:31:10

30-9-2016 13:31:10Apache/2.4.7 (Ubuntu) Server at rancher.domain Port 443
30-9-2016 13:31:10
30-9-2016 13:31:10' (JSON::ParserError)
30-9-2016 13:31:10/usr/local/lib/ruby/2.3.0/json/common.rb:156:in parse' 30-9-2016 13:31:10/usr/local/lib/ruby/2.3.0/json/common.rb:156:inparse'
30-9-2016 13:31:10/usr/src/app/rancher_api.rb:17:in get' 30-9-2016 13:31:10/usr/src/app/rancher_api.rb:23:inblock (2 levels) in get_all'
30-9-2016 13:31:10/usr/src/app/rancher_api.rb:22:in loop' 30-9-2016 13:31:10/usr/src/app/rancher_api.rb:22:inblock in get_all'
30-9-2016 13:31:10/usr/src/app/reaper.rb:37:in each' 30-9-2016 13:31:10/usr/src/app/reaper.rb:37:ineach'
30-9-2016 13:31:10/usr/src/app/reaper.rb:37:in reap_hosts' 30-9-2016 13:31:10/usr/src/app/reaper.rb:21:inrun'
30-9-2016 13:31:10config.ru:24:in block (2 levels) in <main>

RancherOS compatibility

Labeling RancherOS nodes with dynamic values like instance-id at startup is tricky since the Rancher integration is not a shell command. Would it be possible to have a 'global service' Docker container that updates a node with the appropriate labels?

Doesn't handle malformed aws.availability_zone labels

Great PoC! Nicely documented and easy to get running.

I think you need to either derive the Region from the AWS Availability Zone or pass the region as an entirely separate label. Either way, the AWS CLI (and SDKs respectively) need to use the region only. You could gsub out the last letter from the AZ name, maybe.

10/25/2016 10:26:26 PMI, [2016-10-26T02:26:26.509312 #5]  INFO -- : Reaping terminated AWS hosts...
10/25/2016 10:26:26 PMW, [2016-10-26T02:26:26.509420 #5]  WARN -- : *** Dry run - no changes will be applied
10/25/2016 10:26:26 PME, [2016-10-26T02:26:26.856604 #5] ERROR -- : :region option must a region name, not an availability zone name; try `eu-central-1' instead of `eu-central-1a' (ArgumentError)
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/aws-sdk-core/plugins/ec2_region_validation.rb:11:in `after_initialize'
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:84:in `block in after_initialize'
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:83:in `each'
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:83:in `after_initialize'
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:21:in `initialize'
10/25/2016 10:26:26 PM/usr/local/bundle/gems/aws-sdk-core-2.6.3/lib/seahorse/client/base.rb:105:in `new'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:66:in `host_terminated?'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:38:in `block in reap_hosts'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:20:in `yield'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:20:in `block (3 levels) in get_all'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:20:in `each'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:20:in `block (2 levels) in get_all'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:17:in `loop'
10/25/2016 10:26:26 PM/usr/src/app/rancher_api.rb:17:in `block in get_all'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:37:in `each'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:37:in `each'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:37:in `reap_hosts'
10/25/2016 10:26:26 PM/usr/src/app/reaper.rb:21:in `run'
10/25/2016 10:26:26 PMconfig.ru:24:in `block (2 levels) in <main>'

Incorrect handling of nonexistent instances

When initially launching the reaper with several 'reconnecting' nodes that have since aged out of AWS:

10/31/2016 10:21:15 AMI, [2016-10-31T17:21:15.104529 #5]  INFO -- : Reaping terminated AWS hosts...
10/31/2016 10:21:15 AMW, [2016-10-31T17:21:15.104599 #5]  WARN -- : *** Dry run - no changes will be applied
10/31/2016 10:21:15 AME, [2016-10-31T17:21:15.171469 #5] ERROR -- : undefined method `[]' for nil:NilClass (NoMethodError)
10/31/2016 10:21:15 AM/usr/local/bundle/gems/aws-sdk-resources-2.6.3/lib/aws-sdk-resources/resource.rb:223:in `block in add_data_attribute'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:75:in `host_terminated?'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:45:in `block in reap_hosts'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:20:in `yield'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:20:in `block (3 levels) in get_all'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:20:in `each'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:20:in `block (2 levels) in get_all'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:17:in `loop'
10/31/2016 10:21:15 AM/usr/src/app/rancher_api.rb:17:in `block in get_all'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:44:in `each'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:44:in `each'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:44:in `reap_hosts'
10/31/2016 10:21:15 AM/usr/src/app/reaper.rb:24:in `run'
10/31/2016 10:21:15 AMconfig.ru:24:in `block (2 levels) in <main>'

It looks like the API is not triggering the nonexistent instance catch block and is instead triggering a NilClass error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.