Dynamically manage your drone.io cluster of agents.

It is costly to have powerful dedicated build machines. If usage is only a percentage of uptime, dynamically sizing the agent network for the work queue can have a cost saving.

For many libre software projects, requiring a few hours of compute time every day, this can make it possible to use high-powered build machines for a fraction of the cost of a dedicated machine (e.g. $20/month instead of $200/month).

Drone Overview

Drone server is a libre continuous integration application, which farms out jobs to docker agents when it detects events such as pull requests or new commits.

Drone agents spawn additional (project-defined) docker containers which run the jobs they receive, and report their status back to the main drone server.

An agent can be started on any docker-enabled device with a command like:

docker run -d \
  -e DRONE_SERVER=wss://ci.fommil.com/ws/broker \
  -e DRONE_SECRET=<redacted> \
  -e DOCKER_MAX_PROCS=1 \
  -e DRONE_TIMEOUT=30m \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --restart=always \
  --name=drone-agent \
  drone/drone:0.5 agent

Objectives

This is a daemon application that demonstrates the functional programming style in Scala (this is more important to the authors than quick fixes or features) and manages a (paid-for) resource pool of agents:

if the work queue is not empty, agents should be purchased until sufficient worker resource is available to deplete the queue
no cost should be incurred when the queue is empty for extended periods (or if the application gets into a bad state!)

Inputs, Constraints and Thoughts

We can poll the drone server using its REST API http://readme.drone.io/api/ (a websockets API is available but is known to be unreliable) to get the work queue and infer expected timings of jobs.

A SIGTERM will stop an agent from accepting new jobs but will finish existing jobs (this would be a good drone feature).

It's possible to use the Amazon ECS service, see http://docs.datadoghq.com/integrations/ecs/ to set up an appropriate container (which is in a Task, which run in Instances, which are grouped in a Cluster).

A reasonable instance for libre projects is a c4.xlarge instance (4 CPUs, 7.5GB RAM), costing ~$0.06/hour (rounded up to an hour) using the spot price. It is possible to set up budgets and alerts on Amazon, which is a good safety net in case of bugs. Their machines are pretty fast, ensime-server taskes around 11 mins on 4 CPUs x 7.5GB.

Another option is to use the Google Container service. I managed to get a kubernetes Deployment working with this file http://stackoverflow.com/questions/42273332 On pricing, they charge similarly to Amazon with a cheaper pre-emptable option options. Also rounded up hourly (AFAICT). Their machines are marginally slower than the EC2 ones. e.g. an ensime-server build takes 13 mins on 4 CPUs x 8GB. Bumping to 8 CPUs x 8GB gets it to 11mins, on par with the lower spec Amazon machine (probably not worth the extra caps).

Digitalocean have a really nice API, but are pretty expensive ($0.11 / hour for the same spec machine as Amazon), but their CPUs are super slow. An ensime-server build takes 25mins.

There are many ways to programmatically communicate with Amazon http://docs.aws.amazon.com/AmazonECS/latest/APIReference/Welcome.html we'd ideally prefer to use a pure JVM solution without having to rely on proprietary binaries (e.g. REST).

In an ideal world we would respond to work queue events like so: when a job is submitted, spawn an agent that accepts one job and then shutdown.

However, the APIs available to us do not allow us to do this. Notably:

there is no way to spawn an agent that accepts one job and then exits (this would be a good drone feature)
telling an agent not to accept any more jobs cannot be done remotely
amazon charges are rounded up to the hour, so starting an Instance for 1 minute costs 1 hour
an ECS Task cannot be started or accepted unless there is a running Instance
worth repeating: Amazon charges per hour of Instance (not per hour of a Task)
if an agent is killed before it finishes, it is not rescheduled (this would be a good drone feature)
it might be possible to bid for cheaper CPU resource by block booking (to confirm)
it might be possible to extend a block-booked period of time

A possible algorithm may be:

receive input from AWS and drone that mutates a forgetful state containing our knowledge about what we have asked AWS/drone and what it has told us.
if the queue is non-empty and the resource list is empty and we have not requested more resources, then bid for an hour of Instance
when an Instance appears, start a Task within it
when the hour is almost up (timing TBD), make a go/no-go decision to renew for another hour (i.e. is work queue empty and no currently running jobs). If no-go, then tell the agents not to accept any more work and schedule the Task/Instance to die.

This business logic is to be fully implemented using the free monad pattern (i.e. no business logic in the implementation details) and ideally in such a way that another (non-Amazon) interpreter could be written.

yilinwei / drone-dynamic-agents Goto Github PK

drone-dynamic-agents's Introduction

Drone Overview

Objectives

Inputs, Constraints and Thoughts

drone-dynamic-agents's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent