Giter Site home page Giter Site logo

marathon-appcop's Introduction

AppCop Build Coverage Status

Marathon AppCop - Marathon applications law enforcement.

In large Mesos deployments there could be thousands of applications running and deploying every day. Sometimes they happen to be broken, forgotten and unmaintained which could exert pressure on cluster in numerous ways.

To address that AppCop clears Marathon from broken application deployments.

How it works

AppCop takes information provided by the Marathon event-stream related to applications failures and scales them down.

Scoring Mechanism

Based on Marathon events (TASK_KILL, TASK_FAIL, TASK_FINISHED), AppCop is building score registry for each application event emited. Each score is incremented by each app event, so if events related to failures are comming it is constantly raising. When application passes treshold, then AppCop scales application one instance down forcefully and put appcop label in app definition. After that, score for this application is reset. When there is only one instance, then and score is pass theshold then application is suspended. Scores are periodically reset.

GarbageCollection

AppCop is periodically fetching applications and groups from Marathon. When application is suspended or group is empty for long (configurable) time then it is deleted.

Metrics

AppCop provides set of standard system metrics as well as application based metrics.

Metric Types

System Metrics - AppCop specific telemetry (e.g - queue Size, Event delays etc). Location equals, metrics-prefix append metrics-system-sub-prefix.

Applications Metrics - Applications telemetry calculated based on events provided by marathon (like: task_killed, task_finished counters). Location equals, metrics-prefix (append) metrics-app-sub-prefix.

Please note the existance of appid-prefix config option, if set, removes matching string from application id when it comes to metric publication. For example, assumming

appid-prefix = com.example.
appID = com.example.exampleapp

your applications metric will be placed under:

{prefix}.{metrics-app-sub-prefix}.exampleapp

Installation

Installing from source code

To simply compile and run the source code:

go run main.go [options]

To run the tests:

make test

To build the binary:

make build

To build deb package:

make pack

Check dist/ dir.

Setting up AppCop

AppCcop should be installed on all Marathon masters. The event subscription should be set to localhost to reduce network traffic. Please refer to options section for more.

Marathon Labels

AppCop is using Marathon labels to communicate actions or to tune execution logic.

Used labels:

Name Possible values r/w Description
appcop suspend, scaleDown w Every time AppCop scales or suspend application, put appropriate label in app definition
APP_IMMUNITY false, true r When AppCop encounters this label in app definition, treats it as immune to all penalties (excused from all criminal acts on cluster). Use this feature wisely, because if applied to often it could defeat whole purpose for using AppCop

r - label is taken from app definition, not altered, w - label is manipulated by AppCop.

Options

Argument Default Description
config-file Path to a JSON file to read configuration from. Note: Will override options set earlier on the command line
event-stream-location /v2/events Get events from this stream
my-leader marathon-dev My leader, when Marathon /v2/leader endpoint return the same string as this one, make subscription to event stream and launch jobs.
events-queue-size 1000 Size of events queue
listen :4444 Accept connections at this address
log-file Save logs to file (e.g.: /var/log/appcop.log). If empty logs are published to STDERR
log-format text Log format: JSON, text
log-level info Log level: panic, fatal, error, warn, info or debug
marathon-location example.com:8080 Marathon URL
marathon-password Marathon password for basic auth
marathon-protocol http Marathon protocol (http or https)
marathon-ssl-verify true Verify certificates when connecting via SSL
marathon-timeout 30s Time limit for requests made by the Marathon HTTP client. A timeout of zero means no timeout
appid-prefix Prefix common to all fully qualified application ID's. Remove this preffix from applications id's ([Metric Types](#metric types))
marathon-username Marathon username for basic auth
scale-down-score 30 Score for application to scale it one instance down
scale-limit 2 How many scale down actions to commit in one scaling down iteration
update-interval 2s Interval for updating app scores
reset-interval 1d How often collected scores are reset
evaluate-interval 30s How often collected scores are compared against scale-down-score
metrics-interval 30s Metrics reporting interval
metrics-location Graphite URL (used when metrics-target is set to graphite)
metrics-prefix default Metrics prefix (default is resolved to .<app_name>
metrics-system-sub-prefix appcop-internal System specific metrics. Append to metric-prefix
metrics-app-sub-prefix applications Applications specific metrics. Appended to metric-prefix
metrics-target stdout Metrics destination stdout or graphite (empty string disables metrics)
workers-pool-size 10 Number of concurrent workers processing events
mgc-enabled true Enable garbage collecting of Marathon, old suspended applications will be deleted
mgc-max-suspend-time 7 days How long application should be suspended before deleting it
mgc-interval 8 hours Marathon GC interval
mgc-appcop-only true Delete only applications suspended by AppCop
dry-run false Perform a trial run with no changes made to marathon

Endpoints

Endpoint Description
/health healthcheck - returns OK

marathon-appcop's People

Contributors

adamdubiel avatar ojagodzinski avatar tomez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

tomez ykankaya

marathon-appcop's Issues

Releases on Travis

Currently we use our manual scripts and Docker release project.
We should consider moving this to Travis.

Leader metric

AppCop should provide .leader metric which is indicating that it is leading (subscribed to event-stream and polling events).

Empty group collection, recursive.

Right now AppCop is collecting groups only from root directory, if empty group is nested it is not collected.

We need to implement recursive group traversal and remove last empty group (leaf node) to address that.

Application restarts metrics

AppCop should provide metric of each application status change (TASK_RUNNING, TASK_STAGING, TASK_KILLED) in the form of graphite metrics, based on events.

Configuration intervals more human readable.

Right now intervals are provided as integer nanoseconds. We need to allow user to specify it as formated string in order to make configuration more readable and less error prone.

Instead of:
"Score": { "EvaluateInterval": 1800000000000 }

Should be:
"Score": { "EvaluateInterval": 30m }

reverse implementtion of APP_IMMUNITY

I was looking at the possibility to have the reverse functionality of APP_IMMUNITY. Say app cop will only check apps that have a certain flag set. Instead of setting APP_IMMUNITY to a lot of apps, only certain apps will be monitored for the frequency of the events.

Use case:
If some of the apps in the cluster are very critical and need to be up and running at any cost, having appcop scale down these containers will serious availability issues to the entire set of services.

Immunity support

AppCop should honor immunity of chosen applications, by excusing app from most Marathon criminal laws.

Immunity should be set for each app, as a Marathon label APP_IMMUNITY=true. When the label is set to true, then AppCop will not count score for this application.

notification support

Each time AppCop is making action it should push notification message, stating what app was action commenced on.

We should use JSON format and send HTTP POST to specified endpoint where some message broker should intercept it (e.g Hermes).

Example message:

{ 
  "timestamp": "1970-01-01T00:00:00+00:00",
  "appID": "com.example.testApp",
  "Action": "scaleDown"
}

Notifications should be disabled by default via config.

appcop is moving from ogier/pflag

At this point AppCop is using ogier/pflag for parsing command line flags and it is considered abandoned. We should consider using other solution.

Cobra / Viper has big community and is considered easy to use.

Dry Run mode

We could implement dry-run flag, when enabled AppCop will not make any changes, only do a dry run of the scaling and suspend operations.

Dry run should continue to log and send metrics.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.