Giter Site home page Giter Site logo

ho-cto / sre-monitoring-as-code Goto Github PK

View Code? Open in Web Editor NEW
21.0 0.0 15.0 14.22 MB

Monitoring-as-Code (MaC) is a jsonnet mixin implementation of SLIs/SLO/Error Budgets using the open-source monitoring and alerting eco-system of Prometheus and Grafana.

Home Page: https://ho-cto.github.io/sre-monitoring-as-code/

License: MIT License

Shell 2.80% Dockerfile 0.92% Jsonnet 70.61% Java 9.87% HTML 0.59% JavaScript 6.05% Vue 9.16%
shared

sre-monitoring-as-code's Introduction

Main workflow Release workflow Deploy Docs workflow Latest Release License

sre-monitoring-as-code

SRE Monitoring-as-Code (MaC) is a Jsonnet Mixin implementation of SLIs/SLO/Error Budgets using the open-source monitoring and alerting eco-system of Prometheus and Grafana. Our documentation is available to view online.

About the framework

Monitoring Mixins bundle up SLI configuration, Alerting, Grafana dashboards, and Runbooks into a single package. Engineers commit a monitoring definition file and this triggers the packaging of Prometheus Rules and Grafana Dashboards and injects them into the monitoring tools. This way, we can ease up engineers' burden of writing alerting rules, manually drawing up Grafana dashboards, and scribing runbooks.

  • Monitoring Mixins1 are a lightweight flexible configuration, which don’t mandate specific labels or expressions. You can configure and overwrite everything.
  • Mixins use data templating language called Jsonnet, which is the only templating language which has fully supported libraries for Grafana and Prometheus.
  • jsonnet-bundler is used for package management. Once you have a Monitoring Mixin package, you need to install it, keep track of versions and update them
  • SRE MaC will be open-sourced and live on UKHomeOffice GitHub and can be integrated with any Platform which supports pulling containers from GitHub.
  • SLI/SLO/Error Budget configurations match Google SRE2 industry patterns.

Repository structure

Directory Description
.githooks/ Contains the client-side pre-commit and pre-push git hooks which form part of our engineering workflow.
.github/ Contains the GitHub Action workflows and associated config.
docs/ Contains the technical documentation for Monitoring-as-Code using Tech Docs Template and Middleman.
example-apps/ Contains example apps to showcase how custom metrics can be shown within the MaC framework.
local/ Contains a docker-compose implementation of Prometheus, Thanos, Grafana and Alertmanager. The purpose of this project is to test Monitoring-as-Code locally with your application.
monitoring-as-code/ Contains the Jsonnet mixin implementation of SLIs/SLO/Error Budgets for Prometheus and Grafana.
security/ Contains the GitLeaks secret scan configuration.

Installation and usage information is provided in a Readme within each of the directories.

Resources

  1. Prometheus - Monitoring Mixins
  2. Google SRE - Implementing SLOs
  3. Google SRE - Setting SLOs: a step-by-step guide
  4. Liz Fong-Jones - Adopting SRE and Error Budgets
  5. GDS - Run a Service Level Indicator workshop

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

sre-monitoring-as-code's People

Contributors

amandeepsinghho avatar anjeleepadayachyho avatar arifulhaqueho avatar bailey-96 avatar dependabot[bot] avatar finlaymccormick avatar finlaymccormickho avatar georgeowusuho avatar humayunalamho avatar irenemolnarho avatar laurahiles avatar mahrufiqbalho avatar michaelpearsonho avatar peterjakemanho avatar samiwelthomasho avatar tombaileyho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sre-monitoring-as-code's Issues

Refactor run-mixin.sh

Make a variety of small changes to run-mixin:

  • make output path flag required
  • make changes to improve POSIX shell compatibility
  • rename temporary-mixin directory to _input and make it temporary

Change process for MaC

The team need to build the change process for the MaC framework release notes/ release cycle etc and build it into the existing process'

Add repo skeleton structure

Thinking of the following structure: -

|-- sre-monitoring-as-code (top level repository)
| |-- components (directory for grouping components)
| | |-- monaco-pg (directory for component a)
| | |-- monaco-dt (directory for component b)
| |-- local (directory for local environment)
| |-- documentation (directory for local environment)

import - no match locally or in the Jsonnet library paths

couldn't open import "grafonnet/grafana.libsonnet": no match locally or in the Jsonnet library paths

local grafana = import 'grafonnet/grafana.libsonnet';

Import Grafana link from Github rather than locally using relative paths.

Links to internal confluence architecture information

In 'Interpret MaC outputs', search for "Most of the labels are outlined in Designing Service Alerting”. There are couple of sections that have placeholder NOTE text. This information is currently in internal documentation, it will need to be decided where this info is kept and how it's best presented in the external docs.

remove redundant .gitkeep files

gitkeep was being used to retain empty directories on GitHub. These directories have now been populated so these files need to be removed from local/ and monitoring-as-code/

Published docs not rendering stylesheets

Changing the visibility of this repository to public has resulted in the switch to the fixed url https://ho-cto.github.io/sre-monitoring-as-code/. The introduction of a docpath to our site has rendered the CSS stylesheet unuseable. A change is required to the tech docs template config file to set the host and service_link as follows: -

host: https://ho-cto.github.io/sre-monitoring-as-code
service_link: /sre-monitoring-as-code

SLA's for SAS SRE

The team need to define some SLAs for SAS SRE support and backlog

MVP docs

Document MaC onboarding flow.

SLI Set up

How do teams know what percentage of error to set their SLI's. There should be a level of guidance, as 99.99% doesn't mean that you are measuring the performance of your service realistically or cost effectively. Not all services need that level of accuracy, (unless they are a critical service), as it might cost you more in resource power to keep the service to that level or there may be know issues/ dependency that until they are resolved your service will not be performing to that level to consider.

Business strategies and priorities are continually changing, therefore you will also need to ensure that SLI's are continually reviewed when set up. Primarily collate data for a month and review, once you are confident that the SLI's are at the correct threshold, review every 6 months, continually aligning business priorities.

Image in Overview > How MaC works

The image in Overview > How MaC works shows the workflow in more detail than we currently cover in the Overview chapter.

  1. Highlight the overview steps of 1. Define SLIs 2. Implement MaC on a local environment 3. Distribute MaC on your environment in the Overview chapter.
  2. Move the image from Overview to Get started > Distribute, as the detail in the diagram is about the steps to distribute MaC.

SLI menu/metric libraries terminology

It would be good to use consistent terminology for the SLI menu/metric libraries:

In Overview > How MaC Works: change "MaC contains configuration libraries..." to "MaC contains metric libraries (Service-level indicator menu)..." with a link to Service-level indicator menu.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.