Hi, thanks you for make sloth, seems a beautiful tool to setting up SLO's! I hope the people contribute with the project.
I'm trying to do a little POC but i can't get it's work.
Currently i have configured a kube-prometheus-stack in my minikube cluster and i want to measure SLO's of grafana service just like a test.
So i want to set SLO for this using the getting-started unsuccessfully, i searched about that and find this thread: #125.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: kube-prometheus-stack
meta.helm.sh/release-namespace: kube-prometheus-stack
creationTimestamp: "2021-09-18T16:02:38Z"
generation: 1
labels:
app: kube-prometheus-stack-grafana
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 18.0.10
chart: kube-prometheus-stack-18.0.10
heritage: Helm
release: kube-prometheus-stack
name: kube-prometheus-stack-grafana
namespace: kube-prometheus-stack
resourceVersion: "19538"
uid: 30e38e75-d49c-45ea-a7e4-6df0e1e1b1a8
spec:
endpoints:
- path: /metrics
port: service
namespaceSelector:
matchNames:
- kube-prometheus-stack
selector:
matchLabels:
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/name: grafana
Prometheus definition.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
meta.helm.sh/release-name: kube-prometheus-stack
meta.helm.sh/release-namespace: kube-prometheus-stack
creationTimestamp: "2021-09-18T16:02:37Z"
generation: 1
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 18.0.10
chart: kube-prometheus-stack-18.0.10
heritage: Helm
release: kube-prometheus-stack
name: kube-prometheus-stack-prometheus
namespace: kube-prometheus-stack
resourceVersion: "19456"
uid: f496e3ac-11c9-407c-b79c-24c320d9d160
spec:
alerting:
alertmanagers: []
enableAdminAPI: false
externalUrl: http://kube-prometheus-stack-prometheus.kube-prometheus-stack:9090
image: quay.io/prometheus/prometheus:v2.28.1
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
release: kube-prometheus-stack
portName: web
probeNamespaceSelector: {}
probeSelector:
matchLabels:
release: kube-prometheus-stack
replicas: 1
retention: 10d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
app: kube-prometheus-stack
release: kube-prometheus-stack
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: kube-prometheus-stack-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
release: kube-prometheus-stack
shards: 1
version: v2.28.1
My SLO example.
# This example shows the same example as getting-started.yml but using Sloth Kubernetes CRD.
# It will generate the Prometheus rules in a Kubernetes prometheus-operator PrometheusRules CRD.
#
# `sloth generate -i ./examples/k8s-getting-started.yml`
#
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: sloth-slo-grafana
namespace: kube-prometheus-stack
spec:
service: "kube-prometheus-stack-grafana"
labels:
app: "kube-prometheus-stack"
release: "kube-prometheus-stack"
owner: "myteam"
repo: "myorg/myservice"
tier: "2"
slos:
- name: "requests-availability"
objective: 99.9
description: "Common SLO based on availability for HTTP request responses."
sli:
events:
errorQuery: sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[{{.window}}]))
totalQuery: sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[{{.window}}]))
alerting:
name: MyServiceHighErrorRate
labels:
category: "availability"
annotations:
summary: "High error rate on 'myservice' requests responses"
pageAlert:
labels:
severity: pageteam
routing_key: myteam
ticketAlert:
labels:
severity: "slack"
slack_channel: "#alerts-myteam"
kubectl get slo
NAME SERVICE DESIRED SLOS READY SLOS GEN OK GEN AGE AGE
sloth-slo-grafana kube-prometheus-stack-grafana 1 1 true 36s 10m`
But when i generate the Prometheusrules using sloth binary and apply this, rules doesn't appear in Prometheus.
sloth generate -i slo.yaml > slo-rules.yaml
INFO[0000] SLI plugins loaded plugins=0 svc=storage.FileSLIPlugin version=v0.6.0-66-g3f0d37f
INFO[0000] Generating from Kubernetes Prometheus spec version=v0.6.0-66-g3f0d37f
INFO[0000] Multiwindow-multiburn alerts generated out=- slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] SLI recording rules generated out=- rules=8 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] Metadata recording rules generated out=- rules=7 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] SLO alert rules generated out=- rules=2 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f`
---
# Code generated by Sloth (v0.6.0-66-g3f0d37f): https://github.com/slok/sloth.
# DO NOT EDIT.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: SLO
app.kubernetes.io/managed-by: sloth
name: sloth-slo-grafana
namespace: kube-prometheus-stack
spec:
groups:
- name: sloth-slo-sli-recordings-kube-prometheus-stack-grafana-requests-availability
rules:
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[5m])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[5m])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 5m
tier: "2"
record: slo:sli_error:ratio_rate5m
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[30m])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[30m])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 30m
tier: "2"
record: slo:sli_error:ratio_rate30m
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[1h])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[1h])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 1h
tier: "2"
record: slo:sli_error:ratio_rate1h
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[2h])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[2h])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 2h
tier: "2"
record: slo:sli_error:ratio_rate2h
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[6h])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[6h])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 6h
tier: "2"
record: slo:sli_error:ratio_rate6h
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[1d])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[1d])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 1d
tier: "2"
record: slo:sli_error:ratio_rate1d
- expr: |
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[3d])))
/
(sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[3d])))
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_window: 3d
tier: "2"
record: slo:sli_error:ratio_rate3d
- expr: |
sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}[30d])
labels:
sloth_window: 30d
record: slo:sli_error:ratio_rate30d
- name: sloth-slo-meta-recordings-kube-prometheus-stack-grafana-requests-availability
rules:
- expr: vector(0.9990000000000001)
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:objective:ratio
- expr: vector(1-0.9990000000000001)
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:error_budget:ratio
- expr: vector(30)
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:time_period:days
- expr: |
slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:current_burn_rate:ratio
- expr: |
slo:sli_error:ratio_rate30d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:period_burn_rate:ratio
- expr: 1 - slo:period_burn_rate:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability",
sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
tier: "2"
record: slo:period_error_budget_remaining:ratio
- expr: vector(1)
labels:
app: kube-prometheus-stack
owner: myteam
release: kube-prometheus-stack
repo: myorg/myservice
sloth_id: kube-prometheus-stack-grafana-requests-availability
sloth_mode: cli-gen-k8s
sloth_service: kube-prometheus-stack-grafana
sloth_slo: requests-availability
sloth_spec: sloth.slok.dev/v1
sloth_version: v0.6.0-66-g3f0d37f
tier: "2"
record: sloth_slo_info
- name: sloth-slo-alerts-kube-prometheus-stack-grafana-requests-availability
rules:
- alert: MyServiceHighErrorRate
annotations:
summary: High error rate on 'myservice' requests responses
title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
burn rate is too fast.
expr: |
(
(slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432))
and ignoring (sloth_window)
(slo:sli_error:ratio_rate1h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432))
)
or ignoring (sloth_window)
(
(slo:sli_error:ratio_rate30m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432))
and ignoring (sloth_window)
(slo:sli_error:ratio_rate6h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432))
)
labels:
category: availability
routing_key: myteam
severity: pageteam
sloth_severity: page
- alert: MyServiceHighErrorRate
annotations:
summary: High error rate on 'myservice' requests responses
title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error
budget burn rate is too fast.
expr: |
(
(slo:sli_error:ratio_rate2h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432))
and ignoring (sloth_window)
(slo:sli_error:ratio_rate1d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432))
)
or ignoring (sloth_window)
(
(slo:sli_error:ratio_rate6h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432))
and ignoring (sloth_window)
(slo:sli_error:ratio_rate3d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432))
)
labels:
category: availability
severity: slack
slack_channel: '#alerts-myteam'
sloth_severity: ticket