Giter Site home page Giter Site logo

slok / sloth Goto Github PK

View Code? Open in Web Editor NEW
1.9K 18.0 151.0 1.89 MB

🦥 Easy and simple Prometheus SLO (service level objectives) generator

Home Page: https://sloth.dev

License: Apache License 2.0

Makefile 1.04% Go 95.65% Dockerfile 0.82% Shell 2.16% Smarty 0.32%
slo service-level-objective service-level sli sla prometheus monitoring oncall observability alerts

sloth's Introduction

sloth

Sloth

CI Go Report Card Apache 2 licensed GitHub release (latest SemVer) Kubernetes release OpenSLO

Introduction

Meet the easiest way to generate SLOs for Prometheus.

Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

https://sloth.dev

Features

  • Simple, maintainable and understandable SLO spec.
  • Reliable SLO metrics and alerts.
  • Based on Google SLO implementation and multi window multi burn alerts framework.
  • Autogenerates Prometheus SLI recording rules in different time windows.
  • Autogenerates Prometheus SLO metadata rules.
  • Autogenerates Prometheus SLO multi window multi burn alert rules (Page and warning).
  • SLO spec validation (including validate command for Gitops and CI).
  • Customization of labels, disabling different type of alerts...
  • A single way (uniform) of creating SLOs across all different services and teams.
  • Automatic Grafana dashboard to see all your SLOs state.
  • Single binary and easy to use CLI.
  • Kubernetes (Prometheus-operator) support.
  • Kubernetes Controller/operator mode with CRDs.
  • Support different SLI types.
  • Support for SLI plugins
  • A library with common SLI plugins.
  • OpenSLO support.
  • Safe SLO period windows for 30 and 28 days by default.
  • Customizable SLO period windows for advanced use cases.

Small Sloth SLO dashboard

Getting started

Release the Sloth!

sloth generate -i ./examples/getting-started.yml
version: "prometheus/v1"
service: "myservice"
labels:
  owner: "myteam"
  repo: "myorg/myservice"
  tier: "2"
slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "requests-availability"
    objective: 99.9
    description: "Common SLO based on availability for HTTP request responses."
    labels:
      category: availability
    sli:
      events:
        error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
    alerting:
      name: "MyServiceHighErrorRate"
      labels:
        category: "availability"
      annotations:
        # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
        summary: "High error rate on 'myservice' requests responses"
      page_alert:
        labels:
          severity: "pageteam"
          routing_key: "myteam"
      ticket_alert:
        labels:
          severity: "slack"
          slack_channel: "#alerts-myteam"

This would be the result you would obtain from the above spec example.

Documentation

Check the docs to know more about the usage, examples, and other handy features!

SLI plugins

Looking for common SLI plugins? Check this repository, if you are looking for the sli plugins docs, check this instead.

Development and Contributing

Check CONTRIBUTING.md.

sloth's People

Contributors

chrisfraun avatar commixon avatar dependabot[bot] avatar jclegras avatar jesusvazquez avatar limess avatar martinkubrak avatar matiasmct avatar muhammetozekli avatar nlamirault avatar paulfantom avatar skburgart avatar slok avatar taivo123 avatar typositoire avatar xairos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sloth's Issues

Helm chart is non-functional without CRDs

sloth-0.4.1 chart gets CrashLoopBackOff-ed pod(s) with following log:

error: "kubernetes-controller" command failed: check for PrometheusServiceLevel CRD failed: could not list: the server could not find the requested resource (get prometheusservicelevels.sloth.slok.dev)%

Seems like there need for crds/ folder in chart archive with CRD manifest(s).

Sloth Created metrics don't account namespace creating PrometheusRule errors

When creating two different PrometheusServiceLevel definitions in the different namespace the metrics created by sloth don't account for the namespace causing PrometheusRule evaluation errors.

Example

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  labels:
  name: api-slos
  namespace: team_a
spec:
  service: api
  slos:
  - alerting:
      name: SLOApiErrorRate
      pageAlert:
        disable: false
        labels:
          severity: critical
      ticketAlert:
        disable: false
        labels:
          severity: warning
    description: 99.999% of requests must be served with <500 status code.
    labels:
      team: team_a
    name: 99999_http_request_lt_500
    objective: 99.999
    sli:
      events:
        errorQuery: sum(increase(istio_requests_total{destination_workload_namespace="team_a",
          reporter="destination", destination_workload="api",
          response_code=~"5.*"}[{{ .window }}]))
        totalQuery: sum(increase(istio_requests_total{destination_workload_namespace="team_a",
          reporter="destination", destination_workload="api"}[{{
          .window }}]))
---
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  labels:
  name: api-slos
  namespace: team_b
spec:
  service: api
  slos:
  - alerting:
      name: SLOApiErrorRate
      pageAlert:
        disable: false
        labels:
          severity: critical
      ticketAlert:
        disable: false
        labels:
          severity: warning
    description: 99.999% of requests must be served with <500 status code.
    labels:
      team: team_a
    name: 99999_http_request_lt_500
    objective: 99.999
    sli:
      events:
        errorQuery: sum(increase(istio_requests_total{destination_workload_namespace="team_b",
          reporter="destination", destination_workload="api",
          response_code=~"5.*"}[{{ .window }}]))
        totalQuery: sum(increase(istio_requests_total{destination_workload_namespace="team_b",
          reporter="destination", destination_workload="api"}[{{
          .window }}]))

When sloth creates the PrometheusRules for these services both add metrics with the same labels causing PrometheusRule evaluation errors.

Currently we're working around by appending the namespace to the service in the PrometheusServiceLevel however it seems a better option to have sloth generate metrics that include the namespace as well.

Make helm chart public

We would like to use helm repo add to run sloth

e.g. something like helm repo add sloth https://slok.github.io/helm-charts

In order to support it, the structure need to be change to fit helm expectations.

variable grouping

Is it possible to make aggregations by (var) like for server availability to have hostname as grouping key?

So we can calculate SLO per server? I have done update in queries, the boards not cope with it well, but I'm curious if this can work conceptually, I can update the boards to handle grouping vars.

no matches for kind "PrometheusServiceLevel" in version "sloth.slok.dev/v1"

Hi,
i tried to deploy Sloth on my kube cluster with the deployment below but then got the following error:

INFO[0000] SLI plugins loaded                            plugins=0 svc=storage.FileSLIPlugin version=v0.8.0 window=30d
INFO[0000] Loading Kubernetes configuration...           version=v0.8.0
error: "kubernetes-controller" command failed: check for PrometheusServiceLevel CRD failed: could not list: the server could not find the requested resource (get prometheusservicelevels.sloth.slok.dev)

then I tried to deploy a PrometheusServiceLevel, see below, and then got this:
error: unable to recognize "sloth-k8s-getting-started.yml": no matches for kind "PrometheusServiceLevel" in version "sloth.slok.dev/v1"

What am I doing wrong?

deployment:
https://github.com/slok/sloth/blob/main/deploy/kubernetes/raw/sloth.yaml

PrometheusServiceLevel deployment:
https://github.com/slok/sloth/blob/main/examples/k8s-multifile.yml

Windows Binary Operation of Sloth

hi @slok

I really like the sloth. Especially the use of OpenSLO and other providers.

I am having a few challenges making it work on windows and I am hoping it's just missing documentation.

A) I assume this error is with SLI Plugins rather than the actual spec. It's difficult to understand the error and I had to lint the sample YAML a few times

image

B) I Then followed the instructions in the SLI Plugins repo https://github.com/slok/sloth-common-sli-plugins

image

C) I suspect that I need a go runtime on the machine to be able to use the plugins. Is this assumption correct?

D) Is there anyway of bundling the SLI Plugins in to the binary ?

Once again really great tool and I love the syntax

Quiet services and NaN

I'm wondering what can be done about the scenario where the service has periods of receiving no (or few) requests so NaN values start to creep in.

This is because, for example

(sum(rate(http_server_requests_seconds_count{deployment="sloexample"}[1h])))

evaluates to 0 so you get NaN when you divide by it to get error ratios.

Normally this is fine - until you come to error budgets. My remaining error budget is not "undefined" or NaN just because my service went quiet for a period of time.

Move from dockerhub to ghcr

Would be better to have the images on Github container registry:

  • Avoid docker hub public images rate limits.
  • Better insights of Sloth usage (pulls per version, arch...).
  • Better Auth control.
  • Faster withing Github actions usage (people use Sloth on GH actions CI workflows for SLO validation).
  • Near to the source code, enables better discovery.

Tasks

  • Migrate all dockerhub images to ghcr (skopeo sync --src docker --dest docker --all slok/sloth ghcr.io/slok).
  • Set gh actions to upload to ghcr: #199
  • Upgrade Helm charts, kustomize and raw manifests: #200
  • Update documentation pointing to ghcr: slok/sloth-website#2
  • Add a note on the readme: #201

Allow 'success_query' on SLO

I'm trying to create a latency based SLO where I want to define an objective of '80% of requests return within 1000m' (for example).

The exporter uses a bucket of 1,000ms to allow me to assign le="1" on the query, but I can't work out a clean way to define the equivalent of;

...
spec:
  - name: "service-latency"
    objective: 80
    sli:
      events:
        success_query: sum(rate(service_duration_seconds_bucket{le="1"}[{{.window}}]))
        total_query: sum(rate(service_duration_seconds_bucket{le="+Inf"}[{{.window}}]))

As i can't have success_query i have to do something odd like this where i take the total over all, take away the group faster than my goal, to obtain the 'out of range' values.

        error_query: |
          sum(rate(service_duration_seconds_bucket{le="+Inf"}[{{.window}}]))
          -
          sum(rate(service_duration_seconds_bucket{le="1"}[{{.window}}]))
        total_query: sum(rate(service_duration_seconds_bucket{le="+Inf"}[{{.window}}]))

Prometheus rule created but not loaded on prometheus operator

Hello,
I am not sure if this is a bug or I am missing something in setup process.

I am trying to setup sloth on my cluster based on this approach https://github.com/slok/sloth#kubernetes-controller-prometheus-operator.
I successfully set up the sloth controller which runs without any errors.
I added a sample slo for a service running. The respective prometheus rule created successfully. But the rule doesn't get attached to prometheus operator. Should I do it manually? If yes why not to follow the cli approach instead?

Thanks in advance for any answer.

Remove sloth_window from alert expression

Below is a graph of the alert expression for an ongoing alert we had

image

it appears like the sloth_window label is part of the generated alert, and that the label changes value depending on whether the quick or slow alerting conditions are triggered. This can cause a lot of noise with resolving and re-firing alerts. So in the example above, essentially the same alert fired about 6 times over an hour.

I would propose to change the alerting expression a bit to actively drop the sloth_window label, to avoid excessive firings of the same alert. A proposal for a new alerting expression would be something like

(
  max(
    slo:sli_error:ratio_rate5m{sloth_id="a",sloth_service="b",sloth_slo="c"} > (13.44 * 0.01)
  ) without (sloth_window) and
  max(
    slo:sli_error:ratio_rate1h{sloth_id="a",sloth_service="b",sloth_slo="c"} > (13.44 * 0.01)
  ) without (sloth_window)
) or
(
  max(
     slo:sli_error:ratio_rate30m{sloth_id="a",sloth_service="b",sloth_slo="c"} > (3.5 * 0.01)
  ) without (sloth_window) and
  max(
     slo:sli_error:ratio_rate6h{sloth_id="a",sloth_service="b",sloth_slo="c"} > (3.5 * 0.01)
  ) without (sloth_window)
)

In the expression, max should never aggregate multiple metrics. But even if it did it would take the highest error ratio, which seems like a good pessimistic approach.

Would be happy to try to contribute a PR if you think the change makes sense

how to use it on windows dev box

Hi,
I'm new to Prometheus and grafana. I'd like to know what's require steps to using this tool. I have read the document. But I still have no clue . how to use it.
A little background my application. I'm using AWS hosted Prometheus server, Grafana is currently running local windows dev box.
K8s is running on AWS EKS.

Thanks,
Heng

Support for different SLO time windows

👋 Hi there!

First and foremost thanks for open sourcing this, this is cool stuff that I might end up using at work.

Do you have any plans for adding support to different time windows other than 30 days?

I was taking a look at the code and I see it is hardcoded in https://github.com/slok/sloth/blob/main/internal/prometheus/spec.go#L63

I'm not sure if this is just a matter of adding support for this in the api spec or if there's more to it than just that.

Thanos Ruler and duplicate series

Thanks for open sourcing such a valuable tool, it's been really easy to extend and get up and running!

I'm running into an issue using sloth with prometheus and thanos. Prometheus is installed and managed by the prometheus-operator and my thanos ruler is created using the ThanosRuler CRD. The retention time for my prometheus metrics on the prometheus instance is intentionally low at 6h since the thanos sidecar exports to s3 periodically.

I suspect this was causing my 30d burn metrics to be skewed since prometheus cannot query itself for 30d in the past due to the 6h retention. I have tried installing the SLO as a PrometheusRule that is instead evaluated by Thanos Ruler since it has access to the other thanos components to query for 30+ days in the past.

I'm now getting the following error when evaluating the slo:current_burn_rate:ratio metric:

Error executing query: found duplicate series for the match group {sloth_id="sloth-sample-http-availability-sloth-sample", sloth_service="sloth-sample", sloth_slo="http-availability-sloth-sample"} on the right hand-side of the operation: [{__name__="slo:error_budget:ratio", sloth_id="sloth-sample-http-availability-sloth-sample", sloth_service="sloth-sample", sloth_slo="http-availability-sloth-sample"}, {__name__="slo:error_budget:ratio", prometheus="prometheus/prometheus-operator-prometheus", sloth_id="sloth-sample-http-availability-sloth-sample", sloth_service="sloth-sample", sloth_slo="http-availability-sloth-sample"}];many-to-many matching not allowed: matching labels must be unique on one side

Running one of the metric queries directly I see two sets of data returned, which explains the error:

slo:sli_error:ratio_rate5m{sloth_id="sloth-sample-http-availability-sloth-sample",sloth_service="sloth-sample",sloth_slo="http-availability-sloth-sample"}
slo:sli_error:ratio_rate5m{prometheus="prometheus/prometheus-operator-prometheus", sloth_id="sloth-sample-http-availability-sloth-sample", sloth_service="sloth-sample", sloth_slo="http-availability-sloth-sample", sloth_window="5m"}
slo:sli_error:ratio_rate5m{sloth_id="sloth-sample-http-availability-sloth-sample", sloth_service="sloth-sample", sloth_slo="http-availability-sloth-sample", sloth_window="5m"}

Sure enough if I add an additional label filter to the slo:current_burn_rate:ratio metric, prometheus!="prometheus/prometheus-operator-prometheus", it evaluates just fine.

This is due to Thanos Ruler and Prometheus both loading and evaluating the PrometheusRule, which I wasn't expecting. Is there any way to add a label filter to some of the queries or a better way to run sloth with thanos?

multi file rules not loaded completely

hey, this is more related to prometheus rather than sloth itself but since I'm using it (thank you) and its a feature of sloth decided to go ahead.

I'm using the multiple specs in a single file feature, which works as intended, and I get a file with multiple yaml's (rooted at groups: ) within a single file. However, when using promtool or running prometheus with such file, they only load/detect the first group.

I was a bit surprise tbh, is this expected?

support for victoriaMetrics

Hi @slok , is there any plan to support vm ,
the the architecture is almost similar. (happy to contribute 😄 )

tried out sloth with prom worked seemingly, for our prod setup, we're using vm and at the moment sloth only supports prom.

Allow to add custom labels to a SLO

Hi, thanks for creating this awesome repo. I'm wondering that is there a way to add a custom label to a specific SLO ?

image

Or can you please make sloth to generate the 'description' ?
image

I need to show the details of SLO on Dashboard. For example, for LatencyP90 SLO, I want to show that we're setting the second as '4s', and want to show this '4s' on Dashboard, rather than just the Objective of 90%

I love your work @slok

error: "generate" command failed: invalid spec, could not load with any of the supported spec types

I get the error in the issue title when running the generate command for an OpenSLO formatted file. I'm using the Windows executable for v0.7.0.

I'm using this example file as input.

I run this command (from here):

./sloth-windows-amd64.exe generate -i ./examples/openslo-getting-started.yml -o /tmp/openslo-getting-started.yml

and get this error:

INFO[0000] SLI plugins loaded                            plugins=0 svc=storage.FileSLIPlugin version=v0.7.0
error: "generate" command failed: invalid spec, could not load with any of the supported spec types

So it looks like I'm missing a plugin, but I don't see any OpenSLO plugins in the plugin repo. Do we need a plugin for OpenSLO? Also the function of the plugins aren't clear to me - "can be referenced as an SLI on the SLO specs and will return a raw SLI type" - what does this mean exactly?

helm linting error on variables

Hi,
I use k8s template to generate the prometheusRules, unfortunately the generated yaml file fails at linting:

[ERROR] templates/: parse error at (monitoring/templates/prometheusSlothRules.yaml:183): undefined variable "$labels"

with the line:

title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget

because helm's treating the {{ }} as helm variables. Should we escape this one? Or is there a way to work around?
Thanks!
Best,
Hieu

Query SLO monthly

Hi Xabier,

This is not an issue.

I saw that on Sloth Dashboard, we have the Dashboard to show Remaining error budget for the current month.
image

Is there a way to query this metric in a monthly manner ? Something like: Assumpt that current month is February. I set Grafana From and To back to 30 Nov and 30 Jan. So I'd got a dashboard of Remaining Error budget reseted at 100% at the beginning of each Month

My actual goal is: I need to know my service SLO compliance through out the months, to see which months it used all the error budget.

Loki rules fail validation

LokiRuler supports PrometheusRule resources, but if I try to pass a Loki-format rule to sloth it fails with the error:

INFO[0000] Generating from Kubernetes Prometheus spec    version=sloth-helm-chart-0.4.0-13-g6341f64 window=30d
error: "generate" command failed: could not generate Kubernetes format rules: could not generate prometheus rules: invalid SLO group: Key: 'SLOGroup.SLOs[0].SLI.Events.ErrorQuery' Error:Field validation for 'ErrorQuery' failed on the 'prom_expr' tag
Key: 'SLOGroup.SLOs[0].SLI.Events.TotalQuery' Error:Field validation for 'TotalQuery' failed on the 'prom_expr' tag%  

The relevant section of an example that fails:

      sli:
        events:
          errorQuery: |
            sum(rate(
              {job="nginx-ingress"}
                | json
                | status >= 500
                | __error__ != "JSONParserErr"
            [{{.window}}]))
          totalQuery: |
            sum(rate(
              {job="nginx-ingress"}
                | json
                | __error__ != "JSONParserErr"
            [{{.window}}]))

Would it be possible to make the validator pass these rules, and/or add an option to disable the validation step?

Improving Sloth SLOs dashboard

Hi Xabier,
first of all, many thanks for the Sloth SLOs sample dashboard (https://grafana.com/grafana/dashboards/14348)! We are using it since a while. :-)

I noticed that the color coding and ranges for Remaining error budget (month) is not correct.
It starts in red if there are no values, it is yellow if there are no errors, and it is green if the budget is below 40%.
Furthermore, I suppose negative values should be cut off since empty is empty.

My suggestions:

          "description": "This month remaining error budget, starts the 1st of the month and ends  28th-31st (not rolling window)",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "max": 1,
              "min": 0,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "grey",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 0
                  },
                  {
                    "color": "orange",
                    "value": 0.01
                  },
                  {
                    "color": "light-yellow",
                    "value": 0.2
                  },
                  {
                    "color": "green",
                    "value": 0.8
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },

and alike for A rolling window of the total period (30d) error budget remaining.

Furthermore, cutting of negative budget (also occurs twice).

        "expr": "1-clamp_max(sum_over_time( ... ) , 1)",

Thanks and regards
Wolfgang

Make OptimizedSLIRecordGenerator optional

In https://github.com/slok/sloth/blob/main/internal/prometheus/recording_rules.go#L53 the SLI for the SLO-period is always calculated using a optimized calculation method. When there isn't uniform load on a service the result of the optimized method can differ quite a lot from a calculation using errors divided by total. An example can be seen below where traffic (green) is very uneven, and the optimized calculation (yellow) underestimates the error rate many times over compared to the regular (blue) calculation.

image

This comes from the optimized calcuation assuming that each 5m slice are equally important for the overall SLO. You can even see that while the blue line stays static in periods with no traffic, the yellow error rate slowly decreases.

Granted there isn't any broad consensus on how SLOs should be calculated, it is a topic with passionate debate. One example discussing the fairness of using ratio based SLOs can be found in https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/

“ISPs love to tell you they have 99.9% uptime. What does that mean? This implies it’s time-based, but all night I’m asleep, I am not using my internet connection. And even if there’s a downtime, I won’t notice and my ISP will tell me they were up all the time. Not using the service is like free uptime for the ISP. Then I have a 10-minute, super important video conference, and my internet connection goes down. That’s for me like a full outage. And they say, yeah, 10 minutes per month, that’s three nines, it’s fine.”

A better alternative: a request-based SLA that says, “During each month we’ll serve 99% of requests successfully.”

Would there be any interest in making the OptimizedSLIRecordGenerator optional? With some input on how a user could control this, I'd be happy to try to create a pull request.

New recording rule is created when adding a label to the existing PrometheusServiceLevel

Issue

Sloth generates a new recording rule when adding a label to the existing PrometheusServiceLevel spec.

Repro

  • Create PrometheusServiceLevel and deploy
  • Verify recording rule is created.
  • Add a new label to PrometheusServiceLevel spec and deploy.

Expected

  • Existing recording rule is amended.

Actual

  • A new recording rule is created with the new label.

image

OpenSLO

Great tool, thanks! Can you share your thoughts around supporting OpenSLO please?

Improve UX for SLO's with more than 2 9's

Thanks for the awesome code! I had to do the equivalent of this a couple years ago with nothing but sed!

If I have a service with two objectives of 99.9 and another objective of 99.99, I get numbers like these:

$ sloth generate -i slo.yml 2>/dev/null | grep -Eo '0\.[0-9]{4}[0-9]+'
0.9990000000000001
0.9990000000000001
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.9990000000000001
0.9990000000000001
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.0009999999999999432
0.9998999999999999
0.9998999999999999
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117
0.00010000000000005117

... where what I probably want is

0.9990000000000001 -> 0.999
0.0009999999999999432 -> 0.001
0.9998999999999999 -> 0.9999
0.00010000000000005117 -> 0.0001

I think the implementation might be something like "don't use floats".

Related to this issue: the grafana board reports a 99.99 objective as 100%, which is dismaying to people who only see the dashboard and not the actual values in the promql queries.

Prometheusrule didn't work.

Hi, thanks you for make sloth, seems a beautiful tool to setting up SLO's! I hope the people contribute with the project.

I'm trying to do a little POC but i can't get it's work.

Currently i have configured a kube-prometheus-stack in my minikube cluster and i want to measure SLO's of grafana service just like a test.

grafana_http_request_duration_seconds_count metrics appears in Prometheus.

image

So i want to set SLO for this using the getting-started unsuccessfully, i searched about that and find this thread: #125.

I apply these changes but i can't see metrics yet, what i'm doing wrong?

Servicemonitor Grafana:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus-stack
    meta.helm.sh/release-namespace: kube-prometheus-stack
  creationTimestamp: "2021-09-18T16:02:38Z"
  generation: 1
  labels:
    app: kube-prometheus-stack-grafana
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 18.0.10
    chart: kube-prometheus-stack-18.0.10
    heritage: Helm
    release: kube-prometheus-stack
  name: kube-prometheus-stack-grafana
  namespace: kube-prometheus-stack
  resourceVersion: "19538"
  uid: 30e38e75-d49c-45ea-a7e4-6df0e1e1b1a8
spec:
  endpoints:
  - path: /metrics
    port: service
  namespaceSelector:
    matchNames:
    - kube-prometheus-stack
  selector:
    matchLabels:
      app.kubernetes.io/instance: kube-prometheus-stack
      app.kubernetes.io/name: grafana

Prometheus definition.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus-stack
    meta.helm.sh/release-namespace: kube-prometheus-stack
  creationTimestamp: "2021-09-18T16:02:37Z"
  generation: 1
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 18.0.10
    chart: kube-prometheus-stack-18.0.10
    heritage: Helm
    release: kube-prometheus-stack
  name: kube-prometheus-stack-prometheus
  namespace: kube-prometheus-stack
  resourceVersion: "19456"
  uid: f496e3ac-11c9-407c-b79c-24c320d9d160
spec:
  alerting:
    alertmanagers: []
  enableAdminAPI: false
  externalUrl: http://kube-prometheus-stack-prometheus.kube-prometheus-stack:9090
  image: quay.io/prometheus/prometheus:v2.28.1
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      release: kube-prometheus-stack
  portName: web
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      release: kube-prometheus-stack
  replicas: 1
  retention: 10d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      app: kube-prometheus-stack
      release: kube-prometheus-stack
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: kube-prometheus-stack-prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      release: kube-prometheus-stack
  shards: 1
  version: v2.28.1

My SLO example.

# This example shows the same example as getting-started.yml but using Sloth Kubernetes CRD.
# It will generate the Prometheus rules in a Kubernetes prometheus-operator PrometheusRules CRD.
#
# `sloth generate -i ./examples/k8s-getting-started.yml`
#
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: sloth-slo-grafana
  namespace: kube-prometheus-stack
spec:
  service: "kube-prometheus-stack-grafana"
  labels:
    app: "kube-prometheus-stack"
    release: "kube-prometheus-stack"
    owner: "myteam"
    repo: "myorg/myservice"
    tier: "2"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "Common SLO based on availability for HTTP request responses."
      sli:
        events:
          errorQuery: sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[{{.window}}]))
          totalQuery: sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[{{.window}}]))
      alerting:
        name: MyServiceHighErrorRate
        labels:
          category: "availability"
        annotations:
          summary: "High error rate on 'myservice' requests responses"
        pageAlert:
          labels:
            severity: pageteam
            routing_key: myteam
        ticketAlert:
          labels:
            severity: "slack"
            slack_channel: "#alerts-myteam"

It's work,

kubectl get slo
NAME                SERVICE                         DESIRED SLOS   READY SLOS   GEN OK   GEN AGE   AGE
sloth-slo-grafana   kube-prometheus-stack-grafana   1              1            true     36s       10m`

But when i generate the Prometheusrules using sloth binary and apply this, rules doesn't appear in Prometheus.

sloth generate -i slo.yaml > slo-rules.yaml
INFO[0000] SLI plugins loaded                            plugins=0 svc=storage.FileSLIPlugin version=v0.6.0-66-g3f0d37f
INFO[0000] Generating from Kubernetes Prometheus spec    version=v0.6.0-66-g3f0d37f
INFO[0000] Multiwindow-multiburn alerts generated        out=- slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] SLI recording rules generated                 out=- rules=8 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] Metadata recording rules generated            out=- rules=7 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f
INFO[0000] SLO alert rules generated                     out=- rules=2 slo=kube-prometheus-stack-grafana-requests-availability svc=generate.prometheus.Service version=v0.6.0-66-g3f0d37f`

This is the output of generated file:

---
# Code generated by Sloth (v0.6.0-66-g3f0d37f): https://github.com/slok/sloth.
# DO NOT EDIT.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: null
  labels:
    app.kubernetes.io/component: SLO
    app.kubernetes.io/managed-by: sloth
  name: sloth-slo-grafana
  namespace: kube-prometheus-stack
spec:
  groups:
  - name: sloth-slo-sli-recordings-kube-prometheus-stack-grafana-requests-availability
    rules:
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[5m])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[5m])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 5m
        tier: "2"
      record: slo:sli_error:ratio_rate5m
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[30m])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[30m])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 30m
        tier: "2"
      record: slo:sli_error:ratio_rate30m
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[1h])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[1h])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 1h
        tier: "2"
      record: slo:sli_error:ratio_rate1h
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[2h])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[2h])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 2h
        tier: "2"
      record: slo:sli_error:ratio_rate2h
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[6h])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[6h])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 6h
        tier: "2"
      record: slo:sli_error:ratio_rate6h
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[1d])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[1d])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 1d
        tier: "2"
      record: slo:sli_error:ratio_rate1d
    - expr: |
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana",status_code=~"(5..|4..)"}[3d])))
        /
        (sum(rate(grafana_http_request_duration_seconds_count{job="kube-prometheus-stack-grafana"}[3d])))
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_window: 3d
        tier: "2"
      record: slo:sli_error:ratio_rate3d
    - expr: |
        sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}[30d])
        / ignoring (sloth_window)
        count_over_time(slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}[30d])
      labels:
        sloth_window: 30d
      record: slo:sli_error:ratio_rate30d
  - name: sloth-slo-meta-recordings-kube-prometheus-stack-grafana-requests-availability
    rules:
    - expr: vector(0.9990000000000001)
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:objective:ratio
    - expr: vector(1-0.9990000000000001)
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:error_budget:ratio
    - expr: vector(30)
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:time_period:days
    - expr: |
        slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
        / on(sloth_id, sloth_slo, sloth_service) group_left
        slo:error_budget:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:current_burn_rate:ratio
    - expr: |
        slo:sli_error:ratio_rate30d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
        / on(sloth_id, sloth_slo, sloth_service) group_left
        slo:error_budget:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:period_burn_rate:ratio
    - expr: 1 - slo:period_burn_rate:ratio{sloth_id="kube-prometheus-stack-grafana-requests-availability",
        sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"}
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        tier: "2"
      record: slo:period_error_budget_remaining:ratio
    - expr: vector(1)
      labels:
        app: kube-prometheus-stack
        owner: myteam
        release: kube-prometheus-stack
        repo: myorg/myservice
        sloth_id: kube-prometheus-stack-grafana-requests-availability
        sloth_mode: cli-gen-k8s
        sloth_service: kube-prometheus-stack-grafana
        sloth_slo: requests-availability
        sloth_spec: sloth.slok.dev/v1
        sloth_version: v0.6.0-66-g3f0d37f
        tier: "2"
      record: sloth_slo_info
  - name: sloth-slo-alerts-kube-prometheus-stack-grafana-requests-availability
    rules:
    - alert: MyServiceHighErrorRate
      annotations:
        summary: High error rate on 'myservice' requests responses
        title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
          burn rate is too fast.
      expr: |
        (
            (slo:sli_error:ratio_rate5m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432))
            and ignoring (sloth_window)
            (slo:sli_error:ratio_rate1h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432))
        )
        or ignoring (sloth_window)
        (
            (slo:sli_error:ratio_rate30m{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432))
            and ignoring (sloth_window)
            (slo:sli_error:ratio_rate6h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432))
        )
      labels:
        category: availability
        routing_key: myteam
        severity: pageteam
        sloth_severity: page
    - alert: MyServiceHighErrorRate
      annotations:
        summary: High error rate on 'myservice' requests responses
        title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error
          budget burn rate is too fast.
      expr: |
        (
            (slo:sli_error:ratio_rate2h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432))
            and ignoring (sloth_window)
            (slo:sli_error:ratio_rate1d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432))
        )
        or ignoring (sloth_window)
        (
            (slo:sli_error:ratio_rate6h{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432))
            and ignoring (sloth_window)
            (slo:sli_error:ratio_rate3d{sloth_id="kube-prometheus-stack-grafana-requests-availability", sloth_service="kube-prometheus-stack-grafana", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432))
        )
      labels:
        category: availability
        severity: slack
        slack_channel: '#alerts-myteam'
        sloth_severity: ticket

I hope can help me, thanks!

Generate multiple rules from multiple files

As part of our CI/CD chain, we want to be able to run sloth against a number of files.

This works for validation, but not for the generation of alerts/rules:

[mmw@localhost adtech-argus]$ sloth validate -i src/sloth/specs/
INFO[0000] SLI plugins loaded                            plugins=0 svc=storage.FileSLIPlugin version=v0.9.0 window=30d
INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=v0.9.0 window=30d windows=2
INFO[0000] Validation succeeded                          slo-specs=2 version=v0.9.0 window=30d


[mmw@localhost adtech-argus]$ sloth generate -i src/sloth/specs/
error: "generate" command failed: could not read SLOs spec file data: read src/sloth/specs/: is a directory

If we can't generate the files in batches then we'll need to iterate over the files, however as our plan is to use the docker container for this it would add significant overhead to our pipeline

Sloth as library for generation prometheus rules

I use sloth as generator prometheus rules by my specification.

MySpec -> SlothSpec -> PrometheusRule

Code of generation is placed to internal and i cant import it in my project. To generate prometheus rules i have to generate yaml slothSpec and run binary.

It would be cool if the code from directory internal moved to directory pkg. And Service.Generate will become available for import.

sli queries status code description in spec

Its pretty often to exclude some status codes from queries on which slis based on. According to google video courses on coursera it is recommended to describe why status codes are excluded.

Please provide an description field for them. maybe like?

slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "requests-availability"
    sli:
      events:
        error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
        error_query_clarifications: |
        - 429 -- excluded because of written TOS on api limiting service
        - 503 -- retied by balancer

Alerting for k8s

Hi all! I have a problem with Sloth for today. By some reason I can't disable alerting.

When I put next part I still get expression generated and active alert

  alerting:
    name: http-requests-availability
    page_alert:
      disable: true
    ticket_alert:
      disable: true

When I trying remove name field I got error:

error: "generate" command failed: could not generate Kubernetes format rules: could not generate prometheus rules: invalid SLO group: Key: 'SLOGroup.SLOs[0].PageAlertMeta.Name' Error:Field validation for 'Name' failed on the 'required_if_enabled' tag
Key: 'SLOGroup.SLOs[0].TicketAlertMeta.Name' Error:Field validation for 'Name' failed on the 'required_if_enabled' tag%

I'm using apiVersion: sloth.slok.dev/v1 for my chart

promtool complains about duplicate rules

Hi,

I created a PrometheusServiceLevel with two SLI.
Checking the generated PrometheusRule CR with promtool it complains about a duplicate recording rule.

> promtool check rules example.yaml
Checking example.yaml
1 duplicate rule(s) found.
Metric: slo:sli_error:ratio_rate30d
Label(s):
        sloth_window: 30d
Might cause inconsistency while recording expressions.
  SUCCESS: 34 rules found

example of one generated recording rule.

    - expr: |
        sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
        / ignoring (sloth_window)
        count_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
      labels:
        sloth_window: 30d
        // sloth_slo: ingress-error-rate  // <- this fixes promlint complains.
      record: slo:sli_error:ratio_rate30d

The problem can be bypassed by adding abel: sloth_slo: ingress-error-rate to the expression explicitly.
Would you accept a PR for this chagne?

Multiple services per yaml file

Is it possible to generate multiple services in one .yaml file?
In Kubernetes yamls it is common to have multiple definitions in one file, separated by ---. This doesn't seem to work here atm.

Example:

# Source: slothgen/templates/sloth.yaml
version: "prometheus/v1"
service: "service1"
labels:
  owner: "itsme"
slos:
  - name: "myslo1"
    objective: 99.9
    description:" SLO"
    sli:
      events:
        error_query: sum(rate(http_server_requests_seconds{uri=~".*/XXX/.*",application="blablabla", status=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_server_requests_seconds{uri=~".*/XXX/.*", namespace="blablabla,application="occonnect-digen-apigateway"}[{{.window}}]))
    alerting:
      name: "alert"
      pageAlert:
        disabled: true
      ticketAlert:
        disabled: true
---
# Source: slothgen/templates/sloth.yaml
version: "prometheus/v1"
service: "myservice2"
labels:
  owner: "itsme"
slos:
  - name: "myslo1"
    objective: 99.9
    description: "SLO"
    sli:
      events:
        error_query: sum(rate(http_server_requests_seconds{uri=~".*/YYY/.*", namespace="blablabla", status=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_server_requests_seconds{uri=~".*/YYY/.*", namespace="blablabla"}[{{.window}}]))
    alerting:
      name: "alert"
      pageAlert:
        disabled: true
      ticketAlert:
        disabled: true

This would allow us to generate the sloth yamls with helm (or similar), and generate large amounts of Definitions like that.

No data coming for prometheus recording rule slo:period_error_budget_remaining:ratio

Hi @slok
thanks for building awesome tool for tracking SLOs . I am doing a POC with this tool and facing a issue with metric "slo:period_error_budget_remaining:ratio", its always returning empty data.

Screen Shot 2021-09-30 at 2 58 54 PM

I assume this is due to slo:period_burn_rate:ratio, it is referring the recording rule slo:sli_error:ratio_rate30d using filter {sloth_id="monitoring-prometheus-adapter-uptime", sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}
but slo:sli_error:ratio_rate30d is defined without any such label except sloth_window.

- record: slo:period_error_budget_remaining:ratio
    expr: 1 - slo:period_burn_rate:ratio{sloth_id="monitoring-prometheus-adapter-uptime",
      sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}
    labels:
      category: prometheus-adapter
      owner: obs-platform
      sloth_id: monitoring-prometheus-adapter-uptime
      sloth_service: monitoring
      sloth_slo: prometheus-adapter-uptime
 - record: slo:period_burn_rate:ratio
    expr: |
      slo:sli_error:ratio_rate30d{sloth_id="monitoring-prometheus-adapter-uptime", sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}
      / on(sloth_id, sloth_slo, sloth_service) group_left
      slo:error_budget:ratio{sloth_id="monitoring-prometheus-adapter-uptime", sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}
    labels:
      category: prometheus-adapter
      owner: obs-platform
      sloth_id: monitoring-prometheus-adapter-uptime
      sloth_service: monitoring
      sloth_slo: prometheus-adapter-uptime

 - record: slo:sli_error:ratio_rate30d
    expr: |
      sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="monitoring-prometheus-adapter-uptime", sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}[30d])
      / ignoring (sloth_window)
      count_over_time(slo:sli_error:ratio_rate5m{sloth_id="monitoring-prometheus-adapter-uptime", sloth_service="monitoring", sloth_slo="prometheus-adapter-uptime"}[30d])
    labels:
      sloth_window: 30d

Here is my full sloth recording rule file and corresponding generate rules file..
slothrule.txt
generated.txt

can you please check and advice what could be the issue.. thank you..

Consider support the Bad metrics from OpenSLO

Recently the OpensLO added support to allow defining bad / totals metrics for SLO. See also this commit in OpensLO repo:
OpenSLO/OpenSLO@0225725

I was curious what's involved to add support for this into Sloth? Do you think this is achievable for someone who doesn't have any Go programming experience (me)?

Deploying rollup rules

We're very interested in using sloth for our services, but we need to be able to split up the recording rules so that we have some part of the rules running on Prometheus instances in multiple service clusters, while the global SLO data is summed up by a Thanos Rule instance.

Is this currently possible? Is there work being done for this? Would this be a feature we could contribute toward?

Spec Format Varies

The spec for an SLO in a Kubernetes resource varies from the spec fed directly to the binary. By unifying these, users could easily test against the binary to ensure that the output works as they expect and then insert it into the custom resource. Would it be possible to do that unification?

Create K8s events on SLO generation errors

We're running multi-tenant clusters, where tenants get their own namespaces (using HNC). Tenants can create SLOs using Sloth, but we don't give them access to the Sloth controller itself - and thus, they cannot see its logs.

It seems beneficial to expose SLO validation/generation errors as Events in the respective namespaces, which would allow users to view errors for their own SLOs.

ex. to expose the following:

WARN[22070] item requeued due to processing error: could not generate SLOs: invalid SLO group: Key: 'SLOGroup.SLOs[0].SLI.Events.ErrorQuery' Error:Field validation for 'ErrorQuery' failed on the 'template_vars' tag

as an event (I'm unsure if this would require a change to kooper).

I'd be happy to take a crack at contributing this enhancement if it makes sense to you!

Multi-arch images

Would it be possible to publish multi-arch images? I am mostly interested in the ability to specify one image string that underneath would translate to the correct image for AMD64 or ARM64.

For example prometheus-operator is publishing such images with the following script: https://github.com/prometheus-operator/prometheus-operator/blob/master/scripts/push-docker-image.sh. Or even better one can be found in kube-rbac-proxy repo: https://github.com/brancz/kube-rbac-proxy/blob/master/scripts/publish.sh

Deploying CRD using terraform causes the error Forbidden attribute key in "manifest" value

An error occurs when attempting to install the file sloth.slok.dev_prometheusservicelevels.yaml using terraform and the kubernetes provider (kubernetes_manifest resource). Reproduced using hashicorp/kubernetes v2.7.1 and Terraform v1.0.11

 Error: Forbidden attribute key in "manifest" value

    with kubernetes_manifest.sloth_crd,
    on sloth.tf line 16, in resource "kubernetes_manifest" "sloth_crd":
    16: resource "kubernetes_manifest" "sloth_crd" {

   'status' attribute key is not allowed in manifest configuration

example terraform:

resource "kubernetes_manifest" "sloth_crd" {
  provider = kubernetes

  manifest = yamldecode(file("sloth.slok.dev_prometheusservicelevels.yaml"))
}

Using the sloth CRD to install the sloth operator via terraform generates an error due to the unexpected .status element at the root level.

As mentioned in the kubernetes-alpha provider repo here "..the status field seems to be a read-only field from the user perspective. It makes little sense to pass this field to the API server ..in Terraform we use the exact configuration supplied by the user and instead validate before applying and report an error."

The workaround is to remove the .status properties (and sub-elements) from the CRD yaml file.

Support Loki and LogQL alongside Prometheus

Hello!

We're ingesting a significant amount of data via logging and it would be great to be able to create SLI's and SLO's based on LogQL syntax for Loki as well as PromQL.

It all seems to work apart from the point at which Sloth tries to validate that the logql is valid promql, at which point it falls apart rapidly!

Is there any way Sloth could try and validate against https://github.com/grafana/loki/blob/main/pkg/logql/syntax/parser.go as well as against the Prom parser?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.