Giter Site home page Giter Site logo

draganm / missing-container-metrics Goto Github PK

View Code? Open in Web Editor NEW
171.0 7.0 31.0 114 KB

Prometheus exporter for container metrics cAdvisor won't give you

License: MIT License

Go 95.86% Dockerfile 3.41% Shell 0.73%
kubernetes metrics exposed-metrics containerd

missing-container-metrics's Introduction

Missing Container Metrics - metrics cadvisor won't give you

Docker Pulls Docker Image Version

STATUS: stable, maintained

cadvisor is great, but missing a few important metrics, that every serious devops person wants to know about. This is a secondary process to export all the missing Prometheus metrics:

  • OOM-kill
  • number of container restarts
  • last exit code

This was motivated by hunting down a OOM kills in a large Kubernetes cluster. It's possible for containers to keep running, even after a OOM-kill, if a sub-process got affect for example. Without this metric, it becomes much more difficult to find the root cause of the issue.

True story; after this was deployed, a recurring OOM-kill in Fluentd was quickly discovered on one of the nodes. It turns out that the resource limits were set too low to process logs on that node. Logs were not being forwarded because the Fluentd worker process kept being OOM-kill and then restarted by the main process. A fix was then deployed 10 minutes later.

Supported Container Runtimes

  • Docker
  • Containerd

Kubernetes 1.20 has deprecated Docker container runtime, so we have added support for Containerd since the version 0.21.0 of missing-container-metrics. Both options should cover most of common use cases (EKS, GKE, K3S, Digital Ocean Kubernetes, ...).

Deployment

Kubernetes

The easiest way of installing missing-container-metrics in your kubernetes cluster is using our helm chart.

Docker

$ docker run -d -p 3001:3001 -v /var/run/docker.sock:/var/run/docker.sock dmilhdef/missing-container-metrics:v0.14.0

Usage

Exposes metrics about Docker/Containerd containers. Every metric contains following labels:

Exposed Metrics

Each of those metrics, are published with the labels from the next section.

container_restarts (counter)

Number of restarts of the container.

container_ooms (counter)

Number of OOM kills for the container. This covers OOM kill of any process in the container cgroup.

container_last_exit_code (gauge)

Last exit code of the container.

Labels

docker_container_id

Full id of the container.

container_short_id

First 6 bytes of the Docker container id.

container_id

Container id represented in the same format as in metrics of kubernetes pods - prefixed with docker:// and containerd:// depending on the container runtime. This enables easy joins in Prometheus to kube_pod_container_info metric.

name

Name of the container.

image_id

Image id represented in the same format as in metrics of k8s pod. This enables easy joins in Prometheus to kube_pod_container_info metric.

pod

If io.kubernetes.pod.name label is set on the container, it's value will be set as the pod label in the metric

namespace

If io.kubernetes.pod.namespace label is set on the container, it's value will be set as the namespace label of the metric.

Together with pod, this label is useful in the context of Kubernetes deployments, to determine namespace/pod to which the container is part of. One can see it as a shortcut to joining with the kube_pod_container_info metric to determine those values.

Contributing

Contributions are welcome, send your issues and PRs to this repo.

License

MIT - Copyright Dragan Milic and contributors

missing-container-metrics's People

Contributors

draganm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

missing-container-metrics's Issues

Add toleration to run on all nodes (even tainted ones)

We segment our node groups with taints, and by default missing-container-metrics does not install on any nodes with taints. Adding the toleration

tolerations:
        - operator: Exists

Would ensure the daemonset is running on all nodes. Alternatively, allowing for setting tolerations in the Helm chart would give deployers the ability to manage this.

container xxx not found in namespace k8s.io

Hi, I'm seeing a lot of these logs in the output of missing-container-metrics pods. The do expose some go metrics, but not container_restarts, container_ooms or container_last_exit_code. Am I doing something wrong?
Thanks in advance!

{"level":"warn","ts":1638978244.646691,"caller":"containerd/event_handler.go:108","msg":"while getting container info","version":"v0.21.0","error":"container \"f8e217dcebd318d4f6f37c8601d80bddc1cb40edb7a3262bff2df729a7a892e0\" in namespace \"k8s.io\": not found","errorVerbose":"not found
github.com/containerd/containerd/errdefs.init
    /go/pkg/mod/github.com/containerd/[email protected]/errdefs/errors.go:45
    runtime.doInit
    /usr/local/go/src/runtime/proc.go:6265
    runtime.doInit
    /usr/local/go/src/runtime/proc.go:6242
    runtime.doInit
    /usr/local/go/src/runtime/proc.go:6242
    runtime.doInit
    /usr/local/go/src/runtime/proc.go:6242
    runtime.doInit
    /usr/local/go/src/runtime/proc.go:6242
    runtime.main
    /usr/local/go/src/runtime/proc.go:208
    runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1371
    container \"f8e217dcebd318d4f6f37c8601d80bddc1cb40edb7a3262bff2df729a7a892e0\" in namespace \"k8s.io\"
    github.com/containerd/containerd/errdefs.FromGRPC
    /go/pkg/mod/github.com/containerd/[email protected]/errdefs/grpc.go:107
    github.com/containerd/containerd.(*remoteContainers).Get
    /go/pkg/mod/github.com/containerd/[email protected]/containerstore.go:50
    github.com/draganm/missing-container-metrics/containerd.HandleContainerd.func1
    /missing-container-metrics/containerd/handle_containerd.go:29
    github.com/draganm/missing-container-metrics/containerd.(*eventHandler).getOrCreateContainer
    /missing-container-metrics/containerd/event_handler.go:106
    github.com/draganm/missing-container-metrics/containerd.(*eventHandler).handle
    /missing-container-metrics/containerd/event_handler.go:156
    github.com/draganm/missing-container-metrics/containerd.HandleContainerd
    /missing-container-metrics/containerd/handle_containerd.go:62
    main.main.func1.2
    /missing-container-metrics/main.go:64
    runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371","container_id":"f8e217dcebd318d4f6f37c8601d80bddc1cb40edb7a3262bff2df729a7a892e0"}

Would like add ServiceMonitor template option on the helm chart

Hello Draganm,

Thanks a lot for for project.
I did'nt find the github source of your helm chart.

Is it possible to add optional serviceMonitor section that will allow to generate ?

  • service
  • servicemonitor

example

  serviceMonitor:
    enabled: true
    port: 3001
    scrapeInterval: 15s

Prometheus operator support in the Helm Chart

I am a big fan of Prometheus Operator. It would be great if it was possible to add a ServiceMonitor resource to the Helm Chart to configure scraping of this service as well as maybe a PrometheusRule resource so alerts could be configured all in the same deployment.

container_ooms time series are being exported infinitely even when container doesn't exists anymore

appVersion: 0.21.0

Lets say we have pod "pod1" and in this pod main container. We have one occurrence of OOM Kill in main container for pod1 and after pod is restarted it is successfully running.

We get these container_ooms metric exported as time series:
container_ooms{pod="pod1", container_id="old_container_id_oom_killed"} 1
container_ooms{pod="pod1", container_id="new_container_id_successfully_running"} 0

Problem is that container with container_id="old_container_id_oom_killed" doesn't exists anymore, but time series for this container still exists infinitely with value of 1.

When we define alert to trigger on when container_ooms > 0, then this alert will be triggered infinitely even when pod was successfully restarted and is already successfully running the main container.

When I delete pod "pod1", also all time series for this specific pod are removed after 5 minutes (which is expected).

Expected behavior:
When pod is successfully running and container with container_id in which OOM Kill occurred doesn't exists anymore, then time series for this container_id should be removed after 5 minutes (so exporter should NOT export the metric for this non-existent container anymore).

ARM support

Hello,

Thank for provide this exporter.
Do you have any plan for support arm (raspberry) build ?

Best

ARM Support

Hello,

Would you please help provide multi-arch images? There was an earlier request for this and @cablespaghetti had provided it in a fork. I would rather not use a fork if possible.

OOM Counter Incrementing Incorrectly

Hi!

Thank you for the project!

There seem to be one weird bug. Here is the description:

  1. I've installed MCM on the kube cluster v1.21.2 with docker runtime
  2. Port-forward one MCM container to check metrics
    k port-forward monitoring-missingcm-h257l 3001:3001
  3. Connect to some container located on the same node as MCM pod and trigger oom event with the help of stress command:
stress --vm 1 --vm-bytes 3024M
stress: info: [389] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [389] (415) <-- worker 390 got signal 9
stress: WARN: [389] (417) now reaping child worker processes
stress: FAIL: [389] (451) failed run completed in 2s

Please note, we should run the above command several times to reproduce the issue

  1. Check the container_ooms metrics for the above container

Expected result: the container_ooms counter should have value exact the same as number of times the stress command was executed

Actual result: container_ooms is greater than the number of times the stress command was executed. I've got the value 13 even though I run command only 3 times

Additional info:
I've checked docker events on the node while reproducing the issue. The number of oom events is matched with the number of stress runs.
Also checked /var/log/messages on the node. Result is as expected - the number of oom logs is matched with the number of stress runs.

Any idea what could be wrong here?

Ideal time for prometheus scrape_interval

There are a lot of metrics generated, do you know what's the ideal time needed to be set to avoid prometheus to crash while pulling a bunch of these metrics? Thanks

Does anything change with cgroupv2?

Hello,

With many major Linux distributions providing cgroupv2, I wanted to ask if there would be any changes that you might have to make with missing-container-metrics to get it working with cgroupv2?

Use without Prometheus

Would it be possible to use the metrics without Prometheus?
For example, exposing them on an api, or plugging them into kube-state-metrics?

False positives in `container_ooms` metric

Hello!

We are observing what looks like a flood of container_ooms on many containers, which on inspection, mostly appear without restarts / OOMs.

This seems to be caused by a single container receiving an OOM which triggers an eviction

eviction_manager.go:367] "Eviction manager: pods ranked for eviction[...list of pods...]" 

which is associated with many logs of

containerd[1320]: "TaskOOM event &TaskOOM{ContainerID:dd7ba888c8eb849fbffeb5da01bb8cf1034264757d3dff5d7d7815d4c4671200,XXX_unrecognized:[],}"
...

Guessing this could be related to containerd/containerd#7102

From https://pkg.go.dev/github.com/containerd/containerd/api/events?utm_source=godoc#TaskOOM it doesn't look like there's a way to distinguish between false and true TaskOOM eventss.

Adding Kubernetes Labels

Thank you for this project! I'm evaluating it now on our kubernetes cluster. It is working and I'm seeing data being collected. I'm interested specifically in container_ooms.

Is it possible to add Kubernetes pod labels to the metrics? Or how do you query the data to show the number of container_ooms for a particular Kubernetes deployment? I'm looking at pod and I have something like pod="web-puma-7c84796c86-6blt7". I would like to create a dashboard that shows the number of container_ooms for the web-puma deployment.

In case it matters, I'm using Datadog to view the metrics. The Prometheus labels are converted to Datadog tags, and by viewing localhost:3001/metrics, I don't think I'm missing any data provided by missing-container-metrics. The closest thing that works is grouping by image_id but that changes when we build a new image.

container_restarts does not increment with containerd

container_restarts does not increment with containerd. The same test executed in a cluster with docker works fine.

After looking at the code, I see there is no call to start() method in the containerd's handle, so unless I am missing something, it will never increment the aforementioned counter.

P.S. For testing, I am running this (as suggested here):

</dev/zero head -c 1000m | tail

I am running version 0.21.0.

openshift compatability

the package was installed on openshift 4
DOCKER & CONTAINERD set to false
pods are running but no metrics are shown beside the exporter health metrics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.