kubernetes-monitoring / kubernetes-mixin Goto Github PK
View Code? Open in Web Editor NEWA set of Grafana dashboards and Prometheus alerts for Kubernetes.
License: Apache License 2.0
A set of Grafana dashboards and Prometheus alerts for Kubernetes.
License: Apache License 2.0
I frequently get questions about what the "namespace" namespace is in the "K8s / Compute Resources / Cluster" dashboard. It displays the non-pod cgroup statistics, so we should either rename that to "non-pod containers", or not show it at all. I'm slightly leaning towards the latter because drill down is not possible anyways. What do you think?
In our case we have an indexing task which, when started, does a lot of initial writing.
But thats just initial stuff and I'm unsure whether linear prediction is the right thing to do in that case.
Not sure if we can go for a more sophisticated sampling, else I'd suggest to make the sampling time frame configurable.
@metalmatze any ideas on that one?
Hi,
I found the following github issue, but noticed a corresponding issue hadn't been opened here and I am running into the same problem. prometheus-operator/prometheus-operator#1977
kubernetes_build_info
kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.11.142:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-10-142.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"} 1
kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.13.100:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-12-100.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"} 1
kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.12.127:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-14-127.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"} 1
kubernetes_build_info{buildDate="2018-12-06T23:13:14Z",compiler="gc",endpoint="https",gitCommit="6bad6d9c768dc0864dab48a11653aa53b5a47043",gitTreeState="clean",gitVersion="v1.11.5-eks-6bad6d",goVersion="go1.10.3",instance="10.12.55.98:443",job="apiserver",major="1",minor="11+",namespace="default",platform="linux/amd64",service="kubernetes"} 1
kubernetes_build_info{buildDate="2018-12-06T23:13:14Z",compiler="gc",endpoint="https",gitCommit="6bad6d9c768dc0864dab48a11653aa53b5a47043",gitTreeState="clean",gitVersion="v1.11.5-eks-6bad6d",goVersion="go1.10.3",instance="10.12.55.131:443",job="apiserver",major="1",minor="11+",namespace="default",platform="linux/amd64",service="kubernetes"} 1
More clearly, the gitVersion
s don't match
sum by (gitVersion) (kubernetes_build_info)
{gitVersion="v1.11.5"} 3
{gitVersion="v1.11.5-eks-6bad6d"} 2
node_exporter renamed several metric names in v0.16.0. See https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md#0160--2018-05-15
I've compiled a list of metric names in use by this project and have provided what they should be changed to:
node_cpu -> node_cpu_seconds_total
node_memory_MemTotal -> node_memory_MemTotal_bytes
node_memory_Buffers -> node_memory_Buffers_bytes
node_memory_Cached -> node_memory_Cached_bytes
node_memory_MemFree -> node_memory_MemFree_bytes
node_memory_MemTotal -> node_memory_MemTotal_bytes
node_disk_bytes_read -> node_nfsd_disk_bytes_read_total
node_disk_bytes_written -> node_nfsd_disk_bytes_written_total
node_disk_io_time_ms -> node_nfsd_disk_bytes_written_total
node_filesystem_size -> node_filesystem_size_bytes
node_filesystem_avail -> node_filesystem_avail_bytes
node_filesystem_size -> node_filesystem_size_bytes
node_network_receive_bytes -> node_network_receive_bytes_total
node_network_transmit_bytes -> node_network_transmit_bytes_total
node_network_receive_drop -> node_network_receive_drop_total
node_network_transmit_drop -> node_network_transmit_drop_total
node_boot_time -> node_boot_time_seconds
node_disk_io_time_ms -> node_disk_io_time_seconds_total
node_disk_io_time_weighted -> node_disk_io_time_weighted_seconds_total
In our setup, we have two clusters:
node-exporter
(NE) deployed and scraped from the Prometheus running in that cluster.cluster
label to all imported metrics.In both clusters, kube-state-metrics
(KSM) is deployed.
This leads us with a situation where in our Prometheus we now have KSM+NE metrics about our nodes and only KSM metrics about the foreign nodes. This creates a discrepancy between the "nodes as seen by KSM" and "nodes as seen by NE". As a result, the rules break because Prometheus gets confused about the grouping labels.
For our usecase, we fixed this by restricting the two "base recording rules", :kube_pod_info_node_count:
and node_namespace_pod:kube_pod_info:
to only count KSM metrics without a cluster
label (directly inside the generated YAML, for testing purposes):
- name: node.rules
rules:
- - expr: sum(min(kube_pod_info) by (node))
+ - expr: sum(min(kube_pod_info{cluster=""}) by (node))
record: ':kube_pod_info_node_count:'
- expr: |
- max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
+ max(label_replace(kube_pod_info{cluster="",job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
record: 'node_namespace_pod:kube_pod_info:'
- expr: |
count by (node) (sum by (node, cpu) (
This seems to have fixed the problem. I was wondering if we can/should submit a PR to introduce a new config variable to the mixins to allow people to customize the selection of KSM metrics.
Some recording rules are missing, we either add those or remove the dashboards totally. WDYT @tomwilkie @brancz ?
# I am on a fresh git clone
$ git reset --hard
HEAD is now at 297f40f Merge pull request #21 from richerve/fix/remove-POD-resources-pod
$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
# I remove leftovers from my previous experiments...
$ rm -rf dashboards_out
# I do not understand why this fails...
$ make dashboards_out
jsonnet -J vendor -m dashboards_out lib/dashboards.jsonnet
RUNTIME ERROR: Field does not exist: grafanaDashboards
object <anonymous>
lib/dashboards.jsonnet:1:20-66 thunk <dashboards>
lib/dashboards.jsonnet:5:32-41 thunk <o>
std.jsonnet:955:28
std.jsonnet:955:9-36 function <anonymous>
lib/dashboards.jsonnet:5:15-42 thunk <a>
lib/dashboards.jsonnet:(3:1)-(6:1) function <anonymous>
lib/dashboards.jsonnet:(3:1)-(6:1)
make: *** [dashboards_out] Error 1
# ... but this seem to fix it.
$ jb install
Cloning into 'vendor/.tmp/jsonnetpkg-grafonnet-master635056242'...
remote: Counting objects: 961, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 961 (delta 0), reused 0 (delta 0), pack-reused 958
Receiving objects: 100% (961/961), 300.41 KiB | 886.00 KiB/s, done.
Resolving deltas: 100% (544/544), done.
Already on 'master'
Your branch is up to date with 'origin/master'.
# Works without any errors now
$ make dashboards_out
make: 'dashboards_out' is up to date.
# But nothing gets generated, why?
$ ls -la dashboards_out/
total 0
drwxr-xr-x 2 lvlcek staff 64 Jun 13 12:37 .
drwxr-xr-x 17 lvlcek staff 544 Jun 13 12:37 ..
Automatic rulebook_url generation runs as mixin without any condition.
So when vendoring kubernetes-mixin
, my own alerts are also affected by this.
All graphs beside Disk Utilization are empty.
Looks like queries using group_left
are all returning no results.
Example:
record: node:node_cpu_utilisation:avg1m
expr: 1
- avg by(node) (rate(node_cpu{job="node-exporter",mode="idle"}[1m])
* on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)
In rules.libsonnet we declare that CPU time is time not spent idle as per below.
// CPU utilisation is % CPU is not idle.
record: ':node_cpu_utilisation:avg1m',
expr: |||
1 - avg(rate(node_cpu{%(nodeExporterSelector)s,mode="idle"}[1m]))
||| % $._config,
},
There are two initial aspects that I would like to clarify for the purposes of accurately measuring CPU time for the purposes of a utilisation metric.
We include the kubernetes-mixin for monitoring in the kube-prometheus stack, and a common point of frustration is that all alerts are always shipped, even on Kubernetes clusters that are managed like GKE or AKS. For those clusters it is often not possible to retrieve the metrics necessary to monitor the control plane components.
While it would be possible to hand pick or filter alerts, my feeling is that it could be beneficial to split alerts into the two groups also for a world, where a single Prometheus server is not sufficient to monitor an entire cluster, or in multi-tenant Kubernetes environments. In these scenarios we are seeing people assign a Prometheus server per tenant (typically made up of one or more namespaces), and the responsibility of that tenant is not to monitor the Kubernetes cluster itself, but primarily the workload.
This would not be a breaking change, as the entrypoint (as in the .libsonnet
file imported by people) for the alerting rules would stay the same.
The alert KubeAPIErrorsHigh
jumps quite often to 100% for us, even though the error count is quite small. There seems to be a label mismatch between 5xx and 2xx return codes (compare screenshot) and the alert does not ignore the code label. Thus everytime an error happens this will jump to 100% for us:
In this example the alerting jumped to 100% errors, even though the actual error percentage was <1%
CPU Usage doesn't work, and if 'fixed' it just shows an incrementally increasing line
Left Y axis in Memory Usage graph is wrong - it's 'short' instead of 'bytes'
the above assuming that the manifest provided in 'prometheus-operator/contrib/kube-prometheus/manifests' are synced with the jsonnet here
Hi! Is there a plan to expose the kube_pod_container_resource_requests field? This will allow monitoring requests other than CPU and memory, particularly I need GPU requests.
(I'm looking here: https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/pod-metrics.md)
Also for node I'd like to have access to kube_node_status_allocatable and kube_node_status_capacity, to also see the GPUs. (https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/node-metrics.md)
Thanks!
The kube-prometheus stack used to have useful dashboards for the Kubernetes control plane. It would be nice to re-introduce those. The jsonnet definitions of those can be found last in this commit of the kubernetes-grafana package.
Hello, as mentioned in kubernetes slack on #monitoring-mixin
channel, I don't see any thing equivalent to:
I've never encountered ksonnet before, so I'm not sure if I can translate that job file in a timely fashion. I'm also not sure whether it should be added to this existing file or if it warrants putting it in a completely separate file for jobs. I would appreciate any guidance or suggestions.
groups:
- name: job.rules
rules:
- alert: CronJobRunning
expr: time() -kube_cronjob_next_schedule_time > 3600
for: 1h
labels:
severity: warning
annotations:
description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
summary: CronJob didn't finish after 1h
- alert: JobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0
for: 1h
labels:
severity: warning
annotations:
description: Job completion is taking more than 1h to complete
cronjob {{$labels.namespaces}}/{{$labels.job}}
summary: Job {{$labels.job}} didn't finish to complete after 1h
- alert: JobFailed
expr: kube_job_status_failed > 0
for: 1h
labels:
severity: warning
annotations:
description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
summary: Job failed
When jb
vendors kubernetes-mixin
, it includes the .circleci
directory which has a copy of the jsonnet
binary. It'd be nice to not have binary around when vendoring this repo.
With the next release of Promeheus the promtool will have the ability to unit test alerts.
We should also write tests at least for the most complex ones and start running these tests in our CI.
What do you think about integrating the tests into the alert jsonnet objects as well?
The kubelet_volume_stats_used_bytes
metric exposed by the kubelet will always carry the namespace="kube-system
label. We inject the prefixedNamespaceSelector
into the KubePersistentVolumeFullInFourDays
to restrict the scope of the alert by Kubernetes namespace. prefixedNamespaceSelector
uses the namespace
key. Instead we should use the exported_namespace
label key.
What are your thoughts?
KubePersistentVolumeFullInFourDays
:
{
alert: 'KubePersistentVolumeFullInFourDays',
expr: |||
(
kubelet_volume_stats_used_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}
/
kubelet_volume_stats_capacity_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}
) > 0.85
and
predict_linear(kubelet_volume_stats_available_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}[%(volumeFullPredictionSampleTime)s], 4 * 24 * 3600) < 0
||| % $._config,
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
message: 'Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value }} bytes are available.',
},
},
Hi,
I am trying to setup prometheus alert using this but found a problem that prometheus doesn't have metric up{job="kube-controller-manager"}
. As my understanding, this is a config in prometheus to scrape kube-controller-manager
.
So my question is: do i need any special setup on prometheus? Right now i use the default setup from prometheus helm chart https://github.com/helm/charts/blob/8ac01b83b5fd992600928d1e2ff159f69b64b484/stable/prometheus/values.yaml#L810.
Thanks
I was running promtool over the generated rule files and ran into the following. This is likely a bug in jsonnet's yaml generation, but do you have any thoughts?
Note the '
's around the record:
value.
$ cat /tmp/mapping.rules.yaml
groups:
- name: "node.rules"
rules:
- record: ':kube_pod_info_node_count:'
expr: sum(min by(node) (kube_pod_info))
$ promtool check rules /tmp/mapping.rules.yaml
Checking /tmp/mapping.rules.yaml
SUCCESS: 1 rules found
$ cat /tmp/mapping-bad.rules.yaml
groups:
- name: "node.rules"
rules:
- record: :kube_pod_info_node_count:
expr: sum(min by(node) (kube_pod_info))
$ promtool check rules /tmp/mapping-bad.rules.yaml
Checking /tmp/mapping-bad.rules.yaml
FAILED:
yaml: line 3: mapping values are not allowed in this context
level=warn ts=2018-11-09T12:24:37.99045291Z caller=manager.go:343 component="rule manager" group=kubernetes msg="Evaluating rule failed" rule="record: namespace_name:kube_pod_container_resource_requests_memory_bytes:sum\nexpr: sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_memory_bytes{job=\"kube-state-metrics\"})\n * on(namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job=\"kube-state-metrics\"},\n \"pod_name\", \"$1\", \"pod\", \"(.*)\"))\n" err="many-to-many matching not allowed: matching labels must be unique on one side"
I got this warning in prometheus version v2.3.2
i've change the expressions of kube_pod_container_resource_requests_memory_bytes
, kube_pod_container_resource_requests_cpu_cores
and node_num_cpu
adding ignoring
instead on
This is the code:
- record: node:node_num_cpu:sum
expr: count by (node) (sum by (node, cpu) (node_cpu_seconds_total{job="node-exporter"} * ignoring (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))
- record: "namespace_name:kube_pod_container_resource_requests_memory_bytes:sum"
expr: sum by (namespace, label_name) (sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) by (namespace, pod) * ignoring (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)"))
- record: "namespace_name:kube_pod_container_resource_requests_cpu_cores:sum"
expr: sum by (namespace, label_name) (sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) by (namespace, pod) * ignoring (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)"))
I know that the ignoring
operator will remove the labels that is inside of brackets, but, the warning it's solved now, but i'm not sure that it's working, i'm trying to testing this alerts
Can someone validate it to me?
➜ prometheus git:(master) ✗ ks registry add kausal https://github.com/kausalco/public
➜ prometheus git:(master) ✗ ks pkg install kausal/prometheus-ksonnet
ERROR GET https://api.github.com/repos/kausalco/public/contents//prometheus-ksonnet/parts.yaml?ref=6c037aa65f54edadbdcebd6fc0a2ecf167f19109: 404 Not Found []
➜ prometheus git:(master) ✗ ks version
ksonnet version: 0.8.0
jsonnet version: v0.9.5
client-go version: v1.6.8-beta.0+$Format:%h$
We have two use cases for scoping alerts to certain namespaces:
There is a Prometheus server that collects all cluster wide metrics from kubelets, cAdvisor, and kube-state-metrics, for a cluster that has multiple independent users/tenants, therefore it adds significant cognitive overhead for SREs responsible offering the cluster as a service to their users throughout the entire pipeline (Prometheus alerts page, Alerts fired against Alertmanager, list of alerts in Alertmanager), when all they care about are the alerts for the cluster components and infrastructure.
Different users/tenants may have different configurations of the "application" specific alerts of this repository.
@metalmatze @tomwilkie Do you think this is something we should optionally allow defining? I think the default behavior should continue to be what we have today.
Some alert annotations use the $value
of the query as is. Instead one could pipe a value in bytes through humanize1024
.
E.g. KubePersistentVolumeUsageCritical
:
annotations: {
message: 'The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ printf "%0.0f" $value }}% free.',
},
Originated from https://bugzilla.redhat.com/show_bug.cgi?id=1634299.
I followed the steps in README.md
to create an application and default application. I then installed prometheus-ksonnet
running jb install github.com/kausalco/public/prometheus-ksonnet
and made the suggested changes to main.jsonnet
but when I run ks apply default
I get this error:
ERROR find objects: C:\tmp\use-monitoring\vendor/prometheus-ksonnet/lib/nginx.libsonnet:26:21 Text block not terminated with |||
'nginx.conf': |||
I'm using the latest release of ksonnet I found on github (0.12.0)
I'm running this from a windows machine.
the query is currently showing the average of iowait, irq, nice, softirq, steal, system and user instead of the sum.
Also the y-axis is set to "percent 0.0-1.0", but the datapoints are "0-100"
The recording rule for namespace_name:kube_pod_container_resource_requests_cpu_cores:sum
appears to include requests for Pods which aren't running (e.g. those which have been Evicted, Completed, etc). This means the alert for KubeCPUOvercommit
fires artificially.
In one of our clusters it indicates we've requested 5x the amount of CPU available, yet we're still perfectly within this and still able to schedule workloads successfully.
The sum for this rule should only consider requests for pods which are actually holding onto resources.
This probably isn't perfect but it does produce a more reasonable result for the amount of requested CPU.
sum(kube_pod_container_resource_requests_cpu_cores) by(pod)
and on(pod)
(kube_pod_status_scheduled{condition="true"})
The next version of kube-state-metrics should have a metric called kube_pod_container_status_last_terminated_reason. See kubernetes/kube-state-metrics#535.
Can an OOM alert based on this metric be setup in this mixin?
What did you do?
Deployed prometheus-operator including the built-in Grafana instance and the example dashboards.
What did you expect to see?
Useful memory metrics to monitor, debug and tune pod memory usage.
What did you see instead? Under which circumstances?
Pod memory is reported including caches, which can go up and down with available system memory and is not useful at all for the mentioned purposes.
The pod memory graph should clearly separate between memory that the pod uses and memory that can be reclaimed at any time. This is particularly relevant, as there is also a metric in the dashboard that counts use versus limit. This metric will be mostly useless for deployment tuning, unless cache memory is subtracted first.
It would be best if the graph and/or memory quota table below clearly separate total usage from cache, perhaps with a stacked graph. At the very least, it should be made clear that the shown container_memory_usage_bytes
metric also includes container_memory_cache
, similar to how the Linux command free
reports buffer/cache and freely available memory seprately.
It was also suggested that container_memory_rss
be used, but I think usage
- cache
is what's relevant for limits.
Environment
v2.5.0
Have all the dashboards use the local browser default timezone - not UTC.
Resolves the comments on the following issue about setting the timezone for the grafana web ui: https://github.com/coreos/prometheus-operator/issues/1807
The following alerts are being proposed at generic storage level
PVCHardLimitNearingFull - warning (80%), critical (90%)
maps to requests.storage (Across all persistent volume claims, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota.
As this is cluster-wide, having some kind of alert to indicate when PVC storage requests (capacity) is running out is important to know so action (e.g. adding capacity, reclaiming space, etc.) can be taken. Filling up to maximum capacity is usually not a good idea as this can lead to potential undesirable situations, e.g. performance degradation, instability, etc. If underlying storage is on AWS or any public cloud provider, this usually means expanding the underlying volume (and possibly restarting some instances), reclaiming space, or other data offload technique. The same is true for Gluster (OCS) and Ceph. For on-prom storage subsystems, this may mean ordering additional disks in order to support the expansion, as well as procurement process (which may or may not be applicable in the public cloud). Note: in a single OCP cluster, this typically involves multiple storage subsystems, and in this particular scenario, this could mean expansion in 1 or more storage subsystems.
80% utilization is meant as an early warning to start taking action to prevent severe issues
90% utilization is much more severe/critical requiring more immediate action by the admin/operator.
The label of alert “PVCHardLimitNearingFull” is suggested as with the words “hard limit” as requests is confusing to users, and the CPU and Memory quota terminology differs from the Persistent Storage quota terminology (though the Ephemeral Storage quota terminology seems more aligned with the CPU and Memory quota terminology).
StorageClass.PVCHardLimitNearingFull - warning (80%), critical (90%)
maps to .storageclass.storage.k8s.io/requests.storage (Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota
This is similar to the requests.storage and differs in that this is in-context of a storage class. As this is tied to a single storage provisioner, which basically is either underlying storage on a public cloud provider or storage subsystem (e.g. Gluster/OCS, Ceph, AWS EBS, etc.), once again, the admin is having to take action:
expand the storage (which may or may not be a disruptive operation), go through a procurement process (if applicable)
figuring out ways to offload the existing storage (reclamation, archiving, deleting data, migrate to something that's bigger, etc.). If data is getting offloaded, once again, the admin has to communicate with the users to let them know or have the users take action.
StorageClass.PVCCountNearing Full - warning (80%), critical (90%)
maps to .storageclass.storage.k8s.io/persistentvolumeclaims (Across all persistent volume claims associated with the storage-class-name, the total number of persistent volume claims that can exist in the namespace) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota
This is less worrying but nevertheless still relevant as this refers to the count of PVCs. If one runs out, the user will be unable to make requests.
The 80% is just a warning to the admin/operator to either increase the allotted number, or to look into reclamation (if not automatic), or to ask users to to remove unneeded PVCs.
90% just means it’s more urgent, and more likelihood that the developer/consumer is going to experience issues with requesting storage if the issue is not addressed.
Namespace.PVCCountNearingFull - warning (80%), critical (90%).
This maps to persistentvolumeclaims
Namespace.EphemeralStorageLimitNearingFull - warning (80%), critical (90%)
This maps to maps to limits.ephemeral-storage
NodeDiskRunningFull
This should apply to any node (not just Device of node-exporter Namespace/Pod) and when it will be full
This relates to filesystems (see https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml), though the alert label says it’s about a disk (which I found confusing).
For filesystems, utilization beyond 90% is usually not good, but the suggestion is to keep the threshold at 85% like with the existing alert since this is to kick-in after the kubelet garbage collection, which kicks in somewhere at 80-85% (default per https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/#container-collection).
For calculating utilisation as a function of available over total we do:
record: ':node_memory_utilisation:',
expr: |||
1 -
sum(node_memory_MemFree{%(nodeExporterSelector)s} + node_memory_Cached{%(nodeExporterSelector)s} + node_memory_Buffers{%(nodeExporterSelector)s})
/
sum(node_memory_MemTotal{%(nodeExporterSelector)s})
||| % $._config,
},
Spoketo SuperQ about measuring memory and according to him "MemFree + Cached + Buffers is a somewhat obsolete set of metrics, there was a post about that somewhere burried on LKML."
Recommendation is to just use MemFree. Making a note here to remind me to dig out the LKML post and make a PR if necessary.
Hi,
Does anyone have experiences on debugging KubeAPILatencyHigh?
I have a kubernetes cluster which always alert KubeAPILatencyHigh (almost at all times). But I don't know how to debug it, can anyone share some experiences on it.
Here are the information about the kubernetes cluster:
bootstrap tool: kubeadm v1.11.3
master: 1 x (4 cpus, 8G ram, 80G ssd) virtual machine
nodes: 20 x (64 cpus, 256G ram, 2t ssd)
network: flannel v0.10.0
dns: coredns x 3 replica, no autoscale.
total pods: ~300
example cpu usage of master:
$ mpstat 1 5
Linux 4.4.0-138-generic 11/26/2018 _x86_64_ (4 CPU)
04:52:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
04:52:12 PM all 3.05 0.00 2.03 0.00 0.00 0.76 0.00 0.00 0.00 94.16
04:52:13 PM all 5.08 0.00 1.78 0.25 0.00 0.25 0.00 0.00 0.00 92.64
04:52:14 PM all 2.56 0.00 2.05 0.26 0.00 0.26 0.00 0.00 0.00 94.88
04:52:15 PM all 3.87 0.00 2.58 0.00 0.00 0.26 0.00 0.00 0.00 93.30
04:52:16 PM all 4.85 0.26 3.32 0.26 0.00 0.26 0.26 0.00 0.00 90.82
Average: all 3.88 0.05 2.35 0.15 0.00 0.36 0.05 0.00 0.00 93.16
example kube-api log:
I1127 00:24:51.038592 1 trace.go:76] Trace[38993630]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-11-27 00:24:50.293798654 +0000 UTC m=+1142201.756391166) (total time: 744.706806ms):
Trace[38993630]: [744.609062ms] [744.602896ms] About to write a response
I1127 00:24:51.039953 1 trace.go:76] Trace[1687150485]: "Get /api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx" (started: 2018-11-27 00:24:50.386678665 +0000 UTC m=+1142201.849271164) (total time: 653.229158ms):
Trace[1687150485]: [653.135664ms] [653.13091ms] About to write a response
I1127 00:24:58.694129 1 trace.go:76] Trace[1818362417]: "Get /api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx" (started: 2018-11-27 00:24:54.426466538 +0000 UTC m=+1142205.889059018) (total time: 4.267603386s):
Trace[1818362417]: [4.267485115s] [4.267480859s] About to write a response
I1127 00:24:58.699821 1 trace.go:76] Trace[341574261]: "Get /api/v1/namespaces/default" (started: 2018-11-27 00:24:57.166795151 +0000 UTC m=+1142208.629387685) (total time: 1.53298777s):
Trace[341574261]: [1.532910089s] [1.532906249s] About to write a response
I1127 00:24:58.700395 1 trace.go:76] Trace[2008873298]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:57.434586023 +0000 UTC m=+1142208.897178525) (total time: 1.265775612s):
Trace[2008873298]: [1.265702198s] [1.265695779s] About to write a response
I1127 00:24:58.700869 1 trace.go:76] Trace[1072363034]: "Get /api/v1/namespaces/kube-system/secrets/cronjob-controller-token-fh2qf" (started: 2018-11-27 00:24:57.14580432 +0000 UTC m=+1142208.608396862) (total time: 1.555031827s):
Trace[1072363034]: [1.554907629s] [1.554903089s] About to write a response
I1127 00:24:58.701159 1 trace.go:76] Trace[336371808]: "Get /api/v1/namespaces/kube-system/endpoints/kube-scheduler" (started: 2018-11-27 00:24:56.254123782 +0000 UTC m=+1142207.716716427) (total time: 2.446951069s):
Trace[336371808]: [2.446725016s] [2.446719549s] About to write a response
I1127 00:24:58.701375 1 trace.go:76] Trace[1790922014]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.827729065 +0000 UTC m=+1142208.290321563) (total time: 1.873614851s):
Trace[1790922014]: [1.873510708s] [1.873503473s] About to write a response
I1127 00:24:58.701774 1 trace.go:76] Trace[691272094]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-11-27 00:24:55.06580382 +0000 UTC m=+1142206.528396298) (total time: 3.635896447s):
Trace[691272094]: [3.635808899s] [3.635802985s] About to write a response
I1127 00:24:58.701874 1 trace.go:76] Trace[818045190]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.474983542 +0000 UTC m=+1142207.937576130) (total time: 2.226805588s):
Trace[818045190]: [2.226744453s] [2.226735053s] About to write a response
I1127 00:24:58.702327 1 trace.go:76] Trace[1614989265]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.149572743 +0000 UTC m=+1142207.612165251) (total time: 2.552722949s):
Trace[1614989265]: [2.552662992s] [2.552657858s] About to write a response
I1127 00:24:58.702737 1 trace.go:76] Trace[1447455717]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:55.372165695 +0000 UTC m=+1142206.834758232) (total time: 3.33054224s):
Trace[1447455717]: [3.330453031s] [3.330444801s] About to write a response
I1127 00:24:58.703571 1 trace.go:76] Trace[1945409431]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:55.283078173 +0000 UTC m=+1142206.745670715) (total time: 3.420351935s):
Trace[1945409431]: [3.420292916s] [3.420286689s] About to write a response
I1127 00:24:58.703958 1 trace.go:76] Trace[1226849696]: "Get /api/v1/namespaces/kube-system/secrets/node-problem-detector-token-jppnn" (started: 2018-11-27 00:24:54.577116058 +0000 UTC m=+1142206.039708599) (total time: 4.126802508s):
Trace[1226849696]: [4.126740738s] [4.126734225s] About to write a response
I1127 00:24:58.721163 1 trace.go:76] Trace[994305666]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:55.094594747 +0000 UTC m=+1142206.557187342) (total time: 3.626534954s):
Trace[994305666]: [3.626430436s] [3.624376327s] Transaction committed
I1127 00:24:58.724937 1 trace.go:76] Trace[970948525]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.045435379 +0000 UTC m=+1142208.508027980) (total time: 1.679259404s):
Trace[970948525]: [1.67896677s] [1.676837608s] Transaction committed
I1127 00:24:58.725494 1 trace.go:76] Trace[1332645750]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.076805959 +0000 UTC m=+1142208.539398517) (total time: 1.648639757s):
Trace[1332645750]: [1.648314409s] [1.646355104s] Transaction committed
I1127 00:24:58.727250 1 trace.go:76] Trace[1944527049]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.151655982 +0000 UTC m=+1142208.614248610) (total time: 1.575550502s):
Trace[1944527049]: [1.575378497s] [1.572909147s] Transaction committed
I1127 00:24:58.727857 1 trace.go:76] Trace[283895130]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.097471349 +0000 UTC m=+1142208.560063960) (total time: 1.630274405s):
Trace[283895130]: [1.630169627s] [1.628395238s] Transaction committed
We are seeing inconsistent alerts firing on the Kubernetes API latency alert, this is because list requests inherently depend on the amount of items being returned by the list request.
I would propose that we either ignore "list" requests all together in terms of latency like we do for other verbs already:
Or at least treat it separately with different thresholds.
Let me know what you think.
We have a use case for a namespace in kubernetes where we submit a large number of Jobs to our cluster without expecting immediate scheduling.
We've tested these workloads with a recently deployed kube-prometheus stack and have found out that the KubeCPUOvercommit fires whenever we submit a a large batch of jobs at once. I believe there are 2 issues here:
Here's the definition for the metric that KubeCPUOvercommit relies on.
- expr: |
sum by (namespace, label_name) (
sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) by (namespace, pod)
* on (namespace, pod) group_left(label_name)
label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)")
)
record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum
For the first issue, I believe kube_pod_container_resource_requests_cpu_cores
will continue showing resource requests and kube_pod_status_scheduled{condition="true"}
will have a value of 1
for any pod associated with a finished job. Perhaps we can join on another metric like kube_pod_status_phase{phase=~"Running|Pending"}
Adding new alerts is nice and simple by extending the prometheus_alerts+::
structure, but as it is a list, adding labels or annotations to existing alerts isn't.
For example wanting to modify the message annotation to link to playbooks or something. Speaking to Tom, if this was a dictionary rather than a list, this would then become possible.
It is often confusing for users, when there are alerts about Kubernetes workloads (deployments, daemonsets, statefulsets, etc) and it seems at first sight that it is coming from the kube-state-metrics target. We should probably drop any labels that identify kube-state-metrics and just leave the actual contextual information like the object name and namespace.
My hunch is that this would need to be configurable. I understand that for example in the Kausal ksonnet-prometheus package this would be the instance
label, but in most other setups out there (such as the default Prometheus configuration from the Prometheus repo and the Prometheus Operator) this will be labels with the respective Kubernetes resource in it (pod/service/namespace/etc). It's also reasonable that people can do this however they like.
Can you please help me in writing a promql query to trigger an alert when node is in unschedulable state. Thanks in advance !!
I suggest only triggering KubePersistentVolumeFullInFourDays
alert when disk usage is above 85 %.
Reasoning: After the alert fired and an engineer fixed the issue, the alert should resolve immediately.
Similar thing done by @brancz in prometheus-operator/prometheus-operator#1857:
diff --git a/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet b/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
index 5c24f09f..27039f4e 100644
--- a/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
+++ b/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
@@ -7,11 +7,10 @@
{
alert: 'NodeDiskRunningFull',
annotations: {
- description: 'device {{$labels.device}} on node {{$labels.instance}} is running full within the next 24 hours (mounted at {{$labels.mountpoint}})',
- summary: 'Node disk is running full within 24 hours',
+ message: 'Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} is running full within the next 24 hours.',
},
expr: |||
- predict_linear(node_filesystem_free{%(nodeExporterSelector)s,mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24) < 0 and on(instance) up{%(nodeExporterSelector)s}
+ (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0)
||| % $._config,
'for': '30m',
labels: {
@@ -21,11 +20,10 @@
{
alert: 'NodeDiskRunningFull',
annotations: {
- description: 'device {{$labels.device}} on node {{$labels.instance}} is running full within the next 2 hours (mounted at {{$labels.mountpoint}})',
- summary: 'Node disk is running full within 2 hours',
+ message: 'Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} is running full within the next 2 hours.',
},
expr: |||
- predict_linear(node_filesystem_free{%(nodeExporterSelector)s,mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[30m], 3600 * 2) < 0 and on(instance) up{%(nodeExporterSelector)s}
+ (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)
||| % $._config,
'for': '10m',
labels: {
Originating from https://bugzilla.redhat.com/show_bug.cgi?id=1632762.
In prometheus-operator/prometheus-operator#1544 @jalberto reported, that the Kubernetes dashboards of the Grafana Kubernetes plugin are of high quality and would add a lot of value to be added to the stack. I briefly looked at some of the dashboards and I think there are some elements that would certainly be valuable to transfer into the kubernetes monitoring mixin.
My personal opinion on the Grafana Kubernetes plugin is that it does too much as I practically have to give it a certificate with cluster admin rights in my Kubernetes cluster, which isn't necessary with this monitoring mixin. Nevertheless the dashboards seem useful.
the sum(container* ...)
rules are duplicates of data provided by cAdvisor within kubelet, but they are reported in the same record names, albeit with different labels.
The label selectors in the rules in the default rules file collects both the NodeExporter(I think?) records as well as the kubelet cAdvisor records. This results in values that are exactly double reality.
I think the solution here is to just use the service="kubelet"
and container_name!=""
label selectors, and there is no need for a sum()
Originally posted here:
prometheus-operator/prometheus-operator#2302
What did you do?;
Installed Prometheus chart and friends via Helm in a K8s cluster created by Kubeadm 1.11
What did you expect to see?
Correct values aggregated by the rules:
record: pod_name:container_memory_usage_bytes:sum expr: sum by(pod_name) (container_memory_usage_bytes{container_name!="POD",pod_name!=""}) | OK | | 16.737s ago | 17.95ms
-- | -- | -- | -- | --
record: pod_name:container_spec_cpu_shares:sum expr: sum by(pod_name) (container_spec_cpu_shares{container_name!="POD",pod_name!=""}) | OK | | 16.719s ago | 14.89ms
record: pod_name:container_cpu_usage:sum expr: sum by(pod_name) (rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) | OK | | 16.704s ago | 19.75ms
record: pod_name:container_fs_usage_bytes:sum expr: sum by(pod_name) (container_fs_usage_bytes{container_name!="POD",pod_name!=""})
If the rules were changed to just use the output from Kubelet, a sum()
would not be necessary. This would require setting {service="kubelet", container_name!=""}
What did you see instead? Under which circumstances? : In addition to NodeExporter(I think?) exporting data under these record names, kubelet also reports data under these record names, albeit with different labels. Kublet reports the exact sum of all containers in the Pod.. so the above rules report a value that is exactly double the actual value.
Environment
Prometheus Operator version:
Image ID: docker-pullable://quay.io/coreos/prometheus-operator@sha256:faa9f8a9045092b9fe311016eb3888e2c2c824eb2b4029400f188a765b97648a
Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T07:10:00Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:
kubeadm on bare metal
Manifests:
not relevant
We are seeing this alert fire and resolve multiple times per day:
I believe this is partially because we're creating a bunch of short-running jobs/pods which creates a bunch of new series in prometheus, but I also think this alert might be a little too sensitive trying to predict 4 days growth off 1 hour (especially using a simple linear model).
If I set the prediction based on the last 24h, we wouldn't have any alerts:
I'm not sure how the alert would behave in the first 24 hours with the above example.
e. .g Dashboards like "K8s / Compute Resources / Cluster" uses query sum(irate(container_cpu_usage_seconds_total[1m])) by (namespace)
That creates hidden dependency that scrape job can't run less frequent then sub 1m.
I think range should be configurable
Hi,
since the Alert CPUThrottlingHigh
got added, it is firing in my Cluster for a lot of pods. As most of the affected pods are not even at their assigned CPU limit, I assume the expression for the alert is wrong (either miscalculation or [what seems to be more likely] container_cpu_cfs_throttled_periods_total
includes different types of throttle).
This needs further investigating to be sure where this comes from, but like this the alert is not useful. (With about 250 pods running and a 25% limit I observe >100 alerts, 50% limit ~20 alerts.)
This is a followup to prometheus-operator/prometheus-operator#485.
Persistent volumes can now be monitored with the new metrics mentioned by @gabreal, a dashboard with these persistent volumes would be great as one of the default out of the box dashboards.
This dashboard does pretty much the right thing, and is a good stop-gap measure: https://grafana.com/dashboards/6739.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.