dotdc / grafana-dashboards-kubernetes Goto Github PK

A set of modern Grafana dashboards for Kubernetes.

License: Apache License 2.0

dashboard dashboards grafana grafana-dashboard grafana-dashboards grafana-prometheus hacktoberfest kube-state-metrics kubernetes kubernetes-monitoring monitoring monitoring-dashboard node-exporter o11y observability prometheus prometheus-metrics prometheus-node-exporter

grafana-dashboards-kubernetes's Introduction

Hi there! 👋

My name is David, I'm a Site Reliability Engineer working remotely from the south of France.
I'm currently focused on Observability, Reliability and Security aspects of Kubernetes clusters running on several cloud platforms.

About me:

❤️ Open Source
✍️ Blogger
💻 Running Arch Linux
🎯 Always learning something new
💾 Enjoy old adventure games
👶 Dad
🧗‍♂️ Rock climber
🍺 IPA & Wine
🤓 Fun fact: started typing before 2yo!

Where to find me:

grafana-dashboards-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

yashraj-unthinkable starsl-cn tiloups sadfacesmith buddyboii abhishek3100 xixisidi fl64 kjm0001 jwcastillo zukko78 meleuo fyqls pregame3985 reefland jumping uday5895 anhgeeky fangtangjing kongfei605 altoplano hamidreza-hosseini mohitreddy1996 packdhe vladimir-babichev brochwerger ismashal shnikita bezerra117 jun759 nilango-sps dasbatta komgo cebidhem superq dmitriy-tomin chenhuimin dusty-82 edsonmarquesio coderpoet esmaeilram kkavin wolfsrudel achalupka moknin-zscaler baikunthasahoo rawmanski patrickshan vielato lucho00cuba cooptilleuls walterleonardo domino-cwellington laxmikant66 woshizilong miracle2k stttt0tttts k1rk gvasquez-plenty clementnuss danic-git geohost muhammad-asn r-santanna tzago william-lp rommelandrea aurumsoftware leandro-matos hustcat dhanrajsr heinrichtremblay thisisanewshit crankage4902 gdzy1987 anshurajkaushik marianod92 habibtalib cloudzun geekofalltrades sritad ravelin lacoski chiupick86 bluejaya openinfradev nikonorovi zozores pmint93 juangarcia4ks shreymarkandey-materialplus plunix suyambuganesh82 masifpak njagwani anhbt1005 luddskunk selvigp kidmam awesome-software-engineering

grafana-dashboards-kubernetes's Issues

[bug] "FS - Device Errors" query in Nodes dashboard is not scoped

Describe the bug

In k8s-views-nodes.json, the "FS - Device Errors" query is sum(node_filesystem_device_error) by (mountpoint), which aggregates data from the entire datasource.

How to reproduce?

No response

Expected behavior

{instance="$instance"} should be added to the query.

Additional context

No response

Some metrics are missing.

Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:

When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total. When I look in my Prometheus it does not have image=, here is a random one that was on the top of the query:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}

I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.

If you have any suggestions, be much appreciated.

deployment view

Describe the enhancement you'd like

Currently there are views for:

global
namespace
nodes
pods

It would be nice to have a view that would show the status of the deployments (number or replicas, ...)

Additional context

No response

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/trivy

Describe the bug

i am using the dashboard ( grafana-dashboards-kubernetes/dashboards/trivy ) but I am not getting any values for 'CVE vulnerabilities in All namespace(s)' and 'Other vulnerabilities in All namespace(s)', I have enabled OPERATOR_METRICS_VULN_ID_ENABLED= true in my trivy deployment and I am using the latest version of trivy operator and prometheus. could you please help

How to reproduce?

1.install latest trivy-operator and try to use the grafana dashboard

Expected behavior

show cve values

Additional context

No response

[bug] created_by variable is not refreshed on Time Range Change

Describe the bug

Hi,
on the "Kubernetes / Views / Namespaces" Dashboard exists a Variable "created_by" that is filled ONLY on dashboard loading. If I change to yesterday, PODs created are not shown. The only thing to be changed is in the variable "properties the refresh from 1 => 2:

        "refresh": 1, // Bug
        "refresh": 2, // Correct Value

Regards Philipp

How to reproduce?

Always

Expected behavior

created_by should be "refilled" on every Time Range Change

Additional context

No response

[bug] broken panels on k8s-views-nodes in specific cases

Describe the bug

The k8s-views-nodes.json dashboard will have many broken panels in specific Kubernetes setups.
This is currently the case on OKE.

Apparently, this happens when the node label from kube_node_info doesn't match the nodename label from node_uname_info.

Here's some extracted metrics from a broken setup where the labels differ.

TL;DR: node="k8s-wrk-002" and nodename="kind-kube-prometheus-stack-worker2".

kube_node_info:

{
    __name__="kube_node_info",
    container="kube-state-metrics",
    container_runtime_version="containerd://1.6.19-46-g941215f49",
    endpoint="http", 
    instance="10.27.3.148:8080", 
    internal_ip="172.18.0.2", 
    job="kube-state-metrics", 
    kernel_version="6.2.12-arch1-1", 
    kubelet_version="v1.26.3", 
    kubeproxy_version="v1.26.3", 
    namespace="monitoring",
    node="k8s-wrk-002",
    os_image="Ubuntu 22.04.2 LTS",
    pod="kube-prometheus-stack-kube-state-metrics-6df68756d8-zvd58",
    pod_cidr="10.27.1.0/24",
    provider_id="kind://docker/kind-kube-prometheus-stack/kind-kube-prometheus-stack-worker2", 
    service="kube-prometheus-stack-kube-state-metrics", 
    system_uuid="8422f117-6154-45bd-97c0-e3dec80a3f60"
}

node_uname_info:

{
    __name__="node_uname_info", 
    container="node-exporter", 
    domainname="(none)", 
    endpoint="http-metrics", 
    instance="172.18.0.2:9100", 
    job="node-exporter", 
    machine="x86_64", 
    namespace="monitoring", 
    nodename="kind-kube-prometheus-stack-worker2", 
    pod="kube-prometheus-stack-prometheus-node-exporter-qvn22", 
    release="6.2.12-arch1-1", 
    service="kube-prometheus-stack-prometheus-node-exporter", 
    sysname="Linux", 
    version="#1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000"
}

This issue will continue the discussion started in #41

@fcecagno @Chewie

How to reproduce?

You can use https://github.com/dotdc/kind-lab, that will create a kind cluster with renamed nodes.

# Create the kind cluster
./start.sh

# Export configuration
export KUBECONFIG="$(pwd)/kind-kubeconfig.yml"

# Expose Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

Open http://localhost:3000

Open broken dashboard:

http://localhost:3000/d/k8s_views_nodes/kubernetes-views-nodes?orgId=1&refresh=30s

Expected behavior

Dashboard should work with a relabel_configs like suggested @Chewie.
The solution should be described in https://github.com/dotdc/grafana-dashboards-kubernetes#known-issues

Additional context

No response

Values not showing in namespace total cluster CPU & RAM usage

Describe the bug

Values not showing in namespace total cluster CPU & RAM usage (Kubernetes/Views/Namespaces)

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] Global Network Utilization

Describe the bug

On my simple test cluster, I have no issues with the Global Netowrk Utilization, but on my production cluster that does cluster and host networking the numbers are crazy:

No way I have sustained rates like that. I think this is related to the metric:

sum(rate(container_network_receive_bytes_total[$__rate_interval]))

If I look at rate(container_network_receive_bytes_total[30s]), I get:

{id="/", interface="cni0", job="kubernetes-cadvisor"} | 2041725438.15131
{id="/", interface="enp1s0", job="kubernetes-cadvisor"} | 4821605692.45648
{id="/", interface="flannel.1", job="kubernetes-cadvisor"} | 337125370.2678834

I'm not sure what to actually look at here. I tried sum(rate(node_network_receive_bytes_total[$__rate_interval])) and I get a reasonable traffic graph:

This is 5 nodes, pretty much at idle. Showing I/O by instance:

Here is BTOP+ on k3s01 running for a bit, lines up very will with data above:

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

Network - Packets Dropped not working.

It seems that the network view at the pod level is not working.

[bug] should use last non-null value, rather than mean

Describe the bug

3/4 of these guages use the mean, rather than the last non-null value. This can cause strangeness like incorrect reporting of current cpu requests and limits. They should also be consistent.

Current:

Last *:

How to reproduce?

Observe the global view
Change some cpu requests and limits
Observe incorrect reporting of cpu requests and limits

Expected behavior

Should probably use "Last *" rather than "Mean" for calculating the value.

Additional context

No response

[bug] node dashboard only shows latest instance

Describe the bug

Some panels are using node to filter, and others are using a hidden instance variable ( label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)). If a node changes its IP, then some panels will look normal and others will be missing data.

How to reproduce?

Collect node metrics.
Change IP of node.
Observe the node dashboard has some unaffected panels, and others which only show the latest 'instance'.

Expected behavior

It should probably show all instances of a node.

Additional context

No response

[bug] Trivy Dashboard Templating Failed to upgrade legacy queries Datasource prometheus was not found

Describe the bug

The trivy dashboard breaks since this commit 4b52d9c on our clusters.

How to reproduce?

No response

Expected behavior

The dashboard continues to work like on the commit before. Other dashboards don't seem to have this issue.

Additional context

Is there any chance it is because of the missing cluster label on trivy metrics? Should we configure a specific setting to include this cluster label on the trivy operator?

[bug] default resolution is too low

Describe the bug

The default resolution of 30s is too low and renders some dashboards with "No Data". This is likely because I'm using Grafana Mimir, as opposed to a standard Prometheus install.

How to reproduce?

Collect metrics with Grafana Mimir.
Load the dashboard.

Expected behavior

Changing the resolution from 30s to 1m shows the data as expected.

Additional context

No response

[bug] exclude iowait, steal, idle from CPU uages

Describe the bug

Based on

The CPU modes idle, iowait, steal should be excluded from the CPU utilization.

How to reproduce?

No response

Expected behavior

No response

Additional context

Per the iostat man page:

%idle
Show the percentage of time that the CPU or CPUs were idle and the
system did not have an outstanding disk I/O request.

%iowait
Show the percentage of time that the CPU or CPUs were idle during
which the system had an outstanding disk I/O request.

%steal
Show the percentage of time spent in involuntary wait by the
virtual CPU or CPUs while the hypervisor was servicing another
virtual processor.

getting Failed to upgrade legacy queries e.replace is not a function

Describe the bug

Tried to import the dashboards in grafana 7.4.5 , getting Failed to upgrade legacy queries e.replace is not a function
error

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

Running pods panel in Global dashboard

Currently, "Running Pods" panel uses the expression sum(kube_pod_container_info), which sums the containers, but not the pods. I believe the metric kube_pod_info would be the best for this panel.

Should be updated here:

grafana-dashboards-kubernetes/dashboards/k8s-views-global.json

Lines 868 to 881 in 793cb68

    
           "targets": [ 
        
             { 
        
               "datasource": { 
        
                 "type": "prometheus", 
        
                 "uid": "${datasource}" 
        
               }, 
        
               "expr": "sum(kube_pod_container_info)", 
        
               "interval": "", 
        
               "legendFormat": "", 
        
               "refId": "A" 
        
             } 
        
           ], 
        
           "title": "Running Pods", 
        
           "type": "stat"

P.S. Thank you for the dashboards, they look awesome!

Publish tag to make update automation possible

Describe the enhancement you'd like

As a renovate user (but this applies to all similar tools), I would like to leverage our system to upgrade automatically our dashboards.

Currently we have no solution to be automatically notified about an update or a change from this project. A solution based on git tag could do the job perfectly.

Tags don't have to be semantic or logical, a simple tag every months is a perfect and valid solution.

Additional context

… nothing specific. Let me know if you have any question.

PS: Your dashboards are really amazing, thank you for this work!

All dashboards with cluster variable is broken in VictoriaMetrics [bug]

Describe the bug

Popup message in grafana when opening dashboards:

Templating
Failed to upgrade legacy queries Datasource prometheus was not found

Previsious version working fine

How to reproduce?

Install VictoriaMetrics as prometheus datasource and try open namespace dashboard

Expected behavior

Dashboards works correctly

Additional context

No response

Total pod RAM request usage & Total Pod RAM limit usage gauge is showing wrong value

Describe the bug

First of all I want to thank you for your effort for creating amazing Grafana dashboard for K8s I have deployed Prometheus helm chart stack and passed the dashboard provider value to values.yaml, everything went smooth except one issue that I am facing in /kubernetes/view/pods, which total pod RAM request usage and Total RAM limit usage gauge is showing wrong value as you can see in the below screenshot, I wonder if someone can help me to fix it.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] some panels not displyaing correctly on white background

Describe the bug

Hi,

I'm using those dashboards with Grafana using the light theme (easier on my eyes), and some panels are not displaying properly. e.g.:

this can be fixed by setting the color mode of the panel to None instead of Value

How to reproduce?

turn on light mode for Grafana.
check the Kubernetes / Views / Global panel

Expected behavior

the text/values should be readable even with the light theme

Additional context

No response

The pod view should allow multi-selection (+`all`) for the `namespace` and `pod` variables [enhancement]

Describe the enhancement you'd like

I find the lack of multiselect for the pod view very limiting. The view should allow to look at multiple workloads at once.

Additional context

No response

View Pods Dashboard Feature Requests / Issues

RAM Usage Request Gauge
My understanding of requests is that this should closely match the actual. Being 90% of Request is not a bad condition, that is a good condition. I think GREEN should be +/- 20% of the request value. 20% beyond that either side yellow, and the rest is red as being signification under or over request is not ideal. As it is now if you estimate the request perfectly it shows RED like an error condition and that is not the case. Only the LIMIT gauge should be like this (as you get OOM killed),

I think that is wrong, to be stable at 90% of request should get me a gold star :)

I'm not sure if CPU Request needs that as well. If so maybe its GREEN range is wider?!?

Resource by container
Could you add the Actual Usage for CPU and Memory between Request/Limits for each? That would be helpful to show where actual is between the two values.

I think CPU Usage by container and Memory Usage by Container should be renamed to by pod as if you select a Pod with multiple containers, you do not get a graph with multiple plot lines which you would expect if it was by container.

NOTE: I played with adding resource requests and limits as plot lines for CPU Usage by Container and Memory Usage by Container and looks good for pods with a single container. But once I selected a pod with multiple containers and thus multiple requests/limits it become confusing mess. Don't have the Grafana skills to isolate them properly. But maybe you have some ideas to make that work right.

Question: How I should export dashboard json

Hi,

I'm prepare #79 and I have some trouble to export the JSON file a grafana instance.

If I import a dashboard and export again without any modifications, I get a lot of changes:

For example this commit does not contain any changes, from a lot of changes of JSON level: jkroepke@706315b

Thats how I export the JSON

What the recommend way? If the mention approch is the correct one, would it be possible to import and export all dashboard the keep my PR clean as possible? Otherwise, I had tons of non related changes.

[bug] Failed to display node metrics

Describe the bug

This is the way variables are configured on k8s-views-nodes.json:

...
node = label_values(kube_node_info, node)
instance = label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)

In OKE, kube_node_info looks like this:

{__name__="kube_node_info", container="kube-state-metrics", container_runtime_version="cri-o://1.25.1-111.el7", endpoint="http", instance="10.244.0.40:8080", internal_ip="10.0.107.39", job="kube-state-metrics", kernel_version="5.4.17-2136.314.6.2.el7uek.x86_64", kubelet_version="v1.25.4", kubeproxy_version="v1.25.4", namespace="monitoring", node="10.0.107.39", os_image="Oracle Linux Server 7.9", pod="monitoring-kube-state-metrics-6fcd4d745c-txg2k", pod_cidr="10.244.1.0/25", provider_id="ocid1.instance.oc1.sa-saopaulo-1.xxx", service="monitoring-kube-state-metrics", system_uuid="d6462364-95bf-4122-a3ab-xxx"}

And node_uname_info looks like this:

node_uname_info{container="node-exporter", domainname="(none)", endpoint="http-metrics", instance="10.0.107.39:9100", job="node-exporter", machine="x86_64", namespace="monitoring", nodename="oke-cq2bxmvtqca-nsdfwre7l3a-seqv6owhq3a-0", pod="monitoring-prometheus-node-exporter-n6pzv", release="5.4.17-2136.314.6.2.el7uek.x86_64", service="monitoring-prometheus-node-exporter", sysname="Linux", version="#2 SMP Fri Dec 9 17:35:27 PST 2022"}

For this example, node=10.0.107.39, but when I query node_uname_info{nodename=~"(?i:($node))"}, it doesn't return anything, because nodename doesn't match the internal IP address of the node.
As a result, no node metrics is displayed.

How to reproduce?

No response

Expected behavior

No response

Additional context

Modifying the filter https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-nodes.json#L3747-L3772 to use node_uname_info{instance="$node:9100"} fixes the issue.

[bug] Variable `job` in Kubernetes / Views / Nodes is not referenced

Describe the bug

Grafana shows an exclamation mark on the variable job.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] CPU dashboard can report negative values

Describe the bug

How to reproduce?

I don't know

Expected behavior

The dashboard should not produce negative CPU usage values.

Additional context

I adjusted some resource limits, which caused some pods to restart.

[bug] Wrong query on the Network - Bandwidth panel

Describe the bug

On Kubernetes / Views / Pods dashboard on Network - Bandwidth panel wrong query of Transmitted

It is

- sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

Should be

- sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

How to reproduce?

No response

Expected behavior

No response

Additional context

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-pods.json#L1417

The k8s-views-namespaces.json dashboard does not work

Grafana version is v8.2.4

Help please

[enhancement] Windows support

Describe the enhancement you'd like

I have some cluster with Windows nodes enabled. I would like to ask if I can add windows support or do you think it out of context here?

Unlike kubernetes-mixin, which have separate dashboard, I would like to add the Windows queries into the existing one. Thats possible by using queries with OR, e.g.:

sum(container_memory_working_set_bytes{cluster="$cluster",namespace=~"$namespace", image!="", pod=~"${created_by}.*"}) by (pod)
OR
<WINDOWS Query>

Additional context

Since I'm running multiple OS hybrid clusters, I would like to add PRs for windows pods here. I'm not expecting that the maintainers here provide support for Windows. Before start to work here, I would like to know if its getting accepted?

[bug] Trivy Operator Dashboard: The Prometheus data source variable is not used everywhere

Describe the bug

There are panels in the Trivy Operator dashboard which do not properly use the Prometheus data source variable.

How to reproduce?

Import the dashboard
Change between Prometheus data sources in the global variable filter
See that the "Vulnerability count per image and severity in $namespace namespace" panel does not pick up the Prometheus data source correctly

Expected behavior

The global Prometheus data source variable should be applied to all panels.

Additional context

Here are the places I spotted where the Prometheus data source variable is not used:

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L785
https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L882

[bug] "CoreDNS - Forward request duration" broken?

Describe the bug

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] Pod memory / cpu requests and limits invalid values with Opencost installed

Describe the bug

Hi,
On the "Pod" dashboard, we have invalid values for pod resources (cpu and memory): values doubled

We have correct values when we specify the kube-state-metrics job.

The Prometheus metrics:

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] node dashboard shows no values

Describe the bug

We've deployed kube-prometheus-stack via flux:

flux get helmreleases -n monitoring 
NAME                    REVISION        SUSPENDED       READY   MESSAGE                                                                                                        
kube-prometheus-stack   58.2.2          False           True    Helm upgrade succeeded for release monitoring/kube-prometheus-stack.v6 with chart [email protected]
loki-stack              2.10.2          False           True    Helm install succeeded for release monitoring/loki-stack.v1 with chart [email protected]

The Grafana dashboards have been installed with Helm values as described. However, we're not able to see any metrics for the node dashboard despite changing the Helm values:

# File: kube-prometheus-stack-values.yaml
prometheus-node-exporter:
  prometheus:
    monitor:
      relabelings:
      - action: replace
        sourceLabels: [__meta_kubernetes_pod_node_name]
        targetLabel: nodename

How to reproduce?

No response

Expected behavior

No response

Additional context

kubectl get po -n monitoring 
NAME                                                       READY   STATUS    RESTARTS   AGE
kube-prometheus-stack-grafana-75c985bc44-5g7sm             3/3     Running   0          7m34s
kube-prometheus-stack-kube-state-metrics-c4dbc548d-l5tcl   1/1     Running   0          17m
kube-prometheus-stack-operator-7846887766-98vvj            1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-5x97x       1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-97dbf       1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-hz4zf       1/1     Running   0          17m
loki-stack-0                                               1/1     Running   0          17m
loki-stack-promtail-bc95r                                  1/1     Running   0          17m
loki-stack-promtail-fpnh9                                  1/1     Running   0          17m
loki-stack-promtail-z64hg                                  1/1     Running   0          17m
prometheus-kube-prometheus-stack-prometheus-0              2/2     Running   0          17m

[enhancement] Add support for monitoring node runtime & system resource usage

Describe the enhancement you'd like

I'd like the nodes dashboard to show the runtime and system resource usage, as are exported by kubelet.

Additional context

This requires that the cAdvisor metrics for cgroup slices aren't being dropped. For this to work with Kube Prometheus Stack the kubelet ServiceMonitor cAdvisorMetricRelabelings value needs to be overridden to keep the required values.

[bug] Fix node_* metrics on k8s-views-global.json

Describe the bug

Currently, there is no job label selector in k8s-views-global.json.

History:

A hardcoded job label was set to node-exporter in #36 for node_* metrics
They were after removed in #49 to work with Grafana Agent

Adding a job variable for node_* metrics should fix the issue.

@uhthomas @tlemarchand Can you both try the version in #110 to make sure it works on your side ?

Metrics missing in K8s Environment

I've opened a new issue because this one is not in the k3s environment but k8s.

I see some metrics missing, probably because my installation could be incomplete.
I've deployed the k8s cluster, with two masters and three workers nodes. Grafana and prometheus are deployed with "almost" de default settings.

i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n grafana
NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
grafana   ClusterIP   <ip>   <none>        80/TCP    18h

i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n prometheus
NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
prometheus-alertmanager         ClusterIP   <ip>     <none>        80/TCP     21h
prometheus-kube-state-metrics   ClusterIP   <ip>     <none>        8080/TCP   21h
prometheus-node-exporter        ClusterIP   <ip>     <none>        9100/TCP   21h
prometheus-pushgateway          ClusterIP   <ip>    <none>        9091/TCP   21h
prometheus-server               ClusterIP   <ip>    <none>        80/TCP     21h

I've created the datasource using the prometheus-server ip, and some of the metrics works and some don't:

I'm completely sure that those issues are because my environment because I see that your dashboards work fine, but, can you help me troubleshoot?

Thanks,

Issues with node_cpu_seconds_total

I tested the latest changes, and still not right...

Panel CPU Utilization by Node "expr": "avg by (node) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))",yields:

Seems to be the total of all nodes? It is not picking up the multiple nodes, It should look like:

Panel CPU Utilization by namespace is still dark and using old metric: "expr": "sum(rate(container_cpu_usage_seconds_total{image!=\"\"}[$__rate_interval])) by (namespace)", I did try something like above "avg by (namespace) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))" that is not right, only got one namespace listed:

Both Memory Utilization Panels are still dark based on container_memory_working_set_bytes when I use your unmodified files.

[bug] kubernetes-views-namespaces: created_by_name doesn't work

Describe the bug

The created_by_name variable doesn't work as expected and doesn't offer a drill down:

I found out that this is caused by the variable selector container!="" (source link):

As kube_pod_info metric doesn't contain the container label (official docs).

When I remove the container!="" selector, the variable will start working again.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[enhancement] cluster variable support

Thanks for very nice dashboards.

One thing missing is a variable "cluster" maybe. Having multiple clusters it is useful to limit scope to a single cluster. A multi-select variable accepting all and queries adding "cluster=~"$cluster".

[bug] `kube-prometheus-stack` installation steps broken

This worked for me in the past, but I am building a new k3s cluster and I can't install it with the previous documentation: https://github.com/dotdc/grafana-dashboards-kubernetes#install-with-helm-values.

The error I get is a little specific to me since I am using terraform:

│ Error: unable to build kubernetes objects from release manifest: unable to decode "": json: cannot unmarshal number into Go struct field ObjectMeta.metadata.labels of type string
│
│   with module.monitoring.helm_release.prometheus-stack,
│   on ../modules/monitoring/main.tf line 2, in resource "helm_release" "prometheus-stack":
│    2: resource "helm_release" "prometheus-stack" {

I tried to read https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml, but looks like a few things have changed.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

suggest lower cardinality variables for the pod dashboard[bug]

Describe the bug

When in a cluster with a lot of churn on pods, the high cardinality pod metrics cause queries to fail due to the large number of series returns. For instance I doubled the max returned label sets in victoriametrics to 60k and I still fail when trying to use the pod dashboard:

2024-04-22T18:17:33.527Z	warn	VictoriaMetrics/app/vmselect/main.go:231	error in "/api/v1/series?start=1713806220&end=1713809880&match%5B%5D=%7B__name__%3D%22kube_pod_info%22%7D": cannot fetch time series for "filters=[{__name__=\"kube_pod_info\"}], timeRange=[2024-04-22T17:17:00Z..2024-04-22T18:18:00Z]": cannot find metric names: error when searching for metricIDs in the current indexdb: the number of matching timeseries exceeds 60000; either narrow down the search or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits

How to reproduce?

Have a cluster with a lot of pods being created...

Expected behavior

No response

Additional context

I have a fix suggestion that seems to work fine for me. It involves changing the namespace and job queries to not query "all pods" for labels. Like this:

namespace: label_values(kube_namespace_created{cluster="$cluster"},namespace)
job: label_values(kube_pod_info{namespace="$namespace", cluster="$cluster"},job)

[bug] Namespace dashboard shows double resource usage

Describe the bug

The cumulative resource usage in the namespace seems to be 1.25 cpu and 2.5Gi (I changed the two graphs to stack), but it appears as 2.5 cpu and 5Gi respectively.

I imagine the queries need the label selector image!="".

How to reproduce?

N/A

Expected behavior

N/A

Additional context

N/A

[bug] bad_data: invalid parameter 'query': 1:109: parse error: bad duration syntax: "1m0"

Describe the bug

Hi
I try my grafana version 8.1.8, 9.1.0 , 9.2.4, but have a same error as follows.

Thanks for advance!

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] CoreDNS Dashboard No Data

Describe the bug

Hi, and thanks for the good set of Dashboards @dotdc !

I'm having some trouble with the CoreDNS dashboard.

Several graphs and statuses don't show any data, displaying the "No Data" placeholder.

I've noticed that the filter for CoreDNS is a job and not a pod.

At least in my EKS, the CoreDNS is a daemonset and not a job.

Is there something I could do or change?

Thanks =D !

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] metrics which filter by the node-exporter job don't work with Grafana Agent

Describe the bug

Similar to #46, other dashboards don't work as they match against job="node-exporter". The metrics exist and I don't see a need for this label?

How to reproduce?

Use Grafana Agent
Load dashboard

Expected behavior

If I remove the label filter then it looks as expected.

Additional context

No response

Error updating options: Unexpected token p in JSON at position 4

Hi,

I'm trying to add the dashboard, but I'm having this error message in all of them:

"Error updating options: Unexpected token p in JSON at position 4"

Could you please help me?

Many thanks

[bug] Node metrics names on AWS EKS nodes mismatch

Describe the bug

The metrics for kube_node_info & node_uname_info produce different names for nodes, resulting in the Node dashboard not working.

Eg:

node_uname_info:

nodename="ip-10-10-11-100.ec2.internal"

kube_node_info

node="ip-10-10-10-110.us-east-2.compute.internal"

Node exporter version: 1.3.1
Kube state metrics version: 2.5.0

I acknowledge this is not a bug on the dashboard itself but rather the naming standards on the different metric exporters.

However just wanted to know if other aws eks users are experiencing the same issue before I start manually editing the dashboard in an attempt to get the dashboards working.

Thanks

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] difficult to understand what is missing

Describe the bug

this would have be more appropriate as a "help request" since I don't think I'm dealing with a bug here.

My problem is that several panels in Kubernetes / Views / Pods are showing "no data".

Investigating this, and having read https://github.com/dotdc/grafana-dashboards-kubernetes?tab=readme-ov-file#known-issues I could not fully understand my issue.

For example, this is an example of panel's query:

sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod", image!="", cluster="$cluster"}[$__rate_interval])) / sum(kube_pod_container_resource_requests{namespace="$namespace", pod=~"$pod", resource="cpu", job=~"$job", cluster="$cluster"})

this gives me "no data", but if I remove the condition image!="", then it starts working:

why do I have that image!="" condition? Should it be "filled" somehow?

How to reproduce?

I am using dashboards installed via helm chart providers, like described in https://github.com/dotdc/grafana-dashboards-kubernetes?tab=readme-ov-file#install-with-helm-values.

I am using grafana helm chart version 8.3.2

Expected behavior

I need to know how/when the panel's query SHOULD be filled, especially cluster="$cluster" and image!="". I can correcly see al my pods/namespaces in the respective combos/dropdown menu, but not "cluster":

Additional context

these dashboards are the best I could find for K8 environment.

great job

[bug] incorrect node count with grafana agent

Describe the bug

Then metric up{job="node-exporter"} does not exist with Grafana agent, and so the total number of nodes is reported as 0.

How to reproduce?

Use Grafana Agent
Load dashboard

Expected behavior

Should show the total number of nodes (5 in this case).

Additional context

No response

[bug] Dashboard kubernetes-views-pods shows unexpected values for memory requests / limits

Describe the bug

First of all: amazing dashboards...Thanks a ton :)

The panel "Resources by container" in the "kubernetes-views-pods" uses the metrics
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte"}

Unfortunately this leads to unexpected values as the label "resource" in these metrics can have the values "memory" and "ephemeral_storage" and counts them together.

How to reproduce?

No response

Expected behavior

The metrics should probably be:
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}

Additional context

No response

	"targets": [
	{
	"datasource": {
	"type": "prometheus",
	"uid": "${datasource}"
	},
	"expr": "sum(kube_pod_container_info)",
	"interval": "",
	"legendFormat": "",
	"refId": "A"
	}
	],
	"title": "Running Pods",
	"type": "stat"