I have kube-state-metrics running as a deployment via ansible on my clusters: <div

We came up with those independent of <a class="issue-link js-issue-link" data-error-te

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Does that solve your issue <a class="user-mention notranslate" data-hovercard-type="us

Same issue here, v0.4.1 My deployment file comes from <a href="https

No logs available about kube-state-metrics HOT 23 CLOSED

kubernetes commented on August 27, 2024

No logs available

from kube-state-metrics.

Comments (23)

brancz commented on August 27, 2024 1

There were actually some changes in regard to logging in the latest release, the glog library was not properly configured. Can you upgrade to latest v0.4.1?

from kube-state-metrics.

matthiasr commented on August 27, 2024 1

We came up with those independent of #200 and didn't backport them. I'm working on it, it's an easy change.

from kube-state-metrics.

andrewhowdencom commented on August 27, 2024

Depending on your hosts, you could use something like strace to introspect that process from the host NS (I'm like, 99% sure all processes are visible from the host NS)

from kube-state-metrics.

feelobot commented on August 27, 2024

you could also use csysdig but I agree there should be logs

from kube-state-metrics.

brancz commented on August 27, 2024

kube-state-metrics OOMing is likely due to the size of your cluster. It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be. Unfortunately we have not benchmarked this extensively to come up with a formula of how much memory for how many objects, so I recommend running it on a large node without a limit observe the memory usage and set the limit with a reasonable margin.

I'm not sure I understand what kind of logs you are expecting when an application OOMs and is killed by the supervisor.

Which version are you using? Because there should be at least some logs.

from kube-state-metrics.

Resisty commented on August 27, 2024

Hi @brancz,

We're running image: gcr.io/google_containers/kube-state-metrics:v0.3.0. As for what kind of logs you are expecting: literally anything.

from kube-state-metrics.

andyxning commented on August 27, 2024

@brancz

It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be.

This mainly refers to client-go cache?

from kube-state-metrics.

brancz commented on August 27, 2024

@andyxning yes, the informers/informer-framework more specifically.

from kube-state-metrics.

brancz commented on August 27, 2024

Does that solve your issue @Resisty ?

from kube-state-metrics.

gianrubio commented on August 27, 2024

Same issue here, v0.4.1

My deployment file comes from prometheus-operator

$ dmesg
....
[1164660.539073] Memory cgroup out of memory: Kill process 3957 (kube-state-metr) score 2256 or sacrifice child
[1164660.545571] Killed process 3957 (kube-state-metr) total-vm:78820kB, anon-rss:49484kB, file-rss:17916kB, shmem-rss:0kB
[1164961.936488] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[1164968.003317] kube-state-metr invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=999
[1164968.010199] kube-state-metr cpuset=c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 mems_allowed=0
[1164968.016745] CPU: 3 PID: 9143 Comm: kube-state-metr Not tainted 4.9.9-coreos-r1 #1
[1164968.019067] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[1164968.019067]  ffffb97e440c3c50 ffffffffb431a933 ffffb97e440c3d78 ffff9615c1ec8000
[1164968.019067]  ffffb97e440c3cc8 ffffffffb420209e 0000000000000000 00000000000003e7
[1164968.019067]  ffff96154b4f6800 ffff96154b4f6800 0000000000000000 0000000000000001
[1164968.019067] Call Trace:
[1164968.019067]  [<ffffffffb431a933>] dump_stack+0x63/0x90
[1164968.019067]  [<ffffffffb420209e>] dump_header+0x7d/0x203
[1164968.019067]  [<ffffffffb418474c>] oom_kill_process+0x21c/0x3f0
[1164968.019067]  [<ffffffffb4184c1d>] out_of_memory+0x11d/0x4b0
[1164968.019067]  [<ffffffffb41f685b>] mem_cgroup_out_of_memory+0x4b/0x80
[1164968.019067]  [<ffffffffb41fc6d9>] mem_cgroup_oom_synchronize+0x2f9/0x320
[1164968.019067]  [<ffffffffb41f7390>] ? high_work_func+0x20/0x20
[1164968.019067]  [<ffffffffb4184fe6>] pagefault_out_of_memory+0x36/0x80
[1164968.019067]  [<ffffffffb40682bc>] mm_fault_error+0x8c/0x190
[1164968.019067]  [<ffffffffb4068b6f>] __do_page_fault+0x44f/0x4b0
[1164968.019067]  [<ffffffffb4068bf2>] do_page_fault+0x22/0x30
[1164968.019067]  [<ffffffffb45cfdb8>] page_fault+0x28/0x30
[1164968.079458] Task in /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 killed as a result of limit of /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842
[1164968.090099] memory: usage 51200kB, limit 51200kB, failcnt 42
[1164968.093438] memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
[1164968.098076] kmem: usage 720kB, limit 9007199254740988kB, failcnt 0

$ kubectl get pods -n monitoring -o wide -l   app=kube-state-metrics
NAME                                  READY     STATUS             RESTARTS   AGE       IP              NODE
kube-state-metrics-1039847806-tzp7t   0/1       CrashLoopBackOff   15         59m       *    ip-*.eu-west-1.compute.internal
kube-state-metrics-1039847806-v3sn9   0/1       CrashLoopBackOff   15         59m

$ kubectl logs -n monitoring  -f kube-state-metrics-1039847806-tzp7t -p -f
I0410 14:16:05.550772       1 main.go:139] Using default collectors
I0410 14:16:05.551113       1 main.go:186] service account token present: true
I0410 14:16:05.551124       1 main.go:187] service host: https://**:443
I0410 14:16:05.551606       1 main.go:213] Testing communication with server
I0410 14:16:05.740980       1 main.go:218] Communication with server successful
I0410 14:16:05.741145       1 main.go:263] Active collectors: pods,nodes,resourcequotas,replicasets,daemonsets,deployments
I0410 14:16:05.741157       1 main.go:227] Starting metrics server: :8080

How I fixed ?

Changing memory limits from 50mi to 100mi

from kube-state-metrics.

chlunde commented on August 27, 2024

@brancz I also had to increase the memory to 80 MiB on a 6 node OpenShift cluster. Perhaps the memory settings in the deployment should be bumped?

kube-state-metrics/kubernetes/kube-state-metrics-deployment.yaml

Lines 56 to 57 in 469f73f

    
           - --memory=30Mi 
        
           - --extra-memory=2Mi

kube-state-metrics/kubernetes/kube-state-metrics-deployment.yaml

Lines 38 to 41 in 469f73f

    
             memory: 30Mi 
        
           requests: 
        
             cpu: 100m 
        
             memory: 30Mi

does not match README.md

Resource usage changes with the size of the cluster. As a general rule, you should allocate

200MiB memory
0.1 cores
For clusters of more than 100 nodes, allocate at least

2MiB memory per node

Also, having the default settings tight will be an issue when more objects types are added without updating the requirements.

from kube-state-metrics.

brancz commented on August 27, 2024

I'm ok with bumping the request and limit. What do you think would be an appropriate start value then?

from kube-state-metrics.

matthiasr commented on August 27, 2024

The ones we recommend in the README?

from kube-state-metrics.

brancz commented on August 27, 2024

Yes I don't recall why we didn't do that in the first place.

from kube-state-metrics.

smparkes commented on August 27, 2024

We're running on a pretty small cluster (50 pods) and kube-state-metrics is OOMing if I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

Sort of wonder how in blazes I would even begin to try to trace this ...

from kube-state-metrics.

smparkes commented on August 27, 2024

Hrm ... quick follow up: I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing. But we've also been tracking anomalous latency in the API server (trying to get GOOG too look at that since we don't run it (GKE).) Maybe overlapping requests because of delays?

from kube-state-metrics.

brancz commented on August 27, 2024

We don't today, but it's probably time to add pprof endpoints so we can do proper profiling to see what's happening.

from kube-state-metrics.

andyxning commented on August 27, 2024

I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

@smparkes you mean that

with only collecting 50 pods, the memory will eat up to 2G?
you have tried with both 0.5.0 and 1.0.1?
what is the number of services objects?
what is you scrape interval configured in Prometheus?

Maybe overlapping requests because of delays?

This problem possibly be related with client-go cause kube-state-metrics store nothing in memory.

I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing.

Dump goroutine list with pprof?

from kube-state-metrics.

andyxning commented on August 27, 2024

Agreed with @brancz , we need to add pprof for debug.

from kube-state-metrics.

smparkes commented on August 27, 2024

We resolved the issue.
50 live pods. Thousands of errored job pods. (We're still not quite sure how those all accumulated / didn't get gc'd.)
So the delay of the API might not have been the root cause (though I still wonder about whether service does concurrent requests if some requests take longer than the polling interval ... not clear if the service should be hardened to that.)
It would be a "nice to have" to log info on the progress of collection to help debug things like this but the root cause was us ... and us not having monitoring on this, which is ironic :-)
Thanks guys!

from kube-state-metrics.

andyxning commented on August 27, 2024

@brancz Does Prometheus scrape synchronously, i.e., it will wait the previous one to finish before start the next scrape?

@caesarxuchao Does client-go sync with apiserver synchronously, i.e., it will wait the previous one to finish before start the next scrape?

from kube-state-metrics.

andyxning commented on August 27, 2024

@smparkes Actually the scrape logic is very simple and it works synchronously. IMO, adding the log about how many resource objects are analyzed can at some degree make the debug more easy.

I also have make a PR about adding pprof to

from kube-state-metrics.

andyxning commented on August 27, 2024

@smparkes Added a PR about logging the collected resource objects number. Ref #254 .

from kube-state-metrics.

No logs available about kube-state-metrics HOT 23 CLOSED

Comments (23)

How I fixed ?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent