Giter Site home page Giter Site logo

Comments (23)

brancz avatar brancz commented on August 27, 2024 1

There were actually some changes in regard to logging in the latest release, the glog library was not properly configured. Can you upgrade to latest v0.4.1?

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024 1

We came up with those independent of #200 and didn't backport them. I'm working on it, it's an easy change.

from kube-state-metrics.

andrewhowdencom avatar andrewhowdencom commented on August 27, 2024

Depending on your hosts, you could use something like strace to introspect that process from the host NS (I'm like, 99% sure all processes are visible from the host NS)

from kube-state-metrics.

feelobot avatar feelobot commented on August 27, 2024

you could also use csysdig but I agree there should be logs

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

kube-state-metrics OOMing is likely due to the size of your cluster. It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be. Unfortunately we have not benchmarked this extensively to come up with a formula of how much memory for how many objects, so I recommend running it on a large node without a limit observe the memory usage and set the limit with a reasonable margin.

I'm not sure I understand what kind of logs you are expecting when an application OOMs and is killed by the supervisor.

Which version are you using? Because there should be at least some logs.

from kube-state-metrics.

Resisty avatar Resisty commented on August 27, 2024

Hi @brancz,

We're running image: gcr.io/google_containers/kube-state-metrics:v0.3.0. As for what kind of logs you are expecting: literally anything.

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@brancz

It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be.

This mainly refers to client-go cache?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

@andyxning yes, the informers/informer-framework more specifically.

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

Does that solve your issue @Resisty ?

from kube-state-metrics.

gianrubio avatar gianrubio commented on August 27, 2024

Same issue here, v0.4.1

My deployment file comes from prometheus-operator

$ dmesg
....
[1164660.539073] Memory cgroup out of memory: Kill process 3957 (kube-state-metr) score 2256 or sacrifice child
[1164660.545571] Killed process 3957 (kube-state-metr) total-vm:78820kB, anon-rss:49484kB, file-rss:17916kB, shmem-rss:0kB
[1164961.936488] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[1164968.003317] kube-state-metr invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=999
[1164968.010199] kube-state-metr cpuset=c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 mems_allowed=0
[1164968.016745] CPU: 3 PID: 9143 Comm: kube-state-metr Not tainted 4.9.9-coreos-r1 #1
[1164968.019067] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[1164968.019067]  ffffb97e440c3c50 ffffffffb431a933 ffffb97e440c3d78 ffff9615c1ec8000
[1164968.019067]  ffffb97e440c3cc8 ffffffffb420209e 0000000000000000 00000000000003e7
[1164968.019067]  ffff96154b4f6800 ffff96154b4f6800 0000000000000000 0000000000000001
[1164968.019067] Call Trace:
[1164968.019067]  [<ffffffffb431a933>] dump_stack+0x63/0x90
[1164968.019067]  [<ffffffffb420209e>] dump_header+0x7d/0x203
[1164968.019067]  [<ffffffffb418474c>] oom_kill_process+0x21c/0x3f0
[1164968.019067]  [<ffffffffb4184c1d>] out_of_memory+0x11d/0x4b0
[1164968.019067]  [<ffffffffb41f685b>] mem_cgroup_out_of_memory+0x4b/0x80
[1164968.019067]  [<ffffffffb41fc6d9>] mem_cgroup_oom_synchronize+0x2f9/0x320
[1164968.019067]  [<ffffffffb41f7390>] ? high_work_func+0x20/0x20
[1164968.019067]  [<ffffffffb4184fe6>] pagefault_out_of_memory+0x36/0x80
[1164968.019067]  [<ffffffffb40682bc>] mm_fault_error+0x8c/0x190
[1164968.019067]  [<ffffffffb4068b6f>] __do_page_fault+0x44f/0x4b0
[1164968.019067]  [<ffffffffb4068bf2>] do_page_fault+0x22/0x30
[1164968.019067]  [<ffffffffb45cfdb8>] page_fault+0x28/0x30
[1164968.079458] Task in /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 killed as a result of limit of /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842
[1164968.090099] memory: usage 51200kB, limit 51200kB, failcnt 42
[1164968.093438] memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
[1164968.098076] kmem: usage 720kB, limit 9007199254740988kB, failcnt 0
$ kubectl get pods -n monitoring -o wide -l   app=kube-state-metrics
NAME                                  READY     STATUS             RESTARTS   AGE       IP              NODE
kube-state-metrics-1039847806-tzp7t   0/1       CrashLoopBackOff   15         59m       *    ip-*.eu-west-1.compute.internal
kube-state-metrics-1039847806-v3sn9   0/1       CrashLoopBackOff   15         59m
$ kubectl logs -n monitoring  -f kube-state-metrics-1039847806-tzp7t -p -f
I0410 14:16:05.550772       1 main.go:139] Using default collectors
I0410 14:16:05.551113       1 main.go:186] service account token present: true
I0410 14:16:05.551124       1 main.go:187] service host: https://**:443
I0410 14:16:05.551606       1 main.go:213] Testing communication with server
I0410 14:16:05.740980       1 main.go:218] Communication with server successful
I0410 14:16:05.741145       1 main.go:263] Active collectors: pods,nodes,resourcequotas,replicasets,daemonsets,deployments
I0410 14:16:05.741157       1 main.go:227] Starting metrics server: :8080

How I fixed ?

Changing memory limits from 50mi to 100mi

from kube-state-metrics.

chlunde avatar chlunde commented on August 27, 2024

@brancz I also had to increase the memory to 80 MiB on a 6 node OpenShift cluster. Perhaps the memory settings in the deployment should be bumped?

- --memory=30Mi
- --extra-memory=2Mi

memory: 30Mi
requests:
cpu: 100m
memory: 30Mi

does not match README.md

Resource usage changes with the size of the cluster. As a general rule, you should allocate

200MiB memory
0.1 cores
For clusters of more than 100 nodes, allocate at least

2MiB memory per node

Also, having the default settings tight will be an issue when more objects types are added without updating the requirements.

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

I'm ok with bumping the request and limit. What do you think would be an appropriate start value then?

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024

The ones we recommend in the README?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

Yes I don't recall why we didn't do that in the first place.

from kube-state-metrics.

smparkes avatar smparkes commented on August 27, 2024

We're running on a pretty small cluster (50 pods) and kube-state-metrics is OOMing if I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

Sort of wonder how in blazes I would even begin to try to trace this ...

from kube-state-metrics.

smparkes avatar smparkes commented on August 27, 2024

Hrm ... quick follow up: I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing. But we've also been tracking anomalous latency in the API server (trying to get GOOG too look at that since we don't run it (GKE).) Maybe overlapping requests because of delays?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

We don't today, but it's probably time to add pprof endpoints so we can do proper profiling to see what's happening.

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

@smparkes you mean that

  • with only collecting 50 pods, the memory will eat up to 2G?
  • you have tried with both 0.5.0 and 1.0.1?
  • what is the number of services objects?
  • what is you scrape interval configured in Prometheus?

Maybe overlapping requests because of delays?

This problem possibly be related with client-go cause kube-state-metrics store nothing in memory.

I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing.

Dump goroutine list with pprof?

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

Agreed with @brancz , we need to add pprof for debug.

from kube-state-metrics.

smparkes avatar smparkes commented on August 27, 2024

We resolved the issue.
50 live pods. Thousands of errored job pods. (We're still not quite sure how those all accumulated / didn't get gc'd.)
So the delay of the API might not have been the root cause (though I still wonder about whether service does concurrent requests if some requests take longer than the polling interval ... not clear if the service should be hardened to that.)
It would be a "nice to have" to log info on the progress of collection to help debug things like this but the root cause was us ... and us not having monitoring on this, which is ironic :-)
Thanks guys!

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@brancz Does Prometheus scrape synchronously, i.e., it will wait the previous one to finish before start the next scrape?

@caesarxuchao Does client-go sync with apiserver synchronously, i.e., it will wait the previous one to finish before start the next scrape?

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@smparkes Actually the scrape logic is very simple and it works synchronously. IMO, adding the log about how many resource objects are analyzed can at some degree make the debug more easy.

I also have make a PR about adding pprof to

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@smparkes Added a PR about logging the collected resource objects number. Ref #254 .

from kube-state-metrics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.