As discussed in the last SIG instrumentation meeting, we plan to do a first stable rel

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Since <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm now back from vacation and can help coordinate. <a class="user-mention notranslate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

1.0 stabilization about kube-state-metrics HOT 36 CLOSED

kubernetes commented on August 27, 2024 2

1.0 stabilization

from kube-state-metrics.

Comments (36)

loburm commented on August 27, 2024 6

Sorry that it took so much time, running scalability test on 1000 node cluster was a bit tricky.

I have written all numbers down in the doc: https://docs.google.com/document/d/1hm5XrM9dYYY085yOnmMDXu074E4RxjM7R5FS4-WOflo/edit?usp=sharing
If you want I can add additional screenshots from the kubernetes dashboard to the doc (except 1000 node cluster, heapster just died).

from kube-state-metrics.

loburm commented on August 27, 2024 3

Yesterday I have finished testing kube-state-metrics on 100 and 500 node clusters. Today trying to perform it on 1000 node cluster, but have small problems with density test. But base on the first numbers I can say that memory, cpu, latency depend on the number of nodes almost linearly.

I'll prepare small report soon and will share with you.

from kube-state-metrics.

matthiasr commented on August 27, 2024 1

let's do RCs

from kube-state-metrics.

brancz commented on August 27, 2024 1

@loburm now has published the image on gcr: gcr.io/google_containers/kube-state-metrics:v1.0.0-rc.1

from kube-state-metrics.

andyxning commented on August 27, 2024 1

Since #200 has added support for providing deployment manifest that scales with cluster size using pod nanny. Closing this now.

from kube-state-metrics.

piosz commented on August 27, 2024

@kubernetes/api-reviewers are there any guidelines/requirements for graduating a feature to GA which doesn't Kubernetes API?

cc @wojtek-t @gmarek re: scalability testing

from kube-state-metrics.

smarterclayton commented on August 27, 2024

I think we'd probably recommend documenting what the compatibility expectations of that feature are going forward (in a doc in that repo), make sure there is a process for API changes reasonably consistent with the use of the goals, and then make sure the feature repo contains an issue to the graduated feature.

from kube-state-metrics.

piosz commented on August 27, 2024

@loburm will help with scalability tests

from kube-state-metrics.

brancz commented on August 27, 2024

I'm now back from vacation and can help coordinate. @loburm and @piosz let me know how/when you want to tackle this.

from kube-state-metrics.

loburm commented on August 27, 2024

@brancz for the scalability testing we need to have some scenario to test against. I think we should concentrate only on testing metric related to nodes and pods mostly (I assume that other parts should consume significantly smaller amount of resources). How many nodes and pod should be present in the test scenario?

from kube-state-metrics.

brancz commented on August 27, 2024

@loburm I'm completely new to the load tests, so I suggest to start with whatever seems reasonable to you. My thoughts are the same as yours, the number of pods metrics are expected to increase linearly with the number of other objects, so focusing on those and nodes sounds perfect for our load scenarios.

Testing with the recommended upper bound of recommended pods/nodes in a single cluster would be best to see if we can actually handle this, but I'm not sure that's reasonable given that we have never performed load tests before.

from kube-state-metrics.

piosz commented on August 27, 2024

We had a chat offline and we will try to test the following scenarios:

100 nodes cluster, 30pods/node
O(1000) nodes cluster, 30 pods/node

@loburm will verify:

what is the approximate resource usage in both cases
what is the average request latency
whether there any other obvious issues

from kube-state-metrics.

brancz commented on August 27, 2024

Great thanks a lot @piosz and @loburm !

from kube-state-metrics.

matthiasr commented on August 27, 2024

One issue (that we can only do so much about) is the size of `/metrics` and the time it takes Prometheus to scrape it. Putting some bound on that could inform future decisions on adding metrics.

…

On Thu, Jul 20, 2017, 11:51 Frederic Branczyk ***@***.***> wrote: Great thanks a lot @piosz <https://github.com/piosz> and @loburm <https://github.com/loburm> ! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBrPEar4LErT__nbok0UGNaJcxtfSks5sPyMdgaJpZM4M_5pt> .

from kube-state-metrics.

brancz commented on August 27, 2024

Thanks for the heads up @matthiasr! Yes that's one of the bottle-necks I can see happening. We may have to start thinking of sharding strategies for kube-state-metrics.

from kube-state-metrics.

piosz commented on August 27, 2024

Do you think it would be possible to have some kind of pagination support for /metrics handler?

from kube-state-metrics.

andyxning commented on August 27, 2024

How about we split the scrape endpoints according to different collectors, i.e., pods state metrics is available on /metrics/pods.

Or, how about we support both /metrics to fetch all metrics and /metrics/* to every collectors.

Cons: this will make Prometheus configuration more complicated.

from kube-state-metrics.

brancz commented on August 27, 2024

What @andyxning is suggesting is certainly possible, but is likely to just postpone the problem. @piosz I'm not aware of any precedence of that, but paging within the same instance of kube-state-metrics would also just postpone the problem, as I can imagine that the memory consumption is also very large in cases where response timeouts are hit.

from kube-state-metrics.

loburm commented on August 27, 2024

@andyxning I think that will add unnecessary complexity and as I have understood common rule is to expose metrics on /metrics. We can achieve the same by creating multiple instances of kube-state-metrics and each one is responsible for one or multiple collectors we have a special flag for this.

And let me first perform some tests and once we have some real numbers, we can start thinking about possible issues and how they can be solved.

from kube-state-metrics.

brancz commented on August 27, 2024

Completely agree with @loburm, measure first.

from kube-state-metrics.

matthiasr commented on August 27, 2024

yeah, I didn't mean this as "needs immediate changes", but it would be good to measure and monitor for regression. Our cluster is fairly sizable, and the response is big, but not unmanageable. For now I'd just like to have a rough idea of what to expect as we grow the cluster more :) Even "if your cluster has >10k pods, raise the scrape timeout to at least 20s" is something to work with.

from kube-state-metrics.

andyxning commented on August 27, 2024

Aggreed with @loburm. Btw, It still needs to add more configurations to Prometheus for one cluster . :)

from kube-state-metrics.

brancz commented on August 27, 2024

@lobrum any updates on how the scalability tests are coming along?

from kube-state-metrics.

fabxc commented on August 27, 2024

Thank you very much @loburm. Overall I see no concerns around scalability. In fact, we are quite surprised the memory usage stays that low. That should make us good to go for 1.0 soon.

from kube-state-metrics.

andyxning commented on August 27, 2024

@loburm

Empty cluster - cluster without pods (only a system one present).
Loaded - 30 pods per node in average.
After request - cpu and memory usage during metrics fetching.

I am curious about the three stages. Can you please explain it more detailly. :)

only a system one present.
- only one system pod?
what is the difference about Loaded and After request?

from kube-state-metrics.

matthiasr commented on August 27, 2024

Sweet! should we distill this into a recommendation for resources? 2MB per node (minimum 200MB) + 0.001 cores per node (0.01 minimum)?

…

On Fri, Jul 28, 2017, 17:51 Ning Xie ***@***.***> wrote: @loburm <https://github.com/loburm> Empty cluster - cluster without pods (only a system one present). Loaded - 30 pods per node in average. After request - cpu and memory usage during metrics fetching. I am curious about the three stages. Can you please explain it more detailly. :) - only a system one present. - only one system pod? - what is the difference about Loaded and After request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBtAKQDG6FSZ_zsaQ-YzU0avM8EoDks5sSgNugaJpZM4M_5pt> .

from kube-state-metrics.

loburm commented on August 27, 2024

@andyxning empty cluster has only pods that belong to kube-system namespace and created by scalability test at the beginning:

dashboard
heapster
grafana, influxdb
node problem detector
kube-proxy
kube-dns
fluentd

in average it's near 4-5 pods at the beginning. So at the end we really have 34-35 pods per node.

"Loaded" was measured when cluster was stabilized after all pods created. "After request" - after fetching metrics from "/metrics" it really increases memory usage and gives a short peak in cpu usage.

from kube-state-metrics.

andyxning commented on August 27, 2024

@loburm Got it. Thanks for the detailed explaination.

from kube-state-metrics.

piosz commented on August 27, 2024

Thanks @loburm. It seems from scalability point of view kube-state-metrics is ready for 1.0.
@matthiasr could you please add what you wrote to the documentation?

from kube-state-metrics.

brancz commented on August 27, 2024

@piosz I'm preparing everything for the release and am hopeful that I can then publish the release this week. We'll have to fix #192 though before we do. I'm already on it. Do you think we should first cut rc's or just the 1.0 straight out?

from kube-state-metrics.

brancz commented on August 27, 2024

rc.1 is out: I published quay.io/coreos/kube-state-metrics:v1.0.0-rc.1 for testing, and @loburm will publish the image on gcr.io within the next half an hour.

from kube-state-metrics.

WIZARD-CXY commented on August 27, 2024

@fabxc how about "provide deployment manifest that scales with cluster size using pod nanny", do you need any help working on it? I think I can help with it based on this resource recommendation #196

from kube-state-metrics.

smarterclayton commented on August 27, 2024

Some additional metrics from a reasonably large production cluster (on 1.0.0+ fix for owner NPE)

175 nodes
2k namespaces (most of them have roughly one to two services, deployments, pods)
170k samples scraped by ksm (on top of 800k base samples scraped)
ksm uses 400m of memory and 0.07 core
Scraping these samples added 500m to a 4.5gb heap prometheus and 0.03 core (on top of 0.25 core steady state)
2.7k pod series (for the various kube_pod_*)
Rate of change of pods on this cluster is between 2-3 pods per minute

from kube-state-metrics.

WIZARD-CXY commented on August 27, 2024

@smarterclayton good to know

from kube-state-metrics.

andyxning commented on August 27, 2024

@brancz @fabxc Since 1.0 has been released, should we just close this tracking issue?

from kube-state-metrics.

WIZARD-CXY commented on August 27, 2024

OK，I confirm the last unchecked item which is scaling with cluster size using pod nanny is done

from kube-state-metrics.

1.0 stabilization about kube-state-metrics HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent