Giter Site home page Giter Site logo

Comments (36)

loburm avatar loburm commented on August 27, 2024 6

Sorry that it took so much time, running scalability test on 1000 node cluster was a bit tricky.

I have written all numbers down in the doc: https://docs.google.com/document/d/1hm5XrM9dYYY085yOnmMDXu074E4RxjM7R5FS4-WOflo/edit?usp=sharing
If you want I can add additional screenshots from the kubernetes dashboard to the doc (except 1000 node cluster, heapster just died).

from kube-state-metrics.

loburm avatar loburm commented on August 27, 2024 3

Yesterday I have finished testing kube-state-metrics on 100 and 500 node clusters. Today trying to perform it on 1000 node cluster, but have small problems with density test. But base on the first numbers I can say that memory, cpu, latency depend on the number of nodes almost linearly.

I'll prepare small report soon and will share with you.

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024 1

let's do RCs

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024 1

@loburm now has published the image on gcr: gcr.io/google_containers/kube-state-metrics:v1.0.0-rc.1

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024 1

Since #200 has added support for providing deployment manifest that scales with cluster size using pod nanny. Closing this now.

from kube-state-metrics.

piosz avatar piosz commented on August 27, 2024

@kubernetes/api-reviewers are there any guidelines/requirements for graduating a feature to GA which doesn't Kubernetes API?

cc @wojtek-t @gmarek re: scalability testing

from kube-state-metrics.

smarterclayton avatar smarterclayton commented on August 27, 2024

I think we'd probably recommend documenting what the compatibility expectations of that feature are going forward (in a doc in that repo), make sure there is a process for API changes reasonably consistent with the use of the goals, and then make sure the feature repo contains an issue to the graduated feature.

from kube-state-metrics.

piosz avatar piosz commented on August 27, 2024

@loburm will help with scalability tests

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

I'm now back from vacation and can help coordinate. @loburm and @piosz let me know how/when you want to tackle this.

from kube-state-metrics.

loburm avatar loburm commented on August 27, 2024

@brancz for the scalability testing we need to have some scenario to test against. I think we should concentrate only on testing metric related to nodes and pods mostly (I assume that other parts should consume significantly smaller amount of resources). How many nodes and pod should be present in the test scenario?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

@loburm I'm completely new to the load tests, so I suggest to start with whatever seems reasonable to you. My thoughts are the same as yours, the number of pods metrics are expected to increase linearly with the number of other objects, so focusing on those and nodes sounds perfect for our load scenarios.

Testing with the recommended upper bound of recommended pods/nodes in a single cluster would be best to see if we can actually handle this, but I'm not sure that's reasonable given that we have never performed load tests before.

from kube-state-metrics.

piosz avatar piosz commented on August 27, 2024

We had a chat offline and we will try to test the following scenarios:

  1. 100 nodes cluster, 30pods/node
  2. O(1000) nodes cluster, 30 pods/node

@loburm will verify:

  1. what is the approximate resource usage in both cases
  2. what is the average request latency
  3. whether there any other obvious issues

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

Great thanks a lot @piosz and @loburm !

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

Thanks for the heads up @matthiasr! Yes that's one of the bottle-necks I can see happening. We may have to start thinking of sharding strategies for kube-state-metrics.

from kube-state-metrics.

piosz avatar piosz commented on August 27, 2024

Do you think it would be possible to have some kind of pagination support for /metrics handler?

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

How about we split the scrape endpoints according to different collectors, i.e., pods state metrics is available on /metrics/pods.

Or, how about we support both /metrics to fetch all metrics and /metrics/* to every collectors.

Cons: this will make Prometheus configuration more complicated.

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

What @andyxning is suggesting is certainly possible, but is likely to just postpone the problem. @piosz I'm not aware of any precedence of that, but paging within the same instance of kube-state-metrics would also just postpone the problem, as I can imagine that the memory consumption is also very large in cases where response timeouts are hit.

from kube-state-metrics.

loburm avatar loburm commented on August 27, 2024

@andyxning I think that will add unnecessary complexity and as I have understood common rule is to expose metrics on /metrics. We can achieve the same by creating multiple instances of kube-state-metrics and each one is responsible for one or multiple collectors we have a special flag for this.

And let me first perform some tests and once we have some real numbers, we can start thinking about possible issues and how they can be solved.

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

Completely agree with @loburm, measure first.

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024

yeah, I didn't mean this as "needs immediate changes", but it would be good to measure and monitor for regression. Our cluster is fairly sizable, and the response is big, but not unmanageable. For now I'd just like to have a rough idea of what to expect as we grow the cluster more :) Even "if your cluster has >10k pods, raise the scrape timeout to at least 20s" is something to work with.

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

Aggreed with @loburm. Btw, It still needs to add more configurations to Prometheus for one cluster . :)

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

@lobrum any updates on how the scalability tests are coming along?

from kube-state-metrics.

fabxc avatar fabxc commented on August 27, 2024

Thank you very much @loburm. Overall I see no concerns around scalability. In fact, we are quite surprised the memory usage stays that low. That should make us good to go for 1.0 soon.

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@loburm

Empty cluster - cluster without pods (only a system one present).
Loaded - 30 pods per node in average.
After request - cpu and memory usage during metrics fetching.

I am curious about the three stages. Can you please explain it more detailly. :)

  • only a system one present.
    • only one system pod?
  • what is the difference about Loaded and After request?

from kube-state-metrics.

matthiasr avatar matthiasr commented on August 27, 2024

from kube-state-metrics.

loburm avatar loburm commented on August 27, 2024

@andyxning empty cluster has only pods that belong to kube-system namespace and created by scalability test at the beginning:

  • dashboard
  • heapster
  • grafana, influxdb
  • node problem detector
  • kube-proxy
  • kube-dns
  • fluentd

in average it's near 4-5 pods at the beginning. So at the end we really have 34-35 pods per node.

"Loaded" was measured when cluster was stabilized after all pods created. "After request" - after fetching metrics from "/metrics" it really increases memory usage and gives a short peak in cpu usage.

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@loburm Got it. Thanks for the detailed explaination.

from kube-state-metrics.

piosz avatar piosz commented on August 27, 2024

Thanks @loburm. It seems from scalability point of view kube-state-metrics is ready for 1.0.
@matthiasr could you please add what you wrote to the documentation?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

@piosz I'm preparing everything for the release and am hopeful that I can then publish the release this week. We'll have to fix #192 though before we do. I'm already on it. Do you think we should first cut rc's or just the 1.0 straight out?

from kube-state-metrics.

brancz avatar brancz commented on August 27, 2024

rc.1 is out: I published quay.io/coreos/kube-state-metrics:v1.0.0-rc.1 for testing, and @loburm will publish the image on gcr.io within the next half an hour.

from kube-state-metrics.

WIZARD-CXY avatar WIZARD-CXY commented on August 27, 2024

@fabxc how about "provide deployment manifest that scales with cluster size using pod nanny", do you need any help working on it? I think I can help with it based on this resource recommendation #196

from kube-state-metrics.

smarterclayton avatar smarterclayton commented on August 27, 2024

Some additional metrics from a reasonably large production cluster (on 1.0.0+ fix for owner NPE)

  • 175 nodes
  • 2k namespaces (most of them have roughly one to two services, deployments, pods)
  • 170k samples scraped by ksm (on top of 800k base samples scraped)
  • ksm uses 400m of memory and 0.07 core
  • Scraping these samples added 500m to a 4.5gb heap prometheus and 0.03 core (on top of 0.25 core steady state)
  • 2.7k pod series (for the various kube_pod_*)
  • Rate of change of pods on this cluster is between 2-3 pods per minute

from kube-state-metrics.

WIZARD-CXY avatar WIZARD-CXY commented on August 27, 2024

@smarterclayton good to know

from kube-state-metrics.

andyxning avatar andyxning commented on August 27, 2024

@brancz @fabxc Since 1.0 has been released, should we just close this tracking issue?

from kube-state-metrics.

WIZARD-CXY avatar WIZARD-CXY commented on August 27, 2024

OK,I confirm the last unchecked item which is scaling with cluster size using pod nanny is done

from kube-state-metrics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.