Giter Site home page Giter Site logo

Comments (15)

cwlbraa avatar cwlbraa commented on August 15, 2024 2

We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

We have been relying on the fact that other folks have been scale testing and we deeply appreciate the work you've done. I empathize that it's super frustrating catching bugs that you're not equipped or empowered to fix.

We'd love to work with you to get these errors fixed and help unblock you, synchronously or asynchronously. A raw "500" from the CLI side is not enough for us to act on, though, we'd need some logs from cloud controller.

from capi-k8s-release.

cf-gitbot avatar cf-gitbot commented on August 15, 2024

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171706511

The labels on this github issue will be updated when the story is started.

from capi-k8s-release.

tcdowney avatar tcdowney commented on August 15, 2024

Another thing that would help us get better numbers is to include the response time in the nginx access logs. It didn't seem to be present when we checked, so I'm guessing it is explicitly added for the BOSH-generated nginx config.

from capi-k8s-release.

piyalibanerjee avatar piyalibanerjee commented on August 15, 2024

Thanks for the suggestion, @tcdowney! We will add that nginx response time property to capi-k8s-release so we can get those numbers as we reproduce the error -- we'll get back to you with our findings.

We have a couple questions for you and @XanderStrike:

  1. Do you have the performance environment (or a script to build it) where you found this issue so we can do some testing on our own? Currently we are testing it in a cf-for-k8s environment where we deployed 1000 apps.
  2. To reproduce this issue, is it critical for the apps to be started? We used the no-start flag and pushed 1000 apps (so the apps were each assigned an external route, as described in the github issue). Our findings so far:
  • cf apps did not timeout.cf v3-apps took much longer to execute than cf apps and sometimes timed out. We will investigate this further.
  • cf app <APP_NAME> and cf7 app <APP_NAME> both return the results pretty fast, so we couldn't reproduce the performance issue you discovered (may be related to us not starting the deployed apps?).
  • cf routes, as you observed, did take ~20 seconds. We have plans to cross-team with you all (Route CRD stories) which would improve performance for routes.

EDIT (from @jspawar): we observed all of the above on a small cluster not at all of the same size as the cluster you originally used. We will attempt again with a cluster of similar spec

from capi-k8s-release.

XanderStrike avatar XanderStrike commented on August 15, 2024

Thank you for taking a look at the issue!

I'll take your questions in order:

  1. We do! I've spruced it up for you to see here. It takes about 90 minutes to get it created and pushed, but I also have an environment up and available that we can look at if you like. Reach out on Slack.
  2. I'm not sure if it's critical for capi's purposes, but in our tests we do have the apps started. Our chief concern in doing these tests is istio control plane latency (time from cf push to route availability) so for us it is essential that they be running.
  • I just reproduced our issues with cf apps and cf v3-apps timing out using the script/environment above using 6.5. With cf7 I see cf apps seems to hang forever with the same behavior as cf v3-apps.
  • I was unable to reproduce the timeout with cf app <appname> with 6.5. Seems to be taking about 3-5 seconds now. It does still take a long time (30-60 seconds) with cf7 though.

image

image

Let me know if you have any more questions and feel free to reach out to me (or the team!)!

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on August 15, 2024

Have y'all tried running these tests with the apps spread out across spaces or even with a more realistic ratio of instances per app? We're creating bugs to work out the problems you've found with having many apps in one space, but I'm not sure that tells us very much about how a realistic environment might fail at scale.

from capi-k8s-release.

piyalibanerjee avatar piyalibanerjee commented on August 15, 2024

Hi @XanderStrike! We made a story in the CLI team's backlog, which we'll cross team pair with them on, to mitigate the cf7 apps performance issues. We also filed a github issue (which will become a bug in our backlog) so we can solve the 504 Gateway Timeout error we are seeing with cf6 apps in a cf-for-k8s env with many apps deployed in a single space. We'll likely need to collaborate with the eirini and/or you to fix it.

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on August 15, 2024

This probably needs to be revisited.

  1. We ultimately decided not to support capi-k8s-release+cf6.
  2. VAT did some work to make v3/apps faster shortly after this issue was created and discussed.
  3. It's possible this is still slow due to eirini instance reporter performance.

from capi-k8s-release.

njbennett avatar njbennett commented on August 15, 2024

@cwlbraa When you say "revisited," what's the next action here? Are we requesting that @XanderStrike retest? Or is there more work on our side? This issue is in "accepted" state so, what's necessary to finish this out?

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on August 15, 2024

@emalm pinged me out-of-band about this or something similar a few minutes ago. @XanderStrike @astrieanna have revisited their scale tests and apparently they are, in fact, still having issues.

from capi-k8s-release.

tcdowney avatar tcdowney commented on August 15, 2024

cc @keshav-pivotal

from capi-k8s-release.

XanderStrike avatar XanderStrike commented on August 15, 2024

To give an unofficial off the cuff status update about cf-for-k8s scaling, we're still struggling with either capi or eirini at this scale, getting a lot of these kinds of things:

Unexpected Response
Response Code: 500
Request ID:    2ff6d9ec-4e96-4a1e-bf17-5eb98f8dd1f0::01e36d07-393a-41d2-9439-bf4a2de2bdb1
Code: 0, Title: , Detail: {
  "errors": [
    {
      "title": "UnknownError",
      "detail": "An unknown error occurred.",
      "code": 10001
    }
  ]
}
FAILED

This prevents us from reaching our 1.0 goal of 1,000 apps and 2,000 routes because we often have many of these failures before we can even start testing networking components.

However, we've deprioritized this work and paused scale testing entirely because we're confident networking components can scale as well as or better than the rest of the platform, so we haven't spent much time looking into why these errors happen. We'd also like to have a post-1.0 discussion about who should own scaling tests and which teams should be running them, since we've spent as much time debugging other components as we have our own 😂

from capi-k8s-release.

KesavanKing avatar KesavanKing commented on August 15, 2024

@cwlbraa From our scale tests all the issues and relevant logs related to 500 and 503 are documented #67 and Latency issue #70

from capi-k8s-release.

jspawar avatar jspawar commented on August 15, 2024

Re: cf apps taking too long, we think we might have addressed some of that with these changes we just merged in: cloudfoundry/cloud_controller_ng#2123

from capi-k8s-release.

heycait avatar heycait commented on August 15, 2024

Closing this out due to staleness. If there are more performance concerns, please open a new issue.

from capi-k8s-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.