Giter Site home page Giter Site logo

Comments (11)

KesavanKing avatar KesavanKing commented on September 17, 2024 1

@XanderStrike Thanks for sharing your tests.

  1. We push applications via manifest. We use diego-stress-test with small changes to push buildpack based applications. We have containerised it so that easy for you to run.
Steps to run
docker run -it pavanvasisht/dst:1.0.0 
export MAX_IN_FLIGHT=
export NUMBER_OF_BATCHES=
export NUM_OF_SPACE=
export CF_USERNAME=admin
export APP_DOMAIN=
export CF_PASSWORD=admin
export SYSTEM_DOMAIN=
./cedar.sh
2. This error we faced when apps are in 100s and at 300s. We never tired on the deletion case.
  1. We usually have 30 nodes of 8cpu_30gb machines.

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on September 17, 2024 1

yep, at least on develop! please reopen this with new logs if it's still broken when y'all run it through your pipelines, we also improved the logging.

from capi-k8s-release.

cf-gitbot avatar cf-gitbot commented on September 17, 2024

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173583970

The labels on this github issue will be updated when the story is started.

from capi-k8s-release.

KesavanKing avatar KesavanKing commented on September 17, 2024

Verified with cf-k8s-networking team
cloudfoundry/cf-k8s-networking#40 (comment)

from capi-k8s-release.

jamespollard8 avatar jamespollard8 commented on September 17, 2024

from the same issue (but with additional context) as filed by @XanderStrike in cloudfoundry/capi-release#176

error:
Failed to create/update/delete Route resource with guid '...' on Kubernetes

Issue

When creating many apps, we see some fail to create the Route resource.

Context

A scale test on a GCP cluster with 100 nodes. In order to start the test we need to quickly get to 1000 apps so that we can test steady state performance of networking components. Pushing 1000 apps at once is a little ridiculous, so we push 10 at a time which seems a little more reasonable.

Steps to Reproduce

Run the following:

    for n in {0..99}
    do
      for i in {0..9}
      do
        name="bin-$((n * 10 + i))"
        echo $name
        cf push $name -o cfrouting/httpbin8080 -m 256M -k 256M -i 2 &
      done
      wait
    done

Expected result

All apps successfully push.

Current result

Anywhere from 0-5 of those apps come back with Failed to create/update/delete Route resource....

Observe the errors in the "pave-cf-for-scale-tests" step in this pipeline.

Possible Fix

Looking at capi's logs we're seeing this is because of a 422 error. This could mean a few things, the api server could be overloaded (unlikely) or there could be some semantic error with the resource. Unfortunately the kube apiserver doesn't have any useful info in its logs.

Based on our research, we believe the kubeapiserver puts any sort of validation errors in the response body for requests that fail with this error code, but we don't have access to that request body and it's not currently logged by cloud_controller.

We have two recommendations:

  1. Log some part of the request body in the case of errors so we can get something more specific than the status code
  2. Introduce retries into the Kubernetes::RouteCrdClient

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on September 17, 2024

k8s docs about 409 Conflict errors

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on September 17, 2024

We had a hard time reproducing this exactly, but were able to see some of the inadequacies around logging when we have Kubeclient errors and are gonna work to log more of the data when stuff like this comes up (e.response & e.message are being thrown away).

We're a bit hesitant to introduce retries of any error here until we better understand what's causing the conflict or we're able to reproduce on our side.

In the process of trying to repro, we had a couple questions:

  1. Y'all don't have a manifest.yml, right? We were able to repro this accidentally using catnip and server-side manifest application but it looks like your stress tests don't use that.
  2. Did y'all reproduce this without pushing 1000 apps? Does it reproduce if we parallel push and then delete in batches of 10?
  3. If we were to try to run that exact same script and load, how big of a node pool would we need?

cc @XanderStrike @jamespollard8

from capi-k8s-release.

XanderStrike avatar XanderStrike commented on September 17, 2024
  1. We're pushing docker images so no. The code in the original issue is actually exactly what we do.
  2. It consistently happens somewhere on the way to 1000 routes. In this run it happened to the 8th app we tried to push, but in others it gets through a few dozen before it happens. It seems to consistently happen in the first 100 or so though.
  3. We use 100 n1-standard-8 nodes on GCP, we also attempted to scale up capi

The error happens many, many times each run and when we originally created the issue I was able to repro it locally by running the for loop in the original issue. If you look at the pave-cf-for-scale-tests in our scale-test job you'll be able to see the error on any one of the recent runs that made it to that step.

We've brought down our scale test cluster and have paused scale testing since we're hard blocked on this issue, but if you like I can spin it up again.

from capi-k8s-release.

cwlbraa avatar cwlbraa commented on September 17, 2024

@XanderStrike @KesavanKing we added some retry logic and were able to see it alleviating this problem. Would love it if y'all could run your tests again!

from capi-k8s-release.

KauzClay avatar KauzClay commented on September 17, 2024

@cwlbraa nice, is this available in cf-for-k8s now?

from capi-k8s-release.

KesavanKing avatar KesavanKing commented on September 17, 2024

@cwlbraa Thanks for the PR. We will run the tests and let you know.

from capi-k8s-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.