We performed a tests with new routecontroller implementation in cf-for-k8s by pushing

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Verified with cf-k8s-networking team <a class="issue-link js-issue-link" data-erro

from the same issue (but with additional context) as filed by <a class="user-mention n

We're pushing docker images so no. The code in the original issue is actually <a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Failed to create/update/delete Route resource about capi-k8s-release HOT 11 CLOSED

cloudfoundry commented on September 17, 2024

Failed to create/update/delete Route resource

from capi-k8s-release.

Comments (11)

KesavanKing commented on September 17, 2024 1

@XanderStrike Thanks for sharing your tests.

We push applications via manifest. We use diego-stress-test with small changes to push buildpack based applications. We have containerised it so that easy for you to run.

Steps to run

docker run -it pavanvasisht/dst:1.0.0 
export MAX_IN_FLIGHT=
export NUMBER_OF_BATCHES=
export NUM_OF_SPACE=
export CF_USERNAME=admin
export APP_DOMAIN=
export CF_PASSWORD=admin
export SYSTEM_DOMAIN=
./cedar.sh

2. This error we faced when apps are in 100s and at 300s. We never tired on the deletion case.

We usually have 30 nodes of 8cpu_30gb machines.

from capi-k8s-release.

cwlbraa commented on September 17, 2024 1

yep, at least on develop! please reopen this with new logs if it's still broken when y'all run it through your pipelines, we also improved the logging.

from capi-k8s-release.

cf-gitbot commented on September 17, 2024

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173583970

The labels on this github issue will be updated when the story is started.

from capi-k8s-release.

KesavanKing commented on September 17, 2024

Verified with cf-k8s-networking team
cloudfoundry/cf-k8s-networking#40 (comment)

from capi-k8s-release.

jamespollard8 commented on September 17, 2024

from the same issue (but with additional context) as filed by @XanderStrike in cloudfoundry/capi-release#176

error:
Failed to create/update/delete Route resource with guid '...' on Kubernetes

Issue

When creating many apps, we see some fail to create the Route resource.

Context

A scale test on a GCP cluster with 100 nodes. In order to start the test we need to quickly get to 1000 apps so that we can test steady state performance of networking components. Pushing 1000 apps at once is a little ridiculous, so we push 10 at a time which seems a little more reasonable.

Steps to Reproduce

Run the following:
    for n in {0..99}
    do
      for i in {0..9}
      do
        name="bin-$((n * 10 + i))"
        echo $name
        cf push $name -o cfrouting/httpbin8080 -m 256M -k 256M -i 2 &
      done
      wait
    done
Expected result

All apps successfully push.

Current result

Anywhere from 0-5 of those apps come back with Failed to create/update/delete Route resource....

Observe the errors in the "pave-cf-for-scale-tests" step in this pipeline.

Possible Fix

Looking at capi's logs we're seeing this is because of a 422 error. This could mean a few things, the api server could be overloaded (unlikely) or there could be some semantic error with the resource. Unfortunately the kube apiserver doesn't have any useful info in its logs.

Based on our research, we believe the kubeapiserver puts any sort of validation errors in the response body for requests that fail with this error code, but we don't have access to that request body and it's not currently logged by cloud_controller.

We have two recommendations:

Log some part of the request body in the case of errors so we can get something more specific than the status code

Introduce retries into the Kubernetes::RouteCrdClient

from capi-k8s-release.

cwlbraa commented on September 17, 2024

k8s docs about 409 Conflict errors

from capi-k8s-release.

cwlbraa commented on September 17, 2024

We had a hard time reproducing this exactly, but were able to see some of the inadequacies around logging when we have Kubeclient errors and are gonna work to log more of the data when stuff like this comes up (e.response & e.message are being thrown away).

We're a bit hesitant to introduce retries of any error here until we better understand what's causing the conflict or we're able to reproduce on our side.

In the process of trying to repro, we had a couple questions:

Y'all don't have a manifest.yml, right? We were able to repro this accidentally using catnip and server-side manifest application but it looks like your stress tests don't use that.
Did y'all reproduce this without pushing 1000 apps? Does it reproduce if we parallel push and then delete in batches of 10?
If we were to try to run that exact same script and load, how big of a node pool would we need?

cc @XanderStrike @jamespollard8

from capi-k8s-release.

XanderStrike commented on September 17, 2024

We're pushing docker images so no. The code in the original issue is actually exactly what we do.
It consistently happens somewhere on the way to 1000 routes. In this run it happened to the 8th app we tried to push, but in others it gets through a few dozen before it happens. It seems to consistently happen in the first 100 or so though.
We use 100 n1-standard-8 nodes on GCP, we also attempted to scale up capi

The error happens many, many times each run and when we originally created the issue I was able to repro it locally by running the for loop in the original issue. If you look at the pave-cf-for-scale-tests in our scale-test job you'll be able to see the error on any one of the recent runs that made it to that step.

We've brought down our scale test cluster and have paused scale testing since we're hard blocked on this issue, but if you like I can spin it up again.

from capi-k8s-release.

cwlbraa commented on September 17, 2024

@XanderStrike @KesavanKing we added some retry logic and were able to see it alleviating this problem. Would love it if y'all could run your tests again!

from capi-k8s-release.

KauzClay commented on September 17, 2024

@cwlbraa nice, is this available in cf-for-k8s now?

from capi-k8s-release.

KesavanKing commented on September 17, 2024

@cwlbraa Thanks for the PR. We will run the tests and let you know.

from capi-k8s-release.

Failed to create/update/delete Route resource about capi-k8s-release HOT 11 CLOSED

Comments (11)

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent