Comments (11)
@XanderStrike Thanks for sharing your tests.
- We push applications via manifest. We use diego-stress-test with small changes to push buildpack based applications. We have containerised it so that easy for you to run.
Steps to run
docker run -it pavanvasisht/dst:1.0.0 export MAX_IN_FLIGHT= export NUMBER_OF_BATCHES= export NUM_OF_SPACE= export CF_USERNAME=admin export APP_DOMAIN= export CF_PASSWORD=admin export SYSTEM_DOMAIN= ./cedar.sh
- We usually have 30 nodes of 8cpu_30gb machines.
from capi-k8s-release.
yep, at least on develop! please reopen this with new logs if it's still broken when y'all run it through your pipelines, we also improved the logging.
from capi-k8s-release.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/173583970
The labels on this github issue will be updated when the story is started.
from capi-k8s-release.
Verified with cf-k8s-networking team
cloudfoundry/cf-k8s-networking#40 (comment)
from capi-k8s-release.
from the same issue (but with additional context) as filed by @XanderStrike in cloudfoundry/capi-release#176
error:
Failed to create/update/delete Route resource with guid '...' on Kubernetes
Issue
When creating many apps, we see some fail to create the Route resource.
Context
A scale test on a GCP cluster with 100 nodes. In order to start the test we need to quickly get to 1000 apps so that we can test steady state performance of networking components. Pushing 1000 apps at once is a little ridiculous, so we push 10 at a time which seems a little more reasonable.
Steps to Reproduce
Run the following:
for n in {0..99} do for i in {0..9} do name="bin-$((n * 10 + i))" echo $name cf push $name -o cfrouting/httpbin8080 -m 256M -k 256M -i 2 & done wait done
Expected result
All apps successfully push.
Current result
Anywhere from 0-5 of those apps come back with
Failed to create/update/delete Route resource....
Observe the errors in the "pave-cf-for-scale-tests" step in this pipeline.
Possible Fix
Looking at capi's logs we're seeing this is because of a 422 error. This could mean a few things, the api server could be overloaded (unlikely) or there could be some semantic error with the resource. Unfortunately the kube apiserver doesn't have any useful info in its logs.
Based on our research, we believe the kubeapiserver puts any sort of validation errors in the response body for requests that fail with this error code, but we don't have access to that request body and it's not currently logged by cloud_controller.
We have two recommendations:
- Log some part of the request body in the case of errors so we can get something more specific than the status code
- Introduce retries into the Kubernetes::RouteCrdClient
from capi-k8s-release.
k8s docs about 409 Conflict errors
from capi-k8s-release.
We had a hard time reproducing this exactly, but were able to see some of the inadequacies around logging when we have Kubeclient errors and are gonna work to log more of the data when stuff like this comes up (e.response & e.message are being thrown away).
We're a bit hesitant to introduce retries of any error here until we better understand what's causing the conflict or we're able to reproduce on our side.
In the process of trying to repro, we had a couple questions:
- Y'all don't have a manifest.yml, right? We were able to repro this accidentally using catnip and server-side manifest application but it looks like your stress tests don't use that.
- Did y'all reproduce this without pushing 1000 apps? Does it reproduce if we parallel push and then delete in batches of 10?
- If we were to try to run that exact same script and load, how big of a node pool would we need?
cc @XanderStrike @jamespollard8
from capi-k8s-release.
- We're pushing docker images so no. The code in the original issue is actually exactly what we do.
- It consistently happens somewhere on the way to 1000 routes. In this run it happened to the 8th app we tried to push, but in others it gets through a few dozen before it happens. It seems to consistently happen in the first 100 or so though.
- We use 100 n1-standard-8 nodes on GCP, we also attempted to scale up capi
The error happens many, many times each run and when we originally created the issue I was able to repro it locally by running the for loop in the original issue. If you look at the pave-cf-for-scale-tests
in our scale-test job you'll be able to see the error on any one of the recent runs that made it to that step.
We've brought down our scale test cluster and have paused scale testing since we're hard blocked on this issue, but if you like I can spin it up again.
from capi-k8s-release.
@XanderStrike @KesavanKing we added some retry logic and were able to see it alleviating this problem. Would love it if y'all could run your tests again!
from capi-k8s-release.
@cwlbraa nice, is this available in cf-for-k8s now?
from capi-k8s-release.
@cwlbraa Thanks for the PR. We will run the tests and let you know.
from capi-k8s-release.
Related Issues (20)
- cf create-service-key does not work HOT 6
- Assumptions about `KUBERNETES_SERVICE_HOST` break kubecf HOT 7
- Testing gitbot integration HOT 1
- Test cf-gitbot integration HOT 1
- Issue: FAILED - Package failed to process correctly after upload HOT 3
- General flakiness when CCNG attempts CRUD operations on Kubernetes resources HOT 2
- Random occurrences where build is marked is staged but is missing necessary metadata HOT 3
- container registry-buddy in cf-api-server and cf-api-worker pods always stop HOT 6
- Add a way to override some clock properties HOT 2
- staging can't find a gcr.io image HOT 3
- cf push hangs on 'Instance starting...' when trying to push a Docker Image app that runs as root HOT 2
- Droplet upload and download don't work HOT 2
- /v3/processes/PROCESS_GUID returns incorrect process command HOT 2
- CC should send all build failure messages to the app's `cf logs` stream HOT 2
- [RFC #0003] Scheduling workloads across multiple k8s Namespaces HOT 2
- cf delete-org fails when org has Docker apps pushed to the org HOT 2
- EXPOSE port for non-8080 Docker apps not respected HOT 4
- ccdb-migrate job should not wait for Istio HOT 9
- build reconciler is not copying the full detected start command for processes HOT 2
- Readiness probe for ccng workers and registry-buddy are not working HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from capi-k8s-release.