openshift-qe / ocp-qe-perfscale-ci Goto Github PK
View Code? Open in Web Editor NEWOpenShift QE PerfScale CI
License: Apache License 2.0
OpenShift QE PerfScale CI
License: Apache License 2.0
Looks like another issue we hit because of http_proxy/https_proxy failure or missing python packages.
Error reported:
10-20 15:44:58.629 2022-10-20 15:44:58,403 [ERROR] Failed to get the metrics: HTTPSConnectionPool(host='prometheus-k8s-openshift-monitoring.apps.scaleci12-20928.qe.devcluster.openshift.com', port=443): Max retries exceeded with url: /api/v1/query?query=ALERTS%7Balertname%3D%22etcdHighNumberOfLeaderChanges%22%2C+severity%3D%22warning%22%7D (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))
See failures for private cluster:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/scale-nightly-regression/486/
Noticed that it failed for non private cluster too:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/scale-nightly-regression/491/
30 minutes of waiting time is not enough for large clusters (ex. 50 nodes - each need ~4 minutes to reboot on azure = more than 180 minutes).
Suggestion: Reduce wait to 15 minutes but add additional verification.
Store node name which has 'NotReady|SchedulingDisabled' status and in next iteration if the node will be different - reset back wait_num.
In this way, there is no time limit for large clusters during reboots, but if something will go wrong, then next step will be executed much earlier.
On GCP I scaled workers to 3 and installed INFRA_WORKLOAD_INSTALL, Then scale cluster again to 120 nodes.
All machinesets are scaled.
oc get machinesets -A
NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE
openshift-machine-api infra-qili-gcp-kn95ma 15 15 15 15 50m
openshift-machine-api infra-qili-gcp-kn95mb 15 15 11 11 50m
openshift-machine-api infra-qili-gcp-kn95mc 15 15 1 1 50m
openshift-machine-api qili-gcp-kn95m-worker-a 15 15 3 3 5h51m
openshift-machine-api qili-gcp-kn95m-worker-b 15 15 5h51m
openshift-machine-api qili-gcp-kn95m-worker-c 15 15 5h51m
openshift-machine-api qili-gcp-kn95m-worker-f 15 15 5h51m
openshift-machine-api workload-qili-gcp-kn95m 15 15 1 1 50m
#147 fixed this issue and the fix worked on Azure.
I found there is no label of infra and workload on GCP machinesets
% oc get --no-headers machinesets -A --show-labels
openshift-machine-api infra-qili-gcp-kn95ma 1 1 1 1 147m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api infra-qili-gcp-kn95mb 1 1 1 1 147m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api infra-qili-gcp-kn95mc 1 1 1 1 147m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api qili-gcp-kn95m-worker-a 15 15 8 8 7h29m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api qili-gcp-kn95m-worker-b 15 15 7h29m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api qili-gcp-kn95m-worker-c 15 15 7h29m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api qili-gcp-kn95m-worker-f 15 15 7h29m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
openshift-machine-api workload-qili-gcp-kn95m 1 1 1 1 147m machine.openshift.io/cluster-api-cluster=qili-gcp-kn95m
So the fix in #147 gets all machinesets.
oc get --no-headers machinesets -A -l machine.openshift.io/cluster-api-machine-role!=infra,machine.openshift.io/cluster-api-machine-role!=workload | awk '{print $2}'
infra-qili-gcp-kn95ma
infra-qili-gcp-kn95mb
infra-qili-gcp-kn95mc
qili-gcp-kn95m-worker-a
qili-gcp-kn95m-worker-b
qili-gcp-kn95m-worker-c
qili-gcp-kn95m-worker-f
workload-qili-gcp-kn95m
If the pod latency is not within 5s, the output passed to write-scale-ci-results contains paranethesis and is causing issues with the script currently. Need to add in an escape of those characters
I've seen this happen many times I attempt to scale with the job
Example:
06-23 15:11:28.146 service/dittybopper created
06-23 15:11:28.399 route.route.openshift.io/dittybopper created
06-23 15:11:28.399 Warning: would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (containers "dittybopper", "dittybopper-syncer" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "dittybopper", "dittybopper-syncer" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "dittybopper", "dittybopper-syncer" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "dittybopper", "dittybopper-syncer" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
06-23 15:11:28.399 deployment.apps/dittybopper created
06-23 15:11:28.655 configmap/sc-ocp-prom created
06-23 15:11:28.655 configmap/sc-grafana-config created
06-23 15:11:28.655
06-23 15:11:28.655 Waiting for dittybopper deployment to be available...
06-23 15:12:36.285 error: timed out waiting for the condition on deployments/dittybopper
From this job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-post-config/432/console, I saw vshpere is not recognized, so it must be using the the default monitoring-config.yaml - which had volumeClaimTemplate configure , so the monitoring pods tried to find a PVC which is not configured because we don't pass OPENSHIFT_PROMETHEUS_STORAGE_CLASS and OPENSHIFT_ALERTMANAGER_STORAGE_CLASS to the template as ENV for vsphere.
06-13 10:49:26.304 ++ find /home/jenkins/ws/workspace/h-pipeline_cluster-post-config_5/flexy-artifacts/workdir/install-dir/
06-13 10:49:26.304 ++ grep vsphere -c
06-13 10:49:26.304 + [[ 0 > 0 ]]
06-13 10:49:26.304 + envsubst
06-13 10:49:26.304 + oc apply -f -
06-13 10:49:29.561 configmap/cluster-monitoring-config created
This issue caused monitoring failed to move to infra machineset.
NAME READY AGE
statefulset.apps/alertmanager-main 0/2 147m
statefulset.apps/prometheus-k8s 0/2 147m
Describe the 2 statefulsets
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 11m (x39 over 148m) statefulset-controller create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: failed to create PVC alertmanager-main-db-alertmanager-main-0: PersistentVolumeClaim "alertmanager-main-db-alertmanager-main-0" is invalid: spec.resources[storage]: Invalid value: "0": must be greater than zero
Warning FailedCreate 118s (x41 over 148m) statefulset-controller create Claim alertmanager-main-db-alertmanager-main-0 for Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: PersistentVolumeClaim "alertmanager-main-db-alertmanager-main-0" is invalid: spec.resources[storage]: Invalid value: "0": must be greater than zero
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 13m (x39 over 149m) statefulset-controller create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: failed to create PVC prometheus-k8s-db-prometheus-k8s-0: PersistentVolumeClaim "prometheus-k8s-db-prometheus-k8s-0" is invalid: spec.resources[storage]: Invalid value: "0": must be greater than zero
Warning FailedCreate 3m10s (x41 over 149m) statefulset-controller create Claim prometheus-k8s-db-prometheus-k8s-0 for Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: PersistentVolumeClaim "prometheus-k8s-db-prometheus-k8s-0" is invalid: spec.resources[storage]: Invalid value: "0": must be greater than zero
Some teams using our workloads might have their own installation of dittybopper installed on there cluster. We need to make dittybopper installation optional
if we do install dittybopper, we need to be able to pass the github url and branch that we want to install from to have more flexibility
To run the netobserv-perf automation on non-AWS clouds (or on 4.12 where the gp2 storageClass no longer exists), we should let the user pass the storageClassName for the LokiStack CRD. It can default to gp2.
cc: @nathan-weinberg
During EUS upgrade, the script properly patches the machine config pool to update. But doesn't wait for all the machines to properly update. Should be a step after the mcp patches
06-09 13:37:13.513 Kube-apiserver is done progressing
06-09 13:37:13.513 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal co details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
06-09 13:37:13.513
06-09 13:37:13.513
06-09 13:37:15.463 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
06-09 13:37:15.464
06-09 13:37:15.464
06-09 13:37:15.464 post check passed without err.
06-09 13:37:15.464
06-09 13:37:16.385 output machineconfigpool.machineconfiguration.openshift.io/worker patched
06-09 13:37:16.385
06-09 13:37:28.561 [Pipeline] }
06-09 13:37:28.565 [Pipeline] // script
06-09 13:37:28.569 [Pipeline] script
06-09 13:37:28.571 [Pipeline] {
scale up only support 3 machinesets cluster.
ocp-qe-perfscale-ci/Jenkinsfile
Line 58 in ff917f0
% oc get --no-headers machinesets -A
openshift-machine-api qili-48-zaure-rqzx4-worker-northcentralus 34 34 33 33 4h27m
This is similar to #61 but to perform health check after cluster creation and before tests are executed.
Similar to the Upgrade Ci we need to be able to run lots of jobs and log issues with out the cluster being around
It would be helpful to print off logs and maybe a must gather in certain cases to be able to properly open bugs
Some thoughts:
After failed scaling job cluster-workers-scaling
There is no additional information about not ready machineset or nodes.
Need to add
oc describe machineset
and
oc describe node
Currently there is an error that gets thrown when install fails and it goes and automatically calls destroy. Seems like the build number is not being found properly
02-22 15:24:15.611 java.lang.NumberFormatException: For input string: ""
02-22 15:24:15.611 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
02-22 15:24:15.611 at java.lang.Integer.parseInt(Integer.java:592)
02-22 15:24:15.611 at java.lang.Integer.parseInt(Integer.java:615)
02-22 15:24:15.611 at hudson.plugins.copyartifact.SpecificBuildSelector.getBuild(SpecificBuildSelector.java:70)
02-22 15:24:15.611 at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:454)
02-22 15:24:15.611 at jenkins.tasks.SimpleBuildStep.perform(SimpleBuildStep.java:123)
02-22 15:24:15.611 at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:100)
02-22 15:24:15.611 at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:70)
02-22 15:24:15.611 at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
02-22 15:24:15.611 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
02-22 15:24:15.611 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
02-22 15:24:15.611 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
02-22 15:24:15.611 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
02-22 15:24:15.611 at java.lang.Thread.run(Thread.java:750)
02-22 15:24:15.622 Finished: FAILURE
Flow:
When my cluster name is ex. skordas - then first machineset on list is infra node
NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE
openshift-machine-api infra-us-east-2a 3 0 22m
openshift-machine-api infra-us-east-2b 0 0 22m
openshift-machine-api infra-us-east-2c 0 0 22m
openshift-machine-api skordas-511b-fjt5x-worker-us-east-2a 0 40 40 40 3h40m
openshift-machine-api skordas-511b-fjt5x-worker-us-east-2b 0 40 40 40 3h40m
openshift-machine-api skordas-511b-fjt5x-worker-us-east-2c 0 40 40 40 3h40m
openshift-machine-api workload-us-east-2a 0 1 1 1 22m
so running build to scale down to 3 nodes will set 3 nodes for infra
machineset, and rest will be 0
Currently we write timestamp to Scale CI/Upgrade sheet and Chron regression output but we do not specify which timezone the tests ran. This can be confusing at times.
@mffiedler It looks like the network-perf tests were rewritten to all use the run.sh script with different env variables setting which test to run.
See run in jenkins for failure....e2e-benchmarking-multibranch-pipeline/job/network-perf-pod-network-test/154/console
Have you ran/seen this? I am not super familiar with these tests so wanted to check with you on if the below sounds correct
I'm guessing the following:
WORKLOAD=pod2pod is for branch network-perf-pod-network-test
WORKLOAD=hostnet is for network-perf-hostnetwork-network-test
WORKLOAD= pod2svc is for network-perf-serviceip-network-test
I see a couple more options in the documentation with setting network policy to true. Should we add that in as a parameter for each of the tests to cover each of the scenarios listed?
Links to independent spreadsheets in the Scale CI/Upgrade Results spreadsheet tend to take a format similar to the following:
https://docs.google.com/spreadsheets/d/14y9JA__itZptyC5w7nFBX6d8IRZb37Zz9I1kEZckQkQ ***************
The link itself is actually correct - I think this might stem from how the result is being parsed from the Jenkins log.
In the "generate jobs in gsheet" script get_periodic_jobs.py, rows for both "rosa" and "rosa_hcp" appear to be the same.
Should probably have those that are "rosa_hcp" have "Cloud Type" = "rosa_hcp", or something similar.
There is no longer a storage-perf folder under workloads in e2e-benchmark, need to validate it was not moved. If it was taken out we should remove the storage-perf branch
..../scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/storage-perf/22/console
https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads
I've seen several cases of the kube-burner job creating a new Gsheet for runs, even when there is no data populated within them. Not an urgent issue but it seems wasteful.
Example case: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/586
The spreadsheet: https://docs.google.com/spreadsheets/d/16A-DNhYSuTmr_QnjbW2gEkd8T8W1_rwF4Mwha8ZYZMk/edit?usp=sharing
When 120 node cluster is loaded with projects - then scaling down this cluster to 3 nodes with the same number of pods can hit cluster maximum.
My proposition is moving benchmark-cleaner before cluster-workers-scaling
Some users might want to have a specific number of namespaces and a set number of pods per namespace. Would be helpful to add in an option to pass your own custom kube-burner config file
Already set up in e2e-benchmark, just need to add the possiblity into jenkins
https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#launching-custom-workloads
Want to add in an extra check after the certain benchmarks are run to verify the cluster is in a decent state
Many of the runs I've seen recently, the benchmark finishes but some of the nodes go notready. In this case I feel we should fail the test but currently it is passing
% oc get machineset -A
NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE
openshift-machine-api infra-northcentralus2 1 1 3h5m
openshift-machine-api infra-northcentralus3 1 1 3h5m
openshift-machine-api infra-qili-preserve-az0516-sr44j1 1 1 3h5m
openshift-machine-api qili-preserve-az0516-sr44j-worker-northcentralus 3 3 3 3 3h55m
openshift-machine-api workload-qili-preserve-az0516-sr44j 1 1 3h5m
% oc get machines -A | grep infra
openshift-machine-api infra-northcentralus2-82z2h Failed 3h8m
openshift-machine-api infra-northcentralus3-2klrt Failed 3h8m
openshift-machine-api infra-qili-preserve-az0516-sr44j1-lbhxq Failed 3h8m
Describing the machine, machine creation failed for Please make sure that the referenced resource exists, and that both resources are in the same region
Error Message: failed to reconcile machine "infra-northcentralus2-82z2h": network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidResourceReference" Message="Resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Network/virtualNetworks/qili-preserve-az0516-sr44j-vnet/subnets/qili-preserve-az0516-sr44j-worker-subnet referenced by resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Network/networkInterfaces/infra-northcentralus2-82z2h-nic was not found. Please make sure that the referenced resource exists, and that both resources are in the same region." Details=[]
Check the infra machineset yaml, location is centralus
.
% oc get machinesets/infra-northcentralus2 -n openshift-machine-api -o yaml
...
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
machine.openshift.io/cluster-api-machineset: infra-northcentralus2
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
machine.openshift.io/cluster-api-machine-role: infra
machine.openshift.io/cluster-api-machine-type: infra
machine.openshift.io/cluster-api-machineset: infra-northcentralus2
spec:
lifecycleHooks: {}
metadata:
labels:
node-role.kubernetes.io/infra: ""
providerSpec:
value:
apiVersion: azureproviderconfig.openshift.io/v1beta1
credentialsSecret:
name: azure-cloud-credentials
namespace: openshift-machine-api
image:
offer: ""
publisher: ""
resourceID: /resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Compute/images/qili-preserve-az0516-sr44j
sku: ""
version: ""
kind: AzureMachineProviderSpec
location: centralus
managedIdentity: qili-preserve-az0516-sr44j-identity
metadata:
creationTimestamp: null
osDisk:
diskSettings: {}
diskSizeGB: 128
managedDisk:
storageAccountType: Premium_LRS
osType: Linux
publicIP: false
resourceGroup: qili-preserve-az0516-sr44j-rg
subnet: qili-preserve-az0516-sr44j-worker-subnet
userDataSecret:
name: worker-user-data
vmSize: Standard_D48s_v3
vnet: qili-preserve-az0516-sr44j-vnet
zone: "2"
Checking code
ocp-qe-perfscale-ci/Jenkinsfile
Line 297 in 8f3eb87
But the worker node machinesets is actually on 'northcentralus'
% oc get machineset/qili-preserve-az0516-sr44j-worker-northcentralus -n openshift-machine-api -o yaml
...
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
machine.openshift.io/cluster-api-machineset: qili-preserve-az0516-sr44j-worker-northcentralus
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: qili-preserve-az0516-sr44j-worker-northcentralus
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
acceleratedNetworking: true
apiVersion: machine.openshift.io/v1beta1
credentialsSecret:
name: azure-cloud-credentials
namespace: openshift-machine-api
image:
offer: ""
publisher: ""
resourceID: /resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Compute/images/qili-preserve-az0516-sr44j-gen2
sku: ""
version: ""
kind: AzureMachineProviderSpec
location: northcentralus
While trying to add infra and workload nodes to my GCP cluster, I am hitting an issue that the name of the network does not have the end random numbers and letters that my cluster has/it's looking for
On this line:
export NETWORK_NAME=$(gcloud compute networks list | grep $CLUSTER_NAME | awk '{print $1}')
Cluster name: <shortened_cluster_name>-5snsb
network found is only the below so it is not properly setting it.
<shortened_cluster_name>-network CUSTOM REGIONAL
Working on some sort of work around. Not sure if it's a length thing but I am a second cluster that has matching cluster name and network name and would work fine.
We need to be able to add workload and infra nodes to clusters created on alibaba and ibm cloud types
We will need to add a new yaml files for infra and workload nodes in the cluster-post-config
We will also need to add a call from the cluster-workers-scaling branch for each of these to pass the proper parameters
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.