cloud-bulldozer / e2e-benchmarking Goto Github PK

View Code? Open in Web Editor NEW

40.0 40.0 73.0 1.01 MB

Performance Tests for end Platforms

License: Apache License 2.0

Dockerfile 1.08% Shell 89.22% Python 9.70%

e2e-benchmarking's People

Contributors

Stargazers

Watchers

Forkers

dry923 rsevilla87 mohit-sheth jtaleric chaitanyaenr mfleader melghub innovation-sre whitleykeith masco acalhounrh amitsagtani97 harshith-umesh hughnhan akrzos paigerube14 mffiedler kedark3 qiliredhat piyushgupta1551 morenod gavin-stackrox kseremet sjug mukrishn numansiddique sanjaychari chazzrobbz elenagerman msherif1234 kcokyaman troy0820 noreen21 memodi mkarg75 nsu700 mtulio oktaysavdi jaredoconnell rh-aelgendy smalleni tssurya venkataanil svetsa-rh jdowni000 gcs278 sivanamurugesan liqcui sachinninganure-zz jadkisso12 afcollins juanrh krishvoor jcaamano nathan-weinberg wolfganghuse jeffdyoung skordas chentex bbenshab athiruma syedriko frobware sachinninganure rbbratta lenahorsley smandarh bbathow dcbw josecastillolema vishnuchalla mehabhalodiya

e2e-benchmarking's Issues

Deploying log generator for the first time results in an error

When deploying the cluster logging for the first time it will sometimes result in an error for missing the ClusterLogging and ClusterLogForwarder.

Mon 29 Mar 2021 10:49:00 AM UTC: Checking if oc client is installed
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.10    True        False         74d     Cluster version is 4.6.10
Mon 29 Mar 2021 10:49:00 AM UTC: Deteting openshift-logging/openshift-operators-redhat namespaces if exists
Mon 29 Mar 2021 10:49:01 AM UTC: Installing the necessary objects for setting up elastic and logging operators and creating a cluster logging instance
Mon 29 Mar 2021 10:49:01 AM UTC: Creating cluster logging with custom elasticsearch backend
namespace/openshift-operators-redhat created
namespace/openshift-logging created
operatorgroup.operators.coreos.com/openshift-operators-redhat created
operatorgroup.operators.coreos.com/cluster-logging created
subscription.operators.coreos.com/cluster-logging created
unable to recognize "STDIN": no matches for kind "ClusterLogging" in version "logging.openshift.io/v1"
unable to recognize "STDIN": no matches for kind "ClusterLogForwarder" in version "logging.openshift.io/v1"

I presume this is due to trying to apply the cr's to quickly after applying the crd.

Update github links to benchmark-operator from ripsaw

Some of the links used in the e2e tests and common files still refer to ripsaw in the github link. While this isn't hurting anything from a functional standpoint we should revise the old links for ease of readability.

e2e tests that timeout do not return failure code

When a test such as uperf's host network test times out I expect to see a failure code as the result of the script however it is returning 0 meaning a successful completion of the test.

Example

[2021-12-01, 05:25:28 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: BENCHMARK	UUID					STATE
[2021-12-01, 05:25:28 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	Not Assigned Yet
[2021-12-01, 05:25:37 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317
[2021-12-01, 05:25:39 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Building
[2021-12-01, 05:25:44 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Starting Servers
[2021-12-01, 05:26:27 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Starting Clients
[2021-12-01, 05:26:38 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Waiting for Clients
[2021-12-01, 05:27:00 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Clients Running
[2021-12-01, 05:27:08 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Set Running
[2021-12-01, 07:14:43 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Run Next Set
[2021-12-01, 07:14:50 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Running
[2021-12-01, 07:14:59 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Cleanup
[2021-12-01, 07:15:11 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1	18a763ca-0e2f-4a97-a845-b9b6713dd317	Complete
[2021-12-01, 07:15:12 EST] {subprocess.py:89} INFO - ripsaw-cli:ripsaw.clients.k8s:INFO :: uperf-benchmark-hostnet-network-1 with uuid 18a763ca-0e2f-4a97-a845-b9b6713dd317 has reached the desired state Complete

In the example above the steps between "Set Running" and "Run Next Set" take roughly 2 hours which is the default timeout interval of the job.

This results in us having CI jobs that complete and "pass" even though they should throw an error code.

Add retries on ES uploading connection

After router test execution finished ok, uploading to ES failed with a timeout, making all the test invalid:

[2021-10-25 21:16:38,830] {subprocess.py:78} INFO - �[1mMon Oct 25 21:16:38 UTC 2021 Testing all routes before triggering the workload�[0m
[2021-10-25 21:24:07,367] {subprocess.py:78} INFO - �[1mMon Oct 25 21:24:07 UTC 2021 Generating config for termination http with 1 clients 0 keep alive requests and path /1024.html�[0m
[2021-10-25 21:24:08,348] {subprocess.py:78} INFO - �[1mMon Oct 25 21:24:08 UTC 2021 Copying mb config http-scale-http.json to pod http-scale-client-5795dcd5cf-nd4w8�[0m
[2021-10-25 21:24:10,000] {subprocess.py:78} INFO - �[1mMon Oct 25 21:24:10 UTC 2021 Executing sample 1/2 using termination http with 1 clients and 0 keepalive requests�[0m
[2021-10-25 21:24:10,283] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO - {
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "requests_per_second": 94798,
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "avg_latency": 5259,
[2021-10-25 21:25:29,192] {subprocess.py:78} INFO -     "latency_95pctl": 7364,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "latency_99pctl": 9336,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "sample": "1",
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "keepalive": 0,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO -     "200": 5687916
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO - }
[2021-10-25 21:25:29,193] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:25:29,239] {subprocess.py:78} INFO - �[1mMon Oct 25 21:25:29 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:26:29,241] {subprocess.py:78} INFO - �[1mMon Oct 25 21:26:29 UTC 2021 Executing sample 2/2 using termination http with 1 clients and 0 keepalive requests�[0m
[2021-10-25 21:26:29,612] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO - {
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "requests_per_second": 96280,
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "avg_latency": 5172,
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "latency_95pctl": 7214,
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "latency_99pctl": 9203,
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "sample": "2",
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:27:48,844] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO -     "keepalive": 0,
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO -     "200": 5776842
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO - }
[2021-10-25 21:27:48,845] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:27:48,889] {subprocess.py:78} INFO - �[1mMon Oct 25 21:27:48 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:28:48,892] {subprocess.py:78} INFO - �[1mMon Oct 25 21:28:48 UTC 2021 Generating config for termination http with 1 clients 1 keep alive requests and path /1024.html�[0m
[2021-10-25 21:28:49,569] {subprocess.py:78} INFO - �[1mMon Oct 25 21:28:49 UTC 2021 Copying mb config http-scale-http.json to pod http-scale-client-5795dcd5cf-nd4w8�[0m
[2021-10-25 21:28:51,231] {subprocess.py:78} INFO - �[1mMon Oct 25 21:28:51 UTC 2021 Executing sample 1/2 using termination http with 1 clients and 1 keepalive requests�[0m
[2021-10-25 21:28:51,514] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO - {
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "requests_per_second": 7520,
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "avg_latency": 66310,
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "latency_95pctl": 112284,
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "latency_99pctl": 147764,
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "sample": "1",
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:29:54,914] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO -     "keepalive": 1,
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO -     "200": 451232
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO - }
[2021-10-25 21:29:54,915] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:29:54,958] {subprocess.py:78} INFO - �[1mMon Oct 25 21:29:54 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:30:54,960] {subprocess.py:78} INFO - �[1mMon Oct 25 21:30:54 UTC 2021 Executing sample 2/2 using termination http with 1 clients and 1 keepalive requests�[0m
[2021-10-25 21:30:55,269] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:31:58,903] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:31:58,903] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO - {
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "requests_per_second": 8729,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "avg_latency": 57239,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "latency_95pctl": 93251,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "latency_99pctl": 123224,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "sample": "2",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "keepalive": 1,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO -     "200": 523795
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO - }
[2021-10-25 21:31:58,904] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:31:58,945] {subprocess.py:78} INFO - �[1mMon Oct 25 21:31:58 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:32:58,948] {subprocess.py:78} INFO - �[1mMon Oct 25 21:32:58 UTC 2021 Generating config for termination http with 1 clients 50 keep alive requests and path /1024.html�[0m
[2021-10-25 21:32:59,583] {subprocess.py:78} INFO - �[1mMon Oct 25 21:32:59 UTC 2021 Copying mb config http-scale-http.json to pod http-scale-client-5795dcd5cf-nd4w8�[0m
[2021-10-25 21:33:01,223] {subprocess.py:78} INFO - �[1mMon Oct 25 21:33:01 UTC 2021 Executing sample 1/2 using termination http with 1 clients and 50 keepalive requests�[0m
[2021-10-25 21:33:01,514] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO - {
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:34:16,725] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "requests_per_second": 75027,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "avg_latency": 6622,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "latency_95pctl": 11104,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "latency_99pctl": 15192,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "sample": "1",
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "keepalive": 50,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO -     "200": 4501656
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO - }
[2021-10-25 21:34:16,726] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:34:16,772] {subprocess.py:78} INFO - �[1mMon Oct 25 21:34:16 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:35:16,774] {subprocess.py:78} INFO - �[1mMon Oct 25 21:35:16 UTC 2021 Executing sample 2/2 using termination http with 1 clients and 50 keepalive requests�[0m
[2021-10-25 21:35:18,405] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:36:34,282] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:36:34,282] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO - {
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "requests_per_second": 75339,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "avg_latency": 6589,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "latency_95pctl": 11073,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "latency_99pctl": 15216,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "sample": "2",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "conn_per_targetroute": 1,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "keepalive": 50,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO -     "200": 4520354
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO - }
[2021-10-25 21:36:34,283] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:36:34,329] {subprocess.py:78} INFO - �[1mMon Oct 25 21:36:34 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:37:34,332] {subprocess.py:78} INFO - �[1mMon Oct 25 21:37:34 UTC 2021 Generating config for termination http with 20 clients 0 keep alive requests and path /1024.html�[0m
[2021-10-25 21:37:34,959] {subprocess.py:78} INFO - �[1mMon Oct 25 21:37:34 UTC 2021 Copying mb config http-scale-http.json to pod http-scale-client-5795dcd5cf-nd4w8�[0m
[2021-10-25 21:37:36,657] {subprocess.py:78} INFO - �[1mMon Oct 25 21:37:36 UTC 2021 Executing sample 1/2 using termination http with 20 clients and 0 keepalive requests�[0m
[2021-10-25 21:37:36,941] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO - {
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO -     "requests_per_second": 11818,
[2021-10-25 21:38:47,378] {subprocess.py:78} INFO -     "avg_latency": 62146082886,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "latency_95pctl": 2138190,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "latency_99pctl": 5072470,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "sample": "1",
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "conn_per_targetroute": 20,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "keepalive": 0,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "0": 80286,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "200": 709087,
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO -     "504": 1
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO - }
[2021-10-25 21:38:47,379] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:38:47,418] {subprocess.py:78} INFO - �[1mMon Oct 25 21:38:47 UTC 2021 Sleeping for 60s before next test�[0m
[2021-10-25 21:39:47,420] {subprocess.py:78} INFO - �[1mMon Oct 25 21:39:47 UTC 2021 Executing sample 2/2 using termination http with 20 clients and 0 keepalive requests�[0m
[2021-10-25 21:39:47,718] {subprocess.py:78} INFO - Unable to use a TTY - input is not a terminal or the right kind of file
[2021-10-25 21:41:05,990] {subprocess.py:78} INFO - Executing 'mb -i /tmp/http-scale-http.json -d 60 -o /tmp/results.csv'
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO - Workload finished, results:
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO - {
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "termination": "http",
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "test_type": "http",
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "uuid": "fa9cff5b-ec25-4669-a93c-de06b3806aa3",
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "requests_per_second": 32381,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "avg_latency": 1293305100412,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "latency_95pctl": 668422,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "latency_99pctl": 3339297,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "host_network": "true",
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "sample": "2",
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "runtime": 60,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "routes": 500,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "conn_per_targetroute": 20,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "keepalive": 0,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "tls_reuse": true,
[2021-10-25 21:41:05,991] {subprocess.py:78} INFO -     "number_of_routers": "2",
[2021-10-25 21:41:05,992] {subprocess.py:78} INFO -     "0": 80059,
[2021-10-25 21:41:05,992] {subprocess.py:78} INFO -     "200": 1942910,
[2021-10-25 21:41:05,992] {subprocess.py:78} INFO -     "408": 1
[2021-10-25 21:41:05,992] {subprocess.py:78} INFO - }
[2021-10-25 21:41:05,992] {subprocess.py:78} INFO - Indexing documents in router-test-results
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO - Traceback (most recent call last):
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     six.raise_from(e, None)
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "<string>", line 3, in raise_from
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     httplib_response = conn.getresponse()
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     response.begin()
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib64/python3.6/http/client.py", line 307, in begin
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     version, status, reason = self._read_status()
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -   File "/usr/lib64/python3.6/socket.py", line 586, in readinto
[2021-10-25 21:41:05,993] {subprocess.py:78} INFO -     return self._sock.recv_into(b)
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - socket.timeout: timed out
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - 
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - During handling of the above exception, another exception occurred:
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - 
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - Traceback (most recent call last):
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 250, in perform_request
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     method, url, body, retries=Retry(False), headers=request_headers, **kw
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     _stacktrace=sys.exc_info()[2])
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 344, in increment
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     raise six.reraise(type(error), error, _stacktrace)
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/packages/six.py", line 693, in reraise
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     raise value
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     chunked=chunked)
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 386, in _make_request
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO -     raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
[2021-10-25 21:41:05,994] {subprocess.py:78} INFO - urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='perf-results-elastic.apps.keith-cluster.perfscale.devcluster.openshift.com', port=80): Read timed out. (read timeout=10)
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO - 
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO - During handling of the above exception, another exception occurred:
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO - 
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO - Traceback (most recent call last):
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/workload/workload.py", line 92, in <module>
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     exit(main())
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/workload/workload.py", line 88, in main
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     index_result(payload)
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/workload/workload.py", line 23, in index_result
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     es.index(index=es_index, body=payload)
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     return func(*args, params=params, headers=headers, **kwargs)
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 402, in index
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     body=body,
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/transport.py", line 415, in perform_request
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     raise e
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/transport.py", line 388, in perform_request
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     timeout=timeout,
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -   File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 261, in perform_request
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO -     raise ConnectionTimeout("TIMEOUT", str(e), e)
[2021-10-25 21:41:05,995] {subprocess.py:78} INFO - elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='perf-results-elastic.apps.keith-cluster.perfscale.devcluster.openshift.com', port=80): Read timed out. (read timeout=10))
[2021-10-25 21:41:06,029] {subprocess.py:78} INFO - command terminated with exit code 1
[2021-10-25 21:41:06,032] {subprocess.py:78} INFO - fa9cff5b-ec25-4669-a93c-de06b3806aa3
[2021-10-25 21:41:06,033] {subprocess.py:82} INFO - Command exited with return code 1
[2021-10-25 21:41:06,055] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/bash.py", line 176, in execute
    raise AirflowException('Bash command failed. The command returned a non-zero exit code.')
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code.
[2021-10-25 21:41:06,057] {taskinstance.py:1505} INFO - Marking task as UP_FOR_RETRY. dag_id=4.8_rosa_default, task_id=router, execution_date=20211025T080342, start_date=20211025T211213, end_date=20211025T214106
[2021-10-25 21:41:06,090] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-10-25 21:41:06,115] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check

[RFE] Uperf test result email should say what test was run

The uperf result doc that gets created in google does not explicitly say what test it is for. When searching through previous tests this makes it very difficult to quickly determine what test was what. Simply adding an additional name to the email subject or spreadsheet would be extremely helpful.

IE: instead of
Subject: Uperf-Test-Results-2021-03-02-16.41.14 - Invitation to edit
use
Subject: Uperf-Test-Results-2021-03-02-16.41.14 - pod-to-pod - Invitation to edit

ripsaw cli k8s client does not honor proxy settings

In clusters accessible only via proxy, the e2e-benchmark tests using the ripsaw cli fail when oc otherwise works. The client needs to be updated to configure the kubernetes.client with the proxy. Something like this, I think:

    config.load_kube_config(config_file=kubeconfig_path)
    proxy_url = os.getenv('http_proxy', None)
    if proxy_url:
       client.Configuration._default_proxy = proxy_url
    self.api_client = client.ApiClient()

I will try to put up a PR if I can figure out how to test ripsaw CLI changes locally (easily)

Metric needs to be captured

https://github.com/cloud-bulldozer/plow/blob/master/workloads/etcd-perf/run_etcd_tests_fromgit.sh#L87

Assumes it is always usec, but it could be ms/sec

router tests report latency numbers in microseconds but the label says miliseconds

Earlier iit the latency was in ms but now it is in us

router-perf-v2 failed because to the endpoints can not be reached

router-perf-v2 failed because to the endpoints can not be reached.

Mon Dec 13 09:28:58 AM UTC 2021 Testing all routes before triggering the workload
curl: (52) Empty reply from server

Manually access the endpoints also failed.

% oc get route -n http-scale-http --no-headers -o custom-columns="route:.spec.host" | grep 499                    
http-perf-499-http-scale-http.apps.qili-aws-ovn.qe.devcluster.openshift.com

% oc get pod -n http-scale-http | grep 499        
http-perf-499   1/1     Running   0          57m

% curl --retry 3 --connect-timeout 5 -sSk http://http-perf-499-http-scale-http.apps.qili-aws-ovn.qe.devcluster.openshift.com
curl: (52) Empty reply from server

[BUG] Client Versions of oc above 1.19 cause errors

Output of kube-burner version
kube-burner-0.9.1

Describe the bug
When running with oc client version above 1.19 errors like #178 appear.

To Reproduce

Install oc client 1.20+
Run router v2 e2e test with local client (i.e. not through podman)
See error

Expected behavior
The test should run without error

Screenshots or output
An error seen in #178

Error: unknown flag: --type
See 'kubectl set --help' for usage.

Additional context

Refactor benchmarking scripts into a versioned CLI Tool

Note: This is a proposal and subject to change

Currently this repo is mainly a collection of bash scripts that orchestrate templating out Kubernetes manifests for benchmark-operator CRs. While this has worked for us in the past, I think we're getting to a point where we have to look at how this project should mature and how we can make it more consumable and reliable than it is currently.

Current Problems:

Each script has different parameters that changes depending on the workload (i.e. uperf and $networkpolicy), while also having "global" parameters that shouldn't necessarily change between each workload (i.e. es configuration). Understanding what those args do requires users to read and parse through the scripts to find how they are used.
The bash logic can get pretty complex and hard to read, specifically the logic around comparisons.
We have future requirements on adding more complex logic to the scripts, such as cleanup, indexing, and parsing go/no-go signals from Cerberus.
We may or may not have requirements to productize certain aspects of our toolset.
Debugging/Testing these things can be difficult given bash's lack of test frameworks and debuggers.
Our current Tests for this project have an inherit dependency on the benchmark-operator, which means bugs there surface in these tests as well.

I think most of these problems can be remedied by looking at making this project into a CLI Tool that we release versioned artifacts for. IMO this would best be done by moving this project to Python and publish pip packages for versions that pass CI.

An ideal usage of this package could be something like this:

pip install openshift-benchmarks
# install benchmark-operator
openshift-benchmarks install-operator -n my-ripsaw
# show all workloads
openshift-benchmarks workloads list 
# shows global configs like es and other commands
openshift-benchmarks -h
# shows uperf specific config
openshift-benchmarks uperf -h 
# run uperf workload
openshift-benchmarks uperf run --networkpolicy true   
# uninstall operator
openshift-benchmarks destroy-operator

It would also be nice to use this as a library in other python code if possible. This would be great for using it in Airflow but isn't a hard requirement.

I've noticed that some of the logic from this old PR into the benchmark-operator cloud-bulldozer/benchmark-operator#437 could be used as a starting point.

Etcd-perf test needs to run on the master nodes

Etcd-perf test runs fio to capture the latency/fsync metrics on the disk. Currently, the pod gets scheduled on one of the worker nodes and they might not be using the same disk as master nodes, we need to make sure the test pod is scheduled on one of the master nodes to be able to hit the location used by the etcd to read/write.

Blindly removing openshift-operators-redhat during logging install

We are blinding removing the openshift-operators-redhat namespace during install/cleanup. This is bad practice as it may being used by other RH operators and could result in future problems. We should more selectively remove this namespace or even better, if we only remove what we need to add.

Line reference:

e2e-benchmarking/workloads/logging/deploy_logging_stack.sh

Line 38 in 0cf028d

oc delete --wait=true project openshift-operators-redhat --ignore-not-found

Relative route to csv_gen.py not working on router-perf

Failed execution of router on airflow:

[2021-11-18, 04:04:55 CET] {subprocess.py:89} INFO - touchstone_compare --database elasticsearch -url http://elastic:62cuyJA229jfFl604nUC54TV@perf-results-elastic.apps.keith-cluster.perfscale.devcluster.openshift.com:80 -u 73571683-b560-4093-a63e-bee5ead321a0 --config /home/airflow/workspace/e2e-benchmarking/workloads/router-perf-v2/mb-touchstone.json -o csv --output-file /home/airflow/workspace/e2e-benchmarking/workloads/router-perf-v2/ingress-performance.csv --rc 0
[2021-11-18, 04:04:58 CET] {subprocess.py:89} INFO - [1mThu Nov 18 03:04:58 UTC 2021 Installing requirements to generate spreadsheet[0m
[2021-11-18, 04:05:03 CET] {subprocess.py:89} INFO - WARNING: You are using pip version 21.1.1; however, version 21.3.1 is available.
[2021-11-18, 04:05:03 CET] {subprocess.py:89} INFO - You should consider upgrading via the '/tmp/tmp.FcwXikxljr/bin/python -m pip install --upgrade pip' command.
[2021-11-18, 04:05:03 CET] {subprocess.py:89} INFO - ../../utils/common.sh: line 81: ./csv_gen.py: No such file or directory
[2021-11-18, 04:05:03 CET] {subprocess.py:89} INFO - 73571683-b560-4093-a63e-bee5ead321a0
[2021-11-18, 04:05:03 CET] {subprocess.py:93} INFO - Command exited with return code 127

https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/utils/common.sh#L81

Heavy version for pod-density

The current pod-density version which creates specified number of sleep pods in a namespace, we also need a heavy version similar to node-density heavy to create heavy applications. We can leverage node-density heavy version to today to do it by configuring the node and pod counts but it also creates services which is a limitation for this test since there can't be more than 5000 services per namespace due to the ARG_MAX limitation on the host and pod-density would need to create >= 25000 pods to validate and push the cluster maximums.

[Idea] Print log when a benchmark fails to ease debugging

I think we could print some of the workload logs (Not the benchmark operator ones, but from the actual pods of the workload) and when a benchmark fails. We're currently blind when that happens and sometimes implies re-running it manually, hence waste of time.

A right place to add this feature could be within the run_benchmark function:

e2e-benchmarking/utils/benchmark-operator.sh

Lines 53 to 59 in fc86aa5

    
           run_benchmark() { 
        
             source ${ripsaw_tmp}/bin/activate 
        
             set -e 
        
             ripsaw benchmark run -f ${1} -t ${2} 
        
             set +e 
        
             deactivate 
        
           }

Job Iterations set to 0 in maxservices and maxnamespaces kube-burner workloads

While running https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run_maxservices_test_fromgit.sh and https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run_maxnamespaces_test_fromgit.sh, job iterations are set to 0 even after setting the value for TEST_JOB_ITERATIONS in the script files.
I was able to trace the issue to a missing export keyword before TEST_JOB_ITERATIONS. After including the export keyword in the script file, this issue is resolved.

Configmap unable to get created due to number not string

Sometimes when large amounts of iterations of the cluster-density test are run, the {{rand 4}} or {{rand 10}} produce a number and the configmap is not able to get created properly. See output below

09-02 01:18:39.880 time="2021-09-02 05:18:39" level=error msg="Error creating object: ConfigMap in version "v1" cannot be handled as a ConfigMap: v1.ConfigMap.Data: ReadString: expects " or n, but found 9, error found in #10 byte of ...|:{"key1":9764,"key2"|..., bigger context ...|{"apiVersion":"v1","data":{"key1":9764,"key2":"sQxmxGVmZV"},"kind":"ConfigMap","metad|..."
09-02 01:18:39.880 time="2021-09-02 05:18:39" level=error msg="Retrying object creation"
09-02 01:18:40.529 time="2021-09-02 05:18:40" level=error msg="Error creating object: ConfigMap in version "v1" cannot be handled as a ConfigMap: v1.ConfigMap.Data: ReadString: expects " or n, but found 9, error found in #10 byte of ...|:{"key1":9764,"key2"|..., bigger context ...|{"apiVersion":"v1","data":{"key1":9764,"key2":"sQxmxGVmZV"},"kind":"ConfigMap","metad|..."
09-02 01:18:40.529 time="2021-09-02 05:18:40" level=error msg="Retrying object creation"
09-02 01:18:43.869 time="2021-09-02 05:18:43" level=error msg="Error creating object: ConfigMap in version "v1" cannot be handled as a ConfigMap: v1.ConfigMap.Data: ReadString: expects " or n, but found 9, error found in #10 byte of ...|:{"key1":9764,"key2"|..., bigger context ...|{"apiVersion":"v1","data":{"key1":9764,"key2":"sQxmxGVmZV"},"kind":"ConfigMap","metad|..."

I think a quick fix for this is to add quotes around "{{rand 4}}" so in case it isn't a string it'll be passed as one and the config will be able to

Testing out a fix in: https://github.com/paigerube14/e2e-benchmarking/tree/quotes

feature: define standardized workloads based on node count

OpenShift needs a standardized tool that can be used to synthesize actual workloads for the cluster based on the node count.

So for example I would expect something like this as the user experience.

./benchmark --workers 3

This would run what OCP considers as the high end of a workload for 3 nodes. From this workload, the expectation is given the hardware requirements for the cluster are met that the test will pass. If the test fails the cluster-admin should be able to go to the console and review the alerts on the cluster. Those alerts should provide direct direction to admin of resolution steps to resolve performance alerts.

IE disks too slow ..

Add support to query Cerberus

With the recent enhancement - cloud-bulldozer/benchmark-operator#323, we will be able to enable Cerberus by passing a url to Ripsaw. The workloads in plow needs to be modified to enable it and act accordingly.

Improve Update tests

The current upgrade tests at https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/upgrade-perf/run_upgrade_fromgit.sh seem to be too simplistic. I would like to see this improve based on the recent upgrade testing we had done for a customer. It is important to load up the cluster and here are some suggested improvements to the script.

Use kube-burner to load up the cluster with 100 projects using the cluster-density test
For each MachineConfig Pool that is not master, calculate the total allocatable CPUs for all the nodes in that MCP
Per MCP create a a new project and launch a sample HTTP node.js app through a deployment with enough replicas such that it consumes 50% of allocatable through requests/limits cores in the MCP
Expose each app through a service
Expose the service as a route
Enforce network policy per app
Hit the routes for the duration for the upgrade test using mb

Some yamls for sample

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sampleapp
spec:
  replicas: 300
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: app
        image: quay.io/smalleni/sampleapp:latest
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 3
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: "1"
          limits:
            cpu: "1"
      nodeSelector:
        app: "true"

apiVersion: v1
kind: Service
metadata:
  name: samplesvc
spec:
  selector:
    app: sample
  ports:
  - port: 80
    targetPort: 8080

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: except
spec:
  podSelector:
    matchLabels:
      app: sample
  ingress:
  - from:
    - ipBlock:
        cidr: 10.128.0.0/14
        except:
        - "10.130.36.0/23"
        - "10.130.12.0/23"
        - "10.128.18.0/23"
        - "10.131.10.0/23"
        - "10.131.22.0/23"
        - "10.128.24.0/23"
        - "10.128.14.0/23"

the init_cleanup in uperf test is a relic

e2e-benchmarking/workloads/network-perf/common.sh

Lines 235 to 242 in de3a11a

    
           init_cleanup() { 
        
             log "Cloning benchmark-operator from branch ${operator_branch} of ${operator_repo}" 
        
             rm -rf /tmp/benchmark-operator 
        
             git clone --single-branch --branch ${operator_branch} ${operator_repo} /tmp/benchmark-operator --depth 1 
        
             oc delete -f /tmp/benchmark-operator/deploy 
        
             oc delete -f /tmp/benchmark-operator/resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml 
        
             oc delete -f /tmp/benchmark-operator/resources/operator.yaml   
        
           }

we need to use make undeploy

benchmaks/scale ends as completed before scaling all the nodes

[2021-10-15 09:14:34,406] {subprocess.py:78} INFO - + sleep 60
[2021-10-15 09:15:34,407] {subprocess.py:78} INFO - + for i in $(seq 1 $_timeout)
[2021-10-15 09:15:34,408] {subprocess.py:78} INFO - ++ oc get nodes --no-headers -l 'node-role.kubernetes.io/worker,node-role.kubernetes.io/master!=,node-role.kubernetes.io/infra!=,node-role.kubernetes.io/workload!=' --ignore-not-found
[2021-10-15 09:15:34,408] {subprocess.py:78} INFO - ++ grep -v NAME
[2021-10-15 09:15:34,408] {subprocess.py:78} INFO - ++ wc -l
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + current_workers=18
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + echo 'Current worker count: 18'
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - Current worker count: 18
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + echo 'Desired worker count: 3'
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - Desired worker count: 3
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + oc describe -n benchmark-operator benchmarks/scale
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + grep State
[2021-10-15 09:15:34,880] {subprocess.py:78} INFO - + grep Complete
[2021-10-15 09:15:35,337] {subprocess.py:78} INFO -   State:           Complete
[2021-10-15 09:15:35,337] {subprocess.py:78} INFO - + '[' 0 -eq 0 ']'
[2021-10-15 09:15:35,337] {subprocess.py:78} INFO - + '[' 18 -eq 3 ']'
[2021-10-15 09:15:35,337] {subprocess.py:78} INFO - + echo 'Scaling completed but desired worker count is not equal to current worker count!'
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - Scaling completed but desired worker count is not equal to current worker count!
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - + break
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - + '[' 1 == 1 ']'
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - + echo 'Scaling failed'
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - Scaling failed
[2021-10-15 09:15:35,338] {subprocess.py:78} INFO - + exit 1

touchstone compare in router-v2 test gets NULL uuid

When running the router-v2 test by hand with comparissions enabled the following error is seen from touchstone

+ touchstone_compare --database elasticsearch -url https://MY_ES_SERVER -u null -o yaml --config config/mb.json --tolerancy-rules tolerancy-configs/mb.yaml
+ grep -v ERROR
+ tee compare_output_9.yaml
2021-09-09 15:45:12,746 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type': 'http'}, 'buckets': ['routes', 'conn_per_targetroute', 'keepalive'], 'aggregations': {'requests_per_second': ['avg'], 'latency_95pctl': ['avg']}}
2021-09-09 15:45:12,763 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type': 'edge'}, 'buckets': ['routes', 'conn_per_targetroute', 'keepalive'], 'aggregations': {'requests_per_second': ['avg'], 'latency_95pctl': ['avg']}}
2021-09-09 15:45:12,780 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type': 'passthrough'}, 'buckets': ['routes', 'conn_per_targetroute', 'keepalive'], 'aggregations': {'requests_per_second': ['avg'], 'latency_95pctl': ['avg']}}
2021-09-09 15:45:12,796 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type': 'reencrypt'}, 'buckets': ['routes', 'conn_per_targetroute', 'keepalive'], 'aggregations': {'requests_per_second': ['avg'], 'latency_95pctl': ['avg']}}
2021-09-09 15:45:12,813 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type': 'mix'}, 'buckets': ['routes', 'conn_per_targetroute', 'keepalive'], 'aggregations': {'requests_per_second': ['avg'], 'latency_95pctl': ['avg']}}
2021-09-09 15:45:12,813 - touchstone - ERROR - Key test_type key not found in current dict level: []
{}

Looking at the touchstone_compare line the UUID passed is set to null. However, at the top of the output you can see the UUID is set.

09-09-2021T12:23:01 Small scale scenario detected: #workers < 24
09-09-2021T12:23:01 Deploying benchmark infrastructure
time="2021-09-09 12:23:01" level=info msg="Setting log level to info"
time="2021-09-09 12:23:01" level=info msg="🔥 Starting kube-burner with UUID 5f69ab80-8a40-4b00-80a6-54ee61e47018"

Example env.sh file sourced

# General
export KUBECONFIG=/root/gcp/gcp_kube
export UUID=$(uuidgen)

# ES configuration
export ES_SERVER=MY_ES_SERVER
export ES_INDEX=${ES_INDEX:-router-test-results}
export ES_SERVER_BASELINE=MY_ES_BASELINE

# Gold comparison
COMPARE_WITH_GOLD=true
ES_GOLD=${ES_GOLD:-${ES_SERVER}}
GOLD_SDN=${GOLD_SDN:-openshiftsdn}
GOLD_OCP_VERSION=4.8

# Environment setup
NUM_NODES=$(oc get node -l node-role.kubernetes.io/worker --no-headers | grep -cw Ready)
ENGINE=${ENGINE:-podman}
KUBE_BURNER_RELEASE_URL=${KUBE_BURNER_RELEASE_URL:-https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.11/kube-burner-0.11-Linux-x86_64.tar.gz}
KUBE_BURNER_IMAGE=quay.io/cloud-bulldozer/kube-burner:latest
TERMINATIONS=${TERMINATIONS:-"http edge passthrough reencrypt mix"}
INFRA_TEMPLATE=http-perf.yml.tmpl
INFRA_CONFIG=http-perf.yml
export SERVICE_TYPE=${SERVICE_TYPE:-NodePort}
export NUMBER_OF_ROUTERS=${NUMBER_OF_ROUTERS:-2}
export HOST_NETWORK=${HOST_NETWORK:-true}
export NODE_SELECTOR=${NODE_SELECTOR:-'{node-role.kubernetes.io/workload: }'}

# Benchmark configuration
RUNTIME=${RUNTIME:-60}
TLS_REUSE=${TLS_REUSE:-true}
URL_PATH=${URL_PATH:-/1024.html}
SAMPLES=${SAMPLES:-2}
QUIET_PERIOD=${QUIET_PERIOD:-60s}
KEEPALIVE_REQUESTS=${KEEPALIVE_REQUESTS:-"0 1 50"}

# Comparison and csv generation
THROUGHPUT_TOLERANCE=${THROUGHPUT_TOLERANCE:-5}
LATENCY_TOLERANCE=${LATENCY_TOLERANCE:-5}
PREFIX=${PREFIX:-$(oc get clusterversion version -o jsonpath="{.status.desired.version}")}
LARGE_SCALE_THRESHOLD=${LARGE_SCALE_THRESHOLD:-24}
METADATA_COLLECTION=${METADATA_COLLECTION:-true}
SMALL_SCALE_BASELINE_UUID=29d520a2-039a-4a1e-b139-83fe2e63fda1
LARGE_SCALE_BASELINE_UUID=9df8255d-2038-42ed-869d-f748f671da07
GSHEET_KEY_LOCATION=/root/gcp/gsheet
EMAIL_ID_FOR_RESULTS_SHEET="[email protected]"

Missing requirements.txt file

Hi.
In README file is info that I need to install python requirements... There is no requirements.txt file.
Do I still need to install requirements?

Add new baremetal workloads

Create new scripts for running benchmark-operator workloads on baremetal cluster.

trex/tespmd
oslat
cyclictest

The trex & testpmd workload needs SR-IOV, PAO operators and BIOS configuration, ensure it is included before running the actual workload.

Router tests failing on deactivating virtualenv

[2021-11-29, 22:56:37 EST] {subprocess.py:89} INFO - Google Spreadsheet link -> https://docs.google.com/spreadsheets/d/1whYNQ1tjYoQdYGGGSod2-O1XQbAO5EK7GhnIsmvICqw
[2021-11-29, 22:56:37 EST] {subprocess.py:89} INFO - �[1mTue Nov 30 03:56:37 UTC 2021 Removing touchstone�[0m
[2021-11-29, 22:56:37 EST] {subprocess.py:89} INFO - ../../utils/compare.sh: line 14: deactivate: command not found

[RFE] Allow an openshift login option instead of only utilizing kubeconfig

When testing some clusters we may not have easy access to a kubeconfig (think some managed services). Allowing the option to provide a login api address, user and pass would increase the usability of the e2e testing framework. Without this ability we will be unable to add these platforms to our pipeline testing.

Print scale up and upgrade timings on stdout

We currently index the scale up and upgrade timings to elasticsearch, it might be useful to display them on the stdout at the end of the job run for cases where 1 ) we want to quickly take a look at how long it took 2) we don't use elasticsearch - defaults to public but the user might not know about it.

Thoughts?

JOB_TIMEOUT is not used in the kube-burner e2e tests

The wait_for_benchmark function in the kube-burner common.sh does not have any timeout so it could run forever. We have an environment variable that gets set at the top of the file (JOB_TIMEOUT) but do not ever use it in the script.

[BUG] E2E Benchmarking for uperf serviceip keeps failing due to errors with connection refusal and port issues

While running Uperf Service IP testing on 100 node clusters across AWS, Azure and GCP, saw a consistent pattern of errors printing messages as below:

18:25:41  2021-05-06T18:25:28Z - INFO     - MainProcess - wrapper_factory: identified uperf as the benchmark wrapper
18:25:41  2021-05-06T18:25:28Z - INFO     - MainProcess - trigger_uperf: Starting sample 1 out of 3
18:25:41  2021-05-06T18:25:28Z - ERROR    - MainProcess - trigger_uperf: UPerf failed to execute, trying one more time..
18:25:41  2021-05-06T18:25:28Z - ERROR    - MainProcess - trigger_uperf: stdout: Error getting SSL CTX:1
18:25:41  Allocating shared memory of size 156624 bytes
18:25:41  Error connecting to 172.30.46.82
18:25:41  
18:25:41  ** TCP: Cannot connect to 172.30.46.82:20000 Connection refused
18:25:41  2021-05-06T18:25:28Z - ERROR    - MainProcess - trigger_uperf: stderr: 
18:25:41  2021-05-06T18:25:28Z - CRITICAL - MainProcess - trigger_uperf: UPerf failed to execute a second time, stopping...
18:25:41  2021-05-06T18:25:28Z - CRITICAL - MainProcess - trigger_uperf: stdout: Error getting SSL CTX:1
18:25:41  Allocating shared memory of size 156624 bytes
18:25:41  Error connecting to 172.30.46.82
18:25:41  
18:25:41  ** TCP: Cannot connect to 172.30.46.82:20000 Connection refused

Talking to @jtaleric, @dry923 and @mohit-sheth , they suggested this is a potential bug and have it filed as an issue so it can be addressed.

This is same error seen with METADATA_COLLECTION=false or METADATA_COLLECTION=true , no other parameters are being passed except ES_SERVER which is our own server and it works well.
CC : @mffiedler

RFE: Take availability zone into consideration when pinning uperf client/server pods

Today, we pick up two worker nodes to pin uperf client and server pods too.

We should look at either pinning server/client pods on worker nodes in the same availability zone or across different availability zones, so that we can get more consistent results.

Initially, we can start with same availability zone.

Cluster utilities may not be at appropriate versions

During a e2e test we assume the kubectl, oc, etc packages are at a suitable version. This can lead to errors and wasted time debugging and fixing. We should add a common step to ensure we are at the correct package versions for the current implementation.

Router-perf should support clusters with no workload node

Clusters might not have the workload node, for example the cluster used by the e2e CI. We need to expose the parameters which enables/disables running the test orchestrator from the workload node:

export WORKLOAD_JOB_NODE_SELECTOR=<true/false>
export WORKLOAD_JOB_TAINT=<true/false>

Also, looks like it's running the comparison even COMPARE is set to false,

e2e-benchmarking/workloads/router-perf/run_router_test.sh

Line 82 in 40fd02e

    
           ../../utils/touchstone-compare/run_compare.sh ${baseline_router_uuid} ${compare_router_uuid} mb

needs to run only COMPARE=true.

network test on GCP getting SSL CTX error from Uperf

When running the network test (uperf) on a GCP cluster the following error is hit

2021-09-15T14:33:25Z - INFO     - MainProcess - process: Collecting 3 samples of command ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1']
2021-09-15T14:33:26Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1'].
2021-09-15T14:33:28Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1'].
2021-09-15T14:33:29Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1'].
2021-09-15T14:33:29Z - CRITICAL - MainProcess - process: After 3 attempts, unable to run command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1']
2021-09-15T14:33:29Z - WARNING  - MainProcess - process: Sample 1 has failed state for command ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-16384-16384-1']
2021-09-15T14:33:29Z - CRITICAL - MainProcess - uperf: Uperf failed to run! Got results: ProcessSample(expected_rc=0, success=False, attempts=3, timeout=None, failed=[ProcessRun(rc=-11, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nCompleted handshake phase 1\nStarting handshake phase 2\nHandshake phase 2 with 10.0.128.3\n  Done preprocessing accepts\n  Sent handshake header\n  Sending workorder\n    Sent workorder\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\nTX worklist success  Sent workorder\nHandshake phase 2 with 10.0.128.3 done\nCompleted handshake phase 2\nStarting 1 threads running profile:stream-tcp-16384-16384-1 ...   0.00 seconds\n', stderr='', time_seconds=1.363361, hit_timeout=False), ProcessRun(rc=-11, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nCompleted handshake phase 1\nStarting handshake phase 2\nHandshake phase 2 with 10.0.128.3\n  Done preprocessing accepts\n  Sent handshake header\n  Sending workorder\n    Sent workorder\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\nTX worklist success  Sent workorder\nHandshake phase 2 with 10.0.128.3 done\nCompleted handshake phase 2\nStarting 1 threads running profile:stream-tcp-16384-16384-1 ...   0.00 seconds\n', stderr='', time_seconds=1.401737, hit_timeout=False), ProcessRun(rc=-11, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nCompleted handshake phase 1\nStarting handshake phase 2\nHandshake phase 2 with 10.0.128.3\n  Done preprocessing accepts\n  Sent handshake header\n  Sending workorder\n    Sent workorder\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\n    Sent transaction\n    Sent flowop\nTX worklist success  Sent workorder\nHandshake phase 2 with 10.0.128.3 done\nCompleted handshake phase 2\nStarting 1 threads running profile:stream-tcp-16384-16384-1 ...   0.00 seconds\n', stderr='', time_seconds=1.323687, hit_timeout=False)], successful=None)

Upgrade-perf workload should support upgrading to nightly builds

OpenShift cluster upgrades uses channel defined in cincinnati - https://github.com/openshift/cincinnati-graph-data/tree/master/channels to determine if there are any upgrades available. It's set to stable-4.x or candiate-4.x and both of them doesn't track the nightly OCP builds since they are not GA.

router-perf-v2 router tune_liveness_probe can't work on single worker node because of anti-affinity rule

Problem:
The target cluster has a single worker node and only one router. I set NUMBER_OF_ROUTERS=1, NODE_SELECTOR={node-role.kubernetes.io/worker: }, the router-perf-v2 can't work.
Error logs:

08-23 16:30:30.977  23-08-2021T08:30:30 Scaling number of routers to 1
08-23 16:30:31.243  deployment.apps/router-default scaled
08-23 16:30:31.526  Waiting for deployment "router-default" rollout to finish: 1 old replicas are pending termination...
08-23 16:40:40.639  error: deployment "router-default" exceeded its progress deadline

Analysis of the cause:

After running this line of code

e2e-benchmarking/workloads/router-perf-v2/common.sh

Line 61 in 9da00a2

    
           oc set probe -n openshift-ingress --liveness --period-seconds=$((RUNTIME * 2)) deploy/router-default

A new replica set router-default-d9888dff8 is created to make the change to a new pod.

 % oc get pods -n openshift-ingress                                    
NAME                              READY   STATUS    RESTARTS   AGE
router-default-5844bb8f66-jhxph   1/1     Running   0          3h27m
router-default-d9888dff8-pb4kg    0/1     Pending   0          11m

While because of the anti-affinity rule, the new pod can not be scheduled.

% oc describe pod router-default-d9888dff8-pb4kg -n openshift-ingress
...
Controlled By:        ReplicaSet/router-default-d9888dff8
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18s   default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules.

Then when the following code is run

e2e-benchmarking/workloads/router-perf-v2/common.sh

Lines 63 to 64 in 9da00a2

    
           oc scale --replicas=${NUMBER_OF_ROUTERS} -n openshift-ingress deploy/router-default 
        
           oc rollout status -n openshift-ingress deploy/router-default

error happens

% oc scale --replicas=1 -n openshift-ingress deploy/router-default
deployment.apps/router-default scaled
% oc rollout status -n openshift-ingress deploy/router-default
error: deployment "router-default" exceeded its progress deadline

Scale up trys to work on the replica set router-default-d9888dff8, which is not READY.

% oc describe -n openshift-ingress deploy/router-default
...
OldReplicaSets:  router-default-5844bb8f66 (1/1 replicas created), router-default-d9888dff8 (1/1 replicas created)

...
  Normal  ScalingReplicaSet  22m (x5 over 128m)  deployment-controller  Scaled up replica set router-default-d9888dff8 to 1

 % oc get rs -n openshift-ingress
NAME                        DESIRED   CURRENT   READY   AGE
router-default-5844bb8f66   1         1         1       3h44m
router-default-d9888dff8    1         1         0       134m

Proposal:
To make the router-perf-v2 work on single worker node cluster. One proposal could be adding a logic when NUMBER_OF_ROUTERS is set to -1, the tune_liveness_probe and enable_ingress_operator functions are disabled.

BUG: update crs in plow scripts according to ripsaw pr 315

following cloud-bulldozer/benchmark-operator#315 which replaces a few metadata related parameters in the cr such as metadata_privileged is now metadata.privileged, the current scripts will also need to be updated

router test failing with TimeoutExpired from mb command

When running the router test it has been periodically failing with the following error:

...
Indexing documents in router-test-results
10-09-2021T18:56:46 Sleeping for 6s before next test
10-09-2021T18:56:52 Generating config for termination passthrough with 200 clients 0 keep alive requests and path /1024.html
10-09-2021T18:56:53 Copying mb config http-scale-passthrough.json to pod http-scale-client-6fc5db9645-9lpld
10-09-2021T18:56:55 Executing sample 1/1 using termination passthrough with 200 clients and 0 keepalive requests
Executing 'mb -i /tmp/http-scale-passthrough.json -d 1 -o /tmp/results.csv'
Traceback (most recent call last):
  File "/usr/lib64/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 1535, in _communicate
    self._check_timeout(endtime, orig_timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 891, in _check_timeout
    raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command 'mb -i /tmp/http-scale-passthrough.json -d 1 -o /tmp/results.csv' timed out after 5 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workload/workload.py", line 92, in <module>
    exit(main())
  File "/workload/workload.py", line 66, in main
    result_codes, p95_latency, p99_latency, avg_latency = run_mb(args.mb_config, args.runtime, args.output)
  File "/workload/workload.py", line 35, in run_mb
    timeout=int(runtime) * 5)
  File "/usr/lib64/python3.6/subprocess.py", line 430, in run
    stderr=stderr)
subprocess.TimeoutExpired: Command 'mb -i /tmp/http-scale-passthrough.json -d 1 -o /tmp/results.csv' timed out after 5 seconds
command terminated with exit code 1

Example env.sh

# General
export KUBECONFIG=/root/gcp/gcp_kube_3
export UUID=$(uuidgen)

# ES configuration
export ES_SERVER=ES_SERVER
export ES_INDEX=${ES_INDEX:-router-test-results}
export ES_SERVER_BASELINE=ES_SERVER_BASELINE

# Gold comparison
COMPARE_WITH_GOLD="false"
ES_GOLD=${ES_GOLD:-${ES_SERVER}}
GOLD_SDN=${GOLD_SDN:-openshiftsdn}
GOLD_OCP_VERSION=4.8

# Environment setup
NUM_NODES=$(oc get node -l node-role.kubernetes.io/worker --no-headers | grep -cw Ready)
ENGINE=${ENGINE:-podman}
KUBE_BURNER_RELEASE_URL=${KUBE_BURNER_RELEASE_URL:-https://github.com/cloud-bulldozer/kube-burner/releases/download/v0.11/kube-burner-0.11-Linux-x86_64.tar.gz}
KUBE_BURNER_IMAGE=quay.io/cloud-bulldozer/kube-burner:latest
TERMINATIONS=${TERMINATIONS:-"http edge passthrough reencrypt mix"}
INFRA_TEMPLATE=http-perf.yml.tmpl
INFRA_CONFIG=http-perf.yml
export SERVICE_TYPE=${SERVICE_TYPE:-NodePort}
export NUMBER_OF_ROUTERS=${NUMBER_OF_ROUTERS:-1}
#export NUMBER_OF_ROUTERS=${NUMBER_OF_ROUTERS:-2}
export HOST_NETWORK=${HOST_NETWORK:-true}
export NODE_SELECTOR=${NODE_SELECTOR:-'{node-role.kubernetes.io/workload: }'}

# Benchmark configuration
#RUNTIME=${RUNTIME:-60}
#TLS_REUSE=${TLS_REUSE:-true}
#URL_PATH=${URL_PATH:-/1024.html}
#SAMPLES=${SAMPLES:-2}
#QUIET_PERIOD=${QUIET_PERIOD:-60s}
#KEEPALIVE_REQUESTS=${KEEPALIVE_REQUESTS:-"0 1 50"}
# Benchmark configuration
RUNTIME=${RUNTIME:-1}
TLS_REUSE=${TLS_REUSE:-true}
URL_PATH=${URL_PATH:-/1024.html}
SAMPLES=${SAMPLES:-1}
QUIET_PERIOD=${QUIET_PERIOD:-6s}
KEEPALIVE_REQUESTS=${KEEPALIVE_REQUESTS:-"0"}

# Comparison and csv generation
THROUGHPUT_TOLERANCE=${THROUGHPUT_TOLERANCE:-5}
LATENCY_TOLERANCE=${LATENCY_TOLERANCE:-5}
PREFIX=${PREFIX:-$(oc get clusterversion version -o jsonpath="{.status.desired.version}")}
LARGE_SCALE_THRESHOLD=${LARGE_SCALE_THRESHOLD:-24}
METADATA_COLLECTION=${METADATA_COLLECTION:-true}
SMALL_SCALE_BASELINE_UUID="29d520a2-039a-4a1e-b139-83fe2e63fda1"
LARGE_SCALE_BASELINE_UUID="9df8255d-2038-42ed-869d-f748f671da07"
GSHEET_KEY_LOCATION=/root/gcp/gsheet
EMAIL_ID_FOR_RESULTS_SHEET="[email protected]"

Add/document the way to specify the kubeconfig path

We need to add support to the workloads to pass the location of the kubeconfig to access the cluster if not already supported and also document the option for all the workloads given that we now support running multiple clusters from the same jump host meaning that the kubeconfig will not be in the default location - $HOME/.kube/config.

Capture metrics during the upgrade runs

Currently the upgrade test only captures the pass/fail status based on the cluster being able to upgrade or not. We need to capture metrics including the ones we monitor manually to determine the cluster stability and index them long term similar to cluster density runs to be able to analyze the state of the cluster during/after the upgrade. This is especially useful in CI runs.

Kube-burner binary can be leveraged to just call the indexing given a metrics profile to capture and index them in ES and eventually visualized in Grafana. We can use the same metrics aggregated profile that cluster density test uses to start with.

CI pipeline

We need a CI since these scripts are becoming more critical.

Running in OpenShift Pod based Jenkins Agent

Hi @rsevilla87 in my team we are planning to run these in the jenkins agent that runs as a pod on OpenShift. I have not previously tried to run docker/podman within an openshift pod. Do you happen to know a way to do so?
I believe this script was designed by keeping in mind that it would be ran from a jump host machine in the scale lab and not a container/pod.

e2e-benchmarking/workloads/router-perf-v2/common.sh

Line 54 in af7eb1d

    
           ${ENGINE} run --rm -v $(pwd)/templates:/templates:z -v ${KUBECONFIG}:/root/.kube/config:z -v $(pwd)/${INFRA_CONFIG}:/http-perf.yml:z ${KUBE_BURNER_IMAGE} init -c http-perf.yml --uuid=${UUID}

Alternatively, I was thinking using kube-burner binary directly within the jenkins agent pod. Would you be able to accept that as a proposed change if that works for us.

Thanks.

[BUG] router v2 test fails if oc version is above 1.19

When running the routerv2 test with oc versions above 1.19 the following error is thrown from this line of code.

Common.sh file:

log "Adding workload.py to the client pod"
oc set volumes -n http-scale-client deploy/http-scale-client --type=configmap --mount-path=/workload --configmap-name=workload --add

Error:

Error: unknown flag: --type
See 'kubectl set --help' for usage.

OC version that failed:

# ./oc_1_20 version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0+2f3101c", GitCommit:"2f3101cb663d0cb102ccb9730b63753604f6d29b", GitTreeState:"clean", BuildDate:"2021-02-26T13:55:24Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

OC version that worked:

# oc version 
Client Version: 4.6.0-202103060018.p0-aaa9ca3
Server Version: 4.6.21
Kubernetes Version: v1.19.0+2f3101c

Provide a requirements.txt for workload/router-perf-v2

workload and csv_gen use some uncommon pre-reqs. Provide and document an easy way to install them to avoid a nasty surprise with csv_gen failing after a long run.

WARNING: cannot use rsync: rsync not available in container

When running the router v2 test with comparisons enabled there is a warning presented after all the tests are running stating that rysnc is not available in the container. It is unclear if this is actually a problem or not.

example error:

Indexing documents in router-test-results
10-09-2021T18:07:55 Sleeping for 6s before next test
10-09-2021T18:08:01 Enabling cluster version and ingress operators
deployment.apps/cluster-version-operator scaled
deployment.apps/ingress-operator scaled
WARNING: cannot use rsync: rsync not available in container
results.csv
10-09-2021T18:08:05 delete tuned profile for node labeled with node-role.kubernetes.io/workload 
tuned.tuned.openshift.io "openshift-ingress-performance" deleted
10-09-2021T18:08:06 Deleting infrastructure

Private repo

@mohit-sheth Should we use http://github.com/openshift-scale/workloads instead?

e2e-benchmarking/workloads/router-perf/run_router_test.sh

Line 29 in 971d4c8

    
           git clone -b change_index http://github.com/mohit-sheth/workloads /tmp/workloads

Router test failing due to misconfiguration of tolerancy rules

Error logs:

[2021-11-01, 20:59:19 EDT] {subprocess.py:89} INFO - touchstone_compare --database elasticsearch -url http://elastic:62cuyJA229jfFl604nUC54TV@perf-results-elastic.apps.keith-cluster.perfscale.devcluster.openshift.com:80 -u cee41274-53ea-45f2-adb9-5c9002695df9 --config /home/airflow/workspace/e2e-benchmarking/workloads/router-perf-v2/mb-touchstone.json -o csv --tolerancy-rules /home/airflow/workspace/e2e-benchmarking/workloads/router-perf-v2/mb-tolerancy-rules.yaml --output-file /home/airflow/workspace/e2e-benchmarking/workloads/router-perf-v2/ingress-performance.csv --rc 0
[2021-11-01, 20:59:19 EDT] {subprocess.py:89} INFO - 2021-11-01, 20:59:19 EDT - touchstone - CRITICAL - At least two uuids are required when tolerancy-rules flag is passed

The latest PR's merged into e2e and touchstone are causing this behaviour.
#244
cloud-bulldozer/benchmark-comparison#54

The problem lies in this if-else block in env.sh

e2e-benchmarking/workloads/router-perf-v2/env.sh

Line 35 in e37b211

if [[ -v TOLERANCY_RULES_CFG ]]; then

so tolerance_rules is being set even if not explicitly set be user, a default is picked up.
Now in cases where we don't want to compare such as airflow, this is problematic- where we don't compare and only want to generate the csv the control will go to this command :

e2e-benchmarking/workloads/router-perf-v2/ingress-performance.sh

Line 42 in e37b211

compare ${ES_SERVER} ${UUID} ${COMPARISON_CONFIG} csv

and if you look at utils/compare.py due to wrong setting of tolerancy_rules var this flag will be added to the touchstone command :

e2e-benchmarking/utils/compare.sh

Line 32 in e37b211

if [[ -n ${TOLERANCY_RULES} ]]; then

and finally the error is because of this line in touchstone : https://github.com/cloud-bulldozer/benchmark-comparison/blob/3d039d977322f43557920904cf2f59d830c85ca5/src/touchstone/compare.py#L145 where we are checking if whether tolerancy_rules flag is set and if number of uuid's set is > 2, but in our case no of uuid's is only one since we're not comparing and hence the failure.

Possible solution is getting rid of this else block

e2e-benchmarking/workloads/router-perf-v2/env.sh

Line 37 in e37b211

else

run_compare.sh is broken after for uperf tests running with baseline uuids

I was using following benchmark in airflow for comparison:

       {
            "name": "host_network",
            "workload": "network-perf",
            "command": "./run_hostnetwork_network_test_fromgit.sh test_cloud",
            "env": {
                "COMPARE": "true",
                "COMPARE_WITH_GOLD": "false",
                "BASELINE_CLOUD_NAME": "aws",
                "BASELINE_HOSTNET_UUID": "1057b072-ae18-5584-9937-bfec75f407e2",
                "EMAIL_ID_FOR_RESULTS_SHEET": "[email protected]",
                "GSHEET_KEY_LOCATION": "/tmp/key.json"
            }
        },

When doing so, I am not getting correct comparison as the if statement here causes the flow to go into the else part where es_server_baseline is not respected. Ref:

e2e-benchmarking/utils/touchstone-compare/run_compare.sh

Lines 22 to 27 in 64574a0

    
           if [[ ${COMPARE_WITH_GOLD} == "true" ]]; then 
        
             echo "Comparing with gold" 
        
             touchstone_compare --database elasticsearch -url $_es $_es_baseline -u ${2} ${3} -o yaml --config config/${tool}.json --tolerancy-rules tolerancy-configs/${tool}.yaml | grep -v "ERROR"| tee compare_output_${!#}.yaml 
        
           else 
        
             touchstone_compare --database elasticsearch -url $_es -u ${2} -o yaml --config config/${tool}.json --tolerancy-rules tolerancy-configs/${tool}.yaml | grep -v "ERROR"| tee compare_output_${!#}.yaml 
        
           fi

I am not familiar with the comparing with the GOLD and I was asked to use Baseline by @mohit-sheth instead. I believe I am setting the variables correctly but the actual comparison is not being executed correctly.

touchstone_compare --database elasticsearch -url 'https://search-ocp-qe<redacted>.us-east-1.es.amazonaws.com:443' -u 1057b072-ae18-5584-9937-bfec75f407e2 -o yaml --config config/uperf.json --tolerancy-rules tolerancy-configs/uperf.yaml``` 
Although uperf script passed 2 UUIDs `1057b072-ae18-5584-9937-bfec75f407e2,6bb5d96e-c483-56ee-8859-586dc31cc547` only one was used.

Error while deleting kube-burner benchmark resources

Seems like there's an error while deleting the resources created by some of the kube-burner benchmarks.

12:54:03 Tue 25 Jan 2022 11:54:03 AM UTC Removing node-density=enabled label from worker nodes
12:54:03 node/ip-10-0-133-30.us-west-2.compute.internal labeled
12:54:03 namespace "15602773-ce4c-478f-9d4e-a91dc2d6a111" deleted
12:54:32 error: You must provide one or more resources by argument or filename.
12:54:32 Example resource specifications include:
12:54:32    '-f rsrc.yaml'
12:54:32    '--filename=rsrc.json'
12:54:32    '<resource> <name>'
12:54:32    '<resource>'

cc: @amitsagtani97

	run_benchmark() {
	source ${ripsaw_tmp}/bin/activate
	set -e
	ripsaw benchmark run -f ${1} -t ${2}
	set +e
	deactivate
	}

	init_cleanup() {
	log "Cloning benchmark-operator from branch ${operator_branch} of ${operator_repo}"
	rm -rf /tmp/benchmark-operator
	git clone --single-branch --branch ${operator_branch} ${operator_repo} /tmp/benchmark-operator --depth 1
	oc delete -f /tmp/benchmark-operator/deploy
	oc delete -f /tmp/benchmark-operator/resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml
	oc delete -f /tmp/benchmark-operator/resources/operator.yaml
	}

	oc scale --replicas=${NUMBER_OF_ROUTERS} -n openshift-ingress deploy/router-default
	oc rollout status -n openshift-ingress deploy/router-default

	if [[ ${COMPARE_WITH_GOLD} == "true" ]]; then
	echo "Comparing with gold"
	touchstone_compare --database elasticsearch -url $_es $_es_baseline -u ${2} ${3} -o yaml --config config/${tool}.json --tolerancy-rules tolerancy-configs/${tool}.yaml \| grep -v "ERROR"\| tee compare_output_${!#}.yaml
	else
	touchstone_compare --database elasticsearch -url $_es -u ${2} -o yaml --config config/${tool}.json --tolerancy-rules tolerancy-configs/${tool}.yaml \| grep -v "ERROR"\| tee compare_output_${!#}.yaml
	fi

cloud-bulldozer / e2e-benchmarking Goto Github PK

e2e-benchmarking's People

Contributors

Stargazers

Watchers

Forkers

e2e-benchmarking's Issues

Recommend Projects

Recommend Topics

Recommend Org