cloud-bulldozer / airflow-kubernetes Goto Github PK

Airflow DAGs to install and test OpenShift Clusters

License: Apache License 2.0

Python 45.88% Dockerfile 2.68% Shell 51.23% Makefile 0.21%

airflow-kubernetes's Introduction

Openshift Nightlies

This Repo defines Airflow DAGs used in running our nightly performance builds for stable and future releases of Openshift.

Overview

dags/openshift_nightlies - Contains all Airflow code for Tasks as well as release configurations
images - Contains all custom images used in the Airflow DAGs
charts - Helm Charts for the PerfScale Stack (includes Airflow, an EFK Stack for logging, Elastic/Kibana Cluster for results, and an instance of the perf-dashboard)
scripts - Install/Uninstall scripts for the PerfScale Stack

Installation Methods

All of these methods require you to fork this repo into your own user/organization. Please DO NOT attempt to install this by simply cloning the upstream repo

Developer Playground

These instances should be used for development only or ad-hoc performance runs and not for long-term running of DAGs. These resources may be cleaned up at any time.

To install Airflow in a developer playground setting (i.e. in our baremetal cluster)

# all commands are run at the root of your git repo
# install the airflow stack and have it point to your fork of the dag code.
# $PASSWORD refers to the password you would like to secure your airflow instance with.
./scripts/playground/build.sh -p $PASSWORD

Ad-hoc Performance Tests

Playgrounds are the recommend method if you want to run some ad-hoc performance tests against Openshift Clusters. In your own playground you can easily change install/benchmark configs by making changes to the relevant json files under the config directory. For instance, to pin a DAG to a specific version, you can update one of the install config files and add these variables (note: this only works for aws/azure/gcp/cloud at the moment)

{
    "openshift_client_location": "foo",
    "openshift_install_binary_url": "bar"
}

Once pushed to your fork this change would apply to all variants using that install config. This works for any of the install configuration fields.

Cleaning up the Playground

To uninstall the stack, you can run ./scripts/playground/cleanup.sh.

Tenant

The tenant method is almost identical to the playground except it doesn't use your branch name in the install of airflow. This means you can only have 1 running tenant installation per git user. This is more useful for teams that wish to have their own long running airflow not tied to the production git repo. To install Airflow as a Tenant, you can run the following commands from the fork you wish to use as your tenant repo.

# all commands are run at the root of your git repo
# install the airflow stack and have it point to your fork of the dag code.
# $PASSWORD refers to the password you would like to secure your airflow instance with.
./scripts/tenant/create.sh -p $PASSWORD

Uninstalling

To remove the tenant from the cluster, you can run ./scripts/tenant/destroy.sh from the tenant fork.

airflow-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

chaitanyaenr whitleykeith rsevilla87 mohit-sheth harshith-umesh amitsagtani97 chentex mukrishn jtaleric learnitall gavin-stackrox kedark3 squeed morenod paigerube14 rh-ematysek qiliredhat mccv1r0 dcbw brahaney masco mfleader chazzrobbz troy0820 tssurya yogananth-subramanian smalleni venkataanil dry923 jdowni000 mkarg75 afcollins krishvoor shahsahil264 asyedham svetsa-rh athiruma vishnuchalla dceara

airflow-kubernetes's Issues

platform connector

Upgrade tests are trying to upgrade cluster to same version

Considering 4.10 is the latest version in the CI

Expected behavior -
Upgrade an Openshift 4.9 stable release to the latest 4.10 nightly version.

Current Behaviour -
Upgrading Openshift 4.10 latest nightly to the same 4.10 version, hence no upgrade takes place and it always passes with no metrics.

Solutions -

At the last step after cleanup, deploy a cluster with the previous stable version and then upgrade it to the latest nightly.
OR
Disable upgrade test in the most recent Openshift version.

Version alias needs to be updated

We should remove 4.9, and point 4.10 to stable and 4.11 to next

Hypershift CI checklist

Description

Airflow to build a new Management and Hosted cluster on ROSA.

TODO:

install fresh management and hosted cluster
ROSA based clusters
configurable hostedcluster and types
Add OSD support for Hypershift management cluster
#159
e2e benchmarking script and include custom metric template
#140
Destroy Hosted cluster
Delete created S3 storage bucket
Finally delete management cluster
#141
to enable them on Hosted cluster for observability
#160
FW rules for AWS to run network workload on Hosted cluster
#175
custom forks for OCM ROSA Hypershift CLIs
Benchmarks
Optional workload exec on Management cluster
#139
CI to take inputs of existing management cluster to load hostedcluster on them

Add CLI options for hypershift

Custom forks for OCM ROSA Hypershift CLIs using variables

[Bug] Grafana-agent must run in infrastructure nodes

The platform connector, AKA grafana-agent consumes quite a lot of memory, sometimes leading to memory pressure and pod evictions in worker nodes (I've seen instances consuming up to 5GiB), it must run on top of infrastructure nodes.

tox fails on pendulum.tz

Seeing the following error lately:

.tox/py38-unit/lib/python3.8/site-packages/airflow/settings.py:51: in <module>
    TIMEZONE = pendulum.tz.timezone("UTC")
E   TypeError: 'module' object is not callable

This line requires that exists a remote called origin, pointing to the code to be used

airflow-kubernetes/scripts/common.sh

Line 4 in 36a1371

_remote_origin_url=$(git config --get remote.origin.url | sed -e 's,[email protected]:,https://github.com/,g')

Its ok that code needs to be pushed to a github branch to be used, but forcing that remote name will be origin is not documented

cleanup

4.16 AWS Failed Install: oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)

Below is a new failure today (May 9th) trying to install a cluster in a playground.

[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - TASK [post-install : Get cluster name] *****************************************
[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - task path: /home/airflow/workspace/scale-ci-deploy/OCP-4.X/roles/post-install/tasks/main.yml:11
[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - Thursday 09 May 2024  20:09:57 +0000 (0:00:02.981)       0:48:46.599 **********
[2024-05-09, 20:10:00 UTC] {subprocess.py:93} INFO - fatal: [...]: FAILED! => {"changed": true, "cmd": "oc get machineset -n openshift-machine-api -o=go-template='{{(index (index .items 0).metadata.labels  \"machine.openshift.io/cluster-api-cluster\" )}}'\n", "delta": "0:00:00.003887", "end": "2024-05-09 20:10:00.135487", "msg": "non-zero return code", "rc": 1, "start": "2024-05-09 20:10:00.131600", "stderr": "oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc)\noc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)\noc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)", "stderr_lines": ["oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc)", "oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)", "oc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)"], "stdout": "", "stdout_lines": []}

4.16 install state files not cleaned up

Starting with 4.16 version tested today, there are additional local install state files that must be cleaned up.

Installer error message occurs on retry:

time="2024-07-09T20:18:28Z" level=fatal msg="failed to fetch Cluster: failed to load asset \"Cluster\": local infrastructure provisioning artifacts already exist. There may already be a running cluster"

I found the culprit in the installer code:

$INSTALL_DIR/.clusterapi_output/envtest.kubeconfig

Hosted cluster benchmarks

OCM dag failure: environment_new.txt file missing

environment.txt file is missing though it is copied (dag log http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-26T00%3A00%3A00%2B00%3A00 )

[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat: /tmp/environment_new.txt: No such file or directory
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat /tmp/environment_new.txt
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat: /tmp/environment_new.txt: No such file or directory
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - Creating aws key with admin user for OCM testing

Looks like https://github.com/cloud-bulldozer/airflow-kubernetes/blob/master/dags/nocp/scripts/run_ocm_benchmark.sh#L27 is removing all the files in /tmp folder.

This has happend earlier when dag is run through auto schedule (
http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-19T00%3A00%3A00%2B00%3A00 ).

However not happend when triggered manually (http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-26T05%3A45%3A17.720087%2B00%3A00 ).

AWS fw rules for hosted cluster

Re-usable Mangement cluster

tox failing for all PRs

We are seeing the same tox errors from Oct 10 (even for merged patches) https://github.com/cloud-bulldozer/airflow-kubernetes/pull/259/checks

ERROR: py38-unit: InvocationError for command /home/runner/work/airflow-kubernetes/airflow-kubernetes/dags/.tox/py38-unit/bin/python -m pip install --exists-action w '/home/runner/work/airflow-kubernetes/airflow-kubernetes/dags/.tox/.tmp/package/1/openshift-dags-0.0.1.zip[tests]' (exited with code 2)
Error: Process completed with exit code 1.

All PRs in review are also failing with the same error https://github.com/cloud-bulldozer/airflow-kubernetes/actions/runs/3534349130/jobs/5931112165

Can someone have a look at it?

Migrate release_stream_base_url

The current release_stream_base_url of *.openshiftapps.com is a route to a specific cluster behind the VPN.

There is a simpler endpoint that is easier to template for AMD and ARM binaries, which follows the format:
https://amd64.ocp.releases.ci.openshift.org/
https://arm64.ocp.releases.ci.openshift.org/

The REST endpoints are the same. We should migrate to use the new routes.

Missing gold data on Uperf generated reports

Sheets generated by uperf tests do not contains gold data because container executing the script do not have bc command

                      'image': 'quay.io/cloud-bulldozer/airflow-ansible:2.1.3',

[2021-10-13 13:27:20,933] {subprocess.py:78} INFO - �[1mWed Oct 13 13:27:20 UTC 2021: Platform is found to be : aws
[2021-10-13 13:27:20,934] {subprocess.py:78} INFO - �[1mWed Oct 13 13:27:20 UTC 2021: Colocating uperf pods in different AZs
[2021-10-13 13:27:22,041] {subprocess.py:78} INFO - ./common.sh: line 256: bc: command not found

[2021-10-13 14:08:08,765] {subprocess.py:78} INFO - 2021-10-13 14:08:08,765 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type.keyword': 'stream'}, 'exclude': {'norm_ops': 0}, 'buckets': ['protocol.keyword', 'message_size', 'num_threads'], 'aggregations': {'norm_byte': ['max', 'avg']}}
[2021-10-13 14:08:09,947] {subprocess.py:78} INFO - 2021-10-13 14:08:09,946 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type.keyword': 'rr'}, 'exclude': {'norm_ops': 0}, 'buckets': ['protocol.keyword', 'message_size', 'num_threads'], 'aggregations': {'norm_ops': ['max', 'avg'], 'norm_ltcy': [{'percentiles': {'percents': [99]}}, 'avg']}}
[2021-10-13 14:08:11,103] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 49884899.47777778}
[2021-10-13 14:08:11,103] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 600367559.1111112}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 621306037.9666667}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 22005970.48888889}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 333951522.1333333}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 2192307359.288889}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 408.632876586914}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 432.7005767822265}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 395.9165335083008}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 394.42046813964816}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 384.8480773925781}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 608.5277099609375}

[Proposal] Create KubeBurnerOperator

KubeBurnerOperator

We should create a kube-burner operator that can orchestrate kube-burner commands natively within Airflow

rosa-sts-ovn dags must wait for nodes to update before running benchmarks

node-density fails because p99 latency is > 5s. The reason the latency is so high is bc worker nodes are still being updated when the test starts.

We must ensure the worker MCP is completely updated (READY == UPDATED).
We can speed up this process by setting maxUnavailable=100%.

It would also be a plus to pause all MCPs after the cluster and all day2 operations are complete, so any platform maintenance will not disrupt the tests.

Recent failure log: msg="❗ P99 Ready latency (11.52s) higher than configured threshold: 5s

I thought we could increase or remove the pod-threshold from the test, but that would give us bad data and increase confusion when looking at results. It is best to fix this during install/postinstall tasks up front.

[BUG] State from Dags with failed tasks should't be success

We're marking as succeeded Dags with failed tasks. We should evaluate the task results in order to set the final status of the Dag correctly.

This misbehaviour is leading to confusion and lot of human intervention to double check the actual status of tasks within a certain Dags, and it's becoming unsustainable due to the big amount of Dags.

Add support to for kube-burner's skip object validation

It doesn't appear that configuring kube-burner's error_on_verify or verify_objects is possible through a Manifest without forking and changing other repos.

As a benchmark runner, I would like to be able to set the values all from within Airflow without changes to other repos.

[Proposal] Create RipsawOperator

WIth the ever growing number of scripts in e2e-benchmarking and the also ever growing complexity of them, It's become clear that at some point we have to look at making a more robust tool to create and monitor the benchmark crds as it pertains to our nightly CI.

Note: "Operator" here is an Airflow Operator, not a Kubernetes Operator. Read More

This proposal details the initial design for a "RipsawOperator" that could be developed natively within this repo. This operator would handle the installation of the benchmark-operator and also handle the lifecycle of the benchmark crs and report necessary data back to Airflow to understand the benchmark state and results.

Some of the benefits of this would be:

Native orchestration of benchmarks within Airflow will be much more stable than calling to bash scripts with varying environment variables and different execution logic
Easier to inject global values into the benchmark CRs such as Elasticsearch URLs.
The benchmarks could be declaratively defined as raw CRs within this repo, meaning adding a new benchmark would be as simple as dropping in the acutal benchmark CR you want to run
Custom operators have a lot more control over their own lifecycle and state than default operators
Future enhancements would be possible such as direct links to results within Airflow
Easier to develop as you can test new benchmarks natively within Airflow in a developer environment

The biggest drawback to this would be that it would mean that there wouldn't be a way to run ad-hoc benchmarks like we can today with the e2e-benchmarking scripts. However I think this could be remedied if we made the core logic used in this operator a python cli package (i.e. ripsaw-cli) that was able to leverage this logic in a non-airflow environment.

This is really just a place to start a conversation about this. I'm not 100% sold on this idea but I want to start looking at alternatives to the e2e repo so feel free to propose completely different ideas than this one :)

ocm ci: append "\" before shell command output while parsing it

OCM CI failing with below errors
{{[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - Clean-up existing OSD access keys..
[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: conditional binary operator expected
[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: syntax error near 2' [2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: if [[ -eq 2 ]]; then'
[2022-11-09, 05:00:23 EST] {subprocess.py:96} INFO - Command exited with return code 2}}

{{}}

Recently PR #255 introduced some code for identifying last active AWS key which is failing in CI run. Inside run_ocm_benchmark.sh, the script ssh into ocm CI jumphost and execute this code. So we need to add additional "" before parsing/using command output i.e

AWS_KEY=$(aws iam list-access-keys --user-name OsdCcsAdmin --output text --query 'AccessKeyMetadata[*].AccessKeyId')
LEN_AWS_KEY=echo $AWS_KEY | wc -w
if [[ ${LEN_AWS_KEY} -eq 2 ]]; then

   aws iam delete-access-key --user-name OsdCcsAdmin --access-key-id `printf ${AWS_KEY[0]}`

AWS_KEY=$(/usr/bin/aws iam list-access-keys --user-name OsdCcsAdmin --output text --query 'AccessKeyMetadata[*].AccessKeyId')
LEN_AWS_KEY=$(echo $AWS_KEY | wc -w)
if [[ ${LEN_AWS_KEY} -eq 2 ]]; then

  /usr/bin/aws iam delete-access-key --user-name OsdCcsAdmin --access-key-id \$(printf \${AWS_KEY[0]})

[RFE] Deletion of Hypershift Hosted Clusters should be parellel

Deletion of a Hypershift Hosted Cluster can take 10+ minutes. We are currently doing this serially. That means that when I am doing tests of greater than 6 HCs it can take over an hour for cleanup. We should investigate doing this in a parallel format instead to greatly trim down the cleanup time.

Image repository can't be changed

When I want to run a test with an image in my own repository, I'm able to change the name of the image in https://github.com/cloud-bulldozer/airflow-kubernetes/blob/master/dags/openshift_nightlies/tasks/install/rosa/rosa.py#L23 but inspite of changing the repository/tag information in https://github.com/cloud-bulldozer/airflow-kubernetes/blob/master/dags/openshift_nightlies/models/dag_config.py, it is not getting picked up and quay.io/cloud-bulldozer is used.