Giter Site home page Giter Site logo

airflow-kubernetes's Introduction

Openshift Nightlies

This Repo defines Airflow DAGs used in running our nightly performance builds for stable and future releases of Openshift.

Overview

  • dags/openshift_nightlies - Contains all Airflow code for Tasks as well as release configurations
  • images - Contains all custom images used in the Airflow DAGs
  • charts - Helm Charts for the PerfScale Stack (includes Airflow, an EFK Stack for logging, Elastic/Kibana Cluster for results, and an instance of the perf-dashboard)
  • scripts - Install/Uninstall scripts for the PerfScale Stack

Installation Methods

All of these methods require you to fork this repo into your own user/organization. Please DO NOT attempt to install this by simply cloning the upstream repo

Developer Playground

These instances should be used for development only or ad-hoc performance runs and not for long-term running of DAGs. These resources may be cleaned up at any time.

To install Airflow in a developer playground setting (i.e. in our baremetal cluster)

# all commands are run at the root of your git repo
# install the airflow stack and have it point to your fork of the dag code.
# $PASSWORD refers to the password you would like to secure your airflow instance with.
./scripts/playground/build.sh -p $PASSWORD

Ad-hoc Performance Tests

Playgrounds are the recommend method if you want to run some ad-hoc performance tests against Openshift Clusters. In your own playground you can easily change install/benchmark configs by making changes to the relevant json files under the config directory. For instance, to pin a DAG to a specific version, you can update one of the install config files and add these variables (note: this only works for aws/azure/gcp/cloud at the moment)

{
    "openshift_client_location": "foo",
    "openshift_install_binary_url": "bar"
}

Once pushed to your fork this change would apply to all variants using that install config. This works for any of the install configuration fields.

Cleaning up the Playground

To uninstall the stack, you can run ./scripts/playground/cleanup.sh.


Tenant

The tenant method is almost identical to the playground except it doesn't use your branch name in the install of airflow. This means you can only have 1 running tenant installation per git user. This is more useful for teams that wish to have their own long running airflow not tied to the production git repo. To install Airflow as a Tenant, you can run the following commands from the fork you wish to use as your tenant repo.

# all commands are run at the root of your git repo
# install the airflow stack and have it point to your fork of the dag code.
# $PASSWORD refers to the password you would like to secure your airflow instance with.
./scripts/tenant/create.sh -p $PASSWORD

Uninstalling

To remove the tenant from the cluster, you can run ./scripts/tenant/destroy.sh from the tenant fork.


airflow-kubernetes's People

Contributors

afcollins avatar amitsagtani97 avatar chaitanyaenr avatar chentex avatar dry923 avatar gavin-stackrox avatar harshith-umesh avatar jtaleric avatar kedark3 avatar krishvoor avatar marokko44 avatar masco avatar mkarg75 avatar mohit-sheth avatar morenod avatar mukrishn avatar radez avatar rsevilla87 avatar smalleni avatar troy0820 avatar venkataanil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airflow-kubernetes's Issues

Upgrade tests are trying to upgrade cluster to same version

Considering 4.10 is the latest version in the CI

Expected behavior -
Upgrade an Openshift 4.9 stable release to the latest 4.10 nightly version.

Current Behaviour -
Upgrading Openshift 4.10 latest nightly to the same 4.10 version, hence no upgrade takes place and it always passes with no metrics.

Solutions -

  • At the last step after cleanup, deploy a cluster with the previous stable version and then upgrade it to the latest nightly.
    OR
  • Disable upgrade test in the most recent Openshift version.

Hypershift CI checklist

Description

Airflow to build a new Management and Hosted cluster on ROSA.

TODO:

  • install fresh management and hosted cluster

  • ROSA based clusters

  • configurable hostedcluster and types

  • Add OSD support for Hypershift management cluster

  • #159

  • e2e benchmarking script and include custom metric template

  • #140

  • Destroy Hosted cluster

  • Delete created S3 storage bucket

  • Finally delete management cluster

  • #141

  • to enable them on Hosted cluster for observability

  • #160

  • FW rules for AWS to run network workload on Hosted cluster

  • #175

  • custom forks for OCM ROSA Hypershift CLIs

  • Benchmarks

  • Optional workload exec on Management cluster

  • #139

  • CI to take inputs of existing management cluster to load hostedcluster on them

[Bug] Grafana-agent must run in infrastructure nodes

The platform connector, AKA grafana-agent consumes quite a lot of memory, sometimes leading to memory pressure and pod evictions in worker nodes (I've seen instances consuming up to 5GiB), it must run on top of infrastructure nodes.

tox fails on pendulum.tz

Seeing the following error lately:

.tox/py38-unit/lib/python3.8/site-packages/airflow/settings.py:51: in <module>
    TIMEZONE = pendulum.tz.timezone("UTC")
E   TypeError: 'module' object is not callable

4.16 AWS Failed Install: oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)

Below is a new failure today (May 9th) trying to install a cluster in a playground.

[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - TASK [post-install : Get cluster name] *****************************************
[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - task path: /home/airflow/workspace/scale-ci-deploy/OCP-4.X/roles/post-install/tasks/main.yml:11
[2024-05-09, 20:09:57 UTC] {subprocess.py:93} INFO - Thursday 09 May 2024  20:09:57 +0000 (0:00:02.981)       0:48:46.599 **********
[2024-05-09, 20:10:00 UTC] {subprocess.py:93} INFO - fatal: [...]: FAILED! => {"changed": true, "cmd": "oc get machineset -n openshift-machine-api -o=go-template='{{(index (index .items 0).metadata.labels  \"machine.openshift.io/cluster-api-cluster\" )}}'\n", "delta": "0:00:00.003887", "end": "2024-05-09 20:10:00.135487", "msg": "non-zero return code", "rc": 1, "start": "2024-05-09 20:10:00.131600", "stderr": "oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc)\noc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)\noc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)", "stderr_lines": ["oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc)", "oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)", "oc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)"], "stdout": "", "stdout_lines": []}

4.16 install state files not cleaned up

Starting with 4.16 version tested today, there are additional local install state files that must be cleaned up.

Installer error message occurs on retry:

time="2024-07-09T20:18:28Z" level=fatal msg="failed to fetch Cluster: failed to load asset \"Cluster\": local infrastructure provisioning artifacts already exist. There may already be a running cluster"                                                                                             

I found the culprit in the installer code:

$INSTALL_DIR/.clusterapi_output/envtest.kubeconfig

OCM dag failure: environment_new.txt file missing

environment.txt file is missing though it is copied (dag log http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-26T00%3A00%3A00%2B00%3A00 )

[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat: /tmp/environment_new.txt: No such file or directory
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat /tmp/environment_new.txt
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - cat: /tmp/environment_new.txt: No such file or directory
[2022-09-01, 20:00:25 EDT] {subprocess.py:92} INFO - Creating aws key with admin user for OCM testing

Looks like https://github.com/cloud-bulldozer/airflow-kubernetes/blob/master/dags/nocp/scripts/run_ocm_benchmark.sh#L27 is removing all the files in /tmp folder.

This has happend earlier when dag is run through auto schedule (
http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-19T00%3A00%3A00%2B00%3A00 ).

However not happend when triggered manually (http://airflow.apps.sailplane.perf.lab.eng.rdu2.redhat.com/log?dag_id=ocm&task_id=api-load&execution_date=2022-08-26T05%3A45%3A17.720087%2B00%3A00 ).

tox failing for all PRs

We are seeing the same tox errors from Oct 10 (even for merged patches) https://github.com/cloud-bulldozer/airflow-kubernetes/pull/259/checks

ERROR: py38-unit: InvocationError for command /home/runner/work/airflow-kubernetes/airflow-kubernetes/dags/.tox/py38-unit/bin/python -m pip install --exists-action w '/home/runner/work/airflow-kubernetes/airflow-kubernetes/dags/.tox/.tmp/package/1/openshift-dags-0.0.1.zip[tests]' (exited with code 2)
Error: Process completed with exit code 1.

All PRs in review are also failing with the same error https://github.com/cloud-bulldozer/airflow-kubernetes/actions/runs/3534349130/jobs/5931112165

Can someone have a look at it?

Missing gold data on Uperf generated reports

Sheets generated by uperf tests do not contains gold data because container executing the script do not have bc command

                      'image': 'quay.io/cloud-bulldozer/airflow-ansible:2.1.3',

[2021-10-13 13:27:20,933] {subprocess.py:78} INFO - �[1mWed Oct 13 13:27:20 UTC 2021: Platform is found to be : aws
[2021-10-13 13:27:20,934] {subprocess.py:78} INFO - �[1mWed Oct 13 13:27:20 UTC 2021: Colocating uperf pods in different AZs
[2021-10-13 13:27:22,041] {subprocess.py:78} INFO - ./common.sh: line 256: bc: command not found

[2021-10-13 14:08:08,765] {subprocess.py:78} INFO - 2021-10-13 14:08:08,765 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type.keyword': 'stream'}, 'exclude': {'norm_ops': 0}, 'buckets': ['protocol.keyword', 'message_size', 'num_threads'], 'aggregations': {'norm_byte': ['max', 'avg']}}
[2021-10-13 14:08:09,947] {subprocess.py:78} INFO - 2021-10-13 14:08:09,946 - touchstone - ERROR - Error: Issue capturing results from elasticsearch using config {'filter': {'test_type.keyword': 'rr'}, 'exclude': {'norm_ops': 0}, 'buckets': ['protocol.keyword', 'message_size', 'num_threads'], 'aggregations': {'norm_ops': ['max', 'avg'], 'norm_ltcy': [{'percentiles': {'percents': [99]}}, 'avg']}}
[2021-10-13 14:08:11,103] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 49884899.47777778}
[2021-10-13 14:08:11,103] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 600367559.1111112}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 621306037.9666667}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 22005970.48888889}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 333951522.1333333}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 2192307359.288889}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 408.632876586914}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,103 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 432.7005767822265}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 395.9165335083008}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 394.42046813964816}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 384.8480773925781}
[2021-10-13 14:08:11,104] {subprocess.py:78} INFO - 2021-10-13 14:08:11,104 - touchstone - ERROR - Missing UUID in input dict: {'4c9e26de-2805-581d-9434-1d24788783b7': 608.5277099609375}

rosa-sts-ovn dags must wait for nodes to update before running benchmarks

node-density fails because p99 latency is > 5s. The reason the latency is so high is bc worker nodes are still being updated when the test starts.

We must ensure the worker MCP is completely updated (READY == UPDATED).
We can speed up this process by setting maxUnavailable=100%.

It would also be a plus to pause all MCPs after the cluster and all day2 operations are complete, so any platform maintenance will not disrupt the tests.

Recent failure log: msg="❗ P99 Ready latency (11.52s) higher than configured threshold: 5s

I thought we could increase or remove the pod-threshold from the test, but that would give us bad data and increase confusion when looking at results. It is best to fix this during install/postinstall tasks up front.

[BUG] State from Dags with failed tasks should't be success

We're marking as succeeded Dags with failed tasks. We should evaluate the task results in order to set the final status of the Dag correctly.

This misbehaviour is leading to confusion and lot of human intervention to double check the actual status of tasks within a certain Dags, and it's becoming unsustainable due to the big amount of Dags.

Add support to for kube-burner's skip object validation

It doesn't appear that configuring kube-burner's error_on_verify or verify_objects is possible through a Manifest without forking and changing other repos.

As a benchmark runner, I would like to be able to set the values all from within Airflow without changes to other repos.

[Proposal] Create RipsawOperator

WIth the ever growing number of scripts in e2e-benchmarking and the also ever growing complexity of them, It's become clear that at some point we have to look at making a more robust tool to create and monitor the benchmark crds as it pertains to our nightly CI.

Note: "Operator" here is an Airflow Operator, not a Kubernetes Operator. Read More

This proposal details the initial design for a "RipsawOperator" that could be developed natively within this repo. This operator would handle the installation of the benchmark-operator and also handle the lifecycle of the benchmark crs and report necessary data back to Airflow to understand the benchmark state and results.

Some of the benefits of this would be:

  • Native orchestration of benchmarks within Airflow will be much more stable than calling to bash scripts with varying environment variables and different execution logic
  • Easier to inject global values into the benchmark CRs such as Elasticsearch URLs.
  • The benchmarks could be declaratively defined as raw CRs within this repo, meaning adding a new benchmark would be as simple as dropping in the acutal benchmark CR you want to run
  • Custom operators have a lot more control over their own lifecycle and state than default operators
  • Future enhancements would be possible such as direct links to results within Airflow
  • Easier to develop as you can test new benchmarks natively within Airflow in a developer environment

The biggest drawback to this would be that it would mean that there wouldn't be a way to run ad-hoc benchmarks like we can today with the e2e-benchmarking scripts. However I think this could be remedied if we made the core logic used in this operator a python cli package (i.e. ripsaw-cli) that was able to leverage this logic in a non-airflow environment.

This is really just a place to start a conversation about this. I'm not 100% sold on this idea but I want to start looking at alternatives to the e2e repo so feel free to propose completely different ideas than this one :)

ocm ci: append "\" before shell command output while parsing it

OCM CI failing with below errors
{{[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - Clean-up existing OSD access keys..
[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: conditional binary operator expected
[2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: syntax error near 2' [2022-11-09, 05:00:23 EST] {subprocess.py:92} INFO - -bash: line 20: if [[ -eq 2 ]]; then'
[2022-11-09, 05:00:23 EST] {subprocess.py:96} INFO - Command exited with return code 2}}

{{}}

Recently PR #255 introduced some code for identifying last active AWS key which is failing in CI run. Inside run_ocm_benchmark.sh, the script ssh into ocm CI jumphost and execute this code. So we need to add additional "" before parsing/using command output i.e

  • AWS_KEY=$(aws iam list-access-keys --user-name OsdCcsAdmin --output text --query 'AccessKeyMetadata[*].AccessKeyId')
  • LEN_AWS_KEY=echo $AWS_KEY | wc -w
  • if [[ ${LEN_AWS_KEY} -eq 2 ]]; then
  •    aws iam delete-access-key --user-name OsdCcsAdmin --access-key-id `printf ${AWS_KEY[0]}`
    
  • AWS_KEY=$(/usr/bin/aws iam list-access-keys --user-name OsdCcsAdmin --output text --query 'AccessKeyMetadata[*].AccessKeyId')
  • LEN_AWS_KEY=$(echo $AWS_KEY | wc -w)
  • if [[ ${LEN_AWS_KEY} -eq 2 ]]; then
  •   /usr/bin/aws iam delete-access-key --user-name OsdCcsAdmin --access-key-id \$(printf \${AWS_KEY[0]})
    

[RFE] Deletion of Hypershift Hosted Clusters should be parellel

Deletion of a Hypershift Hosted Cluster can take 10+ minutes. We are currently doing this serially. That means that when I am doing tests of greater than 6 HCs it can take over an hour for cleanup. We should investigate doing this in a parallel format instead to greatly trim down the cleanup time.

ROSA Auth issue

If the ROSA kubeconfig is generated too quickly, it ends up expiring in the middle of our performance benchmarks. To avoid this, it is recommended we sleep for about 2 minutes prior to generating the kubeconfig.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.