Giter Site home page Giter Site logo

metal3-io / project-infra Goto Github PK

View Code? Open in Web Editor NEW
14.0 7.0 20.0 2.65 MB

Metal3 testing infrastructure configuration

Home Page: https://prow.apps.test.metal3.io

License: Apache License 2.0

Makefile 0.38% Shell 96.02% Jinja 0.58% Python 2.85% Dockerfile 0.17%
jenkins prow

project-infra's People

Contributors

adilghaffardev avatar derekhiggins avatar dhellmann avatar digambar15 avatar dtantsur avatar elfosardo avatar fmuyassarov avatar furkatgofurov7 avatar honza avatar huutomerkki avatar jaakko-os avatar jan-est avatar kashifest avatar lentzi90 avatar macaptain avatar maelk avatar mboukhalfa avatar metal3-io-bot avatar mikkosest avatar mquhuy avatar namnx228 avatar nymanrobin avatar peppi-lotta avatar rozzii avatar russellb avatar smoshiur1237 avatar stbenjam avatar sunnatillo avatar tuminoid avatar wgslr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-infra's Issues

Improve Prow cluster management

Current Situation

Currently there is no clear instructions to when or how to update the Prow cluster (besides a small not in the prow README Apply the changes and then create a PR with the changes.). However this can lead to scenarios when the actual configuration in the repository and the live cluster diverges. In scenarios such as two persons working with the cluster at the same time and overwriting each others work. Also recently seen scenario when image bumps there was no clear process, leaving one PR hanging and the main diverged from live cluster

  1. PR was merged without applying: #777
  2. PR was on hold waiting for someone to apply: #802

Potential Solution

What would be beneficial is a process so all updates are handled in one way and also some automation to support this.
Some ideas for the automation could be automatically applying changes this of course have the risk of a bad change breaking the automation itself. Another approach would be to simply checking the diff of the live cluster vs a PR and only allow for merge when the PR changes can be found in the cluster or have a periodic job that alerts in case there is a diff between main and the live cluster

Migration to dynamic worker workflow

We decided to use dynamic jenkins worker.
Here the progress of the move.

Done:
✅ clusterctl upgrade tests
✅ feature tests
✅ dev env integration tests
✅ e2e basic
✅ e2e_integration tests
✅ k8s_upgrade tests
✅ ephemeral tests
✅ bmo e2e
✅ fullstack build
✅ Nordix clone

Note! We are are also splitting pipelines for dev_env and e2e_tests, and for feature tests we have already seperate pipeline.

Replace travis CI usage with prow jobs

We have travis-ci running some jobs against baremetal-operator and cluster-api-provider-baremetal. We should be able to replace those with prow jobs.

Fix the Centos CI openstack image building pipeline

When trying to build the ci-image with the Jenkins pipeline there is a conflict between RPM packages nbdkit and selinux-policy-targeted. This same error was seen in centos tests when setting up the metal3-dev-env #738 .

[2024-05-07T05:29:39.094Z] Error: 
[2024-05-07T05:29:39.094Z]  Problem 1: package nbdkit-1.38.0-1.el9.x86_64 from appstream requires (nbdkit-selinux if selinux-policy-targeted), but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - cannot install the best update candidate for package selinux-policy-targeted-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.094Z]   - cannot install the best update candidate for package nbdkit-1.36.2-1.el9.x86_64
[2024-05-07T05:29:39.094Z]  Problem 2: package nbdkit-1.38.0-1.el9.x86_64 from appstream requires (nbdkit-selinux if selinux-policy-targeted), but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - problem with installed package selinux-policy-targeted-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.094Z]   - problem with installed package nbdkit-1.36.2-1.el9.x86_64
[2024-05-07T05:29:39.094Z]   - package selinux-policy-targeted-38.1.35-2.el9.noarch from @System requires selinux-policy = 38.1.35-2.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package selinux-policy-targeted-38.1.35-2.el9.noarch from baseos requires selinux-policy = 38.1.35-2.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package nbdkit-1.36.2-1.el9.x86_64 from @System requires nbdkit-basic-filters(x86-64) = 1.36.2-1.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package nbdkit-1.36.2-1.el9.x86_64 from appstream requires nbdkit-basic-filters(x86-64) = 1.36.2-1.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - cannot install both selinux-policy-38.1.36-1.el9.noarch from baseos and selinux-policy-38.1.35-2.el9.noarch from @System
[2024-05-07T05:29:39.095Z]   - cannot install both selinux-policy-38.1.36-1.el9.noarch from baseos and selinux-policy-38.1.35-2.el9.noarch from baseos
[2024-05-07T05:29:39.095Z]   - cannot install both nbdkit-basic-filters-1.38.0-1.el9.x86_64 from appstream and nbdkit-basic-filters-1.36.2-1.el9.x86_64 from @System
[2024-05-07T05:29:39.095Z]   - cannot install both nbdkit-basic-filters-1.38.0-1.el9.x86_64 from appstream and nbdkit-basic-filters-1.36.2-1.el9.x86_64 from appstream
[2024-05-07T05:29:39.095Z]   - cannot install the best update candidate for package selinux-policy-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.095Z]   - cannot install the best update candidate for package nbdkit-basic-filters-1.36.2-1.el9.x86_64

A workaround was applied to metal3-dev-env to get the tests passing but the issue should ultimately be fixed in the image build and / or monitor if there is a upstream fix in the RPM packages to fix the conflict. After this is done the workaround could be reverted to make sure that we are using the latest packages.

Discussion on Docker rate limit

Recently, we have started experiencing Docker rate limit issues:
toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit make: *** [Makefile:4: install_requirements] Error 1

in our CI runs and in integration test runs in the pull requests. Possible ways as discussed in the metal3 community meeting could be:

BML: Cleanup /tmp/manifests

We don't clean the /tmp/manifests. This is not a problem normally as we start from a fresh VM every time, but for BML it builds up.

Broken link to test failure details

When prow leaves a comment on a PR about a test failure, the "Details" link just links back to the PR itself instead of to something more useful. I suspect this is missing configuration of a base URL somewhere.

Example: #14

Improve retention policy and management of node and ci images

Current Situation

Currently, the retention policy saves the last 5 images, but this isn't very safe. The process to take a new image is manual and lacks visibility into which image is currently in use. You have to be a Jenkins Admin to see and change the image for CI, which means someone with triggering rights to the build can start it without knowing they might erase the actively used image from OpenStack.

This problem also applies to node images. However, everyone currently has visibility into both Artifactory and the dev-env code, allowing them to see what image is used and understand how a new trigger will affect it.

What needs to be fixed

To address these issues, we need to ensure that the actively used image is never deleted. We also need a way to ensure that if the active image is changed, the new image will work properly through some testing. Additionally, any changes to an image build should be testable in the PR before merging.

  • Make the active image separate from the candidate images
  • Add a promotion process for changing the active image
  • Add tests for new image when changes happen to the DiB image workflow
  • Check that all file changes affecting the DiB workflow trigger tests and a new build trigger on merge

Potential solution

A potential solution could involve having the active image with a separate naming convention from the candidate images. For promotion, there would be a pipeline that takes a candidate image as input, runs tests on it, and if the tests pass, automatically changes the active image to the candidate.

Note: Jenkins also offers an artifactory plugin which supports promtion logic out of the box which could be investigated. Not sure if same exists for the openstack plugin

By implementing these changes, we can increase the reliability and safety of our image retention process, improve coordination among team members triggering builds, and reduce the risk of active image overwriting and build failures. Testing new images before they become active will ensure they are reliable and functional, providing a smoother and more predictable CI/CD process.

Separate node image building and testing

The issue

Currently when looking at the node image building pipeline it can be seen that the building time of a image is extremely long ~1 hour per distribution. However when the building step in Jenkins is examined closer it actually contains both the building of image and testing of the image. To make the process more transparent and the error spotting easier separate the test and build steps.

However, currently the testing is part of the same script as the building so a small refactoring is needed here before a separate step can be added in Jenkins.

Jenkins job periodic_node_image_building

Set up RBAC so all metal3-io members can view test pods

Right now only a small group has access to the CI cluster itself. It would be nice to allow access to anyone in the metal3-io github org, at least with read-only access to the test-pods namespace to view the Pods for test jobs and to inspect their logs directly.

Add yaml linter job to CI

As of now, in project-infra we only have check-prow-config tests as default and mandatory, and other integration test triggers can also be triggered per need basis. However, we need to look forward to adding a new YAML linter job to CI that checks and catches unexpected format, blank space etc in YAML files being changed or added.
The need for this is because there were multiple cases up until now, where we had inappropriate YAML files after making changes (extra blank space addition mostly) to prow config files and that merged unnoticed, resulting in the unformatted and hard to read k8s resource YAML definitions being created in the prow cluster.

All Centos tests are failing

Currently everything ran on Centos images are failing, since there is a problem with the base image which has conflicting rpm packages with newly packages. So everything related to upgrade fails in the tests such as sudo dnf upgrade -y
https://jenkins.nordix.org/blue/organizations/jenkins/metal3-centos-e2e-integration-test-main/detail/metal3-centos-e2e-integration-test-main/150/pipeline/

The error is the following:

[2024-05-07T05:29:39.094Z] Error: 
[2024-05-07T05:29:39.094Z]  Problem 1: package nbdkit-1.38.0-1.el9.x86_64 from appstream requires (nbdkit-selinux if selinux-policy-targeted), but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - cannot install the best update candidate for package selinux-policy-targeted-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.094Z]   - cannot install the best update candidate for package nbdkit-1.36.2-1.el9.x86_64
[2024-05-07T05:29:39.094Z]  Problem 2: package nbdkit-1.38.0-1.el9.x86_64 from appstream requires (nbdkit-selinux if selinux-policy-targeted), but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - problem with installed package selinux-policy-targeted-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.094Z]   - problem with installed package nbdkit-1.36.2-1.el9.x86_64
[2024-05-07T05:29:39.094Z]   - package selinux-policy-targeted-38.1.35-2.el9.noarch from @System requires selinux-policy = 38.1.35-2.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package selinux-policy-targeted-38.1.35-2.el9.noarch from baseos requires selinux-policy = 38.1.35-2.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package nbdkit-1.36.2-1.el9.x86_64 from @System requires nbdkit-basic-filters(x86-64) = 1.36.2-1.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - package nbdkit-1.36.2-1.el9.x86_64 from appstream requires nbdkit-basic-filters(x86-64) = 1.36.2-1.el9, but none of the providers can be installed
[2024-05-07T05:29:39.094Z]   - cannot install both selinux-policy-38.1.36-1.el9.noarch from baseos and selinux-policy-38.1.35-2.el9.noarch from @System
[2024-05-07T05:29:39.095Z]   - cannot install both selinux-policy-38.1.36-1.el9.noarch from baseos and selinux-policy-38.1.35-2.el9.noarch from baseos
[2024-05-07T05:29:39.095Z]   - cannot install both nbdkit-basic-filters-1.38.0-1.el9.x86_64 from appstream and nbdkit-basic-filters-1.36.2-1.el9.x86_64 from @System
[2024-05-07T05:29:39.095Z]   - cannot install both nbdkit-basic-filters-1.38.0-1.el9.x86_64 from appstream and nbdkit-basic-filters-1.36.2-1.el9.x86_64 from appstream
[2024-05-07T05:29:39.095Z]   - cannot install the best update candidate for package selinux-policy-38.1.35-2.el9.noarch
[2024-05-07T05:29:39.095Z]   - cannot install the best update candidate for package nbdkit-basic-filters-1.36.2-1.el9.x86_64

Sync labels across repos

@stbenjam proposed using label_sync to sync labels across repos here: #12

My first try with it didn't work, so I proposed a revert (#19), so this issue is a reminder to come back and try to make it work later.

Discrepancy between Jenkins pipelines and Prow's Config

There is a discrepancy between Jenkins pipeline configs and the Prow config, the root cause of this was that the decision to only keep the latest upgrade for each branch. However, the Prow config was never updated to reflect this.

Further none of the configs for branch release-1.7 does match the actual config that exists. We should check all the Prow configs and make them reflect the tests in Jenkins

General state of cluster is not fetched for logs

We are missing all the key indicators where to begin debugging, like kubectl get pods -A. We do fetch all individual logs and describes from all containters, but we don't have a clue where to look as no top-level listing is grabbed.

Add at least:

  • kubectl get pods -A
  • other key listings/statuses

Collect logs from initContainers

Currently we only collect logs from the "normal" containers in each Pod (i.e. the ones defined in spec.containers). It would be good to also get the logs for init containers since it can sometimes happen that they get stuck.

Make the run_fetch_logs.sh script collect logs from spec.initContainers also.

Merge parameters DISTRIBUTION and TARGET_NODE_OS

These parameters are always set to either ubuntu or centos, but with the difference that TARGET_NODE_OS has it capitalized and DISTRIBUTION does not.

TARGET_NODE_OS=Ubuntu
DISTRIBUTION=ubuntu

We never mix them so that we have Ubuntu and centos, which means that we could drop one of them. This would simplify the pipelines and reduce the number of variables. Note that TARGET_NODE_OS is translated to IMAGE_OS in the pipeline env.

Rebase all PRs in Jenkins jobs

In order to prevent issues of old branches not including some fix commits, and to test the commit as if merged, the CI should rebase all commits on top of the target branch before running the tests.

Migration to prow jenkins operator

Now we have moved most of the jjb and pipelines to cover trigger tests from prow this issue report the current status and state the next TODOs:

Done:

🟢 job_capm3_e2e_basic_tests.yml
🟢 job_capm3_e2e_clusterctl_upgrade_tests_prow.yml
🟢 job_capm3_e2e_feature_tests_prow.yml
🟢 job_capm3_e2e_integration_tests_prow.yml
🟢 job_capm3_e2e_k8s_upgrade_tests_prow.yml
🟢 job_capm3_periodic_e2e_clusterctl_upgrade_tests_prow.yml
🟢 job_capm3_periodic_e2e_feature_tests _prow.yml
🟢 job_capm3_periodic_e2e_integration_tests_prow.yml
🟢 job_capm3_periodic_e2e_k8s_upgrade_tests_prow.yml
🟢 job_periodic_clean.yml
🟢 job_capm3_periodic_e2e_ephemeral_tests.yml
🟢 job_capm3_periodic_integration_tests.yml
🟢 job_dev_env_integration_tests.yml
🟢 job_integration_tests.yml
🟢 job_bml_integration_tests.yml
🟢 job_bml_periodic_integration_tests.yml

In progress:

🟡 job_fullstack_building_test.yml
🟡 job_fullstack_project-infra_building_test.yml
🟡 job_periodic_fullstack_building.yml

Does not require any changes:

⚪ job_artifact_cleanup.yml
⚪ job_ci_image_building.yml
⚪ job_container_image_building.yaml
⚪ job_openstack_node_image_building.yml
⚪ job_update_nordix_repos.yml

Deleted:

🔴 job_docker_image_building.yml https://gerrit.nordix.org/c/infra/cicd/+/21248
🔴 job_metal3_dev_tools_integration_test.yml https://gerrit.nordix.org/c/infra/cicd/+/21020
🔴 job_openstack_image_building.yml https://gerrit.nordix.org/c/infra/cicd/+/21102
🔴 job_ironic_image_build_test.yml https://gerrit.nordix.org/c/infra/cicd/+/21423

Update gh required checks by Admin

🟢 CAPM3 https://github.com/metal3-io/cluster-api-provider-metal3
🟢 IPAM https://github.com/metal3-io/ip-address-manager
🟢 BMO https://github.com/metal3-io/baremetal-operator
🟢 DEV_ENV https://github.com/metal3-io/metal3-dev-env
🟢 Project-infra https://github.com/metal3-io/project-infra
🟢 Ironic-image https://github.com/metal3-io/ironic-image
🟢 Mariadb-image https://github.com/metal3-io/mariadb-image
🟢 ironic-ipa-downloader https://github.com/metal3-io/ironic-ipa-downloader

To do

  • Remove Ubuntu required test from BMO config
  • BMO release 0.6 tests is not set at all apart from bmo e2e
  • For mariadb image repo add centos tests in project infra

Enable milestone and milestoneapplier plugins for Prow

Enable milestone and milestoneapplier plugins for Prow to automatically assign a milestone to closed PRs, and allow milestone setting for project members in case they cannot do it via the menu.

We should grab config for this from CAPI, with the notable difference is that we have many repos, different milestones in them, and some repos do not have milestones at all.

https://prow.k8s.io/plugins

Rotate the dev key in DiB image building workflow

When building an image with DiB it accepts a environment variable called DIB_DEV_USER_AUTHORIZED_KEYS. This takes a file path and this file will be copied to created image as the authorized_key file for the user defined in DIB_DEV_USER_USERNAME environment variable.

To rotate the current key a new ed25519 key should be generated and added to the authorized key file to be accepted to login with. Once the new key is validated to be working the old key should be rotated out and removed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.