kubernetes / test-infra Goto Github PK

Test infrastructure for the Kubernetes project.

License: Apache License 2.0

Shell 10.12% Python 25.34% CSS 0.25% Makefile 1.51% HTML 1.00% JavaScript 2.01% Go 52.33% Dockerfile 1.38% TypeScript 0.54% Jsonnet 2.72% HCL 0.13% Jinja 0.35% Smarty 2.33%

k8s-sig-testing

test-infra's Issues

Present build step failures in junit files?

e.g. resource leaks would be handy in a junit file: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce/17595/consoleFull#-167188570730f50738-9bd5-4d49-9dda-6ce3409acae5

Federation e2e failure: wrong ci build version

from kubernetes-e2e-gce-federation job logs

++ gsutil cat gs://kubernetes-release/ci/latest.txt
+ build_version=v1.3.0-beta.0
+ echo 'Using published version ci/v1.3.0-beta.0 (from ci/latest)'
+ fetch_tars_from_gcs ci v1.3.0-beta.0
+ local -r bucket=ci
+ local -r build_version=v1.3.0-beta.0

This is not the build_version that kubernetes-federation-build is pushing, which naturally causes the downstream kubernetes-e2e-gce-federation job to pull the wrong tarballs and fail.

I don't yet understand why this issue took so long to pop up, as the federation stuff has been merged for weeks and this started happening a few days ago.

Switch over to using stored service account credentials

Rather than the default service account.

Dockerize PR Jenkins

It's causing a lot of pain over in kubernetes/kubernetes#26028

404 on getting dockerized-e2e-runner.sh in all kubernetes builds

++ curl -fsS --retry 3 https://raw.githubusercontent.com/kubernetes/kubernetes/test-infra/jenkins/dockerized-e2e-runner.sh
curl: (22) The requested URL returned error: 404

The correct link should be https://raw.githubusercontent.com/kubernetes/test-infra/master/jenkins/dockerized-e2e-runner.sh(?)

/cc @k8s-oncall

metadata cache needs to start automatically on reboot

Otherwise, when you restart Jenkins VMs, builds will start failing everywhere.
(Ask how I know.)

store junit data in BigQuery and/or sqlite to make it really easy to query and generate reports

I think this would be very useful for investigation/exploration.

F-Secure lists ci-test.k8s.io and submit-queue.k8s.io as harmful

Example:

😢

@ixdy

investigate docker-in-docker brokenness with kubekins-test and docker 1.11.1

As part of the Jenkins VM rebuild today, some nodes were upgraded to docker 1.11.1, instead of 1.9.1, as we'd been using before.

It seems that this causes problems for docker-in-docker in our kubekins-test image:

Verifying ./hack/../hack/verify-api-reference-docs.sh
Note: This assumes that swagger spec has been updated. Please run hack/update-swagger-spec.sh to ensure that.
Generating api reference docs at /go/src/k8s.io/kubernetes/_output/generated_html
Reading swagger spec from: /var/lib/jenkins/workspace/kubernetes-pull-test-unit-integration@2/api/swagger-spec/
docker: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
!!! Error in ./hack/update-api-reference-docs.sh:71
  'docker run ${user_flags} --rm -v "${TMP_IN_HOST}":/output:z -v "${SWAGGER_PATH}":/swagger-source:z gcr.io/google_containers/gen-swagger-docs:v5 "${SWAGGER_JSON_NAME}" "${REGISTER_FILE_URL}"' exited with status 127
Call stack:
  1: ./hack/update-api-reference-docs.sh:71 main(...)
Exiting with status 1
!!! Error in ./hack/../hack/verify-api-reference-docs.sh:34
  '"./hack/update-api-reference-docs.sh" "${OUTPUT_DIR}"' exited with status 1
Call stack:
  1: ./hack/../hack/verify-api-reference-docs.sh:34 main(...)
Exiting with status 1
FAILED   ./hack/../hack/verify-api-reference-docs.sh    1s

@bprashanth

move federated test result config somewhere more prominent and make everything use it

jenkins/test-history/buckets.json is sort-of the source of truth for which buckets we care about, except that there is also configuration in gubernator/main.py, jenkins/test-history/gen_json.py, the submit queue, and testgrid. (And maybe other places, who knows.)

It'd be nice if we moved the configuration somewhere more prominent (maybe even top-level?) and then got all of our tooling using it.

It should also be well-documented.

(It'd be a good idea to add owners for each of the various builds at that time, too.)

Run kubernetes-e2e-gce-federation as part of tests that merge bot runs on each PR

This is to catch PRs that break federation tests.

cc @kubernetes/sig-testing @ixdy @kubernetes/sig-cluster-federation

First time tests run on a project it cannot ssh to nodes

per @spxtr we need to run gcloud compute config-ssh before running tests. Otherwise some raw ssh tests will fail the first time we run tests on this node, until we later run gcloud compute ssh to that node to collect logs

somehow indicate which JUnit file a test failure came from

Motivating example: unit/integration test runs like https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/27600/kubernetes-pull-test-unit-integration/32233 have multiple JUnit files, with associated "verbose output" text files that can help debug test failures.

Rather than searching through each file, it'd be nice to know which one to go to for the verbose output. (Maybe even link directly to that file if it exists? May be getting too specific though.)

Cross-link gubernator pages

Pages should be discoverable through browsing.

/ to /pr
/pr/1345 to /pr/user
/build/$PR_LOGS/... to /pr/user
/pr/user to /pr/123? 
    Currently links to github directly, but we have 
    a better way to visualize the test results.

Jenkins PR job details link url is wrong

The Details link for PR e2e tests look like this:

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/$%7BghprbPullId%7D/kubernetes-pull-build-test-e2e-gce/$%7BBUILD_NUMBER%7D/

It looks like the BUILD_NUMBER Jenkins parameter is not getting substituted in properly.

Reference PR: kubernetes/kubernetes#26754

\cc @spxtr @fejta @ixdy

kubelet-gce-e2e-ci doesn't upload JUnit test results

@pwittrock

agent-ctl.sh should specify --project and --zone for all gcloud calls

Not sure how it even worked to start with.

Gubernator should link to jobs to testgrid

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce/20010/ shows a link up at the top to kubernetes-e2e-gce... unfortunately it links to the storage browser rather than http://testgrid.k8s.io/google-gce

We should make it link to testgrid instead.

test-history/gen_json script should not depend on accessing Jenkins

It would be useful if, like the munger, the test-history scripts depended solely on GCS buckets as input. This would allow federating tests results on the dashboard, not just in PR statuses, via the familiar GCS bucket format.

Right now, accessing the jenkins server is used to list job names, their builds (with status), and the timestamp. These can be replaced with, respectively, a config file listing job name -> gcs path mappings, reading build numbers from the bucket, and parsing started.json + finished.json.

Auto-file issues for all broken tests

We've had tests and even entire test suites broken for days, weeks, even months and nobody noticed. @lavalamp suggested that we could auto-file issues for all broken tests, as we do for flaky tests. That seems like a good idea to me.

Move heapster/cadvisor resources into their own project

I believe these have some special vms, etc. Lets move them to their own project.

24-hour test history is misleading when entire test suite fails

http://storage.googleapis.com/kubernetes-test-history/static/index.html

shows 78 tests passed, 0 failed, 0 unstable, 0 broken for this suite:

78  0   kubernetes-e2e-gke-1.2-1.3-upgrade-cluster-new 2    2   0   0

but only 2 tests.

In fact, the whole suite has been failing for weeks. The only thing working has been gcloud installation.

https://k8s-testgrid.appspot.com/google-upgrade#gke-1.2-1.3-upgrade-cluster-new&width=3

add Jenkins metadata to GCE VMs

When trying to clean up old VMs or other resources, I'm often left wondering "where did this even come from?".

We could probably add metadata describing the Jenkins job and build number that spawned the VM, as well as the PR# on PR Jenkins. There's even an add-instance-metadata function in cluster/gce/util.sh we can use.

Clusters are not turned down in case of failure.

This may be OK for small clusters that are running all the time, but is a pain for huge clusters running once a week. We should, at least, have a flag that will allow for cluster deletion on failure.

cc @wojtek-t @fejta @ixdy

Running queue-health containers as part of submit-queue to enable running against multiple repositories

As part of kubernetes-retired/contrib#1304, we are trying to get the submit-queue to run against other kubernetes org repositories.
The chart and history file generation. would need to be run per submit-queue instance.

Put heapster/cadvisor jobs under source control

A combination of

https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins-pull/kubernetes-pull.yaml

and not sure.. maybe https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins/node-e2e.yaml?

Don't leave 9 "ok to test?" messages on every PR

I think each PR builder job leaves a comment saying this. We should not do this, it looks ridiculous...

Cannot merge PR

My PR #105 cannot be merged, because of some problems with CLA. Despite I work at google, bot added CLA:NO label, and manual modification of labels didn't make my PR merge-able.

CC @gmarek

Federation e2e tests failing: pulling ci tarball from wrong bucket.

From kubernetes-e2e-gce-federation logs:

+ local -r bucket=kubernetes-release-dev
++ gsutil cat gs://kubernetes-release-dev/ci/latest.txt
+ build_version=v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ echo 'Using published version kubernetes-release-dev/v1.4.0-alpha.0.1035+d30fd0cb0c23ab (from ci/latest)'
+ fetch_tars_from_gcs gs://kubernetes-release-dev/ci v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ local -r gspath=gs://kubernetes-release-dev/ci
+ local -r build_version=v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ echo 'Pulling binaries from GCS; using server version gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab.'
+ gsutil -mq cp gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab/kubernetes.tar.gz gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab/kubernetes-test.tar.gz .
Using published version kubernetes-release-dev/v1.4.0-alpha.0.1035+d30fd0cb0c23ab (from ci/latest)
Pulling binaries from GCS; using server version gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab.

It has pulled a tarball published by kubernetes-build, not kubernetes-federation-build.

That later causes this error:

FATAL: tagfile /workspace/kubernetes/hack/e2e-internal/../../cluster/../cluster/gce/../../cluster/gce/../../cluster/../federation/manifests/federated-image.tag does not exist. Make sure that you have run build/push-federation-images.sh

I've fixed this error once before ( #146 ) by having kubernetes-federation-build and kubernetes-e2e-gce-federation use an entirely separate ci bucket, so something must have changed since that last PR was merged.

\cc @quinton-hoole @nikhiljindal @ixdy @spxtr

cluster logs not collected from dockerized e2e on timeout

The kubernetes tarball is extracted inside the container in dockerized e2e, which gives us kubernetes/cluster/log-dump.sh. On timeout, we try to call log-dump.sh, but do so outside the container, so it's no longer available.

We should probably move the timeout handling inside the dockerized e2e container.

List of changes in a given build no longer visible

I may be mistaken, but I don't see a list of changes that went in the given build. Now I see "No changes." even in weekly enormous-cluster runs (and I have hard time believing that there was no changes in last week).

cc @fejta @ixdy @spxtr @wojtek-t

nslookup for Jenkins executors

We had some e2e tests failing due to nslookup not being available to the e2e framework.

Install bind-utils package (or equivalent) on the Jenkins worker machines and/or within the e2e running container image?

\cc @nikhiljindal @mml

ref kubernetes/kubernetes#28030

Add a build job for kops Docker images

I'd like to add a build job to pump out kops builds, so I can start using it for AWS bring-up on Jenkins as well. I recently pushed a PR to that repo to build an easy container for kops (just to avoid figuring out exactly how to package/release it just yet), but then we need to figure out how to push builds somewhere. This isn't hard, but right now gcr.io/google-containers is locked down, so a build job can't actually push there.

So here's a suggested route, putting up an issue since about half this stuff isn't code approvals:

Create a kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com service account.
Give kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com rights just to push to the gcr.io bucket for the kubernetes-jenkins project itself, i.e. gcr.io/kubernetes-jenkins
Use that in a new job to build/push kops.

I did consider a couple of alternate routes:

Giving kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com rights to google-containers. Rejected because this gives anyone with https://github.com/kubernetes/test-infra or Jenkins access an easy way to trash a production bucket.
Creating another project. I'm mostly indifferent to naming, so if someone wants CI docker pushes to go somewhere else, find a project name that's not taken and we can work on that.

cc @kubernetes/test-infra-maintainers @justinsb

Submit queue chart doesn't handle submit queue crashes gracefully

Cause: kubernetes-retired/contrib#1275

For some reason the chart thinks ~60 PRs merged.

I suspect a network blip caused both this anomaly and the submit queue crash. Maybe the poll script needs to record the difference between "initializing" and "unreachable".

Move PR e2e tests into their own project

We only want jenkins infra running in kubernetes-jenkins-pull. Let's start these e2e clusters in a separate project.

Record/display cluster vital statistics at a glance for each run

Request: After a cluster has been brought up, record:

the actual cluster version the cluster thinks it's running (not the version we attempted to launch, these can be sometimes different if there's a bug/misconfiguration in a GKE test, for instance)
the docker version
the kernel uname string of the nodes
... etc.

and be able to show those at a glance. I suspect a lot of this could be done with log post-processing, but some of it is difficult to find at all.

cc @cjcullen

Put k8s.io redirection under source control

Lets address #195 by putting the redirector under source control.

e2e-runner.sh no longer compatible with v1.2 and v1.3.4 hack/e2e.go

Those versions of hack/e2e.go do not recognize the --dump flag.

Gubernator should also show passed and skipped tests

It can be helpful for determining whether a run passed because all tests actually passed or if many tests were skipped.

This can probably just be a smaller list at the bottom underneath any failures.

Set up CI running builds and unit/integration tests on OS X

We seem to break the build scripts on OS X not infrequently. It'd be nice to have some sort of CI to detect this before developers do.

[gubernator] FR: expand skipped lines

Feature request: expand skipped lines in gubernator logs.

E.g.

stderr: fatal: reference is not a tree: e5c3111e8dcb432df435dab96d7a19641adf0562

    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1719)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$500(CliGitAPIImpl.java:63)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:1984)
... skipping 9 lines ...
    at java.lang.Thread.run(Thread.java:745)
[xUnit] [INFO] - Starting to record.

Make the ... skipping 9 lines ... clickable, and expand the lines in place (e.g. unhide a hidden block).

/cc @mnshaw @rmmh

Is there a way to quickly push new config to Jenkins?

Logs from Kubemark tests are no longer uploaded to GCS buckets on failure.

Which makes debugging of regressions way harder.

cc @fejta @ixdy @spxtr @wojtek-t

Move pr-test.k8s.io over to gubernator

rather than a direct gcs link

Put node e2e tests under source control

A combination of

https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins-pull/kubernetes-pull.yaml

and

https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins/node-e2e.yaml

metadata cache server curl check doesn't work

The curl check in the metadata cache control script doesn't work, as curl will fail over to the real metadata server:

$ curl -v http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/ip
* About to connect() to metadata.google.internal port 80 (#0)
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 10.240.0.2...
* Connection refused
*   Trying 169.254.169.254...
* connected
* Connected to metadata.google.internal (169.254.169.254) port 80 (#0)
> GET /computeMetadata/v1/instance/network-interfaces/0/ip HTTP/1.1

federation e2e gce automated tests on Jenkins fail consistently with token auth attempt failed with status: 403 Forbidden

+++ [0829 21:48:48] Pushing gcr.io/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube:v1.4.0-alpha.3.197_ef82f394a9e1ba
-> GCR repository detected. Using gcloud
@nikhiljindal I think you know about this, but just so we don't lose track of it, here's an issue to track it.

See kubernetes/kubernetes#31655 (comment) for an example...

@k8s-bot federation gce e2e test this

The push refers to a repository [gcr.io/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube](len: 1)
6864c6906300: Preparing
Post https://gcr.io/v2/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube/blobs/uploads/: token auth attempt for registry: https://gcr.io/v2/token?account=oauth2accesstoken&scope=repository%3Ak8s-jkns-pr-bldr-e2e-gce-fdrtn%2Fhyperkube%3Apush%2Cpull&service=gcr.io request failed with status: 403 Forbidden
!!! Error in ./build/../build/../federation/cluster/common.sh:321
'gcloud docker push "${docker_image_tag}"' exited with status 1
Call stack:
1: ./build/../build/../federation/cluster/common.sh:321 push-federation-images(...)
2: ./build/push-federation-images.sh:29 main(...)
Exiting with status 1
Build step 'Execute shell' marked build as failure

there should be a command to rerun GKE smoke tests only

Moving here from kubernetes-retired/contrib#1322

cc @fejta @ixdy

For gke-large suite we have logs only from nodes from one node pool

In large clusters, we currently have move than one MIG. In particular, in GKE 2000-node clusters:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-large-cluster
we have 2 node pools.

However, we have logs from nodes only from a single one. This is really painful for debugging...

@kubernetes/test-infra-maintainers

We need a make release job

kubernetes/kubernetes#30384 exposed the lack of a make release job. We need one of those, post-submit if necessary. @fejta @spxtr @ixdy

kubernetes / test-infra Goto Github PK

test-infra's Issues

Recommend Projects

Recommend Topics

Recommend Org