kubernetes / test-infra Goto Github PK
View Code? Open in Web Editor NEWTest infrastructure for the Kubernetes project.
License: Apache License 2.0
Test infrastructure for the Kubernetes project.
License: Apache License 2.0
We had some e2e tests failing due to nslookup
not being available to the e2e framework.
Install bind-utils
package (or equivalent) on the Jenkins worker machines and/or within the e2e running container image?
\cc @nikhiljindal @mml
rather than a direct gcs link
It's causing a lot of pain over in kubernetes/kubernetes#26028
It would be useful if, like the munger, the test-history
scripts depended solely on GCS buckets as input. This would allow federating tests results on the dashboard, not just in PR statuses, via the familiar GCS bucket format.
Right now, accessing the jenkins server is used to list job names, their builds (with status), and the timestamp. These can be replaced with, respectively, a config file listing job name -> gcs path mappings, reading build numbers from the bucket, and parsing started.json + finished.json.
When trying to clean up old VMs or other resources, I'm often left wondering "where did this even come from?".
We could probably add metadata describing the Jenkins job and build number that spawned the VM, as well as the PR# on PR Jenkins. There's even an add-instance-metadata
function in cluster/gce/util.sh
we can use.
from kubernetes-e2e-gce-federation job logs
++ gsutil cat gs://kubernetes-release/ci/latest.txt
+ build_version=v1.3.0-beta.0
+ echo 'Using published version ci/v1.3.0-beta.0 (from ci/latest)'
+ fetch_tars_from_gcs ci v1.3.0-beta.0
+ local -r bucket=ci
+ local -r build_version=v1.3.0-beta.0
This is not the build_version
that kubernetes-federation-build is pushing, which naturally causes the downstream kubernetes-e2e-gce-federation job to pull the wrong tarballs and fail.
I don't yet understand why this issue took so long to pop up, as the federation stuff has been merged for weeks and this started happening a few days ago.
This is to catch PRs that break federation tests.
cc @kubernetes/sig-testing @ixdy @kubernetes/sig-cluster-federation
Those versions of hack/e2e.go do not recognize the --dump flag.
We've had tests and even entire test suites broken for days, weeks, even months and nobody noticed. @lavalamp suggested that we could auto-file issues for all broken tests, as we do for flaky tests. That seems like a good idea to me.
It can be helpful for determining whether a run passed because all tests actually passed or if many tests were skipped.
This can probably just be a smaller list at the bottom underneath any failures.
In large clusters, we currently have move than one MIG. In particular, in GKE 2000-node clusters:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-large-cluster
we have 2 node pools.
However, we have logs from nodes only from a single one. This is really painful for debugging...
@kubernetes/test-infra-maintainers
We only want jenkins infra running in kubernetes-jenkins-pull. Let's start these e2e clusters in a separate project.
Moving here from kubernetes-retired/contrib#1322
+++ [0829 21:48:48] Pushing gcr.io/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube:v1.4.0-alpha.3.197_ef82f394a9e1ba
-> GCR repository detected. Using gcloud
@nikhiljindal I think you know about this, but just so we don't lose track of it, here's an issue to track it.
See kubernetes/kubernetes#31655 (comment) for an example...
@k8s-bot federation gce e2e test this
The push refers to a repository [gcr.io/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube](len: 1)
6864c6906300: Preparing
Post https://gcr.io/v2/k8s-jkns-pr-bldr-e2e-gce-fdrtn/hyperkube/blobs/uploads/: token auth attempt for registry: https://gcr.io/v2/token?account=oauth2accesstoken&scope=repository%3Ak8s-jkns-pr-bldr-e2e-gce-fdrtn%2Fhyperkube%3Apush%2Cpull&service=gcr.io request failed with status: 403 Forbidden
!!! Error in ./build/../build/../federation/cluster/common.sh:321
'gcloud docker push "${docker_image_tag}"' exited with status 1
Call stack:
1: ./build/../build/../federation/cluster/common.sh:321 push-federation-images(...)
2: ./build/push-federation-images.sh:29 main(...)
Exiting with status 1
Build step 'Execute shell' marked build as failure
The curl
check in the metadata cache control script doesn't work, as curl
will fail over to the real metadata server:
$ curl -v http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/ip
* About to connect() to metadata.google.internal port 80 (#0)
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 10.240.0.2...
* Connection refused
* Trying 169.254.169.254...
* connected
* Connected to metadata.google.internal (169.254.169.254) port 80 (#0)
> GET /computeMetadata/v1/instance/network-interfaces/0/ip HTTP/1.1
Otherwise, when you restart Jenkins VMs, builds will start failing everywhere.
(Ask how I know.)
Rather than the default service account.
I think each PR builder job leaves a comment saying this. We should not do this, it looks ridiculous...
kubernetes/kubernetes#30384 exposed the lack of a make release
job. We need one of those, post-submit if necessary. @fejta @spxtr @ixdy
per @spxtr we need to run gcloud compute config-ssh
before running tests. Otherwise some raw ssh tests will fail the first time we run tests on this node, until we later run gcloud compute ssh
to that node to collect logs
We seem to break the build scripts on OS X not infrequently. It'd be nice to have some sort of CI to detect this before developers do.
Pages should be discoverable through browsing.
/ to /pr
/pr/1345 to /pr/user
/build/$PR_LOGS/... to /pr/user
/pr/user to /pr/123?
Currently links to github directly, but we have
a better way to visualize the test results.
As part of kubernetes-retired/contrib#1304, we are trying to get the submit-queue to run against other kubernetes org repositories.
The chart and history file generation. would need to be run per submit-queue instance.
++ curl -fsS --retry 3 https://raw.githubusercontent.com/kubernetes/kubernetes/test-infra/jenkins/dockerized-e2e-runner.sh
curl: (22) The requested URL returned error: 404
The correct link should be https://raw.githubusercontent.com/kubernetes/test-infra/master/jenkins/dockerized-e2e-runner.sh
(?)
/cc @k8s-oncall
From kubernetes-e2e-gce-federation logs:
+ local -r bucket=kubernetes-release-dev
++ gsutil cat gs://kubernetes-release-dev/ci/latest.txt
+ build_version=v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ echo 'Using published version kubernetes-release-dev/v1.4.0-alpha.0.1035+d30fd0cb0c23ab (from ci/latest)'
+ fetch_tars_from_gcs gs://kubernetes-release-dev/ci v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ local -r gspath=gs://kubernetes-release-dev/ci
+ local -r build_version=v1.4.0-alpha.0.1035+d30fd0cb0c23ab
+ echo 'Pulling binaries from GCS; using server version gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab.'
+ gsutil -mq cp gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab/kubernetes.tar.gz gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab/kubernetes-test.tar.gz .
Using published version kubernetes-release-dev/v1.4.0-alpha.0.1035+d30fd0cb0c23ab (from ci/latest)
Pulling binaries from GCS; using server version gs://kubernetes-release-dev/ci/v1.4.0-alpha.0.1035+d30fd0cb0c23ab.
It has pulled a tarball published by kubernetes-build
, not kubernetes-federation-build
.
That later causes this error:
FATAL: tagfile /workspace/kubernetes/hack/e2e-internal/../../cluster/../cluster/gce/../../cluster/gce/../../cluster/../federation/manifests/federated-image.tag does not exist. Make sure that you have run build/push-federation-images.sh
I've fixed this error once before ( #146 ) by having kubernetes-federation-build and kubernetes-e2e-gce-federation use an entirely separate ci bucket, so something must have changed since that last PR was merged.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce/20010/ shows a link up at the top to kubernetes-e2e-gce... unfortunately it links to the storage browser rather than http://testgrid.k8s.io/google-gce
We should make it link to testgrid instead.
I think this would be very useful for investigation/exploration.
http://storage.googleapis.com/kubernetes-test-history/static/index.html
shows 78 tests passed, 0 failed, 0 unstable, 0 broken for this suite:
78 0 kubernetes-e2e-gke-1.2-1.3-upgrade-cluster-new 2 2 0 0
but only 2 tests.
In fact, the whole suite has been failing for weeks. The only thing working has been gcloud installation.
https://k8s-testgrid.appspot.com/google-upgrade#gke-1.2-1.3-upgrade-cluster-new&width=3
The kubernetes tarball is extracted inside the container in dockerized e2e, which gives us kubernetes/cluster/log-dump.sh
. On timeout, we try to call log-dump.sh
, but do so outside the container, so it's no longer available.
We should probably move the timeout handling inside the dockerized e2e container.
As part of the Jenkins VM rebuild today, some nodes were upgraded to docker 1.11.1, instead of 1.9.1, as we'd been using before.
It seems that this causes problems for docker-in-docker in our kubekins-test image:
Verifying ./hack/../hack/verify-api-reference-docs.sh
Note: This assumes that swagger spec has been updated. Please run hack/update-swagger-spec.sh to ensure that.
Generating api reference docs at /go/src/k8s.io/kubernetes/_output/generated_html
Reading swagger spec from: /var/lib/jenkins/workspace/kubernetes-pull-test-unit-integration@2/api/swagger-spec/
docker: error while loading shared libraries: libltdl.so.7: cannot open shared object file: No such file or directory
!!! Error in ./hack/update-api-reference-docs.sh:71
'docker run ${user_flags} --rm -v "${TMP_IN_HOST}":/output:z -v "${SWAGGER_PATH}":/swagger-source:z gcr.io/google_containers/gen-swagger-docs:v5 "${SWAGGER_JSON_NAME}" "${REGISTER_FILE_URL}"' exited with status 127
Call stack:
1: ./hack/update-api-reference-docs.sh:71 main(...)
Exiting with status 1
!!! Error in ./hack/../hack/verify-api-reference-docs.sh:34
'"./hack/update-api-reference-docs.sh" "${OUTPUT_DIR}"' exited with status 1
Call stack:
1: ./hack/../hack/verify-api-reference-docs.sh:34 main(...)
Exiting with status 1
FAILED ./hack/../hack/verify-api-reference-docs.sh 1s
Motivating example: unit/integration test runs like https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/27600/kubernetes-pull-test-unit-integration/32233 have multiple JUnit files, with associated "verbose output" text files that can help debug test failures.
Rather than searching through each file, it'd be nice to know which one to go to for the verbose output. (Maybe even link directly to that file if it exists? May be getting too specific though.)
Not sure how it even worked to start with.
jenkins/test-history/buckets.json
is sort-of the source of truth for which buckets we care about, except that there is also configuration in gubernator/main.py
, jenkins/test-history/gen_json.py
, the submit queue, and testgrid. (And maybe other places, who knows.)
It'd be nice if we moved the configuration somewhere more prominent (maybe even top-level?) and then got all of our tooling using it.
It should also be well-documented.
(It'd be a good idea to add owners for each of the various builds at that time, too.)
Cause: kubernetes-retired/contrib#1275
For some reason the chart thinks ~60 PRs merged.
I suspect a network blip caused both this anomaly and the submit queue crash. Maybe the poll script needs to record the difference between "initializing" and "unreachable".
Feature request: expand skipped lines in gubernator logs.
E.g.
stderr: fatal: reference is not a tree: e5c3111e8dcb432df435dab96d7a19641adf0562
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1719)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$500(CliGitAPIImpl.java:63)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:1984)
... skipping 9 lines ...
at java.lang.Thread.run(Thread.java:745)
[xUnit] [INFO] - Starting to record.
Make the ... skipping 9 lines ...
clickable, and expand the lines in place (e.g. unhide a hidden block).
I'd like to add a build job to pump out kops
builds, so I can start using it for AWS bring-up on Jenkins as well. I recently pushed a PR to that repo to build an easy container for kops
(just to avoid figuring out exactly how to package/release it just yet), but then we need to figure out how to push builds somewhere. This isn't hard, but right now gcr.io/google-containers
is locked down, so a build job can't actually push there.
So here's a suggested route, putting up an issue since about half this stuff isn't code approvals:
kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com
service account.kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com
rights just to push to the gcr.io
bucket for the kubernetes-jenkins
project itself, i.e. gcr.io/kubernetes-jenkins
kops
.I did consider a couple of alternate routes:
kubekins-image-builder@kubernetes-jenkins.iam.gserviceaccount.com
rights to google-containers
. Rejected because this gives anyone with https://github.com/kubernetes/test-infra or Jenkins access an easy way to trash a production bucket.cc @kubernetes/test-infra-maintainers @justinsb
e.g. resource leaks would be handy in a junit file: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce/17595/consoleFull#-167188570730f50738-9bd5-4d49-9dda-6ce3409acae5
Lets address #195 by putting the redirector under source control.
Request: After a cluster has been brought up, record:
and be able to show those at a glance. I suspect a lot of this could be done with log post-processing, but some of it is difficult to find at all.
cc @cjcullen
The Details
link for PR e2e tests look like this:
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/$%7BghprbPullId%7D/kubernetes-pull-build-test-e2e-gce/$%7BBUILD_NUMBER%7D/
It looks like the BUILD_NUMBER
Jenkins parameter is not getting substituted in properly.
Reference PR: kubernetes/kubernetes#26754
I believe these have some special vms, etc. Lets move them to their own project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.