k8s-topgun has been failing since Friday, September 27th. No concourse code changes occurred between the last passing build and the builds that followed it, all of them failing.
GKE cluster topgun run's on: cluster-1
There are two types of errors we're seeing.
1. Port forwarding fails
For a variety of tests we keep seeing port-forwarding fail.
One example, where pod not found
:
STEP: Creating the web proxy
STEP: [1569964314.490616322] running: kubectl port-forward --namespace=topgun-swi-80011428 service/topgun-swi-80011428-web :8080
error: error upgrading connection: unable to upgrade connection: pod not found ("topgun-swi-80011428-web-7d4c8fb55b-6kk56_topgun-swi-80011428")
STEP: [1569964314.650369883] running: helm delete --purge topgun-swi-80011428
release "topgun-swi-80011428" deleted
STEP: [1569964315.270037413] running: kubectl delete namespace topgun-swi-80011428 --wait=false
namespace "topgun-swi-80011428" deleted
• Failure [75.169 seconds]
Scaling web instances
/tmp/build/7f688b19/concourse/topgun/k8s/web_scaling_test.go:12
succeeds [It]
/tmp/build/7f688b19/concourse/topgun/k8s/web_scaling_test.go:22
No future change is possible. Bailing out early after 0.155s.
Got stuck at:
Waiting for:
Forwarding
second example, broken pipe
:
running wget -O- topgun-dp-10084543-web:8080/api/v1/info
Handling connection for 45551
initializing
failed
Handling connection for 45551
E1001 21:07:06.913259 7232 portforward.go:372] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:45551->127.0.0.1:51636: write tcp4 127.0.0.1:45551->127.0.0.1:51636: write: broken pipe
STEP: [1569964026.936306000] running: helm delete --purge topgun-dp-10084543
release "topgun-dp-10084543" deleted
STEP: [1569964027.552978992] running: kubectl delete namespace topgun-dp-10084543 --wait=false
namespace "topgun-dp-10084543" deleted
• Failure [75.527 seconds]
DNS Resolution
/tmp/build/7f688b19/concourse/topgun/k8s/dns_proxy_test.go:14
different proxy settings
/tmp/build/7f688b19/gopath/pkg/mod/github.com/onsi/[email protected]/extensions/table/table.go:92
Proxy Disabled, with short service name [It]
/tmp/build/7f688b19/gopath/pkg/mod/github.com/onsi/[email protected]/extensions/table/table_entry.go:46
Expected
<int>: 1
to be zero-valued
third example, where port-forwarding breaks down in the middle of the test:
STEP: [1569963914.924532890] running: /tmp/gexec_artifacts279253227/g904034638/fly -t concourse-topgun-k8s-4 execute -c tasks/dns-proxy-task.yml -v url=topgun-dp-11517743-web.topgun-dp-11517743.svc.cluster.local:8080/api/v1/info
Handling connection for 38901
uploading topgun done.8KiB/s))
executing build 1 at http://127.0.0.1:38901/builds/1
initializing
fetching busybox@sha256:dd97a3fe6d721c5cf03abac0f50e2848dc583f7c4e41bf39102ceb42edfd1808
7c9d20b9b6cd [======================================] 742.9KiB/742.9KiB
running wget -O- topgun-dp-11517743-web.topgun-dp-11517743.svc.cluster.local:8080/api/v1/info
Connecting to topgun-dp-11517743-web.topgun-dp-11517743.svc.cluster.local:8080 (10.11.241.215:8080)
wget: can't connect to remote host (10.11.241.215): Connection refused
failed
Handling connection for 38901
STEP: [1569963925.802273273] running: helm delete --purge topgun-dp-11517743
release "topgun-dp-11517743" deleted
STEP: [1569963926.385312796] running: kubectl delete namespace topgun-dp-11517743 --wait=false
namespace "topgun-dp-11517743" deleted
• Failure [61.513 seconds]
DNS Resolution
/tmp/build/7f688b19/concourse/topgun/k8s/dns_proxy_test.go:14
different proxy settings
/tmp/build/7f688b19/gopath/pkg/mod/github.com/onsi/[email protected]/extensions/table/table.go:92
Proxy Enabled, with full service name [It]
/tmp/build/7f688b19/gopath/pkg/mod/github.com/onsi/[email protected]/extensions/table/table_entry.go:46
Expected
<int>: 1
to be zero-valued
/tmp/build/7f688b19/concourse/topgun/k8s/dns_proxy_test.go:85
Fourth, failing in the middle of a test, example:
STEP: [1569920401.788959742] running: /tmp/gexec_artifacts116775232/g015069087/fly -t concourse-topgun-k8s-4 login -c http://127.0.0.1:39607 -u test -p test
logging in to team 'main'
Handling connection for 39607
Handling connection for 39607
target saved
STEP: [1569920401.954967737] running: /tmp/gexec_artifacts116775232/g015069087/fly -t concourse-topgun-k8s-4 workers --json
Handling connection for 39607
[]
STEP: [1569920411.992353678] running: /tmp/gexec_artifacts116775232/g015069087/fly -t concourse-topgun-k8s-4 workers --json
Handling connection for 39607
E1001 09:00:12.060585 10831 portforward.go:400] an error occurred forwarding 39607 -> 8080: error forwarding port 8080 to pod 2102f8dd698b355c7f0cc17193d227187229e3af46245ee35dae98bd3a77c0f4, uid : exit status 1: 2019/10/01 09:00:12 socat[174757] E connect(5, AF=2 127.0.0.1:8080, 16): Connection refused
could not reach the Concourse server called concourse-topgun-k8s-4:
Get http://127.0.0.1:39607/api/v1/info: EOF
is the targeted Concourse running? better go catch it lol
STEP: [1569920412.065531969] running: helm delete --purge topgun-bd-overlay-55649115
release "topgun-bd-overlay-55649115" deleted
STEP: [1569920412.527161837] running: kubectl delete namespace topgun-bd-overlay-55649115 --wait=false
namespace "topgun-bd-overlay-55649115" deleted
• Failure [42.312 seconds]
baggageclaim drivers
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:11
GKE
/tmp/build/7f688b19/concourse/topgun/k8s/k8s_suite_test.go:302
ubuntu image
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:36
overlay
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:46
works [It]
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:47
Expected
<int>: 1
to be zero-valued
/tmp/build/7f688b19/concourse/topgun/fly.go:100
2. Baggageclaim driver test fails
This one has been happening for most failures but it's not 100% either. When this specific test in topgun fails, it's always with this error, never the port-forwarding error. It happens for both the cos
and ubuntu
node selectors.
Doesn't happen in:
The test is
• Failure [302.169 seconds]
baggageclaim drivers
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:11
GKE
/tmp/build/7f688b19/concourse/topgun/k8s/k8s_suite_test.go:302
cos image
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:30
btrfs
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:72
fails [It]
/tmp/build/7f688b19/concourse/topgun/k8s/baggageclaim_drivers_test.go:73
Expected
<int>: 1
to equal
<int>: 0
/tmp/build/7f688b19/concourse/topgun/k8s/k8s_suite_test.go:171
You can reproduce the error with this helm install
helm upgrade --install --force --wait --namespace topgun-btrfs-fails --set=web.livenessProbe.failureThreshold=3 --set=web.livenessProbe.initialDelaySeconds=3 --set=web.livenessProbe.periodSeconds=3 --set=web.livenessProbe.timeoutSeconds=3 --set=concourse.web.kubernetes.keepNamespaces=false --set=postgresql.persistence.enabled=false --set=image=concourse/concourse-rc --set=imageTag=latest --set=imageDigest=sha256:b61fbd930eefab9300744ca83e05a2f7bcb954ec341c996a7009014a50c8a374 --set=concourse.web.kubernetes.enabled=false --set=concourse.worker.baggageclaim.driver=btrfs --set=worker.replicas=1 --set=worker.nodeSelector.nodeImage=cos topgun-btrfs-fails /Users/pivotal/workspace/charts/stable/concourse
The worker pod fails to come up and goes into a crash loop. It's failing to create the btrfs volume. kubectl logs
shows the following:
{"timestamp":"2019-10-01T21:15:25.924431822Z","level":"error","source":"baggageclaim","message":"baggageclaim.fs.run-command.failed","data":{"args":["bash","-e","-x","-c","\n\t\tif [ ! -e $IMAGE_PATH ] || [ \"$(stat --printf=\"%s\" $IMAGE_PATH)\" != \"$SIZE_IN_BYTES\" ]; then\n\t\t\ttouch $IMAGE_PATH\n\t\t\ttruncate -s ${SIZE_IN_BYTES} $IMAGE_PATH\n\t\tfi\n\n\t\tlo=\"$(losetup -j $IMAGE_PATH | cut -d':' -f1)\"\n\t\tif [ -z \"$lo\" ]; then\n\t\t\tlo=\"$(losetup -f --show $IMAGE_PATH)\"\n\t\tfi\n\n\t\tif ! file $IMAGE_PATH | grep BTRFS; then\n\t\t\tmkfs.btrfs --nodiscard $IMAGE_PATH\n\t\tfi\n\n\t\tmkdir -p $MOUNT_PATH\n\n\t\tif ! mountpoint -q $MOUNT_PATH; then\n\t\t\tmount -t btrfs $lo $MOUNT_PATH\n\t\tfi\n\t"],"command":"/bin/bash","env":["PATH=/usr/local/concourse/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","MOUNT_PATH=/concourse-work-dir/volumes","IMAGE_PATH=/concourse-work-dir/volumes.img","SIZE_IN_BYTES=10266165248"],"error":"exit status 32","session":"3.1","stderr":"+ '[' '!' -e /concourse-work-dir/volumes.img ']'\n++ stat --printf=%s /concourse-work-dir/volumes.img\n+ '[' 10266165248 '!=' 10266165248 ']'\n++ losetup -j /concourse-work-dir/volumes.img\n++ cut -d: -f1\n+ lo=/dev/loop4\n+ '[' -z /dev/loop4 ']'\n+ file /concourse-work-dir/volumes.img\n+ grep BTRFS\n+ mkdir -p /concourse-work-dir/volumes\n+ mountpoint -q /concourse-work-dir/volumes\n+ mount -t btrfs /dev/loop4 /concourse-work-dir/volumes\nmount: /concourse-work-dir/volumes: unknown filesystem type 'btrfs'.\n","stdout":"/concourse-work-dir/volumes.img: BTRFS Filesystem sectorsize 4096, nodesize 16384, leafsize 16384, UUID=226829f7-4ca4-41ed-8377-794b6095ef94, 114688/10266165248 bytes used, 1 devices\n"}}
{"timestamp":"2019-10-01T21:15:25.925165479Z","level":"error","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","data":{"error":"failed to create btrfs filesystem: exit status 32"}}