kubernetes-sigs / apiserver-network-proxy Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
I'm constantly seeing this message (from the proxy server logs when using http-connect), even when running the proxy server/agent locally:
Received error on connection read unix konnectivity-server.socket->@: use of closed network connection.
It doesn't have any effect on the proxy/data transfer as far as the client is concerned, but seems to be some clean up we may not be doing.
/cc @cheftako
/cc @caesarxuchao
To repro, start 4 processes in the terminal:
It seems to occur in almost every request.
python -m SimpleHTTPServer 8001
./bin/proxy-server --uds-name=konnectivity-server.socket --mode=http-connect --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=0
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key
./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost
Proxy should be able to close connection properly, either expected or unexpected.
Currently, the agent to server communication is based on gRPC streams. There're cases where an agent's egress has to go through an egress proxy. In many cases, the egress proxy does not support gRPC protocol directly.
Is there a recommended way to solve this problem? For example, has anyone tried tunnel the grpc http/2 traffic through an HTTP CONNECT based proxy? Is that a supported model?
For example, trace log could be klog.V(4)
.
Part of our goals here is to allow SSH Tunnels to be removed from the KAS. If we support SSH Tunnels it would allow a smoother migration plan for users of SSH Tunnels.
We use a mutex to protect the backend stream in proxy server.
Same in the proxy agent.
Need to investigate if there is a more efficient implementation with channels. Note that both cases are N producers and 1 consumer. It's tricky to stop pipeline, because (1) channel should be closed by the producer not the consumer (2) OTOH it's difficult to let 1 of the N producers to close the channel.
We have many versions of a module in our go.sum. This file should be trimmed down since this is a very early repository and we should ideally only have one version for each module.
The failure message emitted by the kube-apiserver is: "error dialing backend: EOF".
With instrumentation, it shows it failed at this line: https://github.com/kubernetes/kubernetes/blob/b5b675491b69b5d48bf112a896bc739e500c7275/staging/src/k8s.io/apimachinery/pkg/util/proxy/dial.go#L85
The tls handshake received the "EOF" error.
At this line, the tunnel to the kubelet has already been established, and the kube-apiserver is trying to do the tls handshake with the kubelet over the tunnel, and it fails.
This doesn't happen if the proxy runs in the grpc mode.
I don't know if it's related to #80.
Kubernetes standards for command line flags is to use kebab-case note camelCase. As such the agent, client and proxy should all be switched to use kebab-case.
When an agent disconnects #125 closes all client side connections that use the corresponding agent. However, PendingDial requests may still be in flight and have not been added to the list of clients yet. We should either fail them or retry with a different agent instead of letting the client hit its dial timeout.
Original context from @cheftako:
Most of the time I would expect pending dial to be empty. However if there is something in there, there is a chance its request went out via this backend. If so we will never get the response and that also needs to be dealt with.
The issue is that we do not record in the pending data structure which backend it used, so we cannot tell if anything on the pending list would be effected by a given backend breaking. We also need to work out how to deal with it. One option would be to just fail, which is probably easiest. However as the connection has not yet be established, we should be able to switch to using a different backend.
Fedora Rawhide, with latest stable Golang:
Testing in: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build/src
PATH: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin
GOPATH: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build:/usr/share/gocode
GO111MODULE: off
command: go test -buildmode pie -compiler gc -ldflags "-X sigs.k8s.io/apiserver-network-proxy/version=0.0.10 -extldflags '-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld '"
testing: sigs.k8s.io/apiserver-network-proxy
sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client
# sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client [sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.test]
./client_test.go:43:17: not enough arguments in call to tunnel.serve
have ()
want (*grpc.ClientConn)
./client_test.go:73:17: not enough arguments in call to tunnel.serve
have ()
want (*grpc.ClientConn)
./client_test.go:130:17: not enough arguments in call to tunnel.serve
have ()
want (*grpc.ClientConn)
FAIL sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client [build failed]
E.g., divide it in to frontends and backends, and maybe cut a common connection manager for the backend/frontend manager.
Now that the client is tagged, should the root go.mod switch to using that version?
What Happened
I was testing out service account authentiction with authentication-audience
and for some reason it was not working (getting connection closed on the clientside). I was using the following flags:
- --cluster-ca-cert=ca.crt
- --cluster-cert=konnectivity-server.crt
- --cluster-key=konnectivity-server.key
- --agent-namespace=namespace
- --agent-service-account=konnectivity-agent
- --kubeconfig=kubeconfig
- --authentication-audience=system:konnectivity-server
- --mode=http-connect
As I was switching between SA auth mode and certificates, the problem was that the --cluster-ca-cert
flag was set at the same time as the sa auth flags (auth-audience, agent-namespace, ...) which probably led to the konnectivity server running in certificate auth mode.
What I expect
If I set the agent-service-account, then cluster auth validation should not be used, or at least validation with a clear error to set the right flags.
Make sure we handle cases like the proxy-agent coming up before the proxy-server. Also it should reconnect if the connection is broken. We should also have tests for this.
We split the proxy and client code. A README should be added documenting how to use the client in other modules.
Including generated proto files in proto/, generated mocks in proto/agent/mock/.
k/k has similar scripts for this purpose, e.g., verify-codegen.sh
This was observed with kubectl cp
on a large binary file. I attempted to run kubectl cp on a 57M file.
To reproduce, start a kubernetes cluster with network proxy enabled
KUBE_ENABLE_EGRESS_VIA_KONNECTIVITY_SERVICE=true ./cluster/kube-up.sh
SSH onto the master and try copying the kubectl
binary to a random pod:
Eg:
kubectl apply -f https://k8s.io/examples/application/shell-demo.yaml
kubectl cp $(which kubectl) shell-demo:kubectl
The konnectivity-server.log
file shows data transferred, but it stops partially through the data transfer and hangs
EDIT: This can be reproduced without kubernetes
make build && make gen && make certs
)which kubectl) .
)HTTP Connect mode fails:
./bin/proxy-server --uds-name=konnectivity-server.socket --mode=http-connect --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key
python -m SimpleHTTPServer 8001
./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost --request-path=/kubectl
GRPC mode also fails
./bin/proxy-server --uds-name=konnectivity-server.socket --mode=grpc --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key
python -m SimpleHTTPServer 8001
./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --mode=grpc --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost --request-path=/kubectl
Please see #1 (comment)
We should not need to run 3 proxy agents in each Node for a HA cluster. However we do then want each proxy agent to connect each of the proxy servers.
When running in GRPC mode, client -> proxy UDS connections are not closed after the (client -> proxy server -> agent -> destination) connection is closed. After a CLOSE_RSP
It seems that we only remove the frontends, but never call close on the underlying GRPC stream between the client and proxy. This causes resource leaks.
Should be an easy fix though.
After just a couple of operations on the master, the number of opened streams can get quite high. This does not happen in http-connect mode.
jying@kubernetes-master ~ $ netstat | grep konnectivity-server.socket | wc -l
204
/cc @caesarxuchao
make test
currently only recursively executes tests in the network proxy module. We should fix it to execute tests in the konnectivity-client submodule as well.
In the proposal for this repo, this sentence is present under "Non-Goals":
(The proxy can be extended to do this. However that is left to the User if they want that behavior)
Any suggestions as to how to extend this to get this behavior?
Thank you
Currently, we create a new grpcTunnel for every client connection to the proxy server (from the same client). This has some performance implications since we could have a lot of concurrent connections all using different tunnels with each tunnel creates only one stream to the proxy server. We should investigate reusing a single tunnel and multiplexing all streams through that instead of creating a new tunnel for each connection.
/cc @caesarxuchao
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/apiserver-network-proxy-push-images
Requesting access to cloudbuild logs to debug further (kubernetes/k8s.io#1056)
/cc @cheftako
Currently the proxy-server attempts for forward all connection requests from the client to the proxy-agent. It would be useful to allow the proxy server to have a setting where it put the traffic on a local ethernet connection directly. This would allow us to firewall of the KAS so it could ONLY connect to the proxy-server(s). Then the relevant proxy-server could place traffic locally for things like connecting to the ETCD server.
K8s has transitioned from dep to gomod. It would be good if we did as well.
Currently the proxy server will send all connection requests to a single agent. While all requests for an established connection should go through the same agent on which it was established, connection requests should be balanced across available agents.
Currently we have no process for generating Open Source official builds.
We have promoted gcr.io/google-containers but these are not general OSS official builds.
It would be nice to support a multi arch build.
Could get inspiration from kubernetes/node-problem-detector#336
When the proxy server restarts, if the UDS file already exists, trying to listen on the socket will fail with "bind: address already in use". The server should delete the file if it exists, and then listen to the socket.
cc @Jefftree
2020/03/26 16:16:41 http: panic serving @: runtime error: invalid memory address or nil pointer dereference
goroutine 1266 [running]:
net/http.(*conn).serve.func1(0xc0003420a0)
/usr/local/go/src/net/http/server.go:1767 +0x139
panic(0x133c6a0, 0x206ac00)
/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP(0xc00000e800, 0x16be0c0, 0xc0004a81c0, 0xc0003f8100)
/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:81 +0x663
net/http.serverHandler.ServeHTTP(0xc0001c01c0, 0x16be0c0, 0xc0004a81c0, 0xc0003f8100)
/usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.(*conn).serve(0xc0003420a0, 0x16c1f40, 0xc0000a8480)
/usr/local/go/src/net/http/server.go:1890 +0x875
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2927 +0x38e
Our release process is a bit different from normal since we have a multi module golang repository. We should document the release procedure for a new version (TAGS) as well as the release image pipeline in k8s.gcr.io
/cc @caesarxuchao
We have this pattern many places in the repo:
apiserver-network-proxy/cmd/proxy/main.go
Lines 135 to 137 in 8503646
We shouldn't ignore other kinds of error.
Walter mentioned that maybe the code is like this because os.Stat returns error if the file is a symlink, which I checked was not the case on my Ubuntu machine. Anyway, even if there is a kind of error that we want to ignore, we should whitelist it.
The above tests only runs on the master branch, and we only get failure notifications in hindsight of a PR merge. If a PR could cause the build and push procedure to fail, we should catch that before it is merged. I think we should at least run docker build on all PRs as part of the CI process
/cc @Sh4d1
In case of dial failure, agent sets the error:
But the proxy server never checks for error:
I haven't thought through what outcome would be.
There seems to be an issue with the tag konnectivity-client/v0.0.5
. In fact the _GIT_TAG
given to the prow job building the image is v20200218-konnectivity-client/v0.0.5-3-gef0d890
which fails the docker build since it's not a valid tag.
If we want to name a tag with the component name as well I guess it should be done in a docker tag way like konnectivity-client-v0.0.5
.
/cc @Jefftree
Although it is documented that singleUseGRPCTunnels should only be used once, it is technically still possible to to call the Dial function multiple times. This causes certain race conditions and goroutine leaks.
We are not running the proxy server in hostNetwork, so we can't have a livenessprobe since the admin port is only listening on localhost
create a Dockerfile and Makefile target to build the docker image.
Currently our CI only supports the grpc mode. We need to figure out a way to continuously test both mdoes.
agent.pb.go
generated by protoc -I . proto/agent/agent.proto --go_out=plugins=grpc:${GOPATH}/src
is placed at $GOPATH/src/assumes/sigs.k8s.io/apiserver-network-proxy/proto/agent
. But cat hack/go-license-header.txt proto/agent/agent.pb.go > proto/agent/agent.licensed.go
try to read it from the current directory, which doesn't work outside the $GOPATH
make gen
requires the golang/mock, which is not mentioned in the README
I noticed there are many such warning in the log. It's printed here:
The connection is closed here, after receiving EOF from the client connection:
It seems this usually was a premature close, there were data the backend wanted to send to the client, thus the many "DATA send to client stream error" warnings.
I expect this would cause major problems, but it didn't. Why?
when client doesn't send CloseRequest
to proxy server, due to reasons like reset tunnel or timeout, the proxy server should still recognize it and close the corresponding remote connection.
Hello,
In the beginning, I would like to thank you, apiserver-network-proxy team, for your amazing work :)
I use Konnectivity in version 0.0.10 and Kubernetes 1.18.5. Most of the time everything works flawlessly, but when I restart agent's pods, konnectivity-server don't want to switch traffic to the new connections:
I0710 13:11:42.881321 1 server.go:293] >>> DATA sent to backend
I0710 13:11:42.952402 1 server.go:498] <<< Received 372 bytes of DATA from agentID d3c1f7ef-4730-4300-964e-2e382af9e484, connID 1
I0710 13:11:42.961352 1 server.go:278] >>> Received 42 bytes of DATA(id=1)
I0710 13:11:42.961521 1 server.go:293] >>> DATA sent to backend
I0710 13:11:47.641175 1 server.go:418] Connect request from agent 3adda35e-94ab-46ae-bd6c-96a8197ff651
I0710 13:11:47.641222 1 backend_manager.go:99] register Backend &{0xc00050e180} for agentID 3adda35e-94ab-46ae-bd6c-96a8197ff651
W0710 13:11:47.909837 1 server.go:451] stream read error: rpc error: code = Canceled desc = context canceled
I0710 13:11:47.909895 1 backend_manager.go:119] remove Backend &{0xc000222480} for agentID 1bf0e057-b131-45b9-bbcb-ab752e9dfef4
I0710 13:11:47.909940 1 server.go:531] <<< Close backend &{0xc000222480} of agent 1bf0e057-b131-45b9-bbcb-ab752e9dfef4
W0710 13:11:51.898482 1 server.go:451] stream read error: rpc error: code = Canceled desc = context canceled
I0710 13:11:51.898535 1 backend_manager.go:119] remove Backend &{0xc0002223c0} for agentID d3c1f7ef-4730-4300-964e-2e382af9e484
I0710 13:11:51.898592 1 server.go:531] <<< Close backend &{0xc0002223c0} of agent d3c1f7ef-4730-4300-964e-2e382af9e484
I0710 13:11:55.825758 1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:11:55.825884 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:11:55.825914 1 server.go:293] >>> DATA sent to backend
I0710 13:11:55.849556 1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:11:55.849659 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:11:55.849691 1 server.go:293] >>> DATA sent to backend
I0710 13:11:56.097642 1 server.go:418] Connect request from agent 8fe505d3-11a8-4f63-9882-8a963ab16273
I0710 13:11:56.097671 1 backend_manager.go:99] register Backend &{0xc00050e240} for agentID 8fe505d3-11a8-4f63-9882-8a963ab16273
I0710 13:11:57.588222 1 server.go:418] Connect request from agent bf37dd3c-8736-484a-b9e6-6806b4b9338a
I0710 13:11:57.588257 1 backend_manager.go:99] register Backend &{0xc00072c300} for agentID bf37dd3c-8736-484a-b9e6-6806b4b9338a
I0710 13:12:00.827480 1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:00.827593 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.827625 1 server.go:293] >>> DATA sent to backend
I0710 13:12:00.850981 1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:00.851056 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.851075 1 server.go:293] >>> DATA sent to backend
I0710 13:12:00.894264 1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:12:00.894345 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.894376 1 server.go:293] >>> DATA sent to backend
I0710 13:12:05.953939 1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:05.954051 1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:05.954077 1 server.go:293] >>> DATA sent to backend
After konnectivity-server restart, everything is going back to normal.
In "cmd/", we call split the proxy server and agent binaries as "cmd/proxy" and "cmd/agent". We should split the "pkg/" in the similar ways.
The /healthz response controls if kubelet restarts a pod. If there is not proxy agent connecting to the proxy server for a long time, then maybe it's a server problem and restarting it might solve the problem. Thus maybe it's worth setting the healthz to "not ok" if there is no backend connection for a long time.
This is all hypothetical so far. If we see evidence in production, then we will come back adding this healthiness condition.
Ref #102 (comment).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.