kubernetes-sigs / apiserver-network-proxy Goto Github PK

View Code? Open in Web Editor NEW

354.0 354.0 173.0 22.15 MB

License: Apache License 2.0

Makefile 3.54% Go 95.63% Dockerfile 0.83%

k8s-sig-cloud-provider

apiserver-network-proxy's People

Contributors

Stargazers

Watchers

Forkers

cheftako anfernee mvladev sh4d1 jefftree bysph justinsb sivanzcw caesarxuchao avorima dberkov yue9944882 zanetworker rainbowmango charleszheng44 irozzo-1a saranbalaji90 jieyu mars1024 naidu-kjml trawler kvaps vinayakankugoyal antoninbas penzhan8451 poor12 mikechengwei daixiang0 isgasho sataqiu 00pf00 huxiaoliang jkh52 huiwq1990 jiuchen1986 marwinski youssefazrak toumorokoshi scheererj mandelsoft isabella232 liufen90 alibaba-archive wawlian sfowl wangao1236 yongbig qyzhaoxun yanyhui arxiv-research jdnurme silenceper felix0080 yan-lgtm relyt0925 mihivagyok aojea sjenning csrwng openshift llhuii jnummelin joelsmith ddymko pratikdeoghare kubermatic mainred yrxing lichenglife kinvolk mrlihanbo mikhail-sakhnov rata soorena776 k0sproject phoenixredflash spiffxp cloudnative-lab cuisongliu alvaroaleman ozzpy bogdanberce rastislavs jxtreehouse andrewsykim manedurphy korkin25 liggitt ironcore-scrapyard avrittrohwer yankay bhojpur mushuee erikjiang chenpfeisoo lonelycz zackzhangkai gala-r tallclair mqyang56

apiserver-network-proxy's Issues

use of closed network connection

I'm constantly seeing this message (from the proxy server logs when using http-connect), even when running the proxy server/agent locally:

Received error on connection read unix konnectivity-server.socket->@: use of closed network connection.

It doesn't have any effect on the proxy/data transfer as far as the client is concerned, but seems to be some clean up we may not be doing.

/cc @cheftako
/cc @caesarxuchao

To repro, start 4 processes in the terminal:

It seems to occur in almost every request.

python -m SimpleHTTPServer 8001

./bin/proxy-server --uds-name=konnectivity-server.socket --mode=http-connect --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=0

./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key

./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost

Implementing connection close

Proxy should be able to close connection properly, either expected or unexpected.

Agent to server communication through an egress proxy

Currently, the agent to server communication is based on gRPC streams. There're cases where an agent's egress has to go through an egress proxy. In many cases, the egress proxy does not support gRPC protocol directly.

Is there a recommended way to solve this problem? For example, has anyone tried tunnel the grpc http/2 traffic through an HTTP CONNECT based proxy? Is that a supported model?

Use different verbose log level for different log

For example, trace log could be klog.V(4).

Support SSH Tunnels

Part of our goals here is to allow SSH Tunnels to be removed from the KAS. If we support SSH Tunnels it would allow a smoother migration plan for users of SSH Tunnels.

Use channel instead of mutex for performance improvement

We use a mutex to protect the backend stream in proxy server.

Same in the proxy agent.

Need to investigate if there is a more efficient implementation with channels. Note that both cases are N producers and 1 consumer. It's tricky to stop pipeline, because (1) channel should be closed by the producer not the consumer (2) OTOH it's difficult to let 1 of the N producers to close the channel.

Clean up go.sum

We have many versions of a module in our go.sum. This file should be trimmed down since this is a very early repository and we should ideally only have one version for each module.

`kubectl exec` failure with httpConnect mode

The failure message emitted by the kube-apiserver is: "error dialing backend: EOF".

With instrumentation, it shows it failed at this line: https://github.com/kubernetes/kubernetes/blob/b5b675491b69b5d48bf112a896bc739e500c7275/staging/src/k8s.io/apimachinery/pkg/util/proxy/dial.go#L85

The tls handshake received the "EOF" error.

At this line, the tunnel to the kubelet has already been established, and the kube-apiserver is trying to do the tls handshake with the kubelet over the tunnel, and it fails.

This doesn't happen if the proxy runs in the grpc mode.

I don't know if it's related to #80.

apiserver-network-proxy flags should use kebab-case not camelCase

Kubernetes standards for command line flags is to use kebab-case note camelCase. As such the agent, client and proxy should all be switched to use kebab-case.

Handle agent disconnects for PendingDial

When an agent disconnects #125 closes all client side connections that use the corresponding agent. However, PendingDial requests may still be in flight and have not been added to the list of clients yet. We should either fail them or retry with a different agent instead of letting the client hit its dial timeout.

Original context from @cheftako:

Most of the time I would expect pending dial to be empty. However if there is something in there, there is a chance its request went out via this backend. If so we will never get the response and that also needs to be dealt with.

The issue is that we do not record in the pending data structure which backend it used, so we cannot tell if anything on the pending list would be effected by a given backend breaking. We also need to work out how to deal with it. One option would be to just fail, which is probably easiest. However as the connection has not yet be established, we should be able to switch to using a different backend.

Test failing with "not enough arguments in call to tunnel.serve"

Fedora Rawhide, with latest stable Golang:

Testing    in: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build/src
         PATH: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin
       GOPATH: /builddir/build/BUILD/apiserver-network-proxy-0.0.10/_build:/usr/share/gocode
  GO111MODULE: off
      command: go test -buildmode pie -compiler gc -ldflags "-X sigs.k8s.io/apiserver-network-proxy/version=0.0.10 -extldflags '-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld '"
      testing: sigs.k8s.io/apiserver-network-proxy
sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client
# sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client [sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client.test]
./client_test.go:43:17: not enough arguments in call to tunnel.serve
	have ()
	want (*grpc.ClientConn)
./client_test.go:73:17: not enough arguments in call to tunnel.serve
	have ()
	want (*grpc.ClientConn)
./client_test.go:130:17: not enough arguments in call to tunnel.serve
	have ()
	want (*grpc.ClientConn)
FAIL	sigs.k8s.io/apiserver-network-proxy/konnectivity-client/pkg/client [build failed]

Enable basic CI

Run unit test
golint

Divide the pkg/server to subdirectories

E.g., divide it in to frontends and backends, and maybe cut a common connection manager for the backend/frontend manager.

Use sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.0.5?

Now that the client is tagged, should the root go.mod switch to using that version?

Vague errors when running in SA auth mode while the cluster CA cert is set

What Happened
I was testing out service account authentiction with authentication-audience and for some reason it was not working (getting connection closed on the clientside). I was using the following flags:

          - --cluster-ca-cert=ca.crt
          - --cluster-cert=konnectivity-server.crt
          - --cluster-key=konnectivity-server.key
          - --agent-namespace=namespace
          - --agent-service-account=konnectivity-agent
          - --kubeconfig=kubeconfig
          - --authentication-audience=system:konnectivity-server
          - --mode=http-connect

As I was switching between SA auth mode and certificates, the problem was that the --cluster-ca-cert flag was set at the same time as the sa auth flags (auth-audience, agent-namespace, ...) which probably led to the konnectivity server running in certificate auth mode.

What I expect
If I set the agent-service-account, then cluster auth validation should not be used, or at least validation with a clear error to set the right flags.

Make sure proxy-agent reconnects to the proxy-server.

Make sure we handle cases like the proxy-agent coming up before the proxy-server. Also it should reconnect if the connection is broken. We should also have tests for this.

Create README for konnectivity-client

We split the proxy and client code. A README should be added documenting how to use the client in other modules.

Add CI test to verify that generated files are up-to-date

Including generated proto files in proto/, generated mocks in proto/agent/mock/.

k/k has similar scripts for this purpose, e.g., verify-codegen.sh

Add unit/integration tests to catch issues like #78

Network Proxy hangs when large amount of data is transferred

This was observed with kubectl cp on a large binary file. I attempted to run kubectl cp on a 57M file.

To reproduce, start a kubernetes cluster with network proxy enabled

KUBE_ENABLE_EGRESS_VIA_KONNECTIVITY_SERVICE=true ./cluster/kube-up.sh

SSH onto the master and try copying the kubectl binary to a random pod:

Eg:

kubectl apply -f https://k8s.io/examples/application/shell-demo.yaml
kubectl cp $(which kubectl) shell-demo:kubectl

The konnectivity-server.log file shows data transferred, but it stops partially through the data transfer and hangs

EDIT: This can be reproduced without kubernetes

Build network proxy with certs & protos generated. (make build && make gen && make certs)
Copy the kubectl binary (or a similar sized file) to the network proxy dir. (cp $(which kubectl) .)
Run 4 terminals:

HTTP Connect mode fails:

./bin/proxy-server --uds-name=konnectivity-server.socket --mode=http-connect --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key
python -m SimpleHTTPServer 8001
./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost --request-path=/kubectl

GRPC mode also fails

./bin/proxy-server --uds-name=konnectivity-server.socket --mode=grpc --cluster-cert=certs/agent/issued/proxy-master.crt --cluster-key=certs/agent/private/proxy-master.key --server-port=
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key
python -m SimpleHTTPServer 8001
./bin/proxy-test-client --proxy-uds=konnectivity-server.socket --mode=grpc --proxy-host= --proxy-port=0 --mode=http-connect --request-port=8001 --request-host=localhost --request-path=/kubectl

Allow proxy agent to connect to multiple proxy servers.

Please see #1 (comment)

We should not need to run 3 proxy agents in each Node for a HA cluster. However we do then want each proxy agent to connect each of the proxy servers.

[GRPC Mode] Client -> Proxy connections should be closed when finished

When running in GRPC mode, client -> proxy UDS connections are not closed after the (client -> proxy server -> agent -> destination) connection is closed. After a CLOSE_RSP It seems that we only remove the frontends, but never call close on the underlying GRPC stream between the client and proxy. This causes resource leaks.

Should be an easy fix though.

After just a couple of operations on the master, the number of opened streams can get quite high. This does not happen in http-connect mode.

jying@kubernetes-master ~ $ netstat | grep konnectivity-server.socket | wc -l
204

/cc @caesarxuchao

`make test` does not execute unit tests in konnectivity-client module

make test currently only recursively executes tests in the network proxy module. We should fix it to execute tests in the konnectivity-client submodule as well.

Handle requests from the Cluster to the Control Plane

In the proposal for this repo, this sentence is present under "Non-Goals":

(The proxy can be extended to do this. However that is left to the User if they want that behavior)

Any suggestions as to how to extend this to get this behavior?

Thank you

Reuse GRPC tunnel when dialing from the client

Currently, we create a new grpcTunnel for every client connection to the proxy server (from the same client). This has some performance implications since we could have a lot of concurrent connections all using different tunnels with each tunnel creates only one stream to the proxy server. We should investigate reusing a single tunnel and multiplexing all streams through that instead of creating a new tunnel for each connection.

/cc @caesarxuchao

apiserver-network-proxy-push-images job broken

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/apiserver-network-proxy-push-images

Requesting access to cloudbuild logs to debug further (kubernetes/k8s.io#1056)

/cc @cheftako

Implement a local/non agent option for the proxy server.

Currently the proxy-server attempts for forward all connection requests from the client to the proxy-agent. It would be useful to allow the proxy server to have a setting where it put the traffic on a local ethernet connection directly. This would allow us to firewall of the KAS so it could ONLY connect to the proxy-server(s). Then the relevant proxy-server could place traffic locally for things like connecting to the ETCD server.

Transition from dep to gomod

K8s has transitioned from dep to gomod. It would be good if we did as well.

Proxy server should load-balance connection requests to all available agents.

Currently the proxy server will send all connection requests to a single agent. While all requests for an established connection should go through the same agent on which it was established, connection requests should be balanced across available agents.

Implement official build images for proxy-server and proxy-agent.

Currently we have no process for generating Open Source official builds.
We have promoted gcr.io/google-containers but these are not general OSS official builds.

Support multi arch / fat manifest

It would be nice to support a multi arch build.
Could get inspiration from kubernetes/node-problem-detector#336

Clean up UDS file before listening

When the proxy server restarts, if the UDS file already exists, trying to listen on the socket will fail with "bind: address already in use". The server should delete the file if it exists, and then listen to the socket.

cc @Jefftree

Panic when no backend are available

2020/03/26 16:16:41 http: panic serving @: runtime error: invalid memory address or nil pointer dereference
goroutine 1266 [running]:
net/http.(*conn).serve.func1(0xc0003420a0)
	/usr/local/go/src/net/http/server.go:1767 +0x139
panic(0x133c6a0, 0x206ac00)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP(0xc00000e800, 0x16be0c0, 0xc0004a81c0, 0xc0003f8100)
	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:81 +0x663
net/http.serverHandler.ServeHTTP(0xc0001c01c0, 0x16be0c0, 0xc0004a81c0, 0xc0003f8100)
	/usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.(*conn).serve(0xc0003420a0, 0x16c1f40, 0xc0000a8480)
	/usr/local/go/src/net/http/server.go:1890 +0x875
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2927 +0x38e

Document release procedure

Our release process is a bit different from normal since we have a multi module golang repository. We should document the release procedure for a new version (TAGS) as well as the release image pipeline in k8s.gcr.io

/cc @caesarxuchao

Validation should not ignore errors that's not os.IsNotExist

We have this pattern many places in the repo:

apiserver-network-proxy/cmd/proxy/main.go

Lines 135 to 137 in 8503646

    
           if _, err := os.Stat(o.serverCert); os.IsNotExist(err) { 
        
           	return fmt.Errorf("error checking server cert %s, got %v", o.serverCert, err) 
        
           }

We shouldn't ignore other kinds of error.

Walter mentioned that maybe the code is like this because os.Stat returns error if the file is a symlink, which I checked was not the case on my Ubuntu machine. Anyway, even if there is a kind of error that we want to ignore, we should whitelist it.

Add license headers

Include docker build as part of CI

https://k8s-testgrid.appspot.com/sig-cloud-provider-apiserver-network-proxy#apiserver-network-proxy-push-images

The above tests only runs on the master branch, and we only get failure notifications in hindsight of a PR merge. If a PR could cause the build and push procedure to fail, we should catch that before it is merged. I think we should at least run docker build on all PRs as part of the CI process

/cc @Sh4d1

serveRecvBackend should check if dialResp contains error

In case of dial failure, agent sets the error:

apiserver-network-proxy/pkg/agent/agentclient/client.go

Line 140 in da0186e

resp.GetDialResponse().Error = err.Error()

But the proxy server never checks for error:

apiserver-network-proxy/pkg/agent/agentserver/server.go

Line 425 in da0186e

case client.PacketType_DIAL_RSP:

I haven't thought through what outcome would be.

Issue with tag konnectivity-client/v0.0.5

There seems to be an issue with the tag konnectivity-client/v0.0.5. In fact the _GIT_TAG given to the prow job building the image is v20200218-konnectivity-client/v0.0.5-3-gef0d890 which fails the docker build since it's not a valid tag.

If we want to name a tag with the component name as well I guess it should be done in a docker tag way like konnectivity-client-v0.0.5.

/cc @Jefftree

Dialing multiple times on singleUseGRPCTunnel causes race conditions

Although it is documented that singleUseGRPCTunnels should only be used once, it is technically still possible to to call the Dial function multiple times. This causes certain race conditions and goroutine leaks.

Proxy also need to have a separate health port

We are not running the proxy server in hostNetwork, so we can't have a livenessprobe since the admin port is only listening on localhost

Containerize the build

create a Dockerfile and Makefile target to build the docker image.

CI for both httpConnect and grpc mode

Currently our CI only supports the grpc mode. We need to figure out a way to continuously test both mdoes.

`make gen` doesn't work

agent.pb.go generated by protoc -I . proto/agent/agent.proto --go_out=plugins=grpc:${GOPATH}/src is placed at $GOPATH/src/assumes/sigs.k8s.io/apiserver-network-proxy/proto/agent. But cat hack/go-license-header.txt proto/agent/agent.pb.go > proto/agent/agent.licensed.go try to read it from the current directory, which doesn't work outside the $GOPATH
make gen requires the golang/mock, which is not mentioned in the README

"DATA send to client stream error" in httpConnect mode

I noticed there are many such warning in the log. It's printed here:

apiserver-network-proxy/pkg/agent/agentserver/server.go

Line 452 in da0186e

klog.Warningf("<<< DATA send to client stream error: %v", err)

The connection is closed here, after receiving EOF from the client connection:

apiserver-network-proxy/pkg/agent/agentserver/tunnel.go

Line 100 in da0186e

defer conn.Close()

It seems this usually was a premature close, there were data the backend wanted to send to the client, thus the many "DATA send to client stream error" warnings.

I expect this would cause major problems, but it didn't. Why?

/cc @Jefftree @anfernee @cheftako

e2e test with concurrency

proxy server should handle idle connection from the client

when client doesn't send CloseRequest to proxy server, due to reasons like reset tunnel or timeout, the proxy server should still recognize it and close the corresponding remote connection.

DATA to Backend failed, after agent restart

Hello,
In the beginning, I would like to thank you, apiserver-network-proxy team, for your amazing work :)

I use Konnectivity in version 0.0.10 and Kubernetes 1.18.5. Most of the time everything works flawlessly, but when I restart agent's pods, konnectivity-server don't want to switch traffic to the new connections:

I0710 13:11:42.881321       1 server.go:293] >>> DATA sent to backend
I0710 13:11:42.952402       1 server.go:498] <<< Received 372 bytes of DATA from agentID d3c1f7ef-4730-4300-964e-2e382af9e484, connID 1
I0710 13:11:42.961352       1 server.go:278] >>> Received 42 bytes of DATA(id=1)
I0710 13:11:42.961521       1 server.go:293] >>> DATA sent to backend
I0710 13:11:47.641175       1 server.go:418] Connect request from agent 3adda35e-94ab-46ae-bd6c-96a8197ff651
I0710 13:11:47.641222       1 backend_manager.go:99] register Backend &{0xc00050e180} for agentID 3adda35e-94ab-46ae-bd6c-96a8197ff651
W0710 13:11:47.909837       1 server.go:451] stream read error: rpc error: code = Canceled desc = context canceled
I0710 13:11:47.909895       1 backend_manager.go:119] remove Backend &{0xc000222480} for agentID 1bf0e057-b131-45b9-bbcb-ab752e9dfef4
I0710 13:11:47.909940       1 server.go:531] <<< Close backend &{0xc000222480} of agent 1bf0e057-b131-45b9-bbcb-ab752e9dfef4
W0710 13:11:51.898482       1 server.go:451] stream read error: rpc error: code = Canceled desc = context canceled
I0710 13:11:51.898535       1 backend_manager.go:119] remove Backend &{0xc0002223c0} for agentID d3c1f7ef-4730-4300-964e-2e382af9e484
I0710 13:11:51.898592       1 server.go:531] <<< Close backend &{0xc0002223c0} of agent d3c1f7ef-4730-4300-964e-2e382af9e484
I0710 13:11:55.825758       1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:11:55.825884       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:11:55.825914       1 server.go:293] >>> DATA sent to backend
I0710 13:11:55.849556       1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:11:55.849659       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:11:55.849691       1 server.go:293] >>> DATA sent to backend
I0710 13:11:56.097642       1 server.go:418] Connect request from agent 8fe505d3-11a8-4f63-9882-8a963ab16273
I0710 13:11:56.097671       1 backend_manager.go:99] register Backend &{0xc00050e240} for agentID 8fe505d3-11a8-4f63-9882-8a963ab16273
I0710 13:11:57.588222       1 server.go:418] Connect request from agent bf37dd3c-8736-484a-b9e6-6806b4b9338a
I0710 13:11:57.588257       1 backend_manager.go:99] register Backend &{0xc00072c300} for agentID bf37dd3c-8736-484a-b9e6-6806b4b9338a
I0710 13:12:00.827480       1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:00.827593       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.827625       1 server.go:293] >>> DATA sent to backend
I0710 13:12:00.850981       1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:00.851056       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.851075       1 server.go:293] >>> DATA sent to backend
I0710 13:12:00.894264       1 server.go:278] >>> Received 53 bytes of DATA(id=3)
W0710 13:12:00.894345       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:00.894376       1 server.go:293] >>> DATA sent to backend
I0710 13:12:05.953939       1 server.go:278] >>> Received 42 bytes of DATA(id=3)
W0710 13:12:05.954051       1 server.go:291] >>> DATA to Backend failed: rpc error: code = Unavailable desc = transport is closing
I0710 13:12:05.954077       1 server.go:293] >>> DATA sent to backend

After konnectivity-server restart, everything is going back to normal.

Rename pkg/agent/agentserver|agentclient to pkg/proxy and pkg/agent

In "cmd/", we call split the proxy server and agent binaries as "cmd/proxy" and "cmd/agent". We should split the "pkg/" in the similar ways.

Consider setting proxy server healthz to "not ok" if there is no backend connection for a long time

The /healthz response controls if kubelet restarts a pod. If there is not proxy agent connecting to the proxy server for a long time, then maybe it's a server problem and restarting it might solve the problem. Thus maybe it's worth setting the healthz to "not ok" if there is no backend connection for a long time.

This is all hypothetical so far. If we see evidence in production, then we will come back adding this healthiness condition.

Ref #102 (comment).

	if _, err := os.Stat(o.serverCert); os.IsNotExist(err) {
	return fmt.Errorf("error checking server cert %s, got %v", o.serverCert, err)
	}

kubernetes-sigs / apiserver-network-proxy Goto Github PK

apiserver-network-proxy's People

Contributors

Stargazers

Watchers

Forkers

apiserver-network-proxy's Issues

Recommend Projects

Recommend Topics

Recommend Org