Giter Site home page Giter Site logo

Comments (9)

kyessenov avatar kyessenov commented on July 22, 2024

For the client-side mode, we need to trap packets originating in the pod that are not sent by the proxy. Using "owner" has some limitations: https://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#OWNERMATCH
Since "owner" is not 100% robust, envoy may be trapping packets originating from envoy (e.g. ICMP packets or anything else). We need to make sure we're not creating an infinite loop with this rule @enricoschiattarella.

from pilot.

kyessenov avatar kyessenov commented on July 22, 2024

We should stop capturing pod-local traffic with iptables. If one container wants to talk to another container in the pod, there is no need to insert envoy in the middle.

Moreover, we should provide a way for other sidecars to talk to Kubernetes without us getting in the way. We can treat kubernetes namespace as special, perhaps?

from pilot.

rshriram avatar rshriram commented on July 22, 2024

from pilot.

ayj avatar ayj commented on July 22, 2024

A few options exist for redirecting client traffic to external services which I've noted below. Both option (2) and option (3) requires extra runtime privileges. Option (2) requires NET_CAP_ADMIN in the proxy itself whereas option (3) requires privileges to update net_cls. Furthermore, option (3) seems to exposes node-level cgroup stuff to the container of which I don't completely understand the implications though it seems like a bad idea.

Note that each of the options can be ammended with "! -s 127.0.0.1/32" to avoid redirecting to proxy for local-to-local traffic. Similarly, we could avoid proxy interception for specification destinations (e.g. k8s) with a similar destination filter.

Option 1 - UID

-A OUTPUT -p tcp -m owner ! --uid-owner $UID -j REDIRECT --to-ports $PORT

  • pro: No envoy change required
  • con: Requires coordinating UID between proxy and init-container. Istio may not necessarily have control over UID (e.g. set by docker).

Option 2 - SO_MARK

-A OUTPUT -p tcp -m mark ! --mark $MARK -j REDIRECT --to-ports $PORT

  • pro - $MARK value can be configured in pod spec (e.g. configmap, env var) and used by init-container to program iptables and proxy agent to create envoy config with proper SO_MARK values per-upstream cluster.
  • con: Requires adding SO_MARK support to envoy, perhaps configured per upstream cluster?
  • con: Requires proxy run with NET_CAP_ADMIN to set SO_MARK on upstream sockets.

Option 3 - Network classifier cgroups (net_cls)

-A OUTPUT -p tcp -c groups ! --cgroup $GROUP -j REDIRECT --to-ports $PORT

  • pro: No envoy change required
  • con: Requires newer version of iptables (e.g. 1.6.0)
  • con: Requires re-mounting /sys/fs/group/net_cls as read/write to create new net_cls groups. This requires additional privileges (NET_CAP_ADMIN?) and seems to expose all of the node's net_cls cgroups to the pod. Furthermore, changes to net_cls (e.g. adding new group for proxy) persistent on node across pod-restarts.
  • con: Proxy agent needs to update /sys/fs/group/net_cls//tasks file with envoy proxy PID whenever proxy crashes, restarts, etc.
  • note: I wasn't able to get this method to work, but it's possible I mixed up the configuration someplace. I will try again tomorrow.

Option 4 - Explicit iptable rule per-service

  • pro: very explicit.
  • con: lots of iptable rules to cover all client and server services

from pilot.

rshriram avatar rshriram commented on July 22, 2024

from pilot.

kyessenov avatar kyessenov commented on July 22, 2024

Excellent analysis!
I agree with @rshriram that we should not grant additional capabilities to the proxy.
I was hoping that the cgroup method would work better given that the proxy and the app are already isolated into containers, but it seems that we still need privileges to operate on the node-level. Maybe we explore this option for the per-node model.

When I was looking at the "PID owner" method in the netfilter/iptables documentation, I remember seeing a warning about reliability of recovering PIDs from packets. I should try to find it again, but the documentation is scarce.

There are three items we should do to improve our current IP tables method:

  • skip loopback app-to-app packet capture (with the elimination clause)
  • treat kube-system namespace in a special way. Cluster addon namespace holds things like fluentd and ingress controllers, and perhaps, manager itself.
  • investigate having a separate envoy listener for explicit envoy calls

from pilot.

ayj avatar ayj commented on July 22, 2024

When I was looking at the "PID owner" method in the netfilter/iptables documentation,
I remember seeing a warning about reliability of recovering PIDs from packets.

Do you mean UID owner? As far as I can tell, only --uid-owner and --gui-owner options are supported by iptables now. Looks like --pid-owner might have been removed due to its racy nature with process restarts?

treat kube-system namespace in a special way. Cluster addon namespace
holds things like fluentd and ingress controllers, and perhaps, manager itself.

I'm interpreting this to mean we need to bypass pod-level proxy for outbound traffic destined for kube-system namespace? If so, I believe we would need to individually opt-out (via iptables) each destination address in the kube-system. For example, watch API for pods, services, etc. in kube-system and add/remove elimination rules.

investigate having a separate envoy listener for explicit envoy calls

This should be doable now with envoyproxy/envoy/pull/377 and the app-to-app loopback clause noted above. Let me verify that this works with the current iptables recipe.

from pilot.

ayj avatar ayj commented on July 22, 2024

Update on option (3). Fortunately it looks like bind-mounting /sys/fs/cgroup/net_cls/ into the proxy container works instead of remounting. And each pod gets its own view of cgroups. Unfortunately this still requires privileges (pod fails to start otherwise) and the iptable "-m cgroup ! --cgroup " rule doesn't seem to work anyway in the end. I didn't find much on this specific use case and what I did find seemed to recommend treating cgroup as read-only from with the container/pod.

from pilot.

ayj avatar ayj commented on July 22, 2024

The long-term solution for this problem is being tracked by #57. The short term solution (and the alternatives noted above) are now documented by #78.

from pilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.