Once Envoy receives support for indirect port listeners, we should add a to pro

Excellent analysis! I agree with <a class="user-mention notranslate" data-hovercar

The long-term solution for this problem is being tracked by <a class="issue-link js-is

IP table rules for a single port deployment about pilot HOT 9 CLOSED

istio commented on July 22, 2024

IP table rules for a single port deployment

from pilot.

Comments (9)

kyessenov commented on July 22, 2024

For the client-side mode, we need to trap packets originating in the pod that are not sent by the proxy. Using "owner" has some limitations: https://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#OWNERMATCH
Since "owner" is not 100% robust, envoy may be trapping packets originating from envoy (e.g. ICMP packets or anything else). We need to make sure we're not creating an infinite loop with this rule @enricoschiattarella.

from pilot.

kyessenov commented on July 22, 2024

We should stop capturing pod-local traffic with iptables. If one container wants to talk to another container in the pod, there is no need to insert envoy in the middle.

Moreover, we should provide a way for other sidecars to talk to Kubernetes without us getting in the way. We can treat kubernetes namespace as special, perhaps?

from pilot.

rshriram commented on July 22, 2024

Another example would be the case where the app is directly talking to envoy (its envoy aware). In this case as well, the iptables stuff interferes.

On Wed, Jan 18, 2017 at 9:18 PM Kuat ***@***.***> wrote: We should stop capturing pod-local traffic with iptables. If one container wants to talk to another container in the pod, there is no need to insert envoy in the middle. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH0qd2ejZ3YXEpMepThAPHLsAu-s41aYks5rTsgSgaJpZM4LhXRG> .

-- ~shriram

from pilot.

ayj commented on July 22, 2024

A few options exist for redirecting client traffic to external services which I've noted below. Both option (2) and option (3) requires extra runtime privileges. Option (2) requires NET_CAP_ADMIN in the proxy itself whereas option (3) requires privileges to update net_cls. Furthermore, option (3) seems to exposes node-level cgroup stuff to the container of which I don't completely understand the implications though it seems like a bad idea.

Note that each of the options can be ammended with "! -s 127.0.0.1/32" to avoid redirecting to proxy for local-to-local traffic. Similarly, we could avoid proxy interception for specification destinations (e.g. k8s) with a similar destination filter.

Option 1 - UID

-A OUTPUT -p tcp -m owner ! --uid-owner $UID -j REDIRECT --to-ports $PORT

pro: No envoy change required
con: Requires coordinating UID between proxy and init-container. Istio may not necessarily have control over UID (e.g. set by docker).

Option 2 - SO_MARK

-A OUTPUT -p tcp -m mark ! --mark $MARK -j REDIRECT --to-ports $PORT

pro - $MARK value can be configured in pod spec (e.g. configmap, env var) and used by init-container to program iptables and proxy agent to create envoy config with proper SO_MARK values per-upstream cluster.
con: Requires adding SO_MARK support to envoy, perhaps configured per upstream cluster?
con: Requires proxy run with NET_CAP_ADMIN to set SO_MARK on upstream sockets.

Option 3 - Network classifier cgroups (net_cls)

-A OUTPUT -p tcp -c groups ! --cgroup $GROUP -j REDIRECT --to-ports $PORT

pro: No envoy change required
con: Requires newer version of iptables (e.g. 1.6.0)
con: Requires re-mounting /sys/fs/group/net_cls as read/write to create new net_cls groups. This requires additional privileges (NET_CAP_ADMIN?) and seems to expose all of the node's net_cls cgroups to the pod. Furthermore, changes to net_cls (e.g. adding new group for proxy) persistent on node across pod-restarts.
con: Proxy agent needs to update /sys/fs/group/net_cls//tasks file with envoy proxy PID whenever proxy crashes, restarts, etc.
note: I wasn't able to get this method to work, but it's possible I mixed up the configuration someplace. I will try again tomorrow.

Option 4 - Explicit iptable rule per-service

pro: very explicit.
con: lots of iptable rules to cover all client and server services

from pilot.

rshriram commented on July 22, 2024

Thanks for the detailed analysis. My longer response below. TL;DR; need to maintain a balance between this quest for transparency, system stability and whether users "really" care about so much transparency in the first place. Longer response: Option 1: In the spirit of maintaining platform independence, and balancing security concerns, I would say option 1 is the best we can do, with the added co ordination of UIDs (explicitly picked by user if need be). It keeps the door open for other platforms as well, since we don't know what restrictions pods in mesos have (they do have pod namespaces). Or for that matter a vm based environment. And we should add the elimination clause for loopback sockets (127.0.0.1/24) so that if the user wishes, they could decide to get rid of even the iptables rules and choose to explicitly communicate with envoy for reaching external services (i.e. you run two listeners in envoy, one for handling traffic entering the vm or container or pod, and another for traffic from app to envoy. I have seen enterprise users be willing to do this instead of touching iptables in the first place. Matter of fact, Lyft runs in this way). option 4: Given that there is already a ton of iptables rules on the node, installed by kube or flannel or calico, it kind of makes me nervous to keep relying more and more on iptables rules in the pod. It's opening up the doors for system destabilization. Options 2/3 open the door for all sorts of security issues(I might be wrong about this). Currently, if envoy is compromised, the attacker could only enter a container with normal privileges. And probably have access to the pod where other containers are running with normal privileges as well (nothing elevated). If we add any privileges to envoy container, it would raise lots of eyebrows. With privilege escalation, feature additions to envoy come under a lot of scrutiny (e.g. regex support in URLs). IOW, speaking from a paranoid enterprise client perspective, say healthcare companies or whatever, who wants to lock down everything in the pod, anything that requires elevated access during runtime (not init time) is asking for trouble.

On Wed, Feb 1, 2017 at 12:49 AM Jason Young ***@***.***> wrote: A few options exist for redirecting client traffic to external services which I've noted below. Both option (2) and option (3) requires extra runtime privileges. Option (2) requires NET_CAP_ADMIN in the proxy itself whereas option (3) requires privileges to update net_cls. Furthermore, option (3) seems to exposes node-level cgroup stuff to the container of which I don't completely understand the implications though it seems like a bad idea. Note that each of the options can be ammended with "! -s 127.0.0.1/32" to avoid redirecting to proxy for local-to-local traffic. Similarly, we could avoid proxy interception for specification destinations (e.g. k8s) with a similar destination filter. Option 1 - UID -A OUTPUT -p tcp -m owner ! --uid-owner $UID -j REDIRECT --to-ports $PORT - pro: No envoy change required - con: Requires coordinating UID between proxy and init-container. Istio may not necessarily have control over UID (e.g. set by docker). Option 2 - SO_MARK -A OUTPUT -p tcp -m mark ! --mark $MARK -j REDIRECT --to-ports $PORT - pro - $MARK value can be configured in pod spec (e.g. configmap, env var) and used by init-container to program iptables and proxy agent to create envoy config with proper SO_MARK values per-upstream cluster. - con: Requires adding SO_MARK support to envoy, perhaps configured per upstream cluster? - con: Requires proxy run with NET_CAP_ADMIN to set SO_MARK on upstream sockets. Option 3 - Network classifier cgroups (net_cls) -A OUTPUT -p tcp -c groups ! --cgroup $GROUP -j REDIRECT --to-ports $PORT - pro: No envoy change required - con: Requires newer version of iptables (e.g. 1.6.0) - con: Requires re-mounting /sys/fs/group/net_cls as read/write to create new net_cls groups. This requires additional privileges (NET_CAP_ADMIN?) and seems to expose all of the node's net_cls cgroups to the pod. Furthermore, changes to net_cls (e.g. adding new group for proxy) persistent on node across pod-restarts. - con: Proxy agent needs to update /sys/fs/group/net_cls//tasks file with envoy proxy PID whenever proxy crashes, restarts, etc. - note: I wasn't able to get this method to work, but it's possible I mixed up the configuration someplace. I will try again tomorrow. Option 4 - Explicit iptable rule per-service - pro: very explicit. - con: lots of iptable rules to cover all client and server services — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH0qd4Ik2tzOiyx5hntYVtKvKDi1ZQpEks5rYBzLgaJpZM4LhXRG> .

-- ~shriram

from pilot.

kyessenov commented on July 22, 2024

Excellent analysis!
I agree with @rshriram that we should not grant additional capabilities to the proxy.
I was hoping that the cgroup method would work better given that the proxy and the app are already isolated into containers, but it seems that we still need privileges to operate on the node-level. Maybe we explore this option for the per-node model.

When I was looking at the "PID owner" method in the netfilter/iptables documentation, I remember seeing a warning about reliability of recovering PIDs from packets. I should try to find it again, but the documentation is scarce.

There are three items we should do to improve our current IP tables method:

skip loopback app-to-app packet capture (with the elimination clause)
treat kube-system namespace in a special way. Cluster addon namespace holds things like fluentd and ingress controllers, and perhaps, manager itself.
investigate having a separate envoy listener for explicit envoy calls

from pilot.

ayj commented on July 22, 2024

When I was looking at the "PID owner" method in the netfilter/iptables documentation,
I remember seeing a warning about reliability of recovering PIDs from packets.

Do you mean UID owner? As far as I can tell, only --uid-owner and --gui-owner options are supported by iptables now. Looks like --pid-owner might have been removed due to its racy nature with process restarts?

treat kube-system namespace in a special way. Cluster addon namespace
holds things like fluentd and ingress controllers, and perhaps, manager itself.

I'm interpreting this to mean we need to bypass pod-level proxy for outbound traffic destined for kube-system namespace? If so, I believe we would need to individually opt-out (via iptables) each destination address in the kube-system. For example, watch API for pods, services, etc. in kube-system and add/remove elimination rules.

investigate having a separate envoy listener for explicit envoy calls

This should be doable now with envoyproxy/envoy/pull/377 and the app-to-app loopback clause noted above. Let me verify that this works with the current iptables recipe.

from pilot.

ayj commented on July 22, 2024

Update on option (3). Fortunately it looks like bind-mounting /sys/fs/cgroup/net_cls/ into the proxy container works instead of remounting. And each pod gets its own view of cgroups. Unfortunately this still requires privileges (pod fails to start otherwise) and the iptable "-m cgroup ! --cgroup " rule doesn't seem to work anyway in the end. I didn't find much on this specific use case and what I did find seemed to recommend treating cgroup as read-only from with the container/pod.

from pilot.

ayj commented on July 22, 2024

The long-term solution for this problem is being tracked by #57. The short term solution (and the alternatives noted above) are now documented by #78.

from pilot.

IP table rules for a single port deployment about pilot HOT 9 CLOSED

Comments (9)

Option 1 - UID

Option 2 - SO_MARK

Option 3 - Network classifier cgroups (net_cls)

Option 4 - Explicit iptable rule per-service

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent