Giter Site home page Giter Site logo

Comments (13)

sypakine avatar sypakine commented on June 20, 2024 3

Understood.

I'm almost certain that https://sourcegraph.com/github.com/cilium/cilium/-/commit/c7d547f330cc497bad0d6a1d0f7f3785953d9a98?visible=1 will fix it as this is where we are getting the drop. This is inferred by the missing LB4 slot lookup debug messages in the logs output above.

I'll continue attempting to reproduce the issue and confirm the hypothesis above.

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024 2

👋 interesting find!

Client: 1.12.10 c0f319ad22 2024-02-06T18:46:58+00:00 go version go1.20.14 linux/amd64

Could we please discuss this in context of a supported release? Ideally with a repro on latest main, the relevant code has changed a bit recently. It probably makes sense to peek at

backend = lb4_lookup_backend(ctx, backend_id);
, and see how it matches your theory / observed symptoms.

from cilium.

sypakine avatar sypakine commented on June 20, 2024

I managed to reproduce the issue on 1.15.1 -- key was to use an LB service with externalTrafficPolicy: Local.

REPO_ROOT=$PWD
KUBEPROXY_MODE="none" make kind && \
make kind-image

helm upgrade -i cilium ./install/kubernetes/cilium \
  --wait \
  --namespace kube-system \
  --set k8sServiceHost="kind-control-plane" \
  --set k8sServicePort="6443" \
  --set debug.enabled=true \
  --set pprof.enabled=true \
  --set enableIPv4Masquerade=false \
  --set enableIPv6Masquerade=false \
  --set enableIPv4Masquerade=false \
  --set hostFirewall.enabled=false \
  --set socketLB.hostNamespaceOnly=true \
  --set kubeProxyReplacement=true \
  --set nodeinit.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4.enabled=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set ipv6.enabled=false \
  --set image.override="localhost:5000/cilium/cilium-dev:local" \
  --set image.pullPolicy=Never \
  --set operator.image.override="localhost:5000/cilium/operator-generic:local" \
  --set operator.image.pullPolicy=Never \
  --set operator.image.suffix="" \
  --set securityContext.privileged=true \
  --set gatewayAPI.enabled=false \
  --set socketLB.enabled=false \
  --set bpf.hostLegacyRouting=true \
  --set endpointRoutes.enabled=true \
  --set localRedirectPolicy=true

kind export kubeconfig --name kind
alias k=kubectl

k apply -f - <<EOF
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "default"
spec:
  blocks:
  - cidr: "172.19.250.0/24"
EOF

k apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: server
spec:
  selector:
    app: server
  ports:
  - port: 80
    name: http
  externalTrafficPolicy: Local
  type: LoadBalancer
EOF


time while ! [[ "$( kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}' )" =~ ^172\..* ]]; do echo -n "."; sleep 1; done

k label node kind-worker app=client

k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: client
  labels:
    app: client
spec:
  nodeSelector:
    app: client
  containers:
  - name: client
    image: nginx:latest
    command:
    - sleep
    args:
    - "999999"
EOF

k apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: server
  labels:
    app: server
spec:
  replicas: 20
  selector:
    matchLabels:
      app: server
  template:
    metadata:
      labels:
        app: server
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: client
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: nginx
        ports:
        - containerPort: 80
EOF

k rollout status deployment server

host=$( kubectl get pod client \
    -ojsonpath='{ .spec.nodeName }' )
agent=$( kubectl get pod -n kube-system \
    -l k8s-app=cilium \
    --field-selector=spec.nodeName=${host} \
    -ojsonpath='{.items[].metadata.name}' )

# block FINs from leaving the pod
docker exec -it kind-worker bash

# within the kind node
container=$( crictl ps | grep $(crictl pods | grep client | awk '{ print $1 }') | awk '{ print $1 }' )
pid=$( crictl inspect ${container} | grep pid | head -1 | grep -Eo "[0-9]+" )
nsenter -t ${pid} -n bash

# within the container namespace
iptables -I OUTPUT --protocol tcp --destination-port 80 --tcp-flags FIN FIN -j DROP
exit  # container namespace

exit  # kind node

# generate traffic / ct entries
lbip=$(kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}')
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80

# scale down deployment so we get a service backend lookup miss
kubectl patch deployment server --patch '
spec:
  replicas: 1
'

k exec -it -n kube-system ${agent} -c cilium-agent -- bash
cilium config Debug=Enabled
cilium config DebugLB=Enabled
cilium config MonitorAggregationLevel=None
watch -n2 "cilium bpf ct list global -Dd | grep 172.19.250.1 | grep -v 'remaining: -' | grep -v TxClosing"

Example CT entries -- we'll reuse 10.244.1.188:60590 -> 172.19.250.1:80

TCP SVC 10.244.1.188:60590 -> 172.19.250.1:80 expires=2009248 (remaining: 7917 sec(s)) RxPackets=0 RxBytes=10 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001248 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0
TCP SVC 10.244.1.188:35704 -> 172.19.250.1:80 expires=2009328 (remaining: 7997 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001328 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0

In another terminal, issue curl using source port for the client pod above (e.g. 60590)

NOTE: may need to wait for the port binding to be released to reuse the port

k exec -it client -- curl --local-port 60590 172.19.250.1:80
curl: (7) Failed to connect to 172.19.250.1 port 80 after 0 ms: Couldn't connect to server
k exec -it client -- curl --local-port 55555 172.19.250.1:80
<!DOCTYPE html>
...

Debug output for that flow with CT collision:

k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
...
<- endpoint 3046 flow 0xb4cc68f6 , identity 57167->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.188:60590 -> 172.19.250.1:80 tcp SYN
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=10.244.1.188:60590 dst=172.19.250.1:80
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2009568, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Reply, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Backend service lookup failed: backend_id=10
-> lxc9cc27971afd0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Inheriting identity=2 from stack
<- stack flow 0xb4cc68f6 , identity world->unknown state unknown ifindex lxc9cc27971afd0 orig-ip 0.0.0.0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=172.19.250.1:0 dst=10.244.1.188:0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=2 dir=1 scope=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2001628, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Established, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Successfully mapped addr=172.19.250.1 to identity=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Attempting local delivery for container id 3046 from seclabel 57167
-> endpoint 3046 flow 0xb4cc68f6 , identity world->57167 state established ifindex 0 orig-ip 172.19.250.1: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)

from cilium.

sypakine avatar sypakine commented on June 20, 2024

This is not reproducible on main* using my reproduction steps; toggling the interface state clears the conntrack entries. I'll have to cp potential fix to 1.15 and see if I can reproduce.

  • Daemon: 1.16.0-dev 57db22b202 2024-05-02T13:30:32+01:00 go version go1.22.3 linux/amd64

from cilium.

sypakine avatar sypakine commented on June 20, 2024

Confirmed that recent PR fixes the issue:

 k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
Listening for events on 48 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor
<- endpoint 71 flow 0xfd382215 , identity 30992->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.220:37064 -> 172.19.250.1:80 tcp SYN
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=172.19.250.1:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT entry found lifetime=71691, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: Reply, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Backend service lookup failed: backend_id=22
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Service backend slot lookup: slot=1, dport=80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=10.244.0.78:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1 dir=0 scope=2
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: New, revnat=0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Successfully mapped addr=10.244.0.78 to identity=44471
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack create: proxy-port=0 revnat=7 src-identity=30992 lb=0.0.0.0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Encapsulating to node 2886926339 (0xac130003) from seclabel 30992
-> overlay flow 0xfd382215 , identity 30992->44471 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.1.220:37064 -> 10.244.0.78:80 tcp SYN
CPU 18: MARK 0x2035a252 FROM 71 DEBUG: Inheriting identity=44471 from stack
<- stack flow 0x2035a252 , identity 44471->unknown state unknown ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.0.78:80 -> 10.244.1.220:37064 tcp SYN, ACK

Can we consider backporting that change?

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024

Can we consider backporting that change?

I remember cleaning up a few things beforehand, to make that PR possible. Let me have a look, I wasn't expecting to fix an actual bug 😲. Do you have an idea what's going wrong exactly / why the patch is helping?

from cilium.

sypakine avatar sypakine commented on June 20, 2024

Because the external traffic policy is local

root@kind-worker:/home/cilium# cilium bpf lb list
SERVICE ADDRESS      BACKEND ADDRESS (REVNAT_ID) (SLOT)
172.19.250.1:80/i    10.244.0.186:80 (4) (1)
                     0.0.0.0:0 (4) (0) [LoadBalancer, Local, two-scopes]

this part is returning null:

	key->scope = LB_LOOKUP_SCOPE_EXT;
	key->backend_slot = 0;
	svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
        if (svc) {
		...
	}

	return NULL;

https://sourcegraph.com/github.com/cilium/cilium@a368c8f0f34dfe9a47e8a621af31ea94337f6fb5/-/blob/bpf/lib/lb.h?L1261:16-1261:35

So because of that we do not try to look for a new backend:

		backend = lb4_lookup_backend(ctx, backend_id);
		if (unlikely(!backend || backend->flags != BE_STATE_ACTIVE)) {
			/* Drain existing connections, but redirect new ones to only
			 * active backends.
			 */
			if (backend && !state->syn)
				break;

			svc = lb4_lookup_service(key, false, true);
			if (!svc)
				goto no_service;

https://sourcegraph.com/github.com/cilium/[email protected]/-/blob/bpf/lib/lb.h?L1626-1636

The new logic is much more straightforward:

  1. lookup backend from ct entry
    • backend not found
  2. are there service backends?
    • yes
  3. lookup a new backend

https://sourcegraph.com/github.com/cilium/cilium@main/-/blob/bpf/lib/lb.h?L1567-1578

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024

Oh, ok - I think I see what you're hitting. Missed that the SVC access is coming from bpf_lxc.

So the initial SVC lookup uses scope_switch = is_defined(ENABLE_NODEPORT), while the fallback path uses scope_switch = false. Is that the problem?! If you temporarily fix that scope_switch parameter on v1.15, does that also solve the issue?

from cilium.

sypakine avatar sypakine commented on June 20, 2024

Changing to the below does fix the issue based on my reproduction scenario:

			svc = lb4_lookup_service(key, is_defined(ENABLE_NODEPORT), true);
			if (!svc)
				goto no_service;

I'm not sure the full implications of this / how safe it is.

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024

Sweet, thank you! Could you send a PR for v1.15 with that fix (adding a scope_switch parameter to lb*_local()) ? That seems like a much cleaner, well-isolated fix. And backportable further back

from cilium.

sypakine avatar sypakine commented on June 20, 2024

Thanks @julianwiedmann -- done.

I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024

I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.

Good question! I think for your specific test we should be alright, the bugfix is obvious enough. But having more test coverage for eTP=local scenarios would be good indeed. I'm just not sure whether BPF unit tests are the right scope for that - the map programming is complicated enough that we could easily drift from how the agent sets up the two service entries.

I'd have a bit more confidence into e2e tests for the different eTP=local scenarios (external / in-cluster access, node with/without local backend, selected backend gets terminated, ...).

from cilium.

julianwiedmann avatar julianwiedmann commented on June 20, 2024

Closing via #32608.

from cilium.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.