Comments (13)
Understood.
I'm almost certain that https://sourcegraph.com/github.com/cilium/cilium/-/commit/c7d547f330cc497bad0d6a1d0f7f3785953d9a98?visible=1 will fix it as this is where we are getting the drop. This is inferred by the missing LB4 slot lookup debug messages in the logs output above.
I'll continue attempting to reproduce the issue and confirm the hypothesis above.
from cilium.
👋 interesting find!
Client: 1.12.10 c0f319ad22 2024-02-06T18:46:58+00:00 go version go1.20.14 linux/amd64
Could we please discuss this in context of a supported release? Ideally with a repro on latest main
, the relevant code has changed a bit recently. It probably makes sense to peek at
Line 1567 in f211c17
from cilium.
I managed to reproduce the issue on 1.15.1
-- key was to use an LB service with externalTrafficPolicy: Local
.
REPO_ROOT=$PWD
KUBEPROXY_MODE="none" make kind && \
make kind-image
helm upgrade -i cilium ./install/kubernetes/cilium \
--wait \
--namespace kube-system \
--set k8sServiceHost="kind-control-plane" \
--set k8sServicePort="6443" \
--set debug.enabled=true \
--set pprof.enabled=true \
--set enableIPv4Masquerade=false \
--set enableIPv6Masquerade=false \
--set enableIPv4Masquerade=false \
--set hostFirewall.enabled=false \
--set socketLB.hostNamespaceOnly=true \
--set kubeProxyReplacement=true \
--set nodeinit.enabled=true \
--set ipam.mode=kubernetes \
--set ipv4.enabled=true \
--set ipv4NativeRoutingCIDR=10.244.0.0/16 \
--set ipv6.enabled=false \
--set image.override="localhost:5000/cilium/cilium-dev:local" \
--set image.pullPolicy=Never \
--set operator.image.override="localhost:5000/cilium/operator-generic:local" \
--set operator.image.pullPolicy=Never \
--set operator.image.suffix="" \
--set securityContext.privileged=true \
--set gatewayAPI.enabled=false \
--set socketLB.enabled=false \
--set bpf.hostLegacyRouting=true \
--set endpointRoutes.enabled=true \
--set localRedirectPolicy=true
kind export kubeconfig --name kind
alias k=kubectl
k apply -f - <<EOF
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: "default"
spec:
blocks:
- cidr: "172.19.250.0/24"
EOF
k apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: server
spec:
selector:
app: server
ports:
- port: 80
name: http
externalTrafficPolicy: Local
type: LoadBalancer
EOF
time while ! [[ "$( kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}' )" =~ ^172\..* ]]; do echo -n "."; sleep 1; done
k label node kind-worker app=client
k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: client
labels:
app: client
spec:
nodeSelector:
app: client
containers:
- name: client
image: nginx:latest
command:
- sleep
args:
- "999999"
EOF
k apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: server
labels:
app: server
spec:
replicas: 20
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: client
topologyKey: kubernetes.io/hostname
containers:
- name: server
image: nginx
ports:
- containerPort: 80
EOF
k rollout status deployment server
host=$( kubectl get pod client \
-ojsonpath='{ .spec.nodeName }' )
agent=$( kubectl get pod -n kube-system \
-l k8s-app=cilium \
--field-selector=spec.nodeName=${host} \
-ojsonpath='{.items[].metadata.name}' )
# block FINs from leaving the pod
docker exec -it kind-worker bash
# within the kind node
container=$( crictl ps | grep $(crictl pods | grep client | awk '{ print $1 }') | awk '{ print $1 }' )
pid=$( crictl inspect ${container} | grep pid | head -1 | grep -Eo "[0-9]+" )
nsenter -t ${pid} -n bash
# within the container namespace
iptables -I OUTPUT --protocol tcp --destination-port 80 --tcp-flags FIN FIN -j DROP
exit # container namespace
exit # kind node
# generate traffic / ct entries
lbip=$(kubectl get svc server -ojsonpath='{.status.loadBalancer.ingress[].ip}')
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
k exec -it client -- curl -s -o /dev/null ${lbip}:80
# scale down deployment so we get a service backend lookup miss
kubectl patch deployment server --patch '
spec:
replicas: 1
'
k exec -it -n kube-system ${agent} -c cilium-agent -- bash
cilium config Debug=Enabled
cilium config DebugLB=Enabled
cilium config MonitorAggregationLevel=None
watch -n2 "cilium bpf ct list global -Dd | grep 172.19.250.1 | grep -v 'remaining: -' | grep -v TxClosing"
Example CT entries -- we'll reuse 10.244.1.188:60590 -> 172.19.250.1:80
TCP SVC 10.244.1.188:60590 -> 172.19.250.1:80 expires=2009248 (remaining: 7917 sec(s)) RxPackets=0 RxBytes=10 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001248 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0
TCP SVC 10.244.1.188:35704 -> 172.19.250.1:80 expires=2009328 (remaining: 7997 sec(s)) RxPackets=0 RxBytes=13 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1
a LastTxReport=2001328 Flags=0x0010 [ SeenNonSyn ] RevNAT=11 SourceSecurityID=0 IfIndex=0
In another terminal, issue curl using source port for the client pod above (e.g. 60590)
NOTE: may need to wait for the port binding to be released to reuse the port
k exec -it client -- curl --local-port 60590 172.19.250.1:80
curl: (7) Failed to connect to 172.19.250.1 port 80 after 0 ms: Couldn't connect to server
k exec -it client -- curl --local-port 55555 172.19.250.1:80
<!DOCTYPE html>
...
Debug output for that flow with CT collision:
k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
...
<- endpoint 3046 flow 0xb4cc68f6 , identity 57167->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.188:60590 -> 172.19.250.1:80 tcp SYN
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=10.244.1.188:60590 dst=172.19.250.1:80
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2009568, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Reply, revnat=11
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Backend service lookup failed: backend_id=10
-> lxc9cc27971afd0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Inheriting identity=2 from stack
<- stack flow 0xb4cc68f6 , identity world->unknown state unknown ifindex lxc9cc27971afd0 orig-ip 0.0.0.0: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 1/2: src=172.19.250.1:0 dst=10.244.1.188:0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=2 dir=1 scope=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT entry found lifetime=2001628, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: CT verdict: Established, revnat=0
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Successfully mapped addr=172.19.250.1 to identity=2
CPU 28: MARK 0xb4cc68f6 FROM 3046 DEBUG: Attempting local delivery for container id 3046 from seclabel 57167
-> endpoint 3046 flow 0xb4cc68f6 , identity world->57167 state established ifindex 0 orig-ip 172.19.250.1: 172.19.250.1 -> 10.244.1.188 DestinationUnreachable(Port)
from cilium.
This is not reproducible on main* using my reproduction steps; toggling the interface state clears the conntrack entries. I'll have to cp potential fix to 1.15 and see if I can reproduce.
Daemon: 1.16.0-dev 57db22b202 2024-05-02T13:30:32+01:00 go version go1.22.3 linux/amd64
from cilium.
Confirmed that recent PR fixes the issue:
k exec -it -n kube-system ${agent} -c cilium-agent -- cilium monitor -Dv --related-to $( k get cep client -ojsonpath='{ .status.id }' )
Listening for events on 48 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor
<- endpoint 71 flow 0xfd382215 , identity 30992->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.244.1.220:37064 -> 172.19.250.1:80 tcp SYN
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=172.19.250.1:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4 dir=2 scope=1
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT entry found lifetime=71691, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: Reply, revnat=7
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Backend service lookup failed: backend_id=22
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Service backend slot lookup: slot=1, dport=80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 1/2: src=10.244.1.220:37064 dst=10.244.0.78:80
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1 dir=0 scope=2
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: CT verdict: New, revnat=0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Successfully mapped addr=10.244.0.78 to identity=44471
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Conntrack create: proxy-port=0 revnat=7 src-identity=30992 lb=0.0.0.0
CPU 18: MARK 0xfd382215 FROM 71 DEBUG: Encapsulating to node 2886926339 (0xac130003) from seclabel 30992
-> overlay flow 0xfd382215 , identity 30992->44471 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.1.220:37064 -> 10.244.0.78:80 tcp SYN
CPU 18: MARK 0x2035a252 FROM 71 DEBUG: Inheriting identity=44471 from stack
<- stack flow 0x2035a252 , identity 44471->unknown state unknown ifindex cilium_vxlan orig-ip 0.0.0.0: 10.244.0.78:80 -> 10.244.1.220:37064 tcp SYN, ACK
Can we consider backporting that change?
from cilium.
Can we consider backporting that change?
I remember cleaning up a few things beforehand, to make that PR possible. Let me have a look, I wasn't expecting to fix an actual bug 😲. Do you have an idea what's going wrong exactly / why the patch is helping?
from cilium.
Because the external traffic policy is local
root@kind-worker:/home/cilium# cilium bpf lb list
SERVICE ADDRESS BACKEND ADDRESS (REVNAT_ID) (SLOT)
172.19.250.1:80/i 10.244.0.186:80 (4) (1)
0.0.0.0:0 (4) (0) [LoadBalancer, Local, two-scopes]
this part is returning null:
key->scope = LB_LOOKUP_SCOPE_EXT;
key->backend_slot = 0;
svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
if (svc) {
...
}
return NULL;
So because of that we do not try to look for a new backend:
backend = lb4_lookup_backend(ctx, backend_id);
if (unlikely(!backend || backend->flags != BE_STATE_ACTIVE)) {
/* Drain existing connections, but redirect new ones to only
* active backends.
*/
if (backend && !state->syn)
break;
svc = lb4_lookup_service(key, false, true);
if (!svc)
goto no_service;
https://sourcegraph.com/github.com/cilium/[email protected]/-/blob/bpf/lib/lb.h?L1626-1636
The new logic is much more straightforward:
- lookup backend from ct entry
- backend not found
- are there service backends?
- yes
- lookup a new backend
https://sourcegraph.com/github.com/cilium/cilium@main/-/blob/bpf/lib/lb.h?L1567-1578
from cilium.
Oh, ok - I think I see what you're hitting. Missed that the SVC access is coming from bpf_lxc
.
So the initial SVC lookup uses scope_switch = is_defined(ENABLE_NODEPORT)
, while the fallback path uses scope_switch = false
. Is that the problem?! If you temporarily fix that scope_switch
parameter on v1.15
, does that also solve the issue?
from cilium.
Changing to the below does fix the issue based on my reproduction scenario:
svc = lb4_lookup_service(key, is_defined(ENABLE_NODEPORT), true);
if (!svc)
goto no_service;
I'm not sure the full implications of this / how safe it is.
from cilium.
Sweet, thank you! Could you send a PR for v1.15 with that fix (adding a scope_switch
parameter to lb*_local()
) ? That seems like a much cleaner, well-isolated fix. And backportable further back
from cilium.
Thanks @julianwiedmann -- done.
I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.
from cilium.
I don't see any BPF unit tests for this function, so let me know if you think it prudent to create a test for this change.
Good question! I think for your specific test we should be alright, the bugfix is obvious enough. But having more test coverage for eTP=local
scenarios would be good indeed. I'm just not sure whether BPF unit tests are the right scope for that - the map programming is complicated enough that we could easily drift from how the agent sets up the two service entries.
I'd have a bit more confidence into e2e tests for the different eTP=local
scenarios (external / in-cluster access, node with/without local backend, selected backend gets terminated, ...).
from cilium.
Closing via #32608.
from cilium.
Related Issues (20)
- Directory watcher feature post work items
- Cilium perpetually out of sync within argo-cd due to spire-agent socket mount 'readOnly' HOT 2
- BGPv1 and BGPv2 - Updating a peer's ASN has no effect HOT 1
- Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled HOT 1
- Connectivity test stuck at Waiting for NodePort since 1.15.5 HOT 1
- kpr: stabilize Service handling for loopback connections HOT 1
- wireguard: stabilize the encryption hook
- LB-IPAM: Add min/maxItems constraints to CiliumLoadBalancerIPPool
- Cilium v.1.15 host firewall ignores virtual IPs on the localhost (lo) interface
- CFP: Support Extended feature ParentRef Port in Gateway API
- host/node connects to local endpoints with cilium_net adapter ip address as source
- kpr: constrain from-container's CT lookup for Service connection
- NodeAddress controller can drop addresses on table update
- cilium does not create Gateway class HOT 5
- ServiceMesh 503 when Envoy and App are Co-located with BPF Host Routing HOT 1
- CFP: Modify a config parameter in Helm chart values.yaml to enable multicast
- CI: Conformance Runtime (ci-runtime): Runtime (agent): DNS proxy policy works if Cilium stops HOT 3
- CI: Conformance E2E (ci-e2e) / outside-to-ingress-service
- Cilium 1.15.5 on Azure reduces MTU on eth0, breaks connectivity
- Test upgrade for new default cilium-envoy DS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cilium.