Is there an existing issue for this? <li class="

Reproduction steps on the development environment: Spin up the

The issue here was that the set operations were performed on <code class="notranslate"

NodeAddress controller can drop addresses on table update about cilium HOT 5 OPEN

npdgm commented on July 20, 2024 8

NodeAddress controller can drop addresses on table update

from cilium.

Comments (5)

npdgm commented on July 20, 2024 3

Reproduction steps on the development environment:

Spin up the kind cluster
[ make kind-down ]
make kind
make kind-image
make kind-install-cilium

Check statedb for eth0 addresses

for pod in $(kubectl -n kube-system get po -o NAME -l app.kubernetes.io/name=cilium-agent)
do
   echo -e "\n# $pod"
   kubectl -n kube-system exec -c cilium-agent $pod -- cilium-dbg statedb node-addresses | grep eth0
done

Example:

# pod/cilium-dkrc6
172.25.0.3                  true       true      eth0
fc00:c111::3                true       true      eth0

# pod/cilium-pdjc7
fc00:c111::2                true       true      eth0
172.25.0.2                  true       true      eth0

Add an other address to the worker. Choosing an other subnet because we want it to be ordered before the node address, and 172.25.0.1 is not available.

❯ docker exec kind-worker ip -br a s dev eth0
eth0@if162       UP             172.25.0.2/16 fc00:c111::2/64 fe80::42:acff:fe19:2/64 

❯ docker exec kind-worker ip a a 172.16.0.123/16 dev eth0

❯ docker exec kind-worker ip -br a s dev eth0
eth0@if162       UP             172.25.0.2/16 172.16.0.123/16 fc00:c111::2/64 fe80::42:acff:fe19:2/64

Check the statedb again
Example:

# pod/cilium-dkrc6
172.25.0.3                  true       true      eth0
fc00:c111::3                true       true      eth0

# pod/cilium-pdjc7
fc00:c111::2                true       true      eth0
172.16.0.123                true       true      eth0

172.25.0.2 is missing!

Delete 172.16.0.123 and replace it with 172.25.0.123, same subnet as original node IPs, but ordered after the main node address.

❯ docker exec kind-worker ip a d 172.16.0.123/16 dev eth0
❯ docker exec kind-worker ip a a 172.25.0.123/16 dev eth0
❯ docker exec kind-worker ip -br a s dev eth0
eth0@if162       UP             172.25.0.2/16 172.25.0.123/16 fc00:c111::2/64 fe80::42:acff:fe19:2/64

In this situation the statedb is correct

# pod/cilium-dkrc6
172.25.0.3                  true       true      eth0
fc00:c111::3                true       true      eth0

# pod/cilium-pdjc7
fc00:c111::2                true       true      eth0
172.25.0.2                  true       true      eth0
172.25.0.123                false      false     eth0

from cilium.

npdgm commented on July 20, 2024 1

1.16.0-rc.0 is affected
However so far I could not reproduce the issue on 1.14.12 so it may have been introduced in the 1.15 branch

from cilium.

joamaki commented on July 20, 2024

The issue here was that the set operations were performed on NodeAddress which not only included the IP address, but also the "NodePort" flag. When a new address was added the flag changed. The simplest fix for this is to swap the loops: first delete the existing addresses that are not part of the new set and then add the new ones in. I'll prepare a PR + backport to fix this.

I'm still a bit puzzled by what impact this bug had. In v1.15 Cilium only supports one NodePort IP per device and the "enable-runtime-device-detection" is still disabled by default in v1.15 (nothing happens when Table[NodeAddress] updates, only the very initial version matters). For v1.16 where Cilium now reacts to device/address changes by default and supports multiple NodePort IPs per device (with --nodeport-addresses) so this is definitely a release blocking bug.

A LoadBalancer service is essentially a NodePort + some static IPs. In your case was the VIP address specified as a LoadBalancer IP? If that was the case then my expectation would've been that the node addresses should not have mattered. (EDIT: hmm unless they clashed and you had --enable-runtime-device-detection enabled.. then the NodePort service might've overwritten the LoadBalancer service and it could've gotten removed afterwards...)

Could you dump the LoadBalancer service here? kubectl get svc/name-of-service -o yaml?
Also cilium-dbg service list would be useful (the relevant LoadBalancer and NodePort services). Were you accessing the service via the LoadBalancer "port" or via the "nodePort"?

from cilium.

npdgm commented on July 20, 2024

The simplest fix for this is to swap the loops: first delete the existing addresses that are not part of the new set and then add the new ones in. I'll prepare a PR + backport to fix this.

Great! Thanks a lot.

I'm still a bit puzzled by what impact this bug had.

Sorry the report may have been confusing. It doesn't involve a LoadBalancer service in k8s, I'll explain.

Here's an example of Helm values for an install where the issue was first discovered. With this same set of values, 1.14 is fine, but 1.15 and 1.16 exhibit the issue.
enableRuntimeDeviceDetection was never set and left to defaults, so yes that was false with 1.15. But 1.16 with runtime detection had the exact same problem and behaviour regarding system VIPs.

autoDirectNodeRoutes: true
bgpControlPlane:
  enabled: true
bpf:
  masquerade: true
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup
endpointStatus:
  enabled: true
  status: policy health controllers log state
externalIPs:
  enabled: true
hubble:
  relay:
    enabled: true
  ui:
    enabled: true
ipam:
  mode: kubernetes
ipv4NativeRoutingCIDR: 10.244.0.0/16
k8sServiceHost: localhost
k8sServicePort: 7445
kubeProxyReplacement: true
loadBalancer:
  acceleration: native
  algorithm: maglev
  mode: hybrid
monitor:
  enabled: true
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
routingMode: native
securityContext:
  capabilities:
    ciliumAgent:
    - CHOWN
    - KILL
    - NET_ADMIN
    - NET_RAW
    - IPC_LOCK
    - SYS_ADMIN
    - SYS_RESOURCE
    - DAC_OVERRIDE
    - FOWNER
    - SETGID
    - SETUID
    cleanCiliumState:
    - NET_ADMIN
    - SYS_ADMIN
    - SYS_RESOURCE

A LoadBalancer service is essentially a NodePort + some static IPs. In your case was the VIP address specified as a LoadBalancer IP?

The VIP feature mentionned isn't managed by a k8s or LoadBalancer controller. An address is added/removed to a node main NIC by a system controller in Talos, just like ip addr add would. Should be the same with keepalived, carp, etc.
It's reproduced this way on the development stack in Kind, with the default Cilium config and ip commands. And the bug can be observed in statedb and various bpf maps already, no need to dig into Service bits it seems.

An affected Service can be of the ClusterIP type. LoadBalancer is not required.
The problem actually happens when selected Pods are using hostNetwork, so where Endpoints are Node addresses. Perhaps this is analogue to NodePorts in the datapath, I don't know.

The best example for this would be default Kubernetes API service:

apiVersion: v1
kind: Service
metadata:
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
spec:
  clusterIP: 10.16.0.1
  clusterIPs:
  - 10.16.0.1
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 6443
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

apiVersion: v1
kind: Endpoints
metadata:
  labels:
    endpointslice.kubernetes.io/skip-mirror: "true"
  name: kubernetes
  namespace: default
subsets:
- addresses:
  - ip: 10.0.50.11
  - ip: 10.0.50.12
  - ip: 10.0.50.13
  ports:
  - name: https
    port: 6443
    protocol: TCP

addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
  - 10.0.50.11
  conditions:
    ready: true
- addresses:
  - 10.0.50.12
  conditions:
    ready: true
- addresses:
  - 10.0.50.13
  conditions:
    ready: true
kind: EndpointSlice
metadata:
  labels:
    kubernetes.io/service-name: kubernetes
  name: kubernetes
  namespace: default
ports:
- name: https
  port: 6443
  protocol: TCP

In these dumps, endpoints in 10.0.50.0/24 are node addresses, not from the PodCIDR.

The service is accessed with the Service port 443.

from cilium.

joamaki commented on July 20, 2024

Hmm, Table[NodeAddress] is only used for 1) Picking the NodePort/HostPort frontend IPs, 2) Picking the BPF masquerade IP address. It should have no impact to ClusterIP services. It might be there's another unrelated bug involved here. I'll finish up the NodeAddress fix and will then try and reproduce this issue on my local Talos cluster.

EDIT: Forgot host IP syncing. It's periodically marking the node addresses with the host identity (daemon/cmd/hostips-sync.go). This could explain it.

Automation closed the PR, but I'll reopen this until backports for v1.16 and v1.15 are done.

@npdgm could you try and reproduce this with latest build with the fix?

Digests at https://github.com/cilium/cilium/actions/runs/9938947454/job/27453160771
Instructions on to install from a CI build: https://isovalent.com/blog/post/tutorial-tips-and-tricks-to-install-cilium/#h-the-ci-build-image-way

from cilium.

NodeAddress controller can drop addresses on table update about cilium HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent