Giter Site home page Giter Site logo

Comments (8)

squat avatar squat commented on May 29, 2024

Hi @wangwill Interesting! It's very curious that it only happens on the new node. I'd be interested to know what is different about the new node, e.g. are the kernel and OS version the same?

To be clear, are the logs appearing on the new node or on the old nodes now that the new node was added?

I see some references to this issue across GitHub, e.g. coreos/go-iptables#73 and containernetworking/plugins#461.

It's odd that this happens only to Kilo. Kilo interacts with iptables using the same mechanism that Kube-Proxy does (iptables-wrapper) so I would expect that the same logs should appear on the Kube-Proxy containers. Can you check if Kube-Proxy is also complaining about the same issue?

This issue seems to stem from the table in question being accessed by using the nft command before iptables. Do you maybe know if this is the case?

Finally, is this issue repeatable? Ie does it happen for all new nodes and does it persist after restarts?

from kilo.

wangwill avatar wangwill commented on May 29, 2024

Hi, Squat

  1. All nodes are using Ubuntu 22.04.2 LTS, Kernel version 5.15.0-1038.
  2. Yes. The error logs are appearing on all kilo pods, aka old nodes as well.
  3. k3s server logs:

E0710 04:53:53.588766 1343671 network_policy_controller.go:277] Aborting sync. Failed to cleanup stale iptables rules: unable to list chains: running [/usr/sbin/iptables -t filter -S --wait]: exit status 1: iptables v1.8.7 (nf_tables): table `filter' is incompatible, use 'nft' tool.

  1. I tried iptables -L and iptables-nft -L commands.
    It is displaying errors as above when joined the cluster.
    iptables -L and iptables-nft are loaded after deleting the node and rebooting. (Other nodes within the cluster are still reporting the error.)

Re-join the node to the cluster, the same issue reappeared.

  1. Network traffic across the cluster is partly affected. Existing nodes & pods are communicating correctly while newly joined nodes are having readiness timeout & unable to provision pvc(longhorn) etc errors.

from kilo.

squat avatar squat commented on May 29, 2024

@wangwill thanks for the details. So indeed, it's not just a Kilo problem; it seems everyone using iptables-nft is affected, including the network policy controller and presumably also kube-proxy.

Was the cluster recently upgraded?

I suspect that the issue might have been around for a while but only became obvious when the new node was added. As in, the network policy controller may have been failing to list rules ever since some event occurred on the cluster that affected nftables but we only noticed it recently when Kilo failed to add a new node, since when a new node is added, the other nodes need to update their iptables rules.

Can you look back into journald to check when the error was first logged by the k3s server?

from kilo.

wangwill avatar wangwill commented on May 29, 2024

@squat You are correct. This issue has been ongoing for a while. It is a new testing cluster and it hasn't been upgraded after the 1st init.

The error log can be traced back to 27 June 2023 after I applied:
kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/kube-router.yaml

But during this period, the 5 nodes cluster was running without any errors until today when the major issue occurred.


This is the 1st time the error message popping up in the journal log

Jun 27 15:58:32 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.djDoNN.mount: Deactivated successfully.
Jun 27 15:58:37 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.ajLgoa.mount: Deactivated successfully.
Jun 27 15:58:38 node-a systemd[1]: run-containerd-runc-k8s.io-d2d94380ec43c3c1124370c76ef1ecb4e5758abe1fa57da82197c322a0bc1c3b-runc.MlghJk.mount: Deactivated successfully.
Jun 27 15:59:10 node-a k3s[2061]: E0627 15:59:10.412783    2061 network_policy_controller.go:292] Failed to cleanup stale ipsets: failed to delete ipset KUBE-DST-RPURVQE4ODVUOI6S due to ipset v7.15: Set cannot be destroyed: it is in use by a kernel component
Jun 27 15:59:20 node-a k3s[2061]: W0627 15:59:20.564830    2061 machine.go:65] Cannot read vendor id correctly, set empty.
Jun 27 15:59:26 node-a systemd[1]: run-containerd-runc-k8s.io-abcc4a1abfebc12f51240fd2f857df76322c8d11d2aa2955a97f2d15aa3d9f21-runc.iFeJPp.mount: Deactivated successfully.
Jun 27 15:59:42 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.HhplFH.mount: Deactivated successfully.
Jun 27 15:59:42 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.knMnBf.mount: Deactivated successfully.
Jun 27 15:59:55 node-a k3s[2061]: I0627 15:59:55.408386    2061 topology_manager.go:205] "Topology Admit Handler"
Jun 27 15:59:55 node-a k3s[2061]: E0627 15:59:55.408609    2061 cpu_manager.go:394] "RemoveStaleState: removing container" podUID="bad077f1-13ab-4425-a277-69f8c8116f23" containerName="coredns"
Jun 27 15:59:55 node-a k3s[2061]: I0627 15:59:55.408647    2061 memory_manager.go:345] "RemoveStaleState removing state" podUID="bad077f1-13ab-4425-a277-69f8c8116f23" containerName="coredns"
Jun 27 15:59:55 node-a systemd[1]: Created slice libcontainer container kubepods-besteffort-pod811fccdc_9209_4bf7_b8d3_f76ff6a8b090.slice.
Jun 27 15:59:55 node-a k3s[2061]: I0627 15:59:55.570587    2061 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"xtables-lock\" (UniqueName: \"kubernetes.io/host-path/811fccdc-9209-4bf7-b8d3-f76ff6a8b090-xtables-lock\") pod \"kube-router-r7dnc\" (UID: \"811fccdc-9209-4bf7-b8d>
Jun 27 15:59:55 node-a k3s[2061]: I0627 15:59:55.570640    2061 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-hkgmd\" (UniqueName: \"kubernetes.io/projected/811fccdc-9209-4bf7-b8d3-f76ff6a8b090-kube-api-access-hkgmd\") pod \"kube-router-r7dnc\" (UID: \"811f>
Jun 27 15:59:55 node-a k3s[2061]: I0627 15:59:55.570670    2061 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"lib-modules\" (UniqueName: \"kubernetes.io/host-path/811fccdc-9209-4bf7-b8d3-f76ff6a8b090-lib-modules\") pod \"kube-router-r7dnc\" (UID: \"811fccdc-9209-4bf7-b8d3->
Jun 27 15:59:56 node-a systemd[1]: Started libcontainer container f5a5138c29c9ec7d1b67339d6ba68aa92df83d141c20c7241142fb8e3719215d.
Jun 27 15:59:56 node-a systemd[1]: run-containerd-runc-k8s.io-f5a5138c29c9ec7d1b67339d6ba68aa92df83d141c20c7241142fb8e3719215d-runc.CjFHMl.mount: Deactivated successfully.
Jun 27 15:59:57 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.hJNNdF.mount: Deactivated successfully.
Jun 27 16:00:01 node-a systemd[1]: var-lib-rancher-k3s-agent-containerd-tmpmounts-containerd\x2dmount2592918531.mount: Deactivated successfully.
Jun 27 16:00:04 node-a systemd[1]: Started libcontainer container fcfb2763a590122421f91c3494582e8d34241efc7af9d4ed82178c9434b365ee.
Jun 27 16:00:07 node-a k3s[2061]: E0627 16:00:07.726665    2061 network_policy_controller.go:277] Aborting sync. Failed to cleanup stale iptables rules: unable to list chains: running [/usr/sbin/iptables -t filter -S --wait]: exit status 1: iptables v1.8.7 (nf_tables): table `filter' is incompatible, use 'nft' tool.
Jun 27 16:00:17 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.pmPKEg.mount: Deactivated successfully.
Jun 27 16:00:20 node-a systemd[1]: run-containerd-runc-k8s.io-b6d22729787bcb4cfac75dfb839f5188ac6128d88beb60aa5acfaf7fbec665ca-runc.pojeIG.mount: Deactivated successfully.
Jun 27 16:00:32 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.nOmMIK.mount: Deactivated successfully.
Jun 27 16:00:42 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.boKLaJ.mount: Deactivated successfully.
Jun 27 16:00:45 node-a k3s[2061]: E0627 16:00:45.773842    2061 network_policy_controller.go:277] Aborting sync. Failed to cleanup stale iptables rules: unable to list chains: running [/usr/sbin/iptables -t filter -S --wait]: exit status 1: iptables v1.8.7 (nf_tables): table `filter' is incompatible, use 'nft' tool.
Jun 27 16:00:47 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.chgelL.mount: Deactivated successfully.
Jun 27 16:00:52 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.bnabLK.mount: Deactivated successfully.
Jun 27 16:01:07 node-a k3s[2061]: E0627 16:01:07.900918    2061 network_policy_controller.go:277] Aborting sync. Failed to cleanup stale iptables rules: unable to list chains: running [/usr/sbin/iptables -t filter -S --wait]: exit status 1: iptables v1.8.7 (nf_tables): table `filter' is incompatible, use 'nft' tool.
Jun 27 16:01:17 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.lhonMn.mount: Deactivated successfully.
Jun 27 16:01:27 node-a systemd[1]: run-containerd-runc-k8s.io-29813380f6485c48bdc001419f3b9004e6564b9955034975d65040e73c282875-runc.CmnCMH.mount: Deactivated successfully.

from kilo.

squat avatar squat commented on May 29, 2024

Nice find. It sounds like we might be getting somewhere. Unfortunately the Kilo manifest for kube-router does not pin the container image to a particular version. Can you check what version you are running? Maybe it's in the logs.

There are several references to incompatibility issues that arrive when the k8s/host version of iptables is greater than kube-router's (xref: cloudnativelabs/kube-router#1370); I wonder if you're running into something related.

from kilo.

squat avatar squat commented on May 29, 2024

if you remove kube-router, do the issues go away (after a reboot)?

from kilo.

wangwill avatar wangwill commented on May 29, 2024

remove kube-router didn't fix the issue.

https://docs.k3s.io/advanced#old-iptables-versions

I updated the api server with "--prefer-bundled-bin" to use its bundled version of iptables binary rather than the OS ones.

from kilo.

squat avatar squat commented on May 29, 2024

❤️

from kilo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.