Giter Site home page Giter Site logo

Comments (24)

m13253 avatar m13253 commented on September 28, 2024 1

I captured a log of bootstrapping my Babel instance (with all underlying links up and stable):

/var/log/babeld.log
(Please tell me if I should paste text instead of attaching a PNG here next time)

It seems that the problem happens at the first second, even without any link up / down.


A detailed trace with debug 3 shows:

Received update/prefix for 2a0d:2********4d/128 from fe80::90f8:f6ff:feb7:4b4d on vx-wg-sc.
install_route(2a0d:2********4d/128 from ::/0)
kernel_route: add 2a0d:2********4d/128 from ::/0 table 254 metric 65535 dev 6 nexthop fe80::90f8:f6ff:feb7:
4b4d
Sending seqno 41062 from address 0x56534d99f2d0 (talk)
Netlink message: {seq:41062}netlink_read: No such device
kernel_route(ADD): No such device

I checked that dev 6 is there:

6: vx-wg-sc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1966 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether ea:78:41:57:48:78 brd ff:ff:ff:ff:ff:ff

Hmm, it's weird.

from babeld.

christf avatar christf commented on September 28, 2024 1

could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19

from babeld.

jech avatar jech commented on September 28, 2024

Are the interfaces disappearing and reappearing later? There's a race condition in babeld that could cause these symptoms if an interface disappeared and reappeared under the same name within 30 seconds.

from babeld.

mweinelt avatar mweinelt commented on September 28, 2024

That (ifdown, rmmod, modprobe, ifup) happens sometimes, when I upgrade wireguard, but I also loose routes without touching any of the links on the server.

from babeld.

jech avatar jech commented on September 28, 2024

Please let me know if the following patch makes the problem go away. (It's a workaround, not a proper fix.)

diff --git a/babeld.c b/babeld.c
index 402230c..cafc143 100644
--- a/babeld.c
+++ b/babeld.c
@@ -558,7 +558,7 @@ main(int argc, char **argv)
     kernel_addr_changed = 0;
     kernel_dump_time = now.tv_sec + roughly(30);
     schedule_neighbours_check(5000, 1);
-    schedule_interfaces_check(30000, 1);
+    schedule_interfaces_check(100, 1);
     expiry_time = now.tv_sec + roughly(30);
     source_expiry_time = now.tv_sec + roughly(300);
 
@@ -740,7 +740,7 @@ main(int argc, char **argv)
 
         if(timeval_compare(&check_interfaces_timeout, &now) < 0) {
             check_interfaces();
-            schedule_interfaces_check(30000, 1);
+            schedule_interfaces_check(100, 1);
         }
 
         if(now.tv_sec >= expiry_time) {

from babeld.

mweinelt avatar mweinelt commented on September 28, 2024

It's weird. I watched that happen two evenings in a row when I reported the issue. The kernel rt currently stays populated correctly. When that changes again I'll test your patch.

Update: patch applied, babeld up and running, waiting for events to report back.

from babeld.

mweinelt avatar mweinelt commented on September 28, 2024

I didn't touch any interface, I restarted with the patch and an hour or so later IPv6 routes are missing again.

babeld: 18 routes

# echo "dump" | timeout 1 nc :: 33123 | grep "installed yes" 
add route 564065fd3c60 prefix 172.23.42.1/32 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd7270 prefix 172.23.42.2/32 from 0.0.0.0/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fd3f80 prefix 172.23.42.8/32 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd6cf0 prefix 172.23.42.10/32 from 0.0.0.0/0 installed yes id 2a:d2:44:ff:fe:9d:d2:bf metric 288 refmetric 192 via fe80::2 if wg-glitch
add route 564065fd6810 prefix 172.23.42.64/26 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd68b0 prefix 172.23.42.128/26 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd3d00 prefix 172.23.42.226/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd3da0 prefix 172.23.42.238/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd3e40 prefix 172.23.42.240/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd63a0 prefix fd42:23:42:100::/64 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd6440 prefix fd42:23:42:110::/64 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd3bc0 prefix fd42:23:42:b100::/56 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd4070 prefix fd42:23:42:b200::/56 from ::/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fd6580 prefix fd42:23:42:b800::/56 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd7410 prefix fd42:23:42:ba00::1/128 from ::/0 installed yes id 2a:d2:44:ff:fe:9d:d2:bf metric 288 refmetric 192 via fe80::2 if wg-glitch
add route 564065fcf030 prefix fd42:23:42:ff01::/64 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd4010 prefix fd42:23:42:ff07::/64 from ::/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fcf1c0 prefix fd42:23:42:ff08::/64 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch

kernel rt ip4: 9 routes

# ip -4 r s t 100 
172.23.42.1 via 172.23.42.231 dev wg-glitch proto babel onlink 
172.23.42.2 via 172.23.42.233 dev wg-snafu proto babel onlink 
172.23.42.8 via 172.23.42.224 dev wg-io proto babel onlink 
172.23.42.10 via 172.23.42.231 dev wg-glitch proto babel onlink 
172.23.42.64/26 via 172.23.42.224 dev wg-io proto babel onlink 
172.23.42.128/26 via 172.23.42.224 dev wg-io proto babel onlink 
172.23.42.226/31 via 172.23.42.231 dev wg-glitch proto babel onlink 
172.23.42.238/31 via 172.23.42.231 dev wg-glitch proto babel onlink 
172.23.42.240/31 via 172.23.42.231 dev wg-glitch proto babel onlink 

kernel rt ipv6: 5 routes (missing 4)

# ip -6 r s t 100 
fd42:23:42:b100::/56 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium
fd42:23:42:b200::/56 via fe80::2 dev wg-snafu proto babel metric 1024 onlink pref medium
fd42:23:42:ff01::/64 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium
fd42:23:42:ff07::/64 via fe80::2 dev wg-snafu proto babel metric 1024 onlink pref medium
fd42:23:42:ff08::/64 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium

remark

fwiw: what's weird is that 172.23.42.10/fd42:23:42:ba00::1 is still in the rt, because thats my laptop which has been in suspend for the last 3 hours. the route gets dropped due to unreach and reappears again.

from babeld.

jech avatar jech commented on September 28, 2024

Either something is flushing the routes, or the interfaces go down-up without babeld noticing, or there's a bug somewhere. Sorry I cannot be of more help.

from babeld.

mweinelt avatar mweinelt commented on September 28, 2024

Is something about routing table 100 significant to babel?

I'm running bird alongside babeld on that machine, but it does not touch igp routing tables and only does static announcements.

I'm using ip monitor link to rule out link flaps, none happen.

The routing policy looks as follows:

0:      from all lookup local
200:    from all lookup igpt 
205:    from all lookup igp
210:    from all lookup peers
220:    from all lookup ebgp
230:    from all lookup blackhole
32766:  from all lookup main

with table ids

100    igp
101    igpt
110    peers
120    ebgp
130    blackhole

I think the following message is significant:

kernel_route(ADD): No such device

It happens when the remote wireguard end shuts down and comes back up. So this is likely a race condition. The local interface is not flapping when that happens.
I am uncertain why it would say "No such device", maybe wireguard, the transport I'm using, is doing something weird?

from babeld.

christf avatar christf commented on September 28, 2024

babeld is checking whether an interface has changed based on an interval. If the interface is down and back up in a time that is shorter than the interval, the observed behavior would occur. You could try lowering the interval for the interface checks even further but it is not clear whether that would even get rid of the issue in a significant form.

I am not quite sure about a proper fix. It seems to me that it could be enough to add an additional check to kernel_netlink.c in netlink_read(). There it could be checked whether the current event is of nlmsg_type RTM_NEWLINK|RTM_DELLINK|RTM_CHANGELINK and if it is, the interface check should be triggered as it is. Naturally the scheduled check then could be omitted.

from babeld.

jech avatar jech commented on September 28, 2024

from babeld.

christf avatar christf commented on September 28, 2024

So I prepared a patch that works as described above. While it does not fix
@mweinelt issue, I still feel this is the right approach to detect interface
changes as it reduces the likelyhood of babeld not noticing an ifup/ifdown. I
was able to verify this with a pair of veth and could see the check being
triggered. Unfortunately the interface_check is also scheduled when there is no ip address assigned to an interface so we cannot entirely get rid of the scheduled check yet before reworking check_link_local_addresses() in interface.c. I added FIXME comments for this in my patch.

In this issue, we are dealing with a point-to-point interface which is
considered "UP" all the time while the remote endpoint flaps. Due to this, the
kernel apparently never emits a RTM_DELLINK/RTM_NEWLINK,
To solve this, we need to find out which signal can be emitted by the kernel in
this situation and subscribe to it. At this point I am not sure there is a
generic solution as I have not worked with point-to-point interfaces
previously. That being said, using wireguard in conjunction with babeld is
something that the freifunk community has wanted to do for a while. Without
solving this issue, that cannot be done.

Interestingly the routes get lost even though the interface is always up. My
expectation would be that the kernel retains routes while the interface is up.
The fact that these routes get lost is surprising me.

from babeld.

m13253 avatar m13253 commented on September 28, 2024

I confirm this problem with a setup of [OpenVPN TCP -> WireGuard Mesh -> VxLAN].

I run Babeld on VxLAN interface. Although the upper two layers does not change, the OpenVPN goes up and down frequently.

To test the bug, I typed:

sudo systemctl stop [email protected]
sleep 10
sudo systemctl start [email protected]
tail -f /var/log/babeld.log  # Debian babeld package does not go through systemd journal

I see babeld printing an infinite loop of

kernel_route(ADD): No such device

with a CPU utilization of 80%-100%.

This is weird, since OpenVPN does not flush routes on the upper layers. Babeld is interfered by an interface it is not running on.

My babeld version: babeld-1.8.2-1-g8cbc75d.
To get this version, I first installed babeld package from Debian sid repo, then compiled the latest code and replaced the binary.

from babeld.

christf avatar christf commented on September 28, 2024

I also saw the same situation like you did (device with correct index exists, yet there is a No such device error message). At that point I was out of ideas and started checking when this error message happens, wasn't able to finish this though.
I am seeing the same log with 1.8.2. This completely fills server logs here. Interestingly I did not see this before 1.8.0. I saw it in master and in the unicast branch.

The load could be an entirely different issue or a subsequent error.

from babeld.

christf avatar christf commented on September 28, 2024

nltrace reveals this:

160713 Sending update to ens14 for 2a06:8187:fbab:2:9443:52a8:e9e1:a04a/128 from ::/0.
160714 Sending update to babel-vpn-1374 for 2a06:8187:fbab:2:9443:52a8:e9e1:a04a/128 from ::/0.
160715 Received update/prefix for 2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from fe80::e8da:47ff:feea:17f7 on babel-vpn-1374.
160716 install_route(2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from ::/0)
160717 kernel_route: add 2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from ::/0 table 10 metric 65535 dev 7 nexthop fe80::e8da:47ff:feea:17f7
160718 Sending seqno 3686 from address 0x55bcf55962d0 (talk)
160719 netlink send(4):
160720 Setting msg proto to 0
160721 --------------------------   BEGIN NETLINK MESSAGE ---------------------------
160722   [NETLINK HEADER] 16 octets
160723     .nlmsg_len = 56
160724     .type = 24 <route/route::new>
160725     .flags = 1541 <REQUEST,ACK,MATCH,ATOMIC>
160726     .seq = 3686
160727     .port = 0
160728   [PAYLOAD] 12 octets
160729     0a 80 00 00 0a 2a 00 07 04 00 00 00             .....*......
160730   [ATTR 01] 16 octets
160731     2a 06 81 87 fb ab 00 02 9d 61 37 20 33 0d 07 d0 *........a7 3...
160732   [ATTR 06] 4 octets
160733     ff ff ff ff                                     ....
160734 ---------------------------  END NETLINK MESSAGE   ---------------------------
160735 netlink recv(4):
160736 Setting msg proto to 0
160737 --------------------------   BEGIN NETLINK MESSAGE ---------------------------
160738   [NETLINK HEADER] 16 octets
160739     .nlmsg_len = 76
160740     .type = 2 <ERROR>
160741     .flags = 0 <>
160742     .seq = 3686
160743     .port = 2297
160744   [ERRORMSG] 20 octets
160745     .error = -19 "No such device"
160746   [ORIGINAL MESSAGE] 16 octets
160747     .nlmsg_len = 16
160748     .type = 24 <0x18>
160749     .flags = 1541 <REQUEST,ACK,MATCH,ATOMIC>
160750     .seq = 3686
160751     .port = 0
160752 ---------------------------  END NETLINK MESSAGE   ---------------------------
160753 nl_recvmsgs_report() returns error
160754 Netlink message: {seq:3686}netlink_read: No such device
160755 kernel_route(ADD): No such device

the payload of the netlink message decodes as follows:
10 - rtm_family - AF_INET6
80 - 128 bit - length of dst
0 - length of src
0 - tos filter
0a - 10 - routing table
2a - 42 -protocol = babel
00 - scope
07 - rtm_type => RTN_UNREACHABLE
04 00 00 00 - rtm_flags => RTM_F_ONLINK

ATTR 01 is the destination address
ATTR 06 is the metric which is KERNEL_INFINITY

I cannot see a coded device in the netlink message.

The code that assembles these flags in kernel_netlink.c (around line 1034) is in part 10+ years old. So why is this becoming an issue now?

from babeld.

christf avatar christf commented on September 28, 2024

Edit: I #thought "No such device" was not visible any more. It seems I was too impatient. I still see it.
in the last 7h I got:

# cut -d'}' -f2 < /tmp/netlink_nosuchdevice  |sort |uniq -c
16068629 kernel_route(ADD): No such device
     10 kernel_route(FLUSH): No such process
    644 kernel_route(MODIFY metric): No such device
16069273 netlink_read: No such device
    623 netlink_read: No such process

Edit:
Why do we even need routes with a metric of 65535?

from babeld.

christf avatar christf commented on September 28, 2024

it seems that I can reproduce this using ip route add. If I add a route, omitting a device I get the same error message. If I do not omit the device, another attribute (ATTR04) containing the device ID is added to the netlink request. <<< playing with it some more, I can ip r a unreachable blabla without a device. in the above example this is an unreachable route so ATTR04 should not be required.

from babeld.

m13253 avatar m13253 commented on September 28, 2024

Observed route missing problem also with IPv4.
Restarting babeld on the affected intermediate router solved the problem.

My network is dual-stack. I am sure it was not a connectivity problem since IPv6 routes are working.

Hard to capture logs as evidence because the problem goes away as soon as I restarted it with another verbosity.

from babeld.

m13253 avatar m13253 commented on September 28, 2024

could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19

Seems the problem has gone away with the patch. Thanks! 🎉
I will continue to test the network for several days to see if the problem is really gone.

from babeld.

mweinelt avatar mweinelt commented on September 28, 2024

Looks go here as well, thanks!

from babeld.

m13253 avatar m13253 commented on September 28, 2024

could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19

Patch tested on Linux 4.17.0 and 4.14.52, network with 7 nodes.
Eliminated "no such device" errors during last 3 days.

Has anyone tested compatibility with older kernels?

from babeld.

christf avatar christf commented on September 28, 2024

I have this running with 15 nodes where 2 of them run kernel 4.17 and the rest runs 4.4

from babeld.

m13253 avatar m13253 commented on September 28, 2024

@jech The patch is already available. Would you please merge it?
We've been missing you for weeks :-)

from babeld.

jech avatar jech commented on September 28, 2024

It looks correct to me, so I think I'm going to merge it without testing. I've got one minor nit (see my comment in #19).

My apologies for the delay.

from babeld.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.