Comments (24)
I captured a log of bootstrapping my Babel instance (with all underlying links up and stable):
(Please tell me if I should paste text instead of attaching a PNG here next time)
It seems that the problem happens at the first second, even without any link up / down.
A detailed trace with debug 3
shows:
Received update/prefix for 2a0d:2********4d/128 from fe80::90f8:f6ff:feb7:4b4d on vx-wg-sc.
install_route(2a0d:2********4d/128 from ::/0)
kernel_route: add 2a0d:2********4d/128 from ::/0 table 254 metric 65535 dev 6 nexthop fe80::90f8:f6ff:feb7:
4b4d
Sending seqno 41062 from address 0x56534d99f2d0 (talk)
Netlink message: {seq:41062}netlink_read: No such device
kernel_route(ADD): No such device
I checked that dev 6 is there:
6: vx-wg-sc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1966 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ea:78:41:57:48:78 brd ff:ff:ff:ff:ff:ff
Hmm, it's weird.
from babeld.
could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19
from babeld.
Are the interfaces disappearing and reappearing later? There's a race condition in babeld that could cause these symptoms if an interface disappeared and reappeared under the same name within 30 seconds.
from babeld.
That (ifdown, rmmod, modprobe, ifup) happens sometimes, when I upgrade wireguard, but I also loose routes without touching any of the links on the server.
from babeld.
Please let me know if the following patch makes the problem go away. (It's a workaround, not a proper fix.)
diff --git a/babeld.c b/babeld.c
index 402230c..cafc143 100644
--- a/babeld.c
+++ b/babeld.c
@@ -558,7 +558,7 @@ main(int argc, char **argv)
kernel_addr_changed = 0;
kernel_dump_time = now.tv_sec + roughly(30);
schedule_neighbours_check(5000, 1);
- schedule_interfaces_check(30000, 1);
+ schedule_interfaces_check(100, 1);
expiry_time = now.tv_sec + roughly(30);
source_expiry_time = now.tv_sec + roughly(300);
@@ -740,7 +740,7 @@ main(int argc, char **argv)
if(timeval_compare(&check_interfaces_timeout, &now) < 0) {
check_interfaces();
- schedule_interfaces_check(30000, 1);
+ schedule_interfaces_check(100, 1);
}
if(now.tv_sec >= expiry_time) {
from babeld.
It's weird. I watched that happen two evenings in a row when I reported the issue. The kernel rt currently stays populated correctly. When that changes again I'll test your patch.
Update: patch applied, babeld up and running, waiting for events to report back.
from babeld.
I didn't touch any interface, I restarted with the patch and an hour or so later IPv6 routes are missing again.
babeld: 18 routes
# echo "dump" | timeout 1 nc :: 33123 | grep "installed yes"
add route 564065fd3c60 prefix 172.23.42.1/32 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd7270 prefix 172.23.42.2/32 from 0.0.0.0/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fd3f80 prefix 172.23.42.8/32 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd6cf0 prefix 172.23.42.10/32 from 0.0.0.0/0 installed yes id 2a:d2:44:ff:fe:9d:d2:bf metric 288 refmetric 192 via fe80::2 if wg-glitch
add route 564065fd6810 prefix 172.23.42.64/26 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd68b0 prefix 172.23.42.128/26 from 0.0.0.0/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd3d00 prefix 172.23.42.226/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd3da0 prefix 172.23.42.238/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd3e40 prefix 172.23.42.240/31 from 0.0.0.0/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd63a0 prefix fd42:23:42:100::/64 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd6440 prefix fd42:23:42:110::/64 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd3bc0 prefix fd42:23:42:b100::/56 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd4070 prefix fd42:23:42:b200::/56 from ::/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fd6580 prefix fd42:23:42:b800::/56 from ::/0 installed yes id 02:0d:b9:ff:fe:49:cc:f8 metric 98 refmetric 0 via fe80::1 if wg-io
add route 564065fd7410 prefix fd42:23:42:ba00::1/128 from ::/0 installed yes id 2a:d2:44:ff:fe:9d:d2:bf metric 288 refmetric 192 via fe80::2 if wg-glitch
add route 564065fcf030 prefix fd42:23:42:ff01::/64 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
add route 564065fd4010 prefix fd42:23:42:ff07::/64 from ::/0 installed yes id c4:34:36:e4:f6:42:71:a0 metric 96 refmetric 0 via fe80::2 if wg-snafu
add route 564065fcf1c0 prefix fd42:23:42:ff08::/64 from ::/0 installed yes id d4:57:67:c3:ff:0b:d5:f9 metric 96 refmetric 0 via fe80::2 if wg-glitch
kernel rt ip4: 9 routes
# ip -4 r s t 100
172.23.42.1 via 172.23.42.231 dev wg-glitch proto babel onlink
172.23.42.2 via 172.23.42.233 dev wg-snafu proto babel onlink
172.23.42.8 via 172.23.42.224 dev wg-io proto babel onlink
172.23.42.10 via 172.23.42.231 dev wg-glitch proto babel onlink
172.23.42.64/26 via 172.23.42.224 dev wg-io proto babel onlink
172.23.42.128/26 via 172.23.42.224 dev wg-io proto babel onlink
172.23.42.226/31 via 172.23.42.231 dev wg-glitch proto babel onlink
172.23.42.238/31 via 172.23.42.231 dev wg-glitch proto babel onlink
172.23.42.240/31 via 172.23.42.231 dev wg-glitch proto babel onlink
kernel rt ipv6: 5 routes (missing 4)
# ip -6 r s t 100
fd42:23:42:b100::/56 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium
fd42:23:42:b200::/56 via fe80::2 dev wg-snafu proto babel metric 1024 onlink pref medium
fd42:23:42:ff01::/64 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium
fd42:23:42:ff07::/64 via fe80::2 dev wg-snafu proto babel metric 1024 onlink pref medium
fd42:23:42:ff08::/64 via fe80::2 dev wg-glitch proto babel metric 1024 onlink pref medium
remark
fwiw: what's weird is that 172.23.42.10
/fd42:23:42:ba00::1
is still in the rt, because thats my laptop which has been in suspend for the last 3 hours. the route gets dropped due to unreach and reappears again.
from babeld.
Either something is flushing the routes, or the interfaces go down-up without babeld noticing, or there's a bug somewhere. Sorry I cannot be of more help.
from babeld.
Is something about routing table 100 significant to babel?
I'm running bird alongside babeld on that machine, but it does not touch igp routing tables and only does static announcements.
I'm using ip monitor link
to rule out link flaps, none happen.
The routing policy looks as follows:
0: from all lookup local
200: from all lookup igpt
205: from all lookup igp
210: from all lookup peers
220: from all lookup ebgp
230: from all lookup blackhole
32766: from all lookup main
with table ids
100 igp
101 igpt
110 peers
120 ebgp
130 blackhole
I think the following message is significant:
kernel_route(ADD): No such device
It happens when the remote wireguard end shuts down and comes back up. So this is likely a race condition. The local interface is not flapping when that happens.
I am uncertain why it would say "No such device", maybe wireguard, the transport I'm using, is doing something weird?
from babeld.
babeld is checking whether an interface has changed based on an interval. If the interface is down and back up in a time that is shorter than the interval, the observed behavior would occur. You could try lowering the interval for the interface checks even further but it is not clear whether that would even get rid of the issue in a significant form.
I am not quite sure about a proper fix. It seems to me that it could be enough to add an additional check to kernel_netlink.c in netlink_read(). There it could be checked whether the current event is of nlmsg_type RTM_NEWLINK|RTM_DELLINK|RTM_CHANGELINK and if it is, the interface check should be triggered as it is. Naturally the scheduled check then could be omitted.
from babeld.
from babeld.
So I prepared a patch that works as described above. While it does not fix
@mweinelt issue, I still feel this is the right approach to detect interface
changes as it reduces the likelyhood of babeld not noticing an ifup/ifdown. I
was able to verify this with a pair of veth and could see the check being
triggered. Unfortunately the interface_check is also scheduled when there is no ip address assigned to an interface so we cannot entirely get rid of the scheduled check yet before reworking check_link_local_addresses() in interface.c. I added FIXME comments for this in my patch.
In this issue, we are dealing with a point-to-point interface which is
considered "UP" all the time while the remote endpoint flaps. Due to this, the
kernel apparently never emits a RTM_DELLINK/RTM_NEWLINK,
To solve this, we need to find out which signal can be emitted by the kernel in
this situation and subscribe to it. At this point I am not sure there is a
generic solution as I have not worked with point-to-point interfaces
previously. That being said, using wireguard in conjunction with babeld is
something that the freifunk community has wanted to do for a while. Without
solving this issue, that cannot be done.
Interestingly the routes get lost even though the interface is always up. My
expectation would be that the kernel retains routes while the interface is up.
The fact that these routes get lost is surprising me.
from babeld.
I confirm this problem with a setup of [OpenVPN TCP -> WireGuard Mesh -> VxLAN].
I run Babeld on VxLAN interface. Although the upper two layers does not change, the OpenVPN goes up and down frequently.
To test the bug, I typed:
sudo systemctl stop [email protected]
sleep 10
sudo systemctl start [email protected]
tail -f /var/log/babeld.log # Debian babeld package does not go through systemd journal
I see babeld printing an infinite loop of
kernel_route(ADD): No such device
with a CPU utilization of 80%-100%.
This is weird, since OpenVPN does not flush routes on the upper layers. Babeld is interfered by an interface it is not running on.
My babeld version: babeld-1.8.2-1-g8cbc75d
.
To get this version, I first installed babeld package from Debian sid repo, then compiled the latest code and replaced the binary.
from babeld.
I also saw the same situation like you did (device with correct index exists, yet there is a No such device error message). At that point I was out of ideas and started checking when this error message happens, wasn't able to finish this though.
I am seeing the same log with 1.8.2. This completely fills server logs here. Interestingly I did not see this before 1.8.0. I saw it in master and in the unicast branch.
The load could be an entirely different issue or a subsequent error.
from babeld.
nltrace reveals this:
160713 Sending update to ens14 for 2a06:8187:fbab:2:9443:52a8:e9e1:a04a/128 from ::/0.
160714 Sending update to babel-vpn-1374 for 2a06:8187:fbab:2:9443:52a8:e9e1:a04a/128 from ::/0.
160715 Received update/prefix for 2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from fe80::e8da:47ff:feea:17f7 on babel-vpn-1374.
160716 install_route(2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from ::/0)
160717 kernel_route: add 2a06:8187:fbab:2:9d61:3720:330d:7d0/128 from ::/0 table 10 metric 65535 dev 7 nexthop fe80::e8da:47ff:feea:17f7
160718 Sending seqno 3686 from address 0x55bcf55962d0 (talk)
160719 netlink send(4):
160720 Setting msg proto to 0
160721 -------------------------- BEGIN NETLINK MESSAGE ---------------------------
160722 [NETLINK HEADER] 16 octets
160723 .nlmsg_len = 56
160724 .type = 24 <route/route::new>
160725 .flags = 1541 <REQUEST,ACK,MATCH,ATOMIC>
160726 .seq = 3686
160727 .port = 0
160728 [PAYLOAD] 12 octets
160729 0a 80 00 00 0a 2a 00 07 04 00 00 00 .....*......
160730 [ATTR 01] 16 octets
160731 2a 06 81 87 fb ab 00 02 9d 61 37 20 33 0d 07 d0 *........a7 3...
160732 [ATTR 06] 4 octets
160733 ff ff ff ff ....
160734 --------------------------- END NETLINK MESSAGE ---------------------------
160735 netlink recv(4):
160736 Setting msg proto to 0
160737 -------------------------- BEGIN NETLINK MESSAGE ---------------------------
160738 [NETLINK HEADER] 16 octets
160739 .nlmsg_len = 76
160740 .type = 2 <ERROR>
160741 .flags = 0 <>
160742 .seq = 3686
160743 .port = 2297
160744 [ERRORMSG] 20 octets
160745 .error = -19 "No such device"
160746 [ORIGINAL MESSAGE] 16 octets
160747 .nlmsg_len = 16
160748 .type = 24 <0x18>
160749 .flags = 1541 <REQUEST,ACK,MATCH,ATOMIC>
160750 .seq = 3686
160751 .port = 0
160752 --------------------------- END NETLINK MESSAGE ---------------------------
160753 nl_recvmsgs_report() returns error
160754 Netlink message: {seq:3686}netlink_read: No such device
160755 kernel_route(ADD): No such device
the payload of the netlink message decodes as follows:
10 - rtm_family - AF_INET6
80 - 128 bit - length of dst
0 - length of src
0 - tos filter
0a - 10 - routing table
2a - 42 -protocol = babel
00 - scope
07 - rtm_type => RTN_UNREACHABLE
04 00 00 00 - rtm_flags => RTM_F_ONLINK
ATTR 01 is the destination address
ATTR 06 is the metric which is KERNEL_INFINITY
I cannot see a coded device in the netlink message.
The code that assembles these flags in kernel_netlink.c (around line 1034) is in part 10+ years old. So why is this becoming an issue now?
from babeld.
Edit: I #thought "No such device" was not visible any more. It seems I was too impatient. I still see it.
in the last 7h I got:
# cut -d'}' -f2 < /tmp/netlink_nosuchdevice |sort |uniq -c
16068629 kernel_route(ADD): No such device
10 kernel_route(FLUSH): No such process
644 kernel_route(MODIFY metric): No such device
16069273 netlink_read: No such device
623 netlink_read: No such process
Edit:
Why do we even need routes with a metric of 65535?
from babeld.
it seems that I can reproduce this using ip route add. If I add a route, omitting a device I get the same error message. If I do not omit the device, another attribute (ATTR04) containing the device ID is added to the netlink request. <<< playing with it some more, I can ip r a unreachable blabla without a device. in the above example this is an unreachable route so ATTR04 should not be required.
from babeld.
Observed route missing problem also with IPv4.
Restarting babeld on the affected intermediate router solved the problem.
My network is dual-stack. I am sure it was not a connectivity problem since IPv6 routes are working.
Hard to capture logs as evidence because the problem goes away as soon as I restarted it with another verbosity.
from babeld.
could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19
Seems the problem has gone away with the patch. Thanks! 🎉
I will continue to test the network for several days to see if the problem is really gone.
from babeld.
Looks go here as well, thanks!
from babeld.
could you try if the issue persists with https://github.com/christf/babeld/tree/FIX_nosuchdevice or #19
Patch tested on Linux 4.17.0 and 4.14.52, network with 7 nodes.
Eliminated "no such device" errors during last 3 days.
Has anyone tested compatibility with older kernels?
from babeld.
I have this running with 15 nodes where 2 of them run kernel 4.17 and the rest runs 4.4
from babeld.
@jech The patch is already available. Would you please merge it?
We've been missing you for weeks :-)
from babeld.
It looks correct to me, so I think I'm going to merge it without testing. I've got one minor nit (see my comment in #19).
My apologies for the delay.
from babeld.
Related Issues (20)
- kernel_route: Invalid argument HOT 2
- bugs in parse_hello_subtlv, parse_ihu_subtlv, parse_request_subtlv, parse_seqno_request_subtlv, and parse_other_subtlv HOT 2
- Babeld does not function properly HOT 1
- Feature request: exchange arbitrary strings HOT 4
- incorrect checkings in babel HOT 1
- obscure error message: babeld: send: Destination address required HOT 10
- babeld replaces routes non-atomically HOT 1
- routing loop due to ignoring linkdown HOT 3
- Need help with redistribute default routes HOT 4
- Interface Regex? HOT 1
- Inject routes with different mtu? HOT 3
- Is there intention to support IPv4 multicast group as well? HOT 1
- Error : Generating IPV6 Address HOT 1
- Same route announced twice? HOT 2
- Upper Bound on Interface Count? HOT 4
- Denying routes in install filter doesn't work HOT 6
- Release tag missing for 1.11 HOT 1
- 1.12 tag missing HOT 1
- Check Interfaces in add_interface HOT 2
- babeld 1.12.1 build failure HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from babeld.