Comments (10)
Unfortunately, Dr. Jung (:tada:), I have been unable to capture a log in this situation. But I would not be surprised given the current state of the client that there is some case in which the state machine goes berserk.
from tunneldigger.
@valcryst have you or do you plan to upstream this change into gluon itself?
In saarland we have patched our own secondary tunneldigger watchdog to count the number of tunneldigger processes, and issue a reboot if there are too many of them. That also seems to help.
from tunneldigger.
Thanks. :) (However you found out about that.^^)
Yeah, I have fairly low confidence in that state machine code... but without a log I also don't really know where to start debugging this.
from tunneldigger.
We are facing this issue for some months now, but we wherent able to get access to one of the affected routers till now.
We observed that this symptom always starts after the "daily" reboot and affects around 70-80 Routers out of 1000.
When this happens you see Tunneldigger Broker instantly running the "down-hook" after creating the tunnel.
This is what we have seen on the Tunneldigger Client on the Router.
Sat Jun 11 13:09:25 2022 daemon.debug td-client: [broker2.ff-en.de:10122] Broker usage: 4587
Sat Jun 11 13:09:25 2022 daemon.info td-client: Selected broker1.ff-en.de:10122 as the best broker.
Sat Jun 11 13:09:25 2022 daemon.info td-client: [broker1.ff-en.de:10122] Tunnel successfully established.
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is enabled
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Network device 'mesh-vpn' link is up
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' has link connectivity
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is setting up now
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is now up
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.162513] batman_adv: bat0: Adding interface: mesh-vpn
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.168110] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1420) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.192795] batman_adv: bat0: Interface activated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Unable to create local L2TP tunnel!
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Connection lost.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.348378] batman_adv: bat0: Interface deactivated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.notice netifd: Network device 'mesh-vpn' link is down
But more interesting is this "ps w" output.
root@FF-WTT-Citymanagement-Wetter:~# ps wwwww
PID USER VSZ STAT COMMAND
1 root 1568 S /sbin/procd
2 root 0 SW [kthreadd]
4 root 0 IW< [kworker/0:0H]
6 root 0 IW< [mm_percpu_wq]
7 root 0 SW [ksoftirqd/0]
43 root 0 SW [oom_reaper]
90 root 0 IW< [writeback]
92 root 0 SW [kcompactd0]
93 root 0 IW< [crypto]
95 root 0 IW< [kblockd]
116 root 0 IW< [watchdogd]
129 root 0 IW [kworker/0:1]
138 root 0 SW [kswapd0]
197 root 0 SW [spi0]
292 root 0 IW< [ipv6_addrconf]
300 root 0 IW< [kworker/0:1H]
379 root 0 SWN [jffs2_gcd_mtd4]
446 root 1216 S /sbin/ubusd
447 root 924 S /sbin/askfirst /usr/libexec/login.sh
464 root 1028 S /sbin/urngd
635 root 0 IW< [cfg80211]
659 root 0 IW< [bat_events]
769 root 1296 S /sbin/logd -S 64
821 dnsmasq 1348 S /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
836 root 0 IW [kworker/0:2]
855 root 1080 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
1022 root 932 S /usr/sbin/gluon-arp-limiter
1069 root 1764 S /sbin/netifd
1148 root 932 S /usr/sbin/uradvd -i local-node -p 2a11:6c6:6000:feed::/64
1388 root 1212 S udhcpc -p /var/run/udhcpc-br-wan.pid -s /lib/netifd/dhcp.script -f -t 0 -i br-wan -x hostname:FF-WTT-Citymanagement-Wetter -C -O 121
1391 root 1052 S odhcp6c -s /lib/netifd/dhcpv6.script -t120 br-wan
1396 root 1052 S odhcp6c -s /lib/netifd/dhcpv6.script -L -m30 -t120 br-client
1531 root 1124 S /usr/sbin/gluon-radv-filterd -i br-client -c RADV_FILTER -t 20
1587 root 932 S /usr/sbin/micrond /usr/lib/micron.d
1664 root 928 S /usr/sbin/sse-multiplexd
1760 root 1364 S /usr/sbin/uhttpd -f -h /lib/gluon/status-page/www -r FF-WTT-Citymanagement-Wetter -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 32 -N 100 -R -p 0.0.0.0:80 -p [::]:80
1930 root 1352 S /usr/sbin/dnsmasq -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv.conf
1941 root 1068 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
1950 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
1951 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
2049 root 1584 S /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
2073 root 1216 S< /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 2a11:6c6:6000:feed:ff::1 -p 2a11:6c6:6000:feed:ff::2 -p 91.212.181.1
4095 root 0 IW [kworker/u2:0]
4204 root 2324 S /usr/bin/respondd -d /usr/lib/respondd -p 1001 -g ff02::2:1001 -i mesh0 -g ff05::2:1001 -i br-client -t 10
4519 root 1148 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
4563 root 1220 S -ash
4979 root 1068 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
4980 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
4981 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
5038 root 1212 R ps wwwww
28583 root 0 IW [kworker/u2:3]
32214 root 0 IW [kworker/u2:2]
root@FF-WTT-Citymanagement-Wetter:~#
It looks like it gets started twice and this will last till next reboot (eventualy).
I suspect that some changes how the Tunneldigger watchdog kills the process introduced this?
Kindly Regards
Matthias
from tunneldigger.
It looks like it gets started twice
Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)
Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it.
(In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)
This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.
from tunneldigger.
Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)
Indeed its the main process and one thread for each broker in the broker list.
Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. (In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)
Well i stopped building firmware long ago in our community and someone else is doing it now.
I refer to the old watchdog that was inside tunneldigger before it was accepted into gluon repo.
It included an "killall" wich i assume covered cases like this, the newer one touches only matching pids as it seems.
This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.
You mean this?
https://github.com/freifunk-gluon/packages/blob/master/net/tunneldigger/files/tunneldigger.init#L79
I dont know how it comes that routers end up with two instances of tunneldigger,
but it could be a starting point to look why it happens randomly after daily reboots.
from tunneldigger.
Indeed its the main process and one thread for each broker in the broker list.
That's not what I see. We have 4 brokers and I see 3 processes total.
# ps | grep tunneldigger
1972 root 1068 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
1973 nobody 1040 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
1974 nobody 1040 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
from tunneldigger.
Ok then this assumption was wrong.
Today i have more than 100 routers affected by this problem and the only way to keep
the gateways running while they get hammered is to block them by rate-limiter.
This is how it looks on one of the gateways.
I think i will modify that tunneldigger watchdog in gluon and add a killall to see if that fixes this mess.
from tunneldigger.
Hey @RalfJung i just wanted to let you know, we have overcome this issue by patching
the tunneldigger-watchdog that ships with Gluon. We couldnt investigate deeper and needed
a quick solution, because lots of our refugee installations where hit by this problem.
It seems that "killall -KILL tunneldigger" was our saviour here.
Kindly Regards
#!/usr/bin/lua
local uci = require('simple-uci').cursor()
local function restart_tunneldigger()
os.execute('logger -t tunneldigger-watchdog "Restarting Tunneldigger."')
os.execute('/etc/init.d/tunneldigger stop')
os.execute('sleep 1')
os.execute('killall -KILL tunneldigger')
os.execute('rm -f /var/run/tunneldigger.mesh-vpn.pid'
os.execute('sleep 5')
os.execute('/etc/init.d/tunneldigger start')
end
local function read_pid_file()
local pid_file = io.open('/var/run/tunneldigger.mesh-vpn.pid', 'r')
if not pid_file then
return nil
end
local pid = pid_file:read('*l')
pid_file:close()
return pid
end
local function has_mesh_vpn_neighbours()
local handle = io.popen('batctl o', 'r')
if not handle then
return false
end
for line in handle:lines() do
if line:find('mesh%-vpn') then
handle:close()
return true
end
end
handle:close()
return false
end
if uci:get_bool('tunneldigger', 'mesh_vpn', 'enabled') then
if io.popen('pgrep -x /usr/bin/tunneldigger'):read('*l') ~= read_pid_file() then
os.execute('logger -t tunneldigger-watchdog "Process-Pid does not match with pid-File."')
restart_tunneldigger()
return
end
if not has_mesh_vpn_neighbours() then
os.execute('logger -t tunneldigger-watchdog "No vpn-mesh neighbours found."')
restart_tunneldigger()
return
end
end
from tunneldigger.
@RalfJung I've sent my feedback, but i'm no coder and maybe the experts on this find out
the real cause that triggers this unwanted behaviour of multiple running tunneldiggers.
from tunneldigger.
Related Issues (20)
- [Docs] max_tunnels HOT 3
- Failed to send() control packet HOT 5
- High CPU load due to a single misbehaving client HOT 2
- Change rate limiting to be per-UUID HOT 1
- No releases since 2017? HOT 3
- TC/Traffic Control: Error: Invalid handle. HOT 1
- Review CI Tooling HOT 19
- Newer Kernels log error "recv short packet" for every broker packet HOT 7
- How does tunneldigger work compared to l2tp? HOT 5
- TC/Traffic Control does not always work HOT 1
- Proposal: Broker usage check on reconnect
- tunneldigger-broker: connection fails with `Error: Invalid handle.` HOT 3
- Silence `tc` output when `ignore_fails` is true
- teardown script crashes tunneldigger-broker HOT 5
- broker throwing OSError on creation of timers HOT 1
- Frequent reconnection of clients HOT 70
- setup.py is deprecated HOT 4
- Broker: Wait for interface to have an IP before listening HOT 2
- Update CI to ubuntu-22.04
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tunneldigger.