When running tunneldigger on the current Debian stable kernel (4.9.51), only one client can connect. The second client fails because the l2tp tunnel interface does not appear. After fixing a bug in the netlink interface (#50), one can see that the kernel sends an EEXIST in reply to the session_create.
A lot of digging through the linux kernel sources uncovered the source of the issue: L2TPv3 session IDs have to be unique system-wide. Tunneldigger hard-codes a session ID of 1 for every connection. That used to work due to a bug in the kernel, which meant that the kernel failed to actually ensure uniqueness of the session ID. That bug got fixed by https://github.com/linux-stable/linux-stable/commit/dbdbc73b44782e22b3b4b6e8b51e7a3d245f3086, which was backported to a few stable series, in particular, to 4.9.36.
Proposed fix
Fixing this in a compatible way will require protocol changes: Both ends of the tunnel have to know each others session ID, so they have to negotiate whether they use 1 or something more unique. I started working on a fix at https://github.com/freifunk-saar/tunneldigger/tree/wlanslovenija. The approach is summarized in the commit message over there, copied here for reference:
This patch adds unique session IDs to tunneldigger in a backwards-compatible
way. If both ends of the tunnel agree to use a unique session ID, they both
will use the tunnel ID as the session ID. To manage this mutual agreement, two
messages in the protocol are changed:
CONTROL_TYPE_PREPARE gains a new optional byte at the end that clients use to
indicate to the server whether they want to use a unique session ID. Old
servers will just ignore this additional byte. New servers now know they are
talking with a modern client, and use unique session IDs for this connection.
New servers talking with old clients will notice the absence of this request and
use 1 as the session ID.
Furthermore, CONTROL_TYPE_TUNNEL gains a new optional byte at the end that
servers use to tell clients that they acknowledge using unique session IDs. Old
clients will never see this additional byte, as the server only sends it if
unique session IDs were requested in CONTROL_TYPE_PREPARE. New clients know,
upon seeing this byte, that they are talking to a new server, and will hence use
unique session IDs. If a new client talks to an old server, it will receive an
old-style CONTROL_TYPE_TUNNEL and hence know that it has to use session ID 1.
So, both old a new clients can talk with both old and new servers. However, of
course, if the server has a recent enough kernel, even though it can communicate
with old clients, it still can only support one old client at a time.
I am running two of our four servers with this fix, so compatibility with old clients is tested already. However, due to #55, I can't say anything about long-time stability yet. I also couldn't yet test new clients as I am still fighting my firmware build system. (The client uses such an ancient version of libnl that I can't build it on the host.)
Open problems
As the last paragraph in the commit description says, there still is a potential problem: Once we upgrade one of our servers to a kernel including the problematic bugfix, only one old client will be able to talk to it at a time. There is nothing we can do about this, but what want to avoid is a client trying to connect to a new server and failing, while there are old servers (with higher usage) that could still support this client. I first tried to (ab)use CONTROL_TYPE_USAGE
to let the client indicate whether it supports unique session IDs, so that the server could report "I am full" to old clients and steer them elsewhere. However, clients actually seem to send some rather arbitrary data alongside that message (UUUUUUUU
, to be precise -- wtf?!?), so I am worried that attaching meaningful bytes here will not work very well. We could introduce a CONTROL_TYPE_USAGE2
, but I think I have a better idea.
Clients already have a retry loop to connect again if the connection to the broker failed. I think clients should remember which broker failed, and exclude that one in the next round. Only once all brokers got excluded that way, they will be enabled again. This will, I think, improve client behavior in general, not just for this particular issue. It will also solve this issue as (after #50), brokers will send an error to clients when the session ID is already used, making the client try some other server. So, as long as one of the available brokers still has an old kernel, old clients will reliably be able to connect. Furthermore, even if all servers are on a new kernel, there can still be N old clients connected at the same time (and hopefully, they will fetch an auto-update and then become new clients).
I started implementing this, but got stuck yesterday due to the aforementioned build system issues.