yahoo / l3dsr Goto Github PK
View Code? Open in Web Editor NEWDirect Server Return load balancing across Layer 3 boundaries.
Direct Server Return load balancing across Layer 3 boundaries.
Direct Server Return (DSR) load balancing is a common way to distribute network traffic using an approach that currently requires the load balancer and all hosts behind the Virtual IP (VIP) to be within the same Layer 2 broadcast domain. This is a severe limitation that hinders scaling VIPs beyond a single contiguous subnet. To overcome this limitation, we present a method to perform DSR load balancing across Layer 3 boundaries (``L3DSR''), a solution that allows Yahoo! to serve up to ten times as many VIPs on a single hardware Load Balancer compared to other Layer 3 load balancing methods. In order to overcome Layer 2 limitations, we use the 6-bit Differentiated Services Code Point (DSCP) field of the IPv4 header used for packet classification to relay information to the server. The server inspects the header and rewrites the destination address based on the value of the DSCP field and according to its own mapping of DSCP values to destination addresses. L3DSR is currently supported by: - A10 AX3200 >= 2.2.5 - Brocade ADX Series >= 12.1d - Brocade/Foundry ServerIron 450 - M7 and JetCore blades - >= 12.2.01p - Citrix Netscaler running 8.x, 9.x - Radware Alteon 4408, 4416, 5412 - SW versions 27 and above - Radware AppDirector (All platforms) - 2.10 and above, requires the optional BWM license On the server, L3DSR is currently supported by: - FreeBSD >= 6.x - RHEL4 >= 4.7 (IPv4 only), RHEL5 >= 5.4 (IPv6 >= 5.9), RHEL6 >= 6.0, and Fedora 17 L3DSR was developed at Yahoo! Inc. If you have questions or comments, please contact: Jan Schaumann <[email protected]> (overall design), Carl Stanley <[email protected]> (LBs), Quentin Barnes <[email protected]> (iptables-daddr), or Wayne Badger <[email protected]> (dsrtools/yvipagent).
Hello,
There is an error that build fails on kernel 5.3 or later.
Specifically, the following commits have been merged in kernel 5.3-rc1.
Therefore, it is necessary to replace skb_make_writable
with skb_ensure_writable
in xt_DADDR.c
for kernel 5.3 or later.
I have only been able to get the linux module working when connection tracking is enabled, this is not ideal on high-traffic sites as the connection table fills up and the kernel starts dropping packets. If you know of any workaround for this please enlighten.
I can't build it on Ubuntu 17.10 (kernel:4.13, iptables:v1.6.1)
make -C '/lib/modules/4.13.0-16-generic/build' M='/root/l3dsr/linux/kmod-xt'
make[1]: Entering directory '/usr/src/linux-headers-4.13.0-16-generic'
AR /root/l3dsr/linux/kmod-xt/built-in.o
CC [M] /root/l3dsr/linux/kmod-xt/xt_DADDR.o
Building modules, stage 2.
MODPOST 1 modules
CC /root/l3dsr/linux/kmod-xt/xt_DADDR.mod.o
LD [M] /root/l3dsr/linux/kmod-xt/xt_DADDR.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.13.0-16-generic'
make -C 'extensions-1.4'
make[1]: Entering directory '/root/l3dsr/linux/extensions-1.4'
cc -O2 -g -Wall -Wunused -fPIC -I../kmod-xt -c -o libxt_DADDR.o libxt_DADDR.c
libxt_DADDR.c:19:10: fatal error: iptables.h: No such file or directory
#include <iptables.h>
^~~~~~~~~~~~
compilation terminated.
Attempting to build the module via "make all" is returning an error on a fresh CentOS 6.4 install.
The error returned is:
+ for kvariant in '""'
+ ksrc=/usr/src/kernels/
+ /usr/bin/make -C /usr/src/kernels/ M=/home/caw/l3dsr/linux/build-native-x86_64/BUILD/iptables-daddr- 0.6.2/_kmod_build_/kmod MODVERSION=0.6.2
make[2]: Entering directory `/usr/src/kernels'
make[2]: *** No targets specified and no makefile found. Stop.
make[2]: Leaving directory `/usr/src/kernels'
error: Bad exit status from /var/tmp/rpm-caw/rpm-tmp.U9KPQN (%build)
The file /var/tmp/rpm-caw/rpm-tmp-U9KPQN contains the following:
for kvariant in ""
do
ksrc="/usr/src/kernels/${kvariant:+.$kvariant}"
/usr/bin/make \
-C "$ksrc" \
M="$PWD/_kmod_build_${kvariant}/kmod" \
MODVERSION='0.6.2'
done
It seems that the line 'for kvariant in ""' should contain the subdir of the kernel source directory, such as:
[caw@dsr-srv1 rpm-caw]$ ls -l /usr/src/kernels
total 4
drwxr-xr-x. 22 root root 4096 Sep 27 23:12 2.6.32-358.18.1.el6.x86_64
What I'm not getting is what data the rpm-tmp file gets built from. Assistance appreciated :)
The DADDR iptables plugin, iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.0.2
, can be replaced by tc, so no plugin is need to use l3dsr:
tc qdisc add dev eth0 root handle 1: htb
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match u32 0x00040000 0x00ff0000 at 0 action nat ingress 192.168.0.3 192.168.0.2
where the u32 0x00040000 0x00ff0000 at 0
match Tos 0x4, which is dscp 1.
192.168.0.3 is read server ip, and 192.168.0.2 is vip.
I'm publishing a beta
branch. This is to synchronize our internal repo with this repo. At this point our internal repo will be retired and replaced with this repo as upstream. Hopefully, this will prevent them getting too far out of sync in the future. Also, people won't have to report problems that have already been fixed internally and not yet published here. Lastly, everyone can try out the new code before it ends up on master
.
This code is still waiting for two things before going to master
:
yvipagent
, dsrtools
, has not yet be integrated into this repo.Since dsrtools
is not available yet, for those of you who want to test the code out, you'll need do one of the following:
linux/kmod-xt/xt_DADDR.c
from "raw" to "mangle"./etc/modprobe.d
. See paragraph starting at line 30 in linux/USING
for details.--with mangle
option (autogenerates /etc/modprobe.d/xt_DADDR.conf
file for you).Some highlights of the beta
branch:
--without kmod
when building rpms to prevent generating the kmod package.xt_DADDR
's table's value (raw or mangle) can be examined by reading /sys/module/xt_DADDR/parameters/table
..bz2
to .xz
.pre
and preun
rpm scripts checking for kernel module.hw csum failure
for NICs using CHECKSUM_COMPLETE
.kmodtool
directly, but call the %kernel_module_package
macro.mock
build problem when native rpmdb
format is different than its chroot
.--with mangle
and --with override
rpm build options.skb_ensure_writable()
.DESTDIR
.The test ip4.l3.007
appears to be flaky, the output doesn't always have mangle
in the iptbl
column for the "stopped" cases. Tested in the beta branch on rhel8 (4.18.0-193.19.1.el8_2.x86_64
). The output of an additional run is copied into the gist https://gist.github.com/dmitris/7bf36afeb66743c9c7408348b116c8d4.
$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
1,7c1,7
< type state name ipaddr dscp loopback iptables iptbl src
< ===== ======= ============= ============= ==== ======== ======== ===== ====
< l3dsr stopped 188.125.67.1 188.125.67.1 10 -- -- -- conf
< l3dsr stopped 188.125.67.2 188.125.67.2 11 -- -- -- conf
< l3dsr stopped 188.125.67.3 188.125.67.3 12 -- -- -- conf
< loopb started 188.125.67.68 188.125.67.68 -- lo:1 -- -- disc
< loopb started 188.125.67.69 188.125.67.69 -- lo:2 -- -- disc
---
> type state name ipaddr dscp loopback iptables iptbl src
> ===== ======= ============= ============= ==== ======== ======== ====== ====
> l3dsr stopped 188.125.67.1 188.125.67.1 10 -- -- mangle conf
> l3dsr stopped 188.125.67.2 188.125.67.2 11 -- -- mangle conf
> l3dsr stopped 188.125.67.3 188.125.67.3 12 -- -- mangle conf
> loopb started 188.125.67.68 188.125.67.68 -- lo:1 -- -- disc
> loopb started 188.125.67.69 188.125.67.69 -- lo:2 -- -- disc
The above difference is with expected.status.5.
Actual rv=1 Expected rv=0
===== FAILED: 2020-10-07 11:23:00: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created
$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
===== PASSED: 2020-10-07 11:23:03: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created
$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
1,7c1,7
< type state name ipaddr dscp loopback iptables iptbl src
< ===== ======= ============= ============= ==== ======== ======== ===== ====
< l3dsr stopped 188.125.67.1 188.125.67.1 10 -- -- -- conf
< l3dsr stopped 188.125.67.2 188.125.67.2 11 -- -- -- conf
< l3dsr stopped 188.125.67.3 188.125.67.3 12 -- -- -- conf
< loopb started 188.125.67.68 188.125.67.68 -- lo:1 -- -- disc
< loopb started 188.125.67.69 188.125.67.69 -- lo:2 -- -- disc
---
> type state name ipaddr dscp loopback iptables iptbl src
> ===== ======= ============= ============= ==== ======== ======== ====== ====
> l3dsr stopped 188.125.67.1 188.125.67.1 10 -- -- mangle conf
> l3dsr stopped 188.125.67.2 188.125.67.2 11 -- -- mangle conf
> l3dsr stopped 188.125.67.3 188.125.67.3 12 -- -- mangle conf
> loopb started 188.125.67.68 188.125.67.68 -- lo:1 -- -- disc
> loopb started 188.125.67.69 188.125.67.69 -- lo:2 -- -- disc
The above difference is with expected.status.5.
Actual rv=1 Expected rv=0
===== FAILED: 2020-10-07 11:23:07: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created
$ pwd
/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/tests/ip4.l3.007.d
Hello,
We are using this l3dsr as RPM-based distros.
However, this spec file does not support centos8.
iptables-daddr.spec
Are there plans to support centos8 soon?
I tried to edit spec file so that the build succeeded forcibly.
# diff ~/build/l3dsr/linux/rpm/iptables-daddr.spec iptables-daddr.spec
14a15,17
> %if "%{dist}" == ".el8"
> %define rhel_version 700
> %endif
109,110c112,113
< BuildRequires: iptables-devel >= 1.4.7, iptables-devel < 1.5
< Requires: iptables >= 1.4.7, iptables < 1.5
---
> BuildRequires: iptables-devel >= 1.4.7, iptables-devel < 1.9
> Requires: iptables >= 1.4.7, iptables < 1.9
After installing this rpm package and doing a simple test, it looks like it works as shown below.
# iptables -t mangle -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A PREROUTING -m dscp --dscp 0x0a -j DADDR --set-daddr 1.1.1.1 <---------------------------- vip set with iptables
# nft list table ip mangle
table ip mangle {
chain PREROUTING {
type filter hook prerouting priority -150; policy accept;
ip dscp 0x0a counter packets 25 bytes 2100 # DADDR set 1.1.1.1 <----------- Settings converted to nftables
}
chain INPUT {
type filter hook input priority -150; policy accept;
}
chain FORWARD {
type filter hook forward priority -150; policy accept;
}
chain OUTPUT {
type route hook output priority -150; policy accept;
}
chain POSTROUTING {
type filter hook postrouting priority -150; policy accept;
}
}
# tcpdump -ni eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:22:22.174193 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 1, length 64
17:22:22.174232 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 1, length 64
17:22:23.174028 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 2, length 64
17:22:23.174066 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 2, length 64
17:22:24.174083 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 3, length 64
17:22:24.174127 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 3, length 64
# iptables -t mangle -L -v
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
33 2772 DADDR all -- any any anywhere anywhere DSCP match 0x0a DADDR set 1.1.1.1
It seems to work for the time being.
RHEL 8 publishes kernel-abi-stablelists
that Obsoletes
kernel-abi-whitelists
, but the Obsoletes
is not present in RHEL 9.
We encountered a "hw csum failure" for the latest Mellanox driver which turned up a long latent bug in the xt_DADDR.c code. I'm making the patch available now on the hw_csum_failure
branch for those that run across the issue before I can get some other fixes into master.
Some of the other upcoming changes that are pending for master are:
dsrtools
which deprecates yvipagent
If you'd like any of these changes before I can publish them on github, let me know and I'll see if I can send you an isolated patch for it.
The beta
branch now has dsrtools
available.
dsrtools
is a replacement for yvipagent
. We've been using this tool internally for several years now on our RHEL 6 and RHEL 7 production systems.
As part of this latest update to beta
, the tree has been reorganized. The directories and files that were immediately under the linux
directory are now in a subdirectory named iptables-daddr
. Also under linux
is dsrtools
.
Under the dsrtools
directory, you'll find README
, INSTALL
, and USING
documentation as well as man pages under the src
directory for dsrctl(8)
and dsr.conf(5)
.
Also under dsrtools
is a tests
directory. It is an extensive suite of tests. These tests can be used to validate changes to dsrtools
to ensure they don't introduce regressions.
dsrtools
is necessary for newer versions of iptables-daddr
(1.9.0 and later) that use the raw
table by default. If you wish to continue to use yvipagent
, you'll have to force your version of iptables-daddr
back to mangle
. See Issue #5 on how to do that.
yvipagent
is still available for now in the repository, but will be deprecated in a future release.
If you do any testing with these latest updates on the beta
branch, for please let us know how it goes, for good or bad.
Once we get enough feedback, we'll merge the beta
branch to master
.
If you find any problems or have any questions with dsrtools
, feel free to ask them here or contact Wayne Badger ([email protected]) or Quentin Barnes ([email protected]).
Someone suggested I consider using "raw" instead of the "mangle" table for this module so that it would appear in front of the conntrack module. In that way, the daddr rewriting wouldn't confuse conntrack's tracking. I tried the idea out with some limited testing, and it seems to work, but I'm cautious about the move not being able to find much documentation on the "raw" table.
I wrote a note on netdev a couple of weeks ago (https://www.mail-archive.com/[email protected]/msg125234.html), but so far no help.
Has anyone also hit the problem with conntrack, tried any workarounds, or has comments on using the "raw" table?
TASK [services : restart dsrctl service] ***************************************
00:14:46 fatal: [host.example.com]: FAILED! => {"changed": false, "msg": "Unable to restart service dsr: Job for dsr.service failed because the control process exited with error code. See \"systemctl status dsr.service\" and \"journalctl -xe\" for details.\n"}
systemctl status dsr.service
dsr.service - DSR control
Loaded: loaded (/usr/lib/systemd/system/dsr.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2021-05-11 02:58:32 UTC; 6 days ago
Docs: man:dsrctl(8)
Process: 15427 ExecStop=/usr/sbin/dsrctl stop (code=exited, status=0/SUCCESS)
Process: 15445 ExecStart=/usr/sbin/dsrctl start (code=exited, status=1/FAILURE)
Main PID: 15445 (code=exited, status=1/FAILURE)
May 11 02:58:32 host.example.com systemd[1]: Starting DSR control...
May 11 02:58:32 host.example.com dsrctl[15445]: Failed to get iptables (iptables -L -t raw -n).
May 11 02:58:32 host.example.com systemd[1]: dsr.service: main process exited, code=exited, status=1/FAILURE
May 11 02:58:32 host.example.com systemd[1]: Failed to start DSR control.
May 11 02:58:32 host.example.com systemd[1]: Unit dsr.service entered failed state.
May 11 02:58:32 host.example.com systemd[1]: dsr.service failed.
STDERR: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Noticed on: 7.9.15-1.el7
$ rpm -q dsrtools
dsrtools-1.4.0-20210314.02.el7.noarch
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.