Giter Site home page Giter Site logo

l3dsr's Introduction

Direct Server Return (DSR) load balancing is a common way to distribute
network traffic using an approach that currently requires the load
balancer and all hosts behind the Virtual IP (VIP) to be within the same
Layer 2 broadcast domain.  This is a severe limitation that hinders
scaling VIPs beyond a single contiguous subnet.  To overcome this
limitation, we present a method to perform DSR load balancing across Layer
3 boundaries (``L3DSR''), a solution that allows Yahoo! to serve up to ten
times as many VIPs on a single hardware Load Balancer compared to other
Layer 3 load balancing methods.

In order to overcome Layer 2 limitations, we use the 6-bit Differentiated
Services Code Point (DSCP) field of the IPv4 header used for packet
classification to relay information to the server.  The server inspects
the header and rewrites the destination address based on the value of the
DSCP field and according to its own mapping of DSCP values to destination
addresses.

L3DSR is currently supported by:
 - A10 AX3200 >= 2.2.5
 - Brocade ADX Series >= 12.1d
 - Brocade/Foundry ServerIron 450
   - M7 and JetCore blades
   - >= 12.2.01p
 - Citrix Netscaler running 8.x, 9.x
 - Radware Alteon 4408, 4416, 5412
   - SW versions 27 and above
 - Radware AppDirector (All platforms)
   - 2.10 and above, requires the optional BWM license

On the server, L3DSR is currently supported by:
 - FreeBSD >= 6.x
 - RHEL4 >= 4.7 (IPv4 only), RHEL5 >= 5.4 (IPv6 >= 5.9),
   RHEL6 >= 6.0, and Fedora 17

L3DSR was developed at Yahoo! Inc.  If you have questions or comments,
please contact:
   Jan Schaumann <[email protected]> (overall design),
   Carl Stanley <[email protected]> (LBs),
   Quentin Barnes <[email protected]> (iptables-daddr), or
   Wayne Badger <[email protected]> (dsrtools/yvipagent).

l3dsr's People

Contributors

jschauma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

l3dsr's Issues

Linux module only works with ip_conntrack loaded

I have only been able to get the linux module working when connection tracking is enabled, this is not ideal on high-traffic sites as the connection table fills up and the kernel starts dropping packets. If you know of any workaround for this please enlighten.

Add new kernels and iptables 1.6 support

I can't build it on Ubuntu 17.10 (kernel:4.13, iptables:v1.6.1)

make -C '/lib/modules/4.13.0-16-generic/build' M='/root/l3dsr/linux/kmod-xt'
make[1]: Entering directory '/usr/src/linux-headers-4.13.0-16-generic'
  AR      /root/l3dsr/linux/kmod-xt/built-in.o
  CC [M]  /root/l3dsr/linux/kmod-xt/xt_DADDR.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /root/l3dsr/linux/kmod-xt/xt_DADDR.mod.o
  LD [M]  /root/l3dsr/linux/kmod-xt/xt_DADDR.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.13.0-16-generic'
make -C 'extensions-1.4'
make[1]: Entering directory '/root/l3dsr/linux/extensions-1.4'
cc -O2 -g -Wall -Wunused -fPIC -I../kmod-xt  -c -o libxt_DADDR.o libxt_DADDR.c
libxt_DADDR.c:19:10: fatal error: iptables.h: No such file or directory
 #include <iptables.h>
          ^~~~~~~~~~~~
compilation terminated.

Build error on Centos 6.4 system

Attempting to build the module via "make all" is returning an error on a fresh CentOS 6.4 install.

The error returned is:

+ for kvariant in '""'
+ ksrc=/usr/src/kernels/
+ /usr/bin/make -C /usr/src/kernels/ M=/home/caw/l3dsr/linux/build-native-x86_64/BUILD/iptables-daddr- 0.6.2/_kmod_build_/kmod MODVERSION=0.6.2
make[2]: Entering directory `/usr/src/kernels'
make[2]: *** No targets specified and no makefile found.  Stop.
make[2]: Leaving directory `/usr/src/kernels'
error: Bad exit status from /var/tmp/rpm-caw/rpm-tmp.U9KPQN (%build)

The file /var/tmp/rpm-caw/rpm-tmp-U9KPQN contains the following:

for kvariant in ""
do
    ksrc="/usr/src/kernels/${kvariant:+.$kvariant}"
    /usr/bin/make \
        -C "$ksrc" \
        M="$PWD/_kmod_build_${kvariant}/kmod" \
        MODVERSION='0.6.2'
done

It seems that the line 'for kvariant in ""' should contain the subdir of the kernel source directory, such as:

[caw@dsr-srv1 rpm-caw]$ ls -l /usr/src/kernels
total 4
drwxr-xr-x. 22 root root 4096 Sep 27 23:12 2.6.32-358.18.1.el6.x86_64

What I'm not getting is what data the rpm-tmp file gets built from. Assistance appreciated :)

tc nat can do the same job

The DADDR iptables plugin, iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.0.2, can be replaced by tc, so no plugin is need to use l3dsr:

tc qdisc add dev eth0 root handle 1: htb
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match u32 0x00040000 0x00ff0000 at 0 action nat ingress 192.168.0.3 192.168.0.2

where the u32 0x00040000 0x00ff0000 at 0 match Tos 0x4, which is dscp 1.
192.168.0.3 is read server ip, and 192.168.0.2 is vip.

Beta branch now available

I'm publishing a beta branch. This is to synchronize our internal repo with this repo. At this point our internal repo will be retired and replaced with this repo as upstream. Hopefully, this will prevent them getting too far out of sync in the future. Also, people won't have to report problems that have already been fixed internally and not yet published here. Lastly, everyone can try out the new code before it ends up on master.

This code is still waiting for two things before going to master:

  • The code paths for IPv4 and IPv6 UDP are still untested.
  • The replacement for yvipagent, dsrtools, has not yet be integrated into this repo.

Since dsrtools is not available yet, for those of you who want to test the code out, you'll need do one of the following:

  • In your clone of this repo, modify line 41 in linux/kmod-xt/xt_DADDR.c from "raw" to "mangle".
  • Create a file under /etc/modprobe.d. See paragraph starting at line 30 in linux/USING for details.
  • Build the kmod rpm with the --with mangle option (autogenerates /etc/modprobe.d/xt_DADDR.conf file for you).

Some highlights of the beta branch:

  • RHEL 4 and RHEL 5 support deprecated.
  • Add option --without kmod when building rpms to prevent generating the kmod package.
  • xt_DADDR's table's value (raw or mangle) can be examined by reading /sys/module/xt_DADDR/parameters/table.
  • Generated source tarball suffix changed from .bz2 to .xz.
  • Add RHEL 8 support.
  • Remove pre and preun rpm scripts checking for kernel module.
  • Fix hw csum failure for NICs using CHECKSUM_COMPLETE.
  • Correct UDP checksum handling when checksum computes to 0.
  • Ignore UDP checksum generation when checksum is ignored.
  • Handle IPv6 ICMP packets correctly.
  • No longer call kmodtool directly, but call the %kernel_module_package macro.
  • Fix mock build problem when native rpmdb format is different than its chroot.
  • Add --with mangle and --with override rpm build options.
  • Support 5.3 and later kernels with skb_ensure_writable().
  • Add support for the make macro DESTDIR.

flaky tests (beta branch, rhel8)

The test ip4.l3.007 appears to be flaky, the output doesn't always have mangle in the iptbl column for the "stopped" cases. Tested in the beta branch on rhel8 (4.18.0-193.19.1.el8_2.x86_64). The output of an additional run is copied into the gist https://gist.github.com/dmitris/7bf36afeb66743c9c7408348b116c8d4.

$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
1,7c1,7
< type  state   name           ipaddr         dscp loopback iptables iptbl src
< ===== ======= =============  =============  ==== ======== ======== ===== ====
< l3dsr stopped 188.125.67.1   188.125.67.1   10   --       --       --    conf
< l3dsr stopped 188.125.67.2   188.125.67.2   11   --       --       --    conf
< l3dsr stopped 188.125.67.3   188.125.67.3   12   --       --       --    conf
< loopb started 188.125.67.68  188.125.67.68  --   lo:1     --       --    disc
< loopb started 188.125.67.69  188.125.67.69  --   lo:2     --       --    disc
---
> type  state   name           ipaddr         dscp loopback iptables iptbl  src
> ===== ======= =============  =============  ==== ======== ======== ====== ====
> l3dsr stopped 188.125.67.1   188.125.67.1   10   --       --       mangle conf
> l3dsr stopped 188.125.67.2   188.125.67.2   11   --       --       mangle conf
> l3dsr stopped 188.125.67.3   188.125.67.3   12   --       --       mangle conf
> loopb started 188.125.67.68  188.125.67.68  --   lo:1     --       --     disc
> loopb started 188.125.67.69  188.125.67.69  --   lo:2     --       --     disc
The above difference is with expected.status.5.
Actual rv=1 Expected rv=0
===== FAILED: 2020-10-07 11:23:00: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created

$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
===== PASSED: 2020-10-07 11:23:03: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created

$ sudo DSRCTL='/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/src/dsrctl' Tname='ip4.l3.007.d' ../runtest -t mangle
1,7c1,7
< type  state   name           ipaddr         dscp loopback iptables iptbl src
< ===== ======= =============  =============  ==== ======== ======== ===== ====
< l3dsr stopped 188.125.67.1   188.125.67.1   10   --       --       --    conf
< l3dsr stopped 188.125.67.2   188.125.67.2   11   --       --       --    conf
< l3dsr stopped 188.125.67.3   188.125.67.3   12   --       --       --    conf
< loopb started 188.125.67.68  188.125.67.68  --   lo:1     --       --    disc
< loopb started 188.125.67.69  188.125.67.69  --   lo:2     --       --    disc
---
> type  state   name           ipaddr         dscp loopback iptables iptbl  src
> ===== ======= =============  =============  ==== ======== ======== ====== ====
> l3dsr stopped 188.125.67.1   188.125.67.1   10   --       --       mangle conf
> l3dsr stopped 188.125.67.2   188.125.67.2   11   --       --       mangle conf
> l3dsr stopped 188.125.67.3   188.125.67.3   12   --       --       mangle conf
> loopb started 188.125.67.68  188.125.67.68  --   lo:1     --       --     disc
> loopb started 188.125.67.69  188.125.67.69  --   lo:2     --       --     disc
The above difference is with expected.status.5.
Actual rv=1 Expected rv=0
===== FAILED: 2020-10-07 11:23:07: ip4.l3.007.d IPv4 L3DSR with other loopbacks already created

$ pwd
/home/dmitris/dev/hack/github.com/yahoo/l3dsr/linux/dsrtools/tests/ip4.l3.007.d

About rpm for CentOS 8

Hello,
We are using this l3dsr as RPM-based distros.
However, this spec file does not support centos8.

iptables-daddr.spec

Are there plans to support centos8 soon?
I tried to edit spec file so that the build succeeded forcibly.

# diff ~/build/l3dsr/linux/rpm/iptables-daddr.spec iptables-daddr.spec
14a15,17
>     %if "%{dist}" == ".el8"
>       %define rhel_version 700
>     %endif
109,110c112,113
< BuildRequires: iptables-devel >= 1.4.7, iptables-devel < 1.5
< Requires: iptables >= 1.4.7, iptables < 1.5
---
> BuildRequires: iptables-devel >= 1.4.7, iptables-devel < 1.9
> Requires: iptables >= 1.4.7, iptables < 1.9

After installing this rpm package and doing a simple test, it looks like it works as shown below.

# iptables -t mangle -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A PREROUTING -m dscp --dscp 0x0a -j DADDR --set-daddr 1.1.1.1 <---------------------------- vip set with iptables
 
# nft list table ip mangle
table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority -150; policy accept;
                ip dscp 0x0a counter packets 25 bytes 2100 # DADDR set 1.1.1.1 <----------- Settings converted to nftables
        }
 
        chain INPUT {
                type filter hook input priority -150; policy accept;
        }
 
        chain FORWARD {
                type filter hook forward priority -150; policy accept;
        }
 
        chain OUTPUT {
                type route hook output priority -150; policy accept;
        }
 
        chain POSTROUTING {
                type filter hook postrouting priority -150; policy accept;
        }
}
  • tcpdump
    A ping with ToS is received and the sendding is VIP (1.1.1.1) as SrcIP
# tcpdump -ni eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:22:22.174193 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 1, length 64
17:22:22.174232 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 1, length 64
17:22:23.174028 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 2, length 64
17:22:23.174066 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 2, length 64
17:22:24.174083 IP XX.XX.XX.XX > YY.YY.YY.YY: ICMP echo request, id 31130, seq 3, length 64
17:22:24.174127 IP 1.1.1.1 > XX.XX.XX.XX: ICMP echo reply, id 31130, seq 3, length 64
  • iptables counter
    VIP (1.1.1.1) is set when it matches ToS value Rule
# iptables -t mangle -L -v
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
   33  2772 DADDR      all  --  any    any     anywhere             anywhere             DSCP match 0x0a DADDR set 1.1.1.1

It seems to work for the time being.

Mellanox "hw csum failure" error and other upcoming changes

We encountered a "hw csum failure" for the latest Mellanox driver which turned up a long latent bug in the xt_DADDR.c code. I'm making the patch available now on the hw_csum_failure branch for those that run across the issue before I can get some other fixes into master.

Some of the other upcoming changes that are pending for master are:

  • Change default table from "mangle" to "raw" to resolve problems with conntrack (issue #5)
  • Deprecate iptables 1.2 (RHEL 4) and iptables 1.3 (RHEL 5) support
  • Add dsrtools which deprecates yvipagent
  • Add rpm package building support for RHEL 8
  • Add support for suppressing kmod side of the build
  • Some general bookkeeping and clean up

If you'd like any of these changes before I can publish them on github, let me know and I'll see if I can send you an isolated patch for it.

"dsrtools" now available on "beta" branch

The beta branch now has dsrtools available.

dsrtools is a replacement for yvipagent. We've been using this tool internally for several years now on our RHEL 6 and RHEL 7 production systems.

As part of this latest update to beta, the tree has been reorganized. The directories and files that were immediately under the linux directory are now in a subdirectory named iptables-daddr. Also under linux is dsrtools.

Under the dsrtools directory, you'll find README, INSTALL, and USING documentation as well as man pages under the src directory for dsrctl(8) and dsr.conf(5).

Also under dsrtools is a tests directory. It is an extensive suite of tests. These tests can be used to validate changes to dsrtools to ensure they don't introduce regressions.

dsrtools is necessary for newer versions of iptables-daddr (1.9.0 and later) that use the raw table by default. If you wish to continue to use yvipagent, you'll have to force your version of iptables-daddr back to mangle. See Issue #5 on how to do that.

yvipagent is still available for now in the repository, but will be deprecated in a future release.

If you do any testing with these latest updates on the beta branch, for please let us know how it goes, for good or bad.

Once we get enough feedback, we'll merge the beta branch to master.

If you find any problems or have any questions with dsrtools, feel free to ask them here or contact Wayne Badger ([email protected]) or Quentin Barnes ([email protected]).

Using "raw" instead of "mangle" so conntrack can work

Someone suggested I consider using "raw" instead of the "mangle" table for this module so that it would appear in front of the conntrack module. In that way, the daddr rewriting wouldn't confuse conntrack's tracking. I tried the idea out with some limited testing, and it seems to work, but I'm cautious about the move not being able to find much documentation on the "raw" table.

I wrote a note on netdev a couple of weeks ago (https://www.mail-archive.com/[email protected]/msg125234.html), but so far no help.

Has anyone also hit the problem with conntrack, tried any workarounds, or has comments on using the "raw" table?

make dsrtool handle iptables lock issues gracefully

Random times it is observed that dsrtools start script fails with the following error message.

TASK [services : restart dsrctl service] ***************************************
00:14:46 fatal: [host.example.com]: FAILED! => {"changed": false, "msg": "Unable to restart service dsr: Job for dsr.service failed because the control process exited with error code. See \"systemctl status dsr.service\" and \"journalctl -xe\" for details.\n"}

Couple of error message noted

systemctl status dsr.service
dsr.service - DSR control
   Loaded: loaded (/usr/lib/systemd/system/dsr.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2021-05-11 02:58:32 UTC; 6 days ago
     Docs: man:dsrctl(8)
  Process: 15427 ExecStop=/usr/sbin/dsrctl stop (code=exited, status=0/SUCCESS)
  Process: 15445 ExecStart=/usr/sbin/dsrctl start (code=exited, status=1/FAILURE)
 Main PID: 15445 (code=exited, status=1/FAILURE)

May 11 02:58:32 host.example.com systemd[1]: Starting DSR control...
May 11 02:58:32 host.example.com dsrctl[15445]: Failed to get iptables (iptables -L -t raw -n).
May 11 02:58:32 host.example.com systemd[1]: dsr.service: main process exited, code=exited, status=1/FAILURE
May 11 02:58:32 host.example.com systemd[1]: Failed to start DSR control.
May 11 02:58:32 host.example.com systemd[1]: Unit dsr.service entered failed state.
May 11 02:58:32 host.example.com systemd[1]: dsr.service failed.
STDERR: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

Noticed on: 7.9.15-1.el7

$ rpm -q dsrtools
dsrtools-1.4.0-20210314.02.el7.noarch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.