Giter Site home page Giter Site logo

High CPU load on cloud VMs about tinc HOT 59 OPEN

gsliepen avatar gsliepen commented on August 20, 2024 14
High CPU load on cloud VMs

from tinc.

Comments (59)

breisig avatar breisig commented on August 20, 2024 6

I would also be excited if multi-threading would be integrated :-)

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024 2

How much of the time taken is actually in memory copy between kernel and userspace?

For OpenConnect we've looked at similar optimisation: https://gitlab.com/openconnect/openconnect/-/issues/263

One thing we're looking at is using vhost-net in the kernel instead of read/write on the tun device. With experimental zero-copy TX that might help to eliminate some of the copying, but at the very least it lets us offload it to the kernel's dedicated vhost thread without having to do threading (and locking) on the userspace side.

Longer term though, I'd love to use the kernel's own crypto support for this. A packet comes in on the UDP socket, is decrypted, fed to a BPF filter which decides whether to shove it directly into the tun device, or feed it up to userspace. Likewise, packets from the tun device are fed to a BPF program which prepends any necessary protocol header, then feeds it to AF_TLS to be encrypted and sent over UDP.

For OpenConnect we use DTLS and ESP, depending on the protocol. Tinc would need its own packet format. As we implement the bits we need for this in OpenConnect and ocserv, it would be good to make sure it could also support tinc.

from tinc.

dechamps avatar dechamps commented on August 20, 2024 1

Regarding encryption performance: I suspect the implementation of Chacha2020/Poly1305 in tinc 1.1 is relatively slow compared to alternatives. It's a somewhat naive implementation written in plain C with no CPU-specific optimizations. I believe @gsliepen initially opted to use that because it was the simplest option at the time - no need for external dependencies (the implementation is inside the tinc source tree) or exotic build rules. Also, at the time the third party crypto libraries either did not support this cipher or were just as slow, but I suspect that's not true today.

Benchmarks show that optimized implementations can make a lot of difference: https://bench.cr.yp.to/impl-stream/chacha20.html

from tinc.

stevesbrain avatar stevesbrain commented on August 20, 2024 1

I'm not in a position to assist at all, but I just wanted to say it's a fantastic write up @nh2 - thanks for taking the time :)

from tinc.

dechamps avatar dechamps commented on August 20, 2024 1

I threw together some hacky code to move sendto() calls to a separate thread, along with a 128-packet queue (I have not tried sendmmsg()). iperf3 performance (with crypto enabled) improved from 650 Mbit/s (see #110 (comment)) to 720 Mbit/s. New bottleneck appears to be the TUN write path.

If I combine that asynchronous UDP send code with my other hacky patch to use OpenSSL for Chacha20-Poly1305, I manage to reach 935 Mbit/s, just shy of the gigabit mark.

For those interested, here are the proof of concepts:

from tinc.

breisig avatar breisig commented on August 20, 2024 1

Any update on this? It sounds like it could really speed up tinc.

from tinc.

nh2 avatar nh2 commented on August 20, 2024

Correction from @gsliepen:

writev() is not the equivalent of sendmmsg(). If you do a writev() to a tun device, it will be treated as one packet.

last time I looked there was some functionality in the kernel, but it was not exposed to userspace

Indeed.

There's also an earlier comment on the mailing list on this.

My guess is that this is the kernel patch in question.

from tinc.

nh2 avatar nh2 commented on August 20, 2024

Regarding crypto:

from tinc.

ewoutp avatar ewoutp commented on August 20, 2024

I'm using tinc 1.0 and see similar high CPU loads.
Would it be worth switching to 1.1?

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

On Fri, Apr 29, 2016 at 04:51:13AM -0700, Ewout Prangsma wrote:

I'm using tinc 1.0 and see similar high CPU loads.
Would it be worth switching to 1.1?

You can try. Tinc 1.1 supports recvmmsg(), it might reduce system call
overhead a bit. But other than that, there is not much which would make
1.1 have a lower CPU load than 1.0.

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

from tinc.

splitice avatar splitice commented on August 20, 2024

@nh2 we too are facing this issue on our DigitalOcean cloud VMs. We are running the older 1.0 branch (for stability reasons) and currently seeing only rx: 23.56 Mbit/s (3257 p/s) and tx: 50.07 Mbit/s (8574 p/s) on one of our more central nodes.

from tinc.

nh2 avatar nh2 commented on August 20, 2024

My guess is that this is the kernel patch in question.

I've sent an email to @alexgartrell to ask him if he's still interested in this patch.

It would be great if this could land in the kernel!

from tinc.

splitice avatar splitice commented on August 20, 2024

@nh2 isn't IFF_MULTI_READ for the read side of tun not the write. Am I misunderstanding that patch?

from tinc.

nh2 avatar nh2 commented on August 20, 2024

@splitice Your understanding of the initial submission of the patch is right, but the conversation ended with

Sounds good to me. I'll get a patch turned around soon.

replying to

If we were to use
recvmmsg obviously we'd create a new interface based on sockets
for tun and expose the existing socket through that.

The current file-based tun interface was never designed to be
a high-performance interface. So let's take this opportunity
and create a new interface

so it was my understanding that this (making a tun replacement on which you can use all of the *mmsg functions) is what the plan was.

from tinc.

dechamps avatar dechamps commented on August 20, 2024

I just did some benchmarks today, on my Intel Core i7-2600 running Linux 4.14.

The baseline, using GCC 7.2.0 and tinc bdeba3f (latest 1.1 repo), is 1.86 Gbit/s raw SPTPS performance according to sptps_test ("SPTPS/UDP transmit"). Fiddling with GCC optimizations (-O3 -march=native) doesn't seem to change anything. Switching to clang 5.0.1 tends to make things worse (1.65 Gbit/s), unless further optimizations (beyond -O2) are enabled, in which case it's on par with GCC.

I set up a more realistic benchmark involving two tinc nodes running on the same machine, and then using iperf3 over the tunnel between the two nodes. In the baseline setup, the iperf3 throughput was 650 Mbit/s. During the test, both nodes used ~6.5 seconds of user CPU time per GB each. In addition, the transmitting node used ~6.2 seconds of kernel CPU time per GB, while the receiving node used ~5.5 seconds of kernel time per GB. (In other words, user/kernel CPU usage is roughly 50/50.)

I hacked together a patch to make tinc use OpenSSL 1.1.0g (EVP_chacha20_poly1305()) for Chacha20-Poly1305, instead of the tinc built-in code. Indeed OpenSSL has more optimized code for this cipher, including hand-written assembly. As a result, raw SPTPS performance jumped to 4.19 Gbit/s, a ~2X improvement over the baseline. (I would expect more bleeding-edge versions of OpenSSL to provide even better performance, as more CPU-specific optimizations have been done recently.)

Unfortunately, because tinc spends a lot of time in the kernel, the improvement in the iperf3 benchmark was not as impressive: 785 Mbit/s, using ~4.0 seconds of user CPU time per GB. (Which means user/kernel CPU usage is roughly 40/60 in this test.)

I also tried libsodium 1.0.16, but the raw SPTPS performance wasn't as impressive: 1.95 Gbit/s, barely an improvement over the tinc built-in code.

It looks like it would be worth it to use OpenSSL for Chacha20-Poly1305 as it is clearly much faster than the current code. But in any case, the syscall side of things definitely needs to be improved as well as it immediately becomes the dominant bottleneck as crypto performance improves.

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Yeah and syscall overhead is only going to grow thanks to Meltdown. I think a plan of attack is needed:

  1. Investigate ways to read AND write multiple packets in one go from/to /dev/tun.
  2. Modify tinc to batch writes to sockets and to /dev/tun.
  3. Make tinc use the OpenSSL versions of Chacha20-Poly1305 if it's linking with OpenSSL anyway.

I think that can all be done in parallel. Item 2 can just do write() in a loop until we find the optimum way to send batches of packets to /dev/tun.

I'd like to keep the C-only version of Chacha20-Poly1305 in tinc; it's very nice for running tinc in embedded devices where space is a premium.

from tinc.

dechamps avatar dechamps commented on August 20, 2024

Another data point: if I bypass the crypto completely, I get 22.87 Gbit/s on the raw SPTPS throughput test (duh...). iperf3 throughput is about 1 Gbit/s, and user CPU usage is around ~2 seconds per gigabyte.

@gsliepen: according to my perf profiling on the sending side, it kinda looks like it's the UDP socket path that's expensive, not the TUN/TAP I/O paths. Though I suppose the relative cost of these paths could be system dependent.

   - 75.25% do_syscall_64                                                                                                                  ▒
      - 43.92% sys_sendto                                                                                                                  ▒
         - 43.80% SYSC_sendto                                                                                                              ▒
            - 42.69% sock_sendmsg                                                                                                          ▒
               - 42.42% inet_sendmsg                                                                                                       ▒
                  - 41.69% udp_sendmsg                                                                                                     ▒
                     - 32.25% udp_send_skb                                                                                                 ▒
                        - 31.73% ip_send_skb                                                                                               ▒
                           - 31.57% ip_local_out                                                                                           ▒
                              - 31.29% ip_output                                                                                           ▒
                                 - 30.95% ip_finish_output                                                                                 ▒
                                    - 30.62% ip_finish_output2                                                                             ▒
                                       - 26.26% __local_bh_enable_ip                                                                       ▒
                                          - 26.09% do_softirq.part.17                                                                      ▒
                                             - do_softirq_own_stack                                                                        ▒
                                                - 25.59% __softirqentry_text_start                                                         ▒
                                                   - 24.97% net_rx_action                                                                  ▒
                                                      + 24.25% process_backlog                                                             ▒
                                       + 3.81% dev_queue_xmit                                                                              ▒
                     + 5.82% ip_make_skb                                                                                                   ▒
                     + 2.59% ip_route_output_flow                                                                                          ▒
      - 10.58% sys_write                                                                                                                   ▒
         - 10.42% vfs_write                                                                                                                ▒
            - 10.12% __vfs_write                                                                                                           ▒
               - 10.08% new_sync_write                                                                                                     ▒
                  - 10.02% tun_chr_write_iter                                                                                              ▒
                     - 9.86% tun_get_user                                                                                                  ▒
                        - 8.75% netif_receive_skb                                                                                          ▒
                           - 8.72% netif_receive_skb_internal                                                                              ▒
                              - 8.63% __netif_receive_skb                                                                                  ▒
                                 - __netif_receive_skb_core                                                                                ▒
                                    - 8.46% ip_rcv                                                                                         ▒
                                       - 8.32% ip_rcv_finish                                                                               ▒
                                          - 8.02% ip_local_deliver                                                                         ▒
                                             - 7.99% ip_local_deliver_finish                                                               ▒
                                                - 7.87% tcp_v4_rcv                                                                         ▒
                                                   - 7.36% tcp_v4_do_rcv                                                                   ▒
                                                      + 7.26% tcp_rcv_established                                                          ▒
      - 9.98% sys_select                                                                                                                   ▒
         - 8.13% core_sys_select                                                                                                           ▒
            - 6.22% do_select                                                                                                              ▒
                 1.80% sock_poll                                                                                                           ▒
               + 0.98% __fdget                                                                                                             ▒
                 0.82% tun_chr_poll                                                                                                        ▒
      - 6.87% sys_read                                                                                                                     ▒
         - 6.56% vfs_read                                                                                                                  ▒
            - 5.32% __vfs_read                                                                                                             ▒
               - 5.21% new_sync_read                                                                                                       ▒
                  - 5.00% tun_chr_read_iter                                                                                                ▒
                     - 4.76% tun_do_read.part.42                                                                                           ▒
                        + 3.15% skb_copy_datagram_iter                                                                                     ▒
                        + 0.91% consume_skb                                                                                                ▒
            + 0.82% rw_verify_area                                                                                                         

One thing that comes to mind would be to have the socket I/O be done in a separate thread. Not only would that scale better (the crypto would be done in parallel with I/O, enabling the use of multiple cores), it would also make it possible for that thread to efficiently use sendmmsg() if more than one packet has accumulated inside the sending queue since the last send call started (coalescing).

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Hm, where's sys_recvfrom in your perf output? It would be nice to see how that compares to sys_sendto. (Of course, I should just run perf myself...)

from tinc.

dechamps avatar dechamps commented on August 20, 2024

I did not include it because it's negligible (thanks to the use of recvmmsg(), I presume):

      - 76.54% do_syscall_64                                                                                                               ▒
         + 43.47% sys_sendto                                                                                                               ▒
         + 11.39% sys_write                                                                                                                ▒
         + 10.27% sys_select                                                                                                               ▒
         + 6.74% sys_read                                                                                                                  ▒
         + 1.56% syscall_slow_exit_work                                                                                                    ▒
         - 1.51% sys_recvmmsg                                                                                                              ▒
            - 1.49% __sys_recvmmsg                                                                                                         ▒
               - 1.44% ___sys_recvmsg                                                                                                      ▒
                  - 1.01% sock_recvmsg_nosec                                                                                               ▒
                     - 1.00% inet_recvmsg                                                                                                  ▒
                          0.97% udp_recvmsg                                                                                                ▒
           0.70% syscall_trace_enter                                                                                                       

Here's how things look like on the receiving side (which is not the bottleneck in my benchmark):

      - 73.10% do_syscall_64                                                                                                               ▒
         - 26.20% sys_sendto                                                                                                               ▒
            - 26.09% SYSC_sendto                                                                                                           ▒
               - 25.41% sock_sendmsg                                                                                                       ▒
                  - 25.26% inet_sendmsg                                                                                                    ▒
                     - 24.96% udp_sendmsg                                                                                                  ▒
                        + 19.31% udp_send_skb                                                                                              ▒
                        + 3.17% ip_make_skb                                                                                                ▒
                        + 1.73% ip_route_output_flow                                                                                       ▒
         - 26.06% sys_write                                                                                                                ▒
            - 25.54% vfs_write                                                                                                             ▒
               - 24.47% __vfs_write                                                                                                        ▒
                  - 24.36% new_sync_write                                                                                                  ▒
                     - 24.02% tun_chr_write_iter                                                                                           ▒
                        - 23.43% tun_get_user                                                                                              ▒
                           - 17.87% netif_receive_skb                                                                                      ▒
                              - 17.74% netif_receive_skb_internal                                                                          ▒
                                 - 17.29% __netif_receive_skb                                                                              ▒
                                    - 17.20% __netif_receive_skb_core                                                                      ▒
                                       - 16.42% ip_rcv                                                                                     ▒
                                          - 15.77% ip_rcv_finish                                                                           ▒
                                             + 14.30% ip_local_deliver                                                                     ▒
                                             + 0.95% tcp_v4_early_demux                                                                    ▒
                           + 1.12% copy_page_from_iter                                                                                     ▒
                           + 0.97% skb_probe_transport_header.constprop.62                                                                 ▒
                             0.90% __skb_get_hash_symmetric                                                                                ▒
                           + 0.74% build_skb                                                                                               ▒
         - 8.42% sys_select                                                                                                                ▒
            - 6.92% core_sys_select                                                                                                        ▒
               + 5.44% do_select                                                                                                           ▒
         - 5.84% sys_recvmmsg                                                                                                              ▒
            - 5.72% __sys_recvmmsg                                                                                                         ▒
               - 5.48% ___sys_recvmsg                                                                                                      ▒
                  - 2.83% sock_recvmsg_nosec                                                                                               ▒
                     - 2.81% inet_recvmsg                                                                                                  ▒
                        + 2.75% udp_recvmsg                                                                                                ▒
                  + 1.15% sock_recvmsg                                                                                                     ▒
                  + 0.85% copy_msghdr_from_user                                                                                            ▒
         - 3.05% sys_read                                                                                                                  ▒
            - 2.84% vfs_read                                                                                                               ▒
               - 2.10% __vfs_read                                                                                                          ▒
                  - 2.02% new_sync_read                                                                                                    ▒
                     - 1.86% tun_chr_read_iter                                                                                             ▒
                        + 1.69% tun_do_read.part.42                                                                                        

Even on the receiving side the UDP RX path is quite efficient and there is still a large amount of time spent in the UDP send path (presumably to send the TCP acknowledgements for the iperf3 stream).

(Note: as a reminder, all my perf reports are with all chacha-poly1305 code bypassed, to make syscall performance issues more obvious.)

from tinc.

dechamps avatar dechamps commented on August 20, 2024

I believe the main reason why the UDP TX path is so slow is because Linux runs all kinds of CPU-intensive logic in that call (including selecting routes and calling into netfilter, it seems), and it does that inline in the calling thread, even if the call is non-blocking.

If that's true, then it means that the performance of that path would also depend on the complexity of the network configuration on the machine that tinc is running on (i.e. routing table, iptables rules, etc.), which in the case of my test machine is actually fairly non-trivial, so my results might be a bit biased in that regard.

If we move these syscalls to a separate thread, then it might not do much in terms of CPU efficiency, but it would at least allow tinc to scale better by having all this kernel-side computation happen in parallel on a separate core.

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Ok, so we need four single-producer, single-consumer ringbuffers: tun rx, tun tx, udp rx, udp tx. Each ringbuffer gets its own thread to do I/O. We also need to signal the main event loop; on Linux this could be done using eventfd. The threads doing UDP I/O can use sendmmsg()/recvmmsg() if possible.

from tinc.

splitice avatar splitice commented on August 20, 2024

@dechamps I can confirm that on machines with more complicated routing rules that outgoing performance on tinc does decrease. I haven't got any benchmarks currently but it was something our devops team noted between staging and production. It wasn't too significant for us though (our configuration involves 6-8 routing rules, and significant IPTables rules, which tinc is able to bypass early).

Does sendmmsg have to evaluate the routing tables for each, or does it cache (post 3.6 w/ removal of route cache)? Not that I am saying this wouldn't lead to other savings but it might not be the savings being imagined.

from tinc.

dechamps avatar dechamps commented on August 20, 2024

Ok, so we need four single-producer, single-consumer ringbuffers: tun rx, tun tx, udp rx, udp tx. Each ringbuffer gets its own thread to do I/O.

Sounds good. As a first approach, TUN TX and UDP TX is much simpler than RX because these paths don't use the event loop at all - they just call write() and sendto() directly, dropping the packet on the floor if the call would block. This means there's no need to coordinate with the event loop for TX - it's just a matter of writing a drop-in replacement for write() and sendto() that transparently offloads the syscall to a separate thread for asynchronous execution, dropping the packet if the queue for that separate thread is too busy.

from tinc.

dechamps avatar dechamps commented on August 20, 2024

After some more investigation, one potential issue with sending UDP packets asynchronously is that it prevents the main thread from immediately getting EMSGSIZE feedback for PMTU discovery purposes. The sender thread would have to call back to update the MTU, which suddenly makes everything way more complicated. We might have to choose between one or the other for the time being.

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Well, it slows it down a bit, but PMTU works without the EMSGSIZE feedback as well.

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Very nice. One issue though is that it's making copies of packets, for obvious reasons. But since we're worried about performance already, we should probably have a pool of vpn_packet_t's that we can pick from and hand them over to other threads.

I also like that dropin C11 thread library!

from tinc.

dechamps avatar dechamps commented on August 20, 2024

…and here's a proof of concept for the final piece, asynchronous TUN writes: dechamps@001cd46

With that combined with the asynchronous UDP send, I get 770 Mbit/s. If I combine everything described so far (asynchronous UDP send and TUN write, plus OpenSSL crypto), I can reach 1140 Mbit/s, finally breaching that symbolic Gigabit barrier. This is ~1.75x vanilla tinc performance. It's quite possible that this can be improved further by tuning some knobs or writing things in a smarter way (such as the suggestion that @gsliepen just made above).

CPU usage looked as follows during the fully combined iperf3 stress test:

Thread Sending node Receiving node
Main 95% 95%
UDP send 75% 35%
TUN write 21% 53%

So basically, each tinc node is now able to scale to two CPUs instead of just one.

I haven't looked at the new bottlenecks too closely, but at first glance the main thread seems to be spending as much time in select() as in crypto code. Perhaps that could be the next area for improvement (epoll() comes to mind).

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

I created a hopefully generic buffer pool with an asynchronous hook, see 9088b93. I kept async_device.[ch] and async_send.[ch], but instead of having the functions there do memcpy() into a buffer they get from the pool, functions that currently allocate a vpn_packet_t on the stack should do vpn_packet_t *pkt = async_pool_get(vpn_packet_pool) and pass that on until the place where we'd normally do the write(), and instead call async_pool_put(vpn_packet_pool, pkt). And something analogous for sending UDP packets, although that might be a bit more complicated.

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

Now with asynchronous reading from the tun device: cc5e809.

from tinc.

dechamps avatar dechamps commented on August 20, 2024

I'm not sure I'll have the time to clean this up any time soon, so if anyone is up for it, feel free to pick this up. Pinging @millert as perhaps he might be interested in some more coding fun.

from tinc.

millert avatar millert commented on August 20, 2024

I can devote some time to this. Do we want to go with tinycthread or would you rather use pthreads/winthreads directly?

from tinc.

dechamps avatar dechamps commented on August 20, 2024

@millert Thanks for volunteering :) @gsliepen indicated in #110 (comment) that he liked the idea of using tinycthread, and I agree, so I would recommend using that. (The alternatives are writing different code for the two platforms - which is, well, not great - or using pthreads-Win32, but that's very old and requires adding a dependency on another library that needs to be linked in; whereas tinycthread is just a single drop-in C file and it's future-proof since it implements a standard C API.)

@gsliepen: did you measure any improvements when you experimented with a generic buffer pool? I suspect that this wouldn't make much of a difference and that it would be simpler to just do the naive thing like I did in my code, but I'll admit I'm just speculating here.

@millert: I'm not sure if you're interested in the OpenSSL crypto stuff too, or just the multi-threaded I/O. If you're interested in interfacing with OpenSSL for Chacha20-Poly1305, keep in mind that I have not checked that OpenSSL uses the same message formats and conventions (with respect to keys, etc.) that tinc uses. There is documented evidence that at least three incompatible variants of ChaCha20-Poly1305 exist in the wild, and it's not clear to me which one OpenSSL uses (or even which one tinc uses, for that matter). I did not attempt to make my experimental "OpenSSL tinc" communicate with "vanilla tinc" nodes, and I suspect there might be some challenges there.

from tinc.

millert avatar millert commented on August 20, 2024

@dechamps I'm interested in all the pieces, though I had planned on starting with the multiple thread work. Ultimately, I'd like the ability to use ciphers other than just Chacha20-Poly1305 with tinc 1.1 so OpenSSL support would help further that goal.

from tinc.

millert avatar millert commented on August 20, 2024

@gsliepen Any reason the async tun reader can't just use a pipe() instead of eventfd() on non-Linux systems?

from tinc.

gsliepen avatar gsliepen commented on August 20, 2024

@dechamps: no I didn't measure a performance improvement, but the buffer pool is a step towards less memcpy()ing. The goal is that a buffer from the pool can be used all the way from a read() from tun to the point where the payload is encrypted before it is sent to a remote peer, even if these things happen in different threads. And vice versa for data received from a peer that needs to be sent to the tun device.

@millert: like @dechamps says, tinycthread is preferred. As for pipe() vs eventfd(): yes, that is a possible solution, but you have to be careful that you don't block, and if you would block, you need to do the right thing. And on Windows, you have to use yet another thing.

from tinc.

millert avatar millert commented on August 20, 2024

@gsliepen I ran into a few bugs trying to make this work on macOS that I've fixed in https://github.com/millert/tinc/tree/1.1-multithreaded

However, there is still a deadlock that I haven't identified where async_sendto() calls async_pool_put() which sleeps on the pool mutex and never wakes up. This doesn't seem to happen on Linux. Linux didn't seem to mind that the mutex was not initialized either so there are clearly some difference there.

from tinc.

millert avatar millert commented on August 20, 2024

Aha, the condition variable was also not being initialized. I missed that when I noticed the missing mtx_init().

from tinc.

wenerme avatar wenerme commented on August 20, 2024

I got almost the same speed, rx: 23.5 Mb, with scp directly to server, I can reach the about 100Mb, hope this get improve. I use tinc-pre 1.1.15

tinc version v3.6.0-5371-gf3b959ae2a (built Nov 9 2017 21:32:15, protocol 17.7)

from tinc.

cyberfred2002 avatar cyberfred2002 commented on August 20, 2024

I stumbled upon this as I was having some of the same performance problems as documented. It appears that @millert has put some effort into performance enhancements with https://github.com/millert/tinc/tree/1.1-multithreaded and https://github.com/millert/tinc/commits/1.1-gcm-rebased. Is there any plans on bringing them into 1.1?

from tinc.

EugenMayer avatar EugenMayer commented on August 20, 2024

I am really wondering how your numbers are here - i see to be able get arround 230Mb max, from DC to DC node (opnsense based tinc). the direct connection though is maxing out at 850Mb.

CPU usage on direct connect was 4.5, on tinc 13.40

Both OPNsense boxes are virtualized KVM instances and i ask myself if there is any issue with the CPU/machine type and the access to the AES chip or similar. I actually pass the CPU using "host", running all this on Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz or better.

i am using

tinc: 1.0.35

Cipher=aes-256-cbc
Digest=sha256

Beside the heavy-duty improvements i have seen by swapping in openssl and others, are there are ways to get a (less) impressive improvement?
Seems there has been a lot of work in 1.1 - anything really to expect right now?


Seen people trying

Cipher=aes-128-cbc
ProcessPriority=high
Digest=sha256

zero changes for me

from tinc.

saschaarthur avatar saschaarthur commented on August 20, 2024

any updates here? loving tinc and using it ob multiple servers but this bottleneck is quiete heavy.. multi threading / cpu hardware accelerate would be awesome..

from tinc.

breisig avatar breisig commented on August 20, 2024

Any update?

from tinc.

rearden-steel avatar rearden-steel commented on August 20, 2024

@millert is there any chance to finish your work on this?

from tinc.

atomlab avatar atomlab commented on August 20, 2024

I have some problem with tinc when traffic above 100Mbit/sec, tinc use cpu to 50% and I see increasing latency to other servers. I use TCP mode for tinc.

CPU i7-7700 CPU @ 3.60GHz
tinc version 1.0.26

from tinc.

JustBru00 avatar JustBru00 commented on August 20, 2024

I am also having issues with high CPU usage on the $5/month Digital Ocean VM. I am running Tinc version 1.0.26-1 on Ubuntu 16.04.6. This appears to be caused by many small packets being sent at the same time.

I am attempting to find a way to replicate my problem consistently although it seems to pop up out of nowhere.

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

The vhost support is now functional, if you want to play with it in tinc. I'm happy to license it under LGPLv2.1+ instead of just LGPLv2.1, which makes it compatible.

https://gitlab.com/openconnect/openconnect/-/compare/master...vhost

I do ideally need to suppress some notifications.

from tinc.

splitice avatar splitice commented on August 20, 2024

I can't stress how awesome something to improve performance (particularly on VMs which have substantially more expensive userspace <-> kernelspace transitions) would be.

Heres some data on the top consumer of CPU time in perf. This is on 1.0.35 running on a Vultr VPS.

-   67.17%     0.00%  tincd    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe                                                                                                                                                                                                                                         `
     entry_SYSCALL_64_after_hwframe                                                                                                                                                                                                                                                                                         a
   - do_syscall_64                                                                                                                                                                                                                                                                                                          a
      + 31.84% __x64_sys_pselect6                                                                                                                                                                                                                                                                                           a
      + 23.58% ksys_write                                                                                                                                                                                                                                                                                                   a
      + 7.01% __x64_sys_sendto                                                                                                                                                                                                                                                                                              a
      + 3.29% __x64_sys_recvfrom                                                                                                                                                                                                                                                                                            a
      + 0.67% ksys_read                 

This is on an instance doing relatively little (approx 5mbps) and sitting at 8% CPU.

Looking at strace, pselect is way too common for a syscall thats rarely used. The calls all look like they have the same write/read set. Would there be an advantage in switching to epoll then and simply waiting on the epoll fd?

pselect6(26, [3 4 5 8 10 13 14 16 19 20 21 22 24 25], [], NULL, {tv_sec=1, tv_nsec=0}, {[], 8}) = 1 (in [5], left {tv_sec=0, tv_nsec=992037660})

Inside the kernel, in pselect seems to be quite expensive. Both in terms of lock contention (not sure what else is fighting, perhaps the network stack?) and just in the tcp polling.

Also @fangfufu have you seen https://www.cs.cornell.edu/courses/cs5413/2014fa/projects/group_of_ej222_jj329_jsh263/ ? In my background research into pselect and epoll in tinc I came accross this. Seems to indicate slight improvements are possible with epoll (and some other low hanging fruit) in tinc. The big improvement they had was switching to sendmmsg (something that would require more signficiant changes). I'm benchmarking on 1.0 too which doesnt have the 1.1 sendmmsg optimization, but in my testing aes signficnatly outperforms chacha with aes-ni so for now 1.0 isnt replaced.

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

I haven't looked at the packet format you're using and whether you're using AEAD ciphers anyway but in case it's relevant, implementing stitched AES+SHA gave me a 40% improvement in crypto bandwidth — benchmarking pure crypto (just encrypting the same packet over and over again) took it to 5Gb/s as opposed to just 3Gb/s when I was calling into the crypto library separately for encryption vs. HMAC.

https://gitlab.com/openconnect/openconnect/-/commits/hacks2/

from tinc.

splitice avatar splitice commented on August 20, 2024

@dwmw2 that vhost code sounds really interesting. I've been reading over it for a couple days now.

If I am correct it looks like TUNSETSNDBUF allows for the attaching of a mmaped ring buffer to the "write" side of the tun/tap. Thats awesome (and something I was entirely unaware existed!). It looks like TUNSETSNDBUF isnt even that new either, it's reasonably well supported.

The tun/tap write() calls are the single biggest users of CPU left on my WIP fork (syscall usage at near 50%, the main user being the write() calls). I dare say eliminating those could easily double the PPS tinc is capable of.

Any chance you could commit something (even hacky) as a fork? FYI the file most likely to need the majority of the changes would be linux/device.c. I'd suggesst at-least for benchmarking purposes replacing write_packet with a call to copy to the ring buffer (and always returning success if there is space). Of course setup_device to bootstrap the queue.

I'd be happy to help out. Even if just running comparison benchmarks and review for commit / merge.

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

If I am correct it looks like TUNSETSNDBUF allows for the attaching of a mmaped ring buffer to the "write" side of the tun/tap. Thats awesome (and something I was entirely unaware existed!). It looks like TUNSETSNDBUF isnt even that new either, it's reasonably well supported.

Strictly speaking, it's vhost-net that allows you to attach the mmap'ed ring to both the read and write sides of the tun/tap. It's mostly designed for virtualisation, to allow guests to have fast virtio Ethernet, which is why it is fairly buggy when it's used for L3 "tun" mode instead of L2(Ethernet) "tap" mode. To make it work, we have to enable a virtio-net header on the packets, even though we don't want it, because the kernel makes bogus assumptions that userspace will always ask for that. And because the XDP fast path for transmit is also hosed for tun mode completely (it really does assume it handles Ethernet packets), we also have to limit the send buffer size (sndbuf) because that has the undocumented side effect of disabling the broken XDP code path and falling back to a slightly slower, but actually working, code path in vhost-net. I have patches for the kernel bugs, but we want something that works on today's released kernels, which I've achieved with the workarounds.

tl;dr: er, actually, don't read any of that; just use the example code which works, and which I've told you you can use under LGPLv2.1+.

Basically, vhost-net just hides the fact that we're copying packet around between userspace and the kernel. I don't think it'll give you anything that a decently tuned threading model in userspace wouldn't. But it may well be simpler to implement than messing around with userspace threading.

I don't really have much time for playing with tinc myself. The code in my tree should transplant fairly easily; why don't you give it a try?

FWIW I took a brief look at the SPTPS protocol and it looks very much like you would benefit a lot from that stitched AES+SHA assembler code.

from tinc.

splitice avatar splitice commented on August 20, 2024

@dwmw2 Well thats a discussion and a half.

The summary of the kernel situation is as I understand it (correct me if I am wrong):

  • Lowered performance as you have to transfer fake vhost header (it's all padding)
  • XDP fast path assumes packets start with Ethernet headers, incompatible with the above header
  • XDP fast path can be disabled by not compiling in XDP, running a lowered sndbuf, running an older kernel (pre XDP) or patched kernel

Even with XDP fast path disabled there should be considerably gains just by reducing the number of syscalls (those writes() in that you kick with in your code primarily). Unless the virtio-net copy code path is just that bad...

Q. With XDP disabled (e.g lowered sndbuf) pushing say 16 - 32 smaller packets (or whatever you can fit in your hacked sndbuf) per kick do you not see any real world performance improvements? Are you testing on servers or VMs (massively higher syscall cost (I've been measuring a scale of ~10 microseconds per call or more on public clouds)? As a result tinc exceeding 50k PPS per core to tun/tap with the write() model after even perfect optimization on any VM seems unlikely.

Regarding your threading comment. I think threading has a roll to play. But simply throwing more threads at a slow process isnt going to solve the problem. Currently on VMs I'm testing on Tinc tops out at 300-400mbps (200mbps both ways) with 1500 byte packets. On dedicated servers 1Gbps is acheivable, but I wouldnt call it smooth. Threads might multiply that by 2-3x but at what cost? Likely an entire quad core CPU.

from tinc.

splitice avatar splitice commented on August 20, 2024

@dwmw2

I don't really have much time for playing with tinc myself. The code in my tree should transplant fairly easily; why don't you give it a try?

I undestand completely. However you do clearly have a good understanding of the potential pitfalls and hacks in place to "make it work". It's not the tinc side of the integration that has me daunted...

which is why it is fairly buggy when it's used for L3 "tun" mode instead of L2(Ethernet) "tap" mode.

Tinc does L2(Ethernet) "tap" mode as well as tun so for testing purposes tap is probably ideal anyway

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

The summary of the kernel situation is as I understand it (correct me if I am wrong):

• Lowered performance as you have to transfer fake vhost header (it's all padding)

It's ten bytes. I suspect it's in the noise.

• XDP fast path assumes packets start with Ethernet headers, incompatible with the above header
• XDP fast path can be disabled by not compiling in XDP, running a lowered sndbuf, running an older kernel (pre XDP) or patched kernel

It's not that the XDP fast path is incompatible with the virtio-net header. It's that it assumes Ethernet (L2) mode and doesn't work for tun. Like the non-XDP path, it also assumes a virtio-net header, which comes before the Ethernet header. So XDP might go a little bit faster if we could use it, but for OpenConnect I can't (without requiring my kernel patches).

With XDP disabled (e.g lowered sndbuf) pushing say 16 - 32 smaller packets (or whatever you can fit in your hacked sndbuf)

The sndbuf only has to be "less than INT_MAX" to cause the XDP code path to be disabled. It can still be quite large :)

…per kick do you not see any real world performance improvements? Are you testing on servers or VMs (massively higher syscall cost (I've been measuring a scale of ~10 microseconds per call or more on public clouds)? As a result tinc exceeding 50k PPS per core to tun/tap with the write() model after even perfect optimization on any VM seems unlikely.

I haven't really done the serious performance testing yet. Aside from getting distracted by fixing the kernel because it offended me, and designing the optimal fully in-kernel data path that I actually want to use in future, there's also some low-hanging fruit on the userspace side that I need to do first, like re-using packets instead of constant malloc/free, and switching to epoll().

I think you have a good point that vhost-net could be better than pure threading, because of the eliminated syscall cost.

However you do clearly have a good understanding of the potential pitfalls and hacks in place to "make it work"

Sure, but all that is covered when I say "start by cutting and pasting my vhost.c". That is carefully crafted to use the code paths that do work on today's kernel and it ought to Just Work™.

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

OK, I managed to do some testing. Full details at https://gitlab.com/openconnect/openconnect/-/issues/263#note_613536995

In short, ESP TX performance on my EC2 c5.9xlarge instance:

  • Current OpenConnect master: 1.6Gb/s
  • AES+SHA1 stitched assembler: 2.1Gb/s
  • vhost-net: 1.8Gb/s
  • Both vhost and AES+SHA asm: 2.4Gb/s

So I'd definitely suggest both those are worth looking at.

RX performance is less clear but a lot of the time is now taken in select() and eventfd_write(), which should go away if the sending side could actually manage to saturate the link.

from tinc.

splitice avatar splitice commented on August 20, 2024

Perhaps relevant

ShareX_2021-06-29_12-55-09

Shows the break down of 1.1 (left) and with epoll (right) and in both cases just over 50% of the CPU time is going to TUN/TAP read()/write()

I think this supports my claim that tun/tap overheads are more important to eliminate than crypto. At-least on Virtual Machines.

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

Dunno where your crypto went :)

Here's mine, running 'iperf3 -u -b 2600M' over the VPN, with and without '-R'. Receiving on the left (where I'm limited to the ~1.2Gb/s that the kernel ESP on the other end can send, so I spend more time in syscalls than I otherwise would because I genuinely do spend time waiting, at only about 80% CPU time), and sending on the right, where I am pegged at 100% CPU time sending about 2.5Gb/s of encrypted traffic, while vhost spends about 20% of another vCPU sending tun packets.
perf

Not sure if you're following my vhost branch; I may have made it a bit confusing by rebasing on top of the epoll implementation I just merged to master. The main things I changed were writing directly to the tun device when we aren't busy, to keep latency down, and reading from the vhost 'call' eventfd last after doing all the actual work, before going back into epoll. Again to keep latency down.

from tinc.

splitice avatar splitice commented on August 20, 2024

@dwmw2 1b UDP payload packets encrypt quickly.

I've taken a copy of the vhost branch already. Not sure when I'll have the time to hack it in though. More difficult than I first thought due to tinc using a single packet in stack vs your zc queues.

from tinc.

splitice avatar splitice commented on August 20, 2024

Also I use perf with -g when tracing syscalls. That way the whole kernel side is included (something I find more representative of the calls cost).

from tinc.

dwmw2 avatar dwmw2 commented on August 20, 2024

@dwmw2 1b UDP payload packets encrypt quickly.

Heh, yeah. But perhaps less representative of real-world traffic. I suppose 'iperf3 -u -b 2800M' is hardly representative of real-world traffic either but it seems like a better holistic benchmark for the tunnel :)

I have indeed been using 'perf -g' as well; they give slightly different views but this one was simpler for a screenshot.

from tinc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.