Hi Alex, In your YouTube video I saw you isolated one CX5 card with

I'm putting together some notes on performance tuning here: <a href="https://github.co

Optimize iperf Performance of External Loopback in Dual-port Design,about corundum/corundum

beer-belly commented on July 4, 2024

Hello, I have the same problem with you. The iperf's performance is very poor. Have you solved it now

from corundum.

alexforencich commented on July 4, 2024

If you have multiple NUMA nodes, you'll need to use numactl or similar to run everything that touches the NIC on the same NUMA node. You can see the NUMA node associated with the PCIe device in /sys/class/net/<dev>/device/numa_node. For iperf, you'll have to run both the client and the server under numactl. So if sysfs reports NUMA node 3, then you'll want to run numactl -N 3 iperf3 -s. It's always a good idea to compare the performance against a commercial 100G NIC as well as a sanity check - if the commercial NIC can't get anywhere near 100 Gbps, then there is something else going on.

Strange that you're not getting anywhere near 10 Gbps even with one process; on the server I tested corundum on for the paper (dual socket Intel Xeon 6138), a single iperf process would run at 20-30 Gbps with 1500 byte MTU, or 40-50 Gbps with 9000 byte MTU.

Other things to check: look at lspci, make sure the card is running at gen 3 x16. Not all x16 slots have all lanes wired, and even if they have all lanes wired they don't always have all of the lanes connected - we had to pull some 10G NICs out of the server riser cards to get all 16 lanes on the same slot, instead of splitting them across two slots. Also, make sure the card has registered all 32 interrupts, it currently uses MSI, and for some reason some motherboards don't completely support MSI and limit devices to 1 interrupt instead of 32, and this will limit performance. Moving to MSI-X is on the to-do list.

from corundum.

alexforencich commented on July 4, 2024

I'm putting together some notes on performance tuning here: https://github.com/corundum/corundum/wiki/Performance-Tuning

from corundum.

alexforencich commented on July 4, 2024

I'm also going to take a look on my end and see if I missed any settings between what was used for the test in the paper and what is in the repo. I know I added some instrumentation to the driver for measuring a few things, but this would not improve performance. However, I may have adjusted a couple of other settings in the FPGA config that perhaps didn't make it into the repo. Now that the cocotb migration is complete, I have some time to do some performance measurements, and I will be sure to put together some notes on that.

from corundum.

nightseas commented on July 4, 2024

I've tested two corundum FPGAs (ADM9v3 and VCU118) on two separated servers. Although I haven't got 100Gbps line rate, single thread performance was increased to 30Gbps.
So I assume that namespace isolation has certain limitation comparing to two real physical nodes.
Thanks for your suggestions :)

From optimization perspective, the application (e.g. iperf), queue handling, and interrupt handling shall all be attached to the correct NUMA node to get stable performance.
Here's an example of optimizing eth0 with queue and interrupt CPU affinity to NUMA node 1 (My server has two 14C28T E5v4 processors, and core 0-13,28-41 are in the NUMA node 1). IRQ numbers need to be checked in /proc/interrupts.
In my case single thread performance has been improved up to 16Gbps.

for i in {0..255}; do echo 0003ff,f0003fff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; done
for i in {0..8191}; do echo 0003ff,f0003fff > /sys/class/net/eth0/queues/tx-$i/xps_cpus ; done

for i in {83..114}; do echo 0-13,28-41 > /proc/irq/$i/smp_affinity_list ; done

from corundum.

alexforencich commented on July 4, 2024

Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.

Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.

from corundum.

nightseas commented on July 4, 2024

Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.

Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.

Recently I've switched my test platform from Broadwell to Skylake and have MLX5 installed to replace the VCU118. And most of the issues are disappeared :)
With network namespace isolation the performance between an ADM9V3 and a MLX5 can reach 95~97Gbps on a single server (MTU=8K, have a problem on 9K MTU). The bottleneck appears to be on the CPU side (old ark and low frequency). Even so, on servers from different vendors, e.g. Supermicro and Dell, the performance are not same, which may be relative to the BIOS implementation.
And you are right about the interrupt handling. In up-to-date versions of Kernel, disabling irqbalance and having hard/soft IRQ manually assigned would not help on overall throughput.
I've tried all kinds of ways to optimize the performance on BIOS and OS level to the old Broadwell server and I have to laugh at myself that most of them are actually negative optimization. However this is still a really good lesson to learn ;)

Thanks & Best regards,
Xiaohai

from corundum.

nightseas commented on July 4, 2024

Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.

Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.

Recently I've switched my test platform from Broadwell to Skylake and have MLX5 installed to replace the VCU118. And most of the issues are disappeared :)
With network namespace isolation the performance between an ADM9V3 and a MLX5 can reach 95~97Gbps on a single server (MTU=8K, have a problem on 9K MTU). The bottleneck appears to be on the CPU side (old ark and low frequency). Even so, on servers from different vendors, e.g. Supermicro and Dell, the performance are not same, which may be relative to the BIOS implementation.
And you are right about the interrupt handling. In up-to-date versions of Kernel, disabling irqbalance and having hard/soft IRQ manually assigned would not help on overall throughput.
I've tried all kinds of ways to optimize the performance on BIOS and OS level to the old Broadwell server and I have to laugh at myself that most of them are actually negative optimization. However this is still a really good lesson to learn ;)

Thanks & Best regards,
Xiaohai

from corundum.

Optimize iperf Performance of External Loopback in Dual-port Design about corundum HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent