Comments (8)
Hello, I have the same problem with you. The iperf's performance is very poor. Have you solved it now
from corundum.
If you have multiple NUMA nodes, you'll need to use numactl
or similar to run everything that touches the NIC on the same NUMA node. You can see the NUMA node associated with the PCIe device in /sys/class/net/<dev>/device/numa_node
. For iperf, you'll have to run both the client and the server under numactl. So if sysfs reports NUMA node 3, then you'll want to run numactl -N 3 iperf3 -s
. It's always a good idea to compare the performance against a commercial 100G NIC as well as a sanity check - if the commercial NIC can't get anywhere near 100 Gbps, then there is something else going on.
Strange that you're not getting anywhere near 10 Gbps even with one process; on the server I tested corundum on for the paper (dual socket Intel Xeon 6138), a single iperf process would run at 20-30 Gbps with 1500 byte MTU, or 40-50 Gbps with 9000 byte MTU.
Other things to check: look at lspci, make sure the card is running at gen 3 x16. Not all x16 slots have all lanes wired, and even if they have all lanes wired they don't always have all of the lanes connected - we had to pull some 10G NICs out of the server riser cards to get all 16 lanes on the same slot, instead of splitting them across two slots. Also, make sure the card has registered all 32 interrupts, it currently uses MSI, and for some reason some motherboards don't completely support MSI and limit devices to 1 interrupt instead of 32, and this will limit performance. Moving to MSI-X is on the to-do list.
from corundum.
I'm putting together some notes on performance tuning here: https://github.com/corundum/corundum/wiki/Performance-Tuning
from corundum.
I'm also going to take a look on my end and see if I missed any settings between what was used for the test in the paper and what is in the repo. I know I added some instrumentation to the driver for measuring a few things, but this would not improve performance. However, I may have adjusted a couple of other settings in the FPGA config that perhaps didn't make it into the repo. Now that the cocotb migration is complete, I have some time to do some performance measurements, and I will be sure to put together some notes on that.
from corundum.
I've tested two corundum FPGAs (ADM9v3 and VCU118) on two separated servers. Although I haven't got 100Gbps line rate, single thread performance was increased to 30Gbps.
So I assume that namespace isolation has certain limitation comparing to two real physical nodes.
Thanks for your suggestions :)
From optimization perspective, the application (e.g. iperf), queue handling, and interrupt handling shall all be attached to the correct NUMA node to get stable performance.
Here's an example of optimizing eth0 with queue and interrupt CPU affinity to NUMA node 1 (My server has two 14C28T E5v4 processors, and core 0-13,28-41 are in the NUMA node 1). IRQ numbers need to be checked in /proc/interrupts.
In my case single thread performance has been improved up to 16Gbps.
for i in {0..255}; do echo 0003ff,f0003fff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; done
for i in {0..8191}; do echo 0003ff,f0003fff > /sys/class/net/eth0/queues/tx-$i/xps_cpus ; done
for i in {83..114}; do echo 0-13,28-41 > /proc/irq/$i/smp_affinity_list ; done
from corundum.
Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.
Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.
from corundum.
Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.
Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.
Recently I've switched my test platform from Broadwell to Skylake and have MLX5 installed to replace the VCU118. And most of the issues are disappeared :)
With network namespace isolation the performance between an ADM9V3 and a MLX5 can reach 95~97Gbps on a single server (MTU=8K, have a problem on 9K MTU). The bottleneck appears to be on the CPU side (old ark and low frequency). Even so, on servers from different vendors, e.g. Supermicro and Dell, the performance are not same, which may be relative to the BIOS implementation.
And you are right about the interrupt handling. In up-to-date versions of Kernel, disabling irqbalance and having hard/soft IRQ manually assigned would not help on overall throughput.
I've tried all kinds of ways to optimize the performance on BIOS and OS level to the old Broadwell server and I have to laugh at myself that most of them are actually negative optimization. However this is still a really good lesson to learn ;)
Thanks & Best regards,
Xiaohai
from corundum.
Hmm, that's a good point, my main use case for namespace isolation was testing the UDP stack without having to screw around with IP addresses if my local network conflicts. But for corundum, it's probably not sufficient for performance isolation, although I have been using it quite a bit recently as it's at least a decent tool for sanity checking.
Now, interrupt handling: if you want to figure out how to do that automatically in the driver, I would gladly accept a pull request for that. Although I would think linux should be smart enough to figure out that interrupts for a device should be bound to the same NUMA node.
Recently I've switched my test platform from Broadwell to Skylake and have MLX5 installed to replace the VCU118. And most of the issues are disappeared :)
With network namespace isolation the performance between an ADM9V3 and a MLX5 can reach 95~97Gbps on a single server (MTU=8K, have a problem on 9K MTU). The bottleneck appears to be on the CPU side (old ark and low frequency). Even so, on servers from different vendors, e.g. Supermicro and Dell, the performance are not same, which may be relative to the BIOS implementation.
And you are right about the interrupt handling. In up-to-date versions of Kernel, disabling irqbalance and having hard/soft IRQ manually assigned would not help on overall throughput.
I've tried all kinds of ways to optimize the performance on BIOS and OS level to the old Broadwell server and I have to laugh at myself that most of them are actually negative optimization. However this is still a really good lesson to learn ;)
Thanks & Best regards,
Xiaohai
from corundum.
Related Issues (20)
- Tools for AXI Lite register reads/writes and AXI streaming read/writes HOT 21
- An erro during generating project at file "./config.tcl" line 240: ERROR: [Ipptcl 7-29] Invalid param '03f'. HOT 4
- kernel driver license stated ambigiously HOT 3
- Why is distributed RAM used to store packages on the chip, and why is it divided into segment?
- Missing prerequisite in makefile HOT 1
- Document needs to be updated
- Custom App Ports HOT 2
- Cocotb with icarus verilog gets frozen stuck at some point during simulation: How to debug? HOT 2
- port to zc706 HOT 1
- insmod mqnic.ko error HOT 2
- Does corundum support two devices in a server? HOT 3
- Accelerating PPPOE
- petalinux compile error for mqnic module HOT 8
- nic
- Error "Device needs to be reset" when insmod mqnic.ko HOT 1
- AU50 working with DAC but not Optical HOT 3
- issues in doc HOT 1
- Porting PCIe example to zu7cg HOT 7
- mqnic is not compilent from modern kernel HOT 1
- Weird bugs meet with fb2CG and Vivado 2023.2 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from corundum.