Giter Site home page Giter Site logo

Comments (39)

alexforencich avatar alexforencich commented on July 25, 2024

What FPGA board and what host CPU? Can you show the full lspci output of the card? And is this TX, RX, or simultaneous? And did you use iperf or iperf3? (iperf, while actually being multithreaded, is not as efficient as iperf3). Corundum supports IP checksum offload for RX and TX and Toeplitz hashing for RSS. LSO support is not planned.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Also, what's the link partner, and have you done a test with a commercial 100G NIC with the same host system and link partner?

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

U200 and Intel(R) Xeon(R) Platinum 8163 CPU. I only tried one to one, so I am not sure about TX or RX. iperf is used for the test.

lspci output is as follows:

25:00.0 Ethernet controller: Device 1234:1001
        Subsystem: Device 1234:1001
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin ? routed to IRQ 282
        NUMA node: 0
        Region 0: Memory at 4bfff000000 (64-bit, prefetchable) [size=16M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable+ Count=32/32 Maskable+ 64bit+
                Address: 00000000fee00418  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [1c0 v1] #19
        Kernel driver in use: mqnic

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

That looks like a reasonably performant CPU, and it looks like the card is running at full gen 3 x 16 bandwidth. Max payload of 256 is also reasonable. Is this a dual-socket system with NUMA, or is there only one CPU? If you have two CPUs, then you need to use numactl to run iperf on the same node, which in this case is node 0 (as reported by lspci). Also, the machine hosting the link partner also plays a role - is that an identical machine?

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

There is only one CPU, and the two servers are identical. The performance of iperf3 is worse, only 8Gbps with default settings.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

What's the link partner? And what kind of performance do you get with two commercial 100G NICs cross-connected in the same machines?

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

What does link partner means? server or switch, or something else? How I can get the info of link partner?

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Link partner is whatever is on the other end of the QSFP28 cable that's plugged in to the FPGA board.

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

Ok, the two FPGA cards are connected through a P4 switch.
And, it seems that RX is the bottleneck. In 1to2 case, each iperf sender can get 26G. But in 2to1 case, each iperf sender only get 12G.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Ah, so you only have corundum running? Do you have any commercial 100G NICs, perhaps something from Intel or Mellanox? It would be good to do a quick sanity check to make sure you're seeing reasonable performance with a commercial NIC.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Also, IOMMU on or off? Have you tried with the IOMMU turned off?

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

Hi, I replaced a AU200 card with Mellanox CX6 NIC, and found that when running iperf from AU200 to CX6, 16 connections achieve 75.8Gbps in total, but running iperf from CX6 to AU200, the total throughput is only 23.7Gbps. It seems that when corundum is used as reciever, the performance is low. Do you have any idea about the cause of the receiver bottleneck?
In addition, IOMMU is off.

from corundum.

kaoruzhu1 avatar kaoruzhu1 commented on July 25, 2024

I also noticed that RX perf. is lower than TX.
for example, in a 10g variation on AMD platform, single thread TX perf. is 9.x Gbps and quite stable, but RX is about 7.x Gbps with swing.
currently I had no ideas about what happened.
should dig deeper....

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Interesting that the RX performance is that low. This is with 1500 byte MTU frames? Have you tried 9KB just for comparison? And I suppose if the CX6 can receive at 75.8 in an identical machine, then the machine in question should definitely be capable of receiving at that rate. Can you check the interrupt stats to make sure that all of the interrupts from the card are being used? There are also a few things that you can put an ILA on to get some idea of where the bottleneck might be. Some of these should probably be brought out as performance counters of some sort, for others I'm not sure the best way to go about doing that.

Also, I will note that both the driver and the gateware can probably use some optimizations to improve performance. Variable-length descriptor support should help at least by supporting descriptor block reads, and there are definitely memory management improvements that can be made to the driver. Unfortunately, I am not an expert at kernel development; I'm hoping that there will be enough community interest in Corundum to turn up some potential contributors with more experience in that area who can help out.

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

I added a log before https://github.com/ucsdsysnet/corundum/blob/7c8abe261b2ec3e653da7bc881f769668a231bde/modules/mqnic/mqnic_rx.c#L296 to get info about cq_ring->head_ptr and cq_ring->tail_ptr. I found there are some bursts in head-tail, and only a few cqes are in cq ring in most time. More details are show in the figure.
image

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Another thing that you can try: increase the number of in-flight RX and TX operations like so:

parameter TX_DESC_TABLE_SIZE = 64;
parameter TX_PKT_TABLE_SIZE = 16;
parameter RX_DESC_TABLE_SIZE = 64;
parameter RX_PKT_TABLE_SIZE = 16;

And you are seeing multiple RX queues in use, correct? You just picked one for plotting?

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

yes, mutiple RX queues are used. The figure is plotted with the data where iperf has only one connection from CX6 to U200. I also tested with more connections, and each queue is in used normally. When I tested with two U200 cards, head-tail has smaller fluctuations than one CX6 NIC and one U200 card.

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

Another thing that you can try: increase the number of in-flight RX and TX operations like so:

parameter TX_DESC_TABLE_SIZE = 64;
parameter TX_PKT_TABLE_SIZE = 16;
parameter RX_DESC_TABLE_SIZE = 64;
parameter RX_PKT_TABLE_SIZE = 16;

And you are seeing multiple RX queues in use, correct? You just picked one for plotting?

Hi, Alex. I tried to set the parameters as the value you mentioned above, but it achieved the same performance as the defaut parameters.

from corundum.

wangshuaizs avatar wangshuaizs commented on July 25, 2024

Hi, Alex,

Sometime, the one-to-one throughput can be 17Gbps, while it is 10Gbps in most time. Then I add a log to get cq index, rx queue head and rx queue tail before

while (cq_ring->head_ptr != cq_tail_ptr && done < budget)

The following two figures show the elasped time between the continuous two outputs and the corresponding head, tail and cq index. Hope this can be helpful to locate the reason.

image

image

from corundum.

Winters123 avatar Winters123 commented on July 25, 2024

The performance issue I met is different but I figure it might be reasonable to ask here.

the CPU is intel i5, memory size is 8G. I use 690T to implement corundum. Corundum NIC is connected to an Intel 10G NIC back to back. The issue is that every time the throughput gets higher than 100Mbps (1500B size), there will be around 1% drop rate (I've also tried it with 1Gbps and 4Gbps).

Another issue that I observed (not sure if they are related) is that the IP addr (configured using ifconfig cmd) on eth0 get lost from time to time with the mesg "Activation of network connection failed".

Do you have any idea what caused the issues?

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

I am getting similar performance as @wangshuaizs on Alveo U50 running at PCIe Gen3 (8GT/s) x16.
The CPU is Intel-Xeon(R) E5 2690v3 @2.60GHz 24C48T, the memory is DDR4 2400 (running at 2133MT/s). The linux host of U50 is ubuntu server 20.04 LTS running on an SSD.
The peer is using the same CPU, motherboard, memory and storage. The commercial NIC is Mellanox CX4 100GbE. The linux host of CX4 is CentOS 7.

The hardware and software is compiled from the default configuration from commit 38f7666.

I am setting the MTU of corundum as 9000, as shown below.

enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.0.5  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::20a:35ff:fe06:792e  prefixlen 64  scopeid 0x20<link>
        ether 00:0a:35:06:79:2e  txqueuelen 1000  (Ethernet)
        RX packets 123213551  bytes 185574188431 (185.5 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 50241829  bytes 70609934522 (70.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

The output of sudo lspci d 1234:1001 -vvv is shown as below.

04:00.0 Ethernet controller: Device 1234:1001
        Subsystem: Device 1234:1001
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin ? routed to IRQ 116
        NUMA node: 0
        Region 0: Memory at 3bffe000000 (64-bit, prefetchable) [size=16M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable+ Count=32/32 Maskable+ 64bit+
                Address: 00000000fee01018  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [1f0 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                Port Arbitration Table [500] <?>
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Kernel driver in use: mqnic

The IOMMU seems turned off from the lspci result, compared to the output from https://github.com/corundum/corundum/wiki/Performance-Tuning

The RX performance is roughly 10Gbit/s, and the TX performance is roughly 20Gbit/s. As below, the left is the peer on CX4, and the right is the ubuntu with corudum on U50. I started an iperf client from peer CentOS, and an iperf server from Ubuntu. iperf3 is measuring corundum RX performance, and iperf3 -R is measuring corundum TX performance.

Snipaste_2021-08-19_16-15-27

I did a CX4 100GbE to CX4 100GbE test and reached roughly 90Gbps in my hardware setup, so I am wondering where the bottlenect might be and if there are some configurations I missed. Thanks.

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

After making a cold reboot of the peer server, making sure that there are no other process tapping on the Linux network stack, and changing the peer MTU to 9000, the corundum RX bandwidth managed to reach around 45Gbps, as below. The white bg terminal is the peer, and the black bg terminal is corundum, iperf3 -s on the corundum host and iperf3 -c on the peer.

Snipaste_2021-08-20_09-33-51

But then if I do the reverse test (iperf3 -c -R on the corundum host and iperf3 -s on the peer, or iperf3 -s on the peer and iperf3 -c on the corundum host), iperf3 is unable to send out data after the initial several hundred KBytes. Moreover, ping fails as well afterwards. It seems that the TX of the current build fails on JUMBO packets.

Snipaste_2021-08-20_09-58-26

I tried dual-side MTU=6000. The RX bandwidth is around 39Gbps. The corundum TX fails as well. Tests running at dual-side MTU=1500 is stable, though.

Snipaste_2021-08-20_10-23-54

Snipaste_2021-08-20_10-27-11

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Well, that's a bit concerning. When it gets stuck, what does mqnic-dump report?

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

I attach the report below after a clean corundum boot and a TCP TX request on dual side MTU=9000. It seems that most queues are not moving. Under normal situations, if I understand correcly, the head and the tail of the ring buffers should be non-zero. Is it right?

dump.log

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

If you only run one instance of iperf and you don't specify -P, then only one TX queue should be used. Looks like it used 4853. However, it doesn't look like the card is hung as both TXQ 4853 and TXCQ 4853 are empty, as well as all of the event queues. Did network manager delete the IP address or something?

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

The IP address is still there from the Ubuntu ifconfig, but the outbound port seems not responding. As below, I tried to capture the packet after a halted iperf test from the peer. It seems that there is no outbound packets from the FPGA -- that's why ping fails as well.

image

On the other side, on the Ubunutu machine running corundum, the wireshark capture of a ping request to the peer is as below. It seems that the peer is not receiving ARP packets and generating ARP responses, hence blocking all following packets.

image

From these observations, it is more likely to be a failure on the TX path from my perspective.

The dmesg output seems ok btw.

image


After using taskset to set the cpu affinity, at dual-side MTU=1500, corundum RX can reach 32Gbps with 1 iperf process, but as I increase the parallel processes with -P 2, and taskset -c 31,32, for example, the RX bandwidth does not seem to change obviously.

The TX bandwidth is around 20Gbps with a reversed iperf environment, and as I increase the parallel processes the bandwidth does not seem to change much as well.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

With iperf3, -P only opens multiple connections (and as such will use more TX/RX queues), but you'll have to explicitly run multiple iperf processes in parallel to use more than 1 CPU core.

Anyway, it's interesting that the TX queues are empty when it's hung. Usually when I have seen a hang in the past, there will be packets stuck in the TX queues. So it's not clear what's going on. My initial thought is that maybe the packets are not being handed off to the NIC at all. But maybe the packets are being sent, but then they're getting dropped for some reason before reaching the TX MAC? Very strange.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Hmm, I just noticed that your lspci output lists MaxReadReq as 4096 (the maximum possible value). I do not think I have any machines that set MaxReadReq that high. It's possible that there is a bug somewhere wrt. the DMA engine; I may have to add some additional tests around that. I suspect that if that were an issue, we would see a different failure mode. But perhaps not. At any rate, try an MTU setting less than 4096 (perhaps 3000) and see if that makes a difference.

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

Okay alex, thanks for your help. I can use 1500 for now, because the bandwidth should be enough for most circumstances.
BTW I just made a script to run multiple iperf clients from the dedicated NUMA for the NIC at each side, and managed to reach 63Gbps for RX, and 71Gbps for TX. That's really impressing! 👍 👍

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

I tried to gradually increase the dual-side MTU value, and starting from 4148, corundum TX fails. Hope it helps.

Host and peer MTU TX function
1500 good
2000 good
3000 good
4000 good
4096 good
4100 good
4104 good
4108 good
4112 good
4116 good
4120 good
4124 good
4128 good
4132 good
4136 good
4140 good
4144 good
4148 fail
4172 fail
4500 fail
6000 fail
9000 fail

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Yep, that definitely looks like there is some issue related to the max read request size of 4096. I don't have a script to force the max read request size to something else, but you can try is poking that register in the PCIe config space with setpci and set it to something in the range of 512-2048. And one of our testbeds actually does set max read request size to 4096, but it's only set up for 10G so I have never tried running with an MTU larger than 1500 before. I will see if I can replicate the issue on my end so I can get it sorted out. I also have a possible lead on the bug; it looks like I am not interpreting byte count field value 0 as 4096 as the spec requires, hopefully that's the only thing that I am missing.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

I was not able to replicate any strange behavior in my standalone DMA test, but I was able to replicate the hang with Corundum after changing the settings to support jumbo frames. I'll let you know when I have that particular issue fixed. And apparently the machine in question only sets the max read request size to 4096 on boot, if I do a hot reset later it sets it to 512. So maybe I should also create a script to change the max read request size setting.

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

I think I have it fixed; try the latest commit on my fork and see if that works correctly for jumbo frames (https://github.com/alexforencich/corundum)

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

I tried the latest commit on your fork and it works! 👍 🎉

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

Oops, I think the pcie hot reset changes the max read req to 512. I may need to reboot the machine and try it agin.

from corundum.

isuckatdrifting avatar isuckatdrifting commented on July 25, 2024

It works, once agiain, at MaxReadReq=4096 and MTU=9000.

82:00.0 Ethernet controller: Device 1234:1001
        Subsystem: Device 1234:1001
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        NUMA node: 1
        Region 0: Memory at 3ffff000000 (64-bit, prefetchable) [size=16M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable- Count=1/32 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [1f0 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                Port Arbitration Table [500] <?>
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-

shishi@ubuntu-r730➜  ~ cd corundum/modules/mqnic
shishi@ubuntu-r730➜  mqnic git:(master) ✗
shishi@ubuntu-r730➜  mqnic git:(master) ✗ ls
iperf0.log  iperf6.log      mqnic_board.o  mqnic_eq.o       mqnic_i2c.o    mqnic.mod.c     mqnic_port.o  mqnic_tx.o
iperf1.log  iperf7.log      mqnic_cq.c     mqnic_ethtool.c  mqnic_ioctl.h  mqnic.mod.o     mqnic_ptp.c
iperf2.log  Makefile        mqnic_cq.o     mqnic_ethtool.o  mqnic.ko       mqnic_netdev.c  mqnic_ptp.o
iperf3.log  modules.order   mqnic_dev.c    mqnic.h          mqnic_main.c   mqnic_netdev.o  mqnic_rx.c
iperf4.log  Module.symvers  mqnic_dev.o    mqnic_hw.h       mqnic_main.o   mqnic.o         mqnic_rx.o
iperf5.log  mqnic_board.c   mqnic_eq.c     mqnic_i2c.c      mqnic.mod      mqnic_port.c    mqnic_tx.c
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo insmod mqnic.ko
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo rmmod mqnic.ko
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo insmod mqnic.ko
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo ip link set dev enp130s0 up
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo ip addr add 192.168.0.5/24 dev enp130s0
shishi@ubuntu-r730➜  mqnic git:(master) ✗ sudo ip link set mtu 9000 dev enp130s0
shishi@ubuntu-r730➜  mqnic git:(master) ✗ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.110.101.144  netmask 255.255.255.0  broadcast 10.110.101.255
        inet6 fe80::1618:77ff:fe56:3b6c  prefixlen 64  scopeid 0x20<link>
        ether 14:18:77:56:3b:6c  txqueuelen 1000  (Ethernet)
        RX packets 1236  bytes 895381 (895.3 KB)
        RX errors 0  dropped 43  overruns 0  frame 0
        TX packets 717  bytes 80054 (80.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 38

eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 14:18:77:56:3b:6d  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 91

eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 14:18:77:56:3b:6e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 93

eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 14:18:77:56:3b:6f  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 95

enp130s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.0.5  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::20a:35ff:fe06:792e  prefixlen 64  scopeid 0x20<link>
        ether 00:0a:35:06:79:2e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 836 (836.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 92  bytes 7100 (7.1 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 92  bytes 7100 (7.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

shishi@ubuntu-r730➜  mqnic git:(master) ✗ cd ~
shishi@ubuntu-r730➜  ~ ./batch_iperf_c.sh
shishi@ubuntu-r730➜  ~ ------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  682 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39242 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size: 1.84 MByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39244 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  715 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39250 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  390 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  715 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39246 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  715 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39256 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  715 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39248 connected with 192.168.0.2 port 5001
[  3] local 192.168.0.5 port 39254 connected with 192.168.0.2 port 5001
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  715 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.5 port 39258 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.97 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.6 GBytes  9.96 Gbits/sec

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

Excellent, good to hear!

from corundum.

likewise avatar likewise commented on July 25, 2024

I wonder, with the MRRS 0 now correctly interpreted as 4096, was this also the underlying cause of lowered performance for the 100 Gbps case of the original issue, i.e. opening topic?

from corundum.

alexforencich avatar alexforencich commented on July 25, 2024

No. First, this seems to cause the design to hang. So it's not low performance, it's no performance. Second, the bug was introduced recently with the new generic PCIe DMA engine due to an oversight; the ultrascale "descriptor" format uses a wider field and as such does not have this ambiguity so the old ultrascale-specific DMA engine is unaffected.

from corundum.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.