emmericp / ixy Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 122.0 193 KB

A simple yet fast user space network driver for Intel 10 Gbit/s NICs written from scratch

License: BSD 3-Clause "New" or "Revised" License

CMake 0.26% Shell 0.08% C 99.66%

ixy's People

Contributors

Stargazers

Watchers

Forkers

pudelkom tobhe jnaulty windlessstorm jithinraj izogain laanak08 netsharec mehrdad-shokri fuxiocteract curtiszimmerman th3osmith ferranespigares longjohncoder rako9000 littlecho daveti ring2003 de8gman1990 passchaos adwait1-g sirousk zhenxian-hu mmisono frankurcrazy darcikhey27 ari23 huberste eleksaggr yumasi happydg rhiswell moneytech pedagand coldchen maymomox terrydang curioustauseef amoskovsky nanxiao michaeljclark claudelee tinti sjas tzwickl liumorgan cschlote alexgobin bobo1239 cyrex562 e-zurg dpzmick yilongli linanj gaowei-96 zhangrb backwardn ovebepari sunlite99 gurpreetshanky ranuzz gooichi vladandrew bonkf haowei790818 ackxolotl werekraken luizgsa21 wuli133144 panda-zcj hushunkui rspwfpgas weijian1989 dongcai008 zhutony xiaochencui gofreelee blocky2019 mybluecrab xiangchu0 liweiguang199609 neilty rocker9527 kalimuthu-velappan aggelis songhtdo pelimban scarletyoung emptywatson ronaldoa xuanzang6013 liaoyunkun strivin lnkgyv jackiedinh8 cyanide-burnout ct-clmsn crazybaoli katerega magnate3

ixy's Issues

mmap can be used for allocating such memory regions for DMA

mmap can be used for allocating such memory regions for DMA.

map(NULL, length, PROT_READ |PROT_WRITE, MAP_ANONYMOUS | MAP_HUGETLB | MAP_LOCKED, 0, 0);

This kind of allocation is done in programs that need specific properties on allocated memory blocks as well (e.g. Wine).

Thoughts on memory pools

Memory pools should be

fast
simple
multi-threaded
allow bulk alloc/free

currently they are only fast and simple. Let's see if we can get the other two properties as well without losing speed and simplicity.

A memory pool is just a memory allocator for a fixed number of fixed-size buffers. So its core is some data structure that keeps track of a fixed number of pointers. Reasonable data structures for that are ring buffers and stacks. Currently it's a stack.

Implementing bulk allow/free for that stack is a trivial change. But multi-threaded fast stacks (especially with bulk operations) are extremely difficult (they are a standard example for lock-free data structures suffering from the ABA bug...). That would go against our design goal of keeping it simple...

So let's use a queue? Multi-threaded queues are relatively simple. However, queues suffer from a different problem: poor temporal data locality as they will cycle through all the references -- the stack will re-use buffers that you just recently used.
The fix for that are per-thread caches in the memory pool. There are two problems with that: it's probably no longer simple and it doesn't work with all use cases. One scenario where this fails is some pipeline application where packets aren't sent out on the same thread that they where received. And this is the primary motiviation why we wanted a multi-threaded memory pool in the first place.

An interesting reference is DPDK which defaults to a queue-based memory pool with thread-local caches. And it has the same problem. Their solution for pipeline-based applications is a stack with a spin lock: http://dpdk.org/ml/archives/dev/2016-July/043106.html

Why not use some library? We want to keep ixy free of external dependencies, you should be able to go through the whole code and understand all parts, including the core data structures.

What are we going to do? Probably a stack with a sping lock. And benchmarking! Probably queue vs stack. And single-threaded stack vs. multi-threaded stack in an uncontended scenario. And some realistic contention setup. NUMA?

Broken loop

ixy/src/libixy-vfio.c

Lines 153 to 165 in e04d2d1

    
           for (int i = VFIO_PCI_MSIX_IRQ_INDEX; i >= 0; i--) { 
        
           	struct vfio_irq_info irq = {.argsz = sizeof(irq), .index = i}; 
        
           	check_err(ioctl(device_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq), "get IRQ Info"); 
        
           	/* if this vector cannot be used with eventfd continue with next*/ 
        
           	if ((irq.flags & VFIO_IRQ_INFO_EVENTFD) == 0) { 
        
           		error("IRQ doesn't support Event FD"); 
        
           		continue; 
        
           	} 
        
           	return i; 
        
           }

The loop is executed at most one time since error() calls abort().

Please add function to read MAC address of NIC

ppms function doesn't calculate packets per microsecond

/**
 * Calculate packets per microsecond based on the received number of packets and the elapsed time in nanoseconds since the
 * last calculation.
 * @param received_pkts Number of received packets.
 * @param elapsed_time_nanos Time elapsed in nanoseconds since the last calculation.
 * @return Packets per microsecond.
 */
static uint64_t ppms(uint64_t received_pkts, uint64_t elapsed_time_nanos) {
	return received_pkts / (elapsed_time_nanos / 1000000);
}

micro = 10 ^ -6
nano = 10 ^ -9

By using (elapsed_time_nanos / 1000000), this function calculates the amount of received packets per millisecond, so I don't know if the comments are wrong micro -> milli or the divisor is wrong 1000000 -> 1000.

VIRTIO-PCI: failed to open /sys/bus/pci/devices/<BDF>/resource0

There is no resource0 but resource in my Ubuntu 20.04 VM

So I changed resource0 in virtio.c to resource.

/sys/bus/pci/devices/<BDF>/resource is read-only, so ixy-pktgen failed because permission denied

I added write permissions to /sys/bus/pci/devices/<BDF>/resource (I also tried 777)

Then it can open the resource, but failed at device.h:154 write_io8(): pwrite io resource after virtio_legacy_init(): Configuring bar0

Is there something wrong with my usage? Can anyone help?

Implement self-test

Tests are important! And also difficult when you have hardware dependencies :(

We can implement a simple full system test by adding another example application: ixy-dump which dumps packets to a pcap file.

We can then use all three example applications together for a system test

send packets with ixy-pktgen (and let's add a sequence number generater here)
forward them with ixy-fwd
dump them to a pcap with ixy-dump

Then we can just check if the generated pcap file contains packets with increasing sequence numbers. This can be run on a single server with four connected interfaces.

The important part here is that this tests the actual applications, because there is nothing worse than having examples that just don't work because no one ever tests them.

SEGFAULT when using more than 1 queue

We have issue when we are using than 1 queue on ixgbe

Stack trace:
1 pkt_buf_free
2 ixgbe_tx_batch

Both queues are processed in single thread, we tried to use single mempool as well as mempool-per-queue. The result is the same - any other queue except #0 caught.

Please add support of VMXNET3

great 34C3 talk

thanks for your work and talk. i loved the fast speaking style - keeps me focused 👍 (i hope you ll get more time on the next congress(es)!

run ixy-pktgen on virtualbox virtio err: Device does not support required features

i run virtualbox 6.0 on macos 10.14.6
guest os is ubuntu 16.04.

guest network device is set Paravirtualized network adapter (virtio-net), as documented in https://www.virtualbox.org/manual/ch06.html

when i run app:
/home/zzlu/Downloads/uio_test/ixy/cmake-build-debug-remote/ixy-pktgen 0000:00:03.0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/config [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/config [INFO ] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:352 virtio_init(): Detected virtio legacy network card [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/resource0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:275 virtio_legacy_init(): Configuring bar0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:284 virtio_legacy_init(): Host features: 410fdda3 [ERROR] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:289 virtio_legacy_init(): Device does not support required features

i can figure out this error in code, my host feature without VIRTIO_F_ANY_LAYOUT
how can i fix this?

Update README

After the recent changes and additions the readme should be updated.

Thoughts on driver wrapping

Universal structure contained in every driver and exposed to the user:

struct ixy_device {
    char* driver_name;
    uint16_t (*tx_batch)(struct ixy_device*, struct pkt_buf*, uint16_t);
    uint16_t (*rx_batch)(struct ixy_device*, struct pkt_buf*, uint16_t);
};

Specific driver implementations include this struct somewhere:

struct virtio_device {
    int fd;
    struct virt_queue* rx, *tx, *ctrl;

    struct ixy_device ixy;
};

In the public API we expose two functions, which just forward the calls to the fn pointers in the struct:

uint16_t ixy_rx_batch(struct ixy_device* dev, struct pkt_buf* bufs, uint16_t num_bufs);
uint16_t ixy_tx_batch(struct ixy_device* dev, struct pkt_buf* bufs, uint16_t num_bufs);

Which of course point to the appropriate driver implementations, so the following is always correct code:

#define IXY_TO_VIRT(dev) container_of(dev, struct virtio_device, ixy)

static uint16_t virtio_rx_batch(struct ixy_device* dev, ...) {
    struct virtio_device* virt = IXY_TO_VIRT(dev);
    ...
}

This unwrapping is only needed on the semi-public entry functions to a driver. Internally it can just pass its struct around.

Pros

Unified user-interface, an ixy-fwd app works with every driver
Very common practice

Cons

One level of pointer indirection; Performance impact?
Too much magic?
Even required for just two drivers?

Thoughts on NUMA

NUMA is really important for performance. There are two things to consider: thread-pinning and memory-pinning. Thread pinning is trivial and can be done with the usual affinity mask. The best way to pin memory is by linking against libnuma.
A dependency, eeww. But a simple dependency (just a wrapper for a few syscalls) that I'd see on a level with libpthread; a necessary evil.

Let's look at a forwarding application on a NUMA system with NICs connected to both CPUs.
It will typically have at least one thread per NIC that handles incoming packets and forwards them somewhere. It might need to cross a NUMA-boundary to do so.
In our experience, it's most efficient to pin both the thread and packet memory to the CPU node to which NIC receiving packets is connected. Sending from the wrong node is not as bad as receiving to the wrong node. Also, we (usually) can't know where to send the packets when receiving them, so we can't pin the memory correctly for that.

How to implement this?

read numa_node in NIC's sysfs directory to figure out where it's connected to
use libnuma to set a memory policy before allocating memory for it
pin the thread correctly

Sounds easy, right?
But is it worth implementing it? What do we gain beside added complexity?
Sure, this is obviously a must-have feature for a real-world high-performance driver.

But we've decided against implementing it for now.
Almost everyone will just look at the code and that NUMA stuff is not particularly interesting compared to the rest and it just adds noise.

That doesn't mean you can't use ixy on a NUMA system.
We obviously want to run some benchmarks and performance tests with different NUMA scenarios and we are just going to use the numactl command for that:

 numactl --strict --membind=0 --cpunodebind=0 ./ixy-pktgen <id> <id>

That works just fine with the current memory allocator and allows us to benchmark all relevant scenarios on a NUMA system with NICs attached to both nodes.

Packets left in queue in interrupt mode when traffic stops suddenly

Looks like we sometimes miss a few packets in the queue in interrupt mode if the traffic stops suddenly. Probable cause: we miss an interrupt at the end and the packets stay in the queue until a new packet arrives.

	for (int i = VFIO_PCI_MSIX_IRQ_INDEX; i >= 0; i--) {
	struct vfio_irq_info irq = {.argsz = sizeof(irq), .index = i};

	check_err(ioctl(device_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq), "get IRQ Info");

	/* if this vector cannot be used with eventfd continue with next*/
	if ((irq.flags & VFIO_IRQ_INFO_EVENTFD) == 0) {
	error("IRQ doesn't support Event FD");
	continue;
	}

	return i;
	}