chaoran / fast-wait-free-queue Goto Github PK

A benchmark framework for concurrent queue implementations

License: MIT License

C 95.62% Shell 2.12% Makefile 1.37% C++ 0.90%

fast-wait-free-queue's Introduction

Fast Wait Free Queue

This is a benchmark framework for evaluating the performance of concurrent queues. Currently, it contains four concurrent queue implementations. They are:

A fast wait-free queue wfqueue,
Morrison and Afek's lcrq,
Fatourou and Kallimanis's ccqueue, and
Michael and Scott's msqueue

The benchmark framework also includes a synthetic queue benchmark, faa, which emulates both an enqueue and a dequeue with a fetch-and-add primitive to test the performance of fetch-and-add on a system.

The framework currently contains one benchmark, pairwise, in which all threads repeatedly execute pairs of enqueue and dequeue operations. Between two operations, pairwise uses a delay routine that adds an arbitrary delay (between 50~150ns) to avoid artificial long run scenarios, where a cache line is held by one thread for a long time.

Requirements

GCC 4.1.0 or later (Recommend GCC 4.7.3 or later): current implementations uses GCC __atomic or __sync primitives for atomic memory access.
Linux kernel 2.5.8 or later
glibc 2.3: we use sched_setaffinity to bind threads to cores.
atomic CAS2: lcrq requires CAS2, a 16 Byte wide compare-and-swap primitive. This is available on most recent Intel processors and IBM Power8.
jemalloc (optional): jemalloc eliminates the bottleneck of the memory allocator. You can link with jemalloc by setting JEMALLOC_PATH environment variable to the path where your jemalloc is installed.

How to install

Download one of the released source code tarball, then execute the following commands. The filename used may be different depending on the name of the tarball you have downloaded.

$ tar zxf fast-wait-free-queue-1.0.0.tar.gz
$ cd fast-wait-free-queue-1.0.0
$ make

This should generate 6 binaries (or 5 if your system does not support CAS2, lcrq will fail to compile): wfqueue, wfqueue0, lcrq, ccqueue, msqueue, faa, and delay. These are the pairwise benchmark compiled using different queue implementations.

wfqueue0: the same as wfqueue except that its PATIENCE is set to 0.
delay: a synthetic benchmark used to measure the time spent in the delay routine.

How to run

You can execute a binary directly, using the number of threads as an argument. Without an argument, the execution will use all available cores on the system.

For example,

./wfqueue 8

runs wfqueue with 8 threads.

If you would like to verify the result, compile the binary with VERIFY=1 make. Then execute a binary directly will print either PASSED or error messages.

You can also use the driver script, which invokes a binary up to 10 times and measures the mean of running times, the running time of the current run, the standard deviation, margin of error (both in time and percentage) of each run. The script terminates when the margin of error is relatively small (< 0.02), or has invoked the binary 10 times.

For example,

./driver ./wfqueue 8

runs wfqueue with 8 threads up to 10 times and collect statistic results.

You can use the benchmark script, which invokes driver on all combinations of a list of binaries and a list of numbers of threads, and report the mean running time and margin of error for each combination. You can specify the list of binaries using the environment variable TESTS. You can specify the list of numbers of threads using the environment variable PROCS.

The generated output of benchmark can be used as a datafile for gnuplot. The first column of benchmark's output is the number threads. Then every two columns are the mean running time and margin of error for each queue implementation. They are in the same order as they are specified in TESTS.

For example,

TESTS=wfqueue:lcrq:faa:delay PROCS=1:2:4:8 ./benchmark

runs each of wfqueue, lcrq, faa, and delay using 1, 2, 4, and 8 threads.

Then you can plot them using,

set logscale x 2
plot "t" using 1:(20000/($2-$8)) t "wfqueue" w lines, \
     "t" using 1:(20000/($4-$8)) t "lcrq" w lines, \
     "t" using 1:(20000/($6-$8)) t "faa" w lines

How to map threads to cores

By default, the framework will map a thread with id i to the core with id i % p, where p is the number of available cores on a system; you can check each core's id in proc/cpuinfo.

To implement a custom mapping, you can add a cpumap function in cpumap.h. The signature of cpumap is

int cpumap(int id, int nprocs)

where id is the id of the current thread, nprocs is the number of threads. cpumap should return the corresponding core id for the thread. cpumap.h contains several examples of the cpumap function. You should guard the definition of the added cpumap using a conditional macro, and add the macro to CFLAGS in the makefile.

How to add a new queue implementation

We use a generic pointer void * to represent a value that can be stored in the queue. A queue should implements the queue interface, defined in queue.h.

queue_t: the struct type of the queue,
handle_t: a thread's handle to the queue, used to store thread local state,
void queue_init(queue_t * q, int nprocs): initialize a queue; this will be called only once,
void queue_register(queue_t * q, handle_t * th, int id): initialize a thread's handle; this will be called by every thread that uses the queue,
void enqueue(queue_t * q, handle_t * th, void * val): enqueues a value,
void * dequeue(queue_t * q, handle_t * th): dequeues a value,
void queue_free(queue_t * q, handle_t * h): deallocate a queue and cleanup all resources associated with it,
EMPTY: a value that will be returned if a dequeue fails. This should be a macro that is defined in the header file.

How to add a new benchmark

A benchmark should implement the benchmark interface, defined in benchmark.h, and interact with a queue using the queue interface. The benchmark interface includes:

void init(int nprocs, int n): performs initialization of the benchmark; called only once at the beginning.
void thread_init(int id, int nprocs): performs thread local initialization of the benchmark; called once per thread, after init but before benchmark.
void * benchmark(int id, int nprocs): run the benchmark once, called by each thread to run the benchmark. Each call will be timed and report as one iteration. It can return a result, which will be passed to verify to verify correctness.
int verify(int nprocs, void * results): should verify the result of each thread and return 0 on success and non-zero values on error.

fast-wait-free-queue's People

Contributors

Stargazers

Watchers

Forkers

simple555a fxfactorial tupshin curiousleo lns maximecaron jianlongzhong pramalhe live-for-dream lxq2537664558 menefotto hongyunnchen pslydhh jokeren souravzzz bluemaths skyformat99 vyloy kchanqvq simonzhangsm cychenyin imcg nothing-debug ubuntu-repo daviddecorso shangmacun gersure stevelin168 clayne liaoheping ori-saporta83 beyonddream-productions samuelriedel bottomcoder-ander kasdlls xuzhaofangxzf raid-7 jiaotilizi mfkiwl paranlee panda-sheep tiancheng-luo vbirds aemonswift karouzakisp myronfirst

fast-wait-free-queue's Issues

Incomplete description in README's "How to add a new queue implementation"

Hi! I'm porting the queue described in [1] to Rust: https://github.com/jeehoonkang/crossbeam-queue It's a work in progress, but soon I'd like to evaluate its performance using the benchmark tools in this repo (mainly for comparing with the implementation in wfqueue.c).

So I'm following the instruction in README.md [2] for adding a new queue implementation, but I found that README.md differs from what is described in queue.h, esp.:

It seems I need to implement queue_free, but it's not documented in the README.
EMPTY is described in README, but it is not declared in queue.h.

I could kinda guess what are they, but I wonder if the author of this repository could clarify their meanings. Thank you very much!

[1] Yang and Mellor-Crummey. A Wait-free Queue as Fast as Fetch-and-Add. PPoPP 2016.
[2] https://github.com/chaoran/fast-wait-free-queue/blob/master/README.md#how-to-add-a-new-queue-implementation

wfqueue.c threads stuck at spin wfqueue.c:30

  Id   Target Id         Frame
  16   Thread 0x7fffbf7fe700 (LWP 22384) "a.out" 0x0000000000401779 in help_enq (i=33724383, c=0x7fffcc05db40, th=0x7fffb4001000, q=0x605000)
    at wfqueue.c:213
  15   Thread 0x7fffbffff700 (LWP 22383) "a.out" 0x000000000040192b in help_enq (i=33705511, c=0x7fffc80aa3c0, th=0x7fffb0001000, q=0x605000)
    at wfqueue.c:236
  14   Thread 0x7fffdcff9700 (LWP 22382) "a.out" spin (p=0x7fffcc0f3940) at wfqueue.c:30
  13   Thread 0x7fffdd7fa700 (LWP 22381) "a.out" spin (p=0x7fffcc05e140) at wfqueue.c:30
  12   Thread 0x7fffddffb700 (LWP 22380) "a.out" spin (p=0x7fffcc05e2c0) at wfqueue.c:30
  11   Thread 0x7fffde7fc700 (LWP 22379) "a.out" spin (p=0x7fffcc05e3c0) at wfqueue.c:30
  10   Thread 0x7fffdeffd700 (LWP 22378) "a.out" spin (p=0x7fffcc05c140) at wfqueue.c:32
* 1    Thread 0x7ffff7fe0740 (LWP 22369) "a.out" 0x0000000000400def in running_wfq_test (arg_producer=<optimized out>, arg_consumer=<optimized out>,

This issue happened on and off, and it always stucking at last 5 count
Total Nproc = 16 core
nProducerThread = 8, nConsumerThread= 7
nProducing = 8000000, nConsuming = 7999995,

Syntax error in driver script

I get the following error with a high core count:

#! Host: archlinux
#! Benchmarks: wfqueue wfqueue0 faa lcrq ccqueue msqueue delay
#! Threads: 36
36 300.96 3.95 309.71 2.89 205.67 4.08(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
./driver: line 40: ((: 2 >= 10 || 2 >= 5 &&  == 1: syntax error: operand expected (error token is "== 1")
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
./driver: line 40: ((: 3 >= 10 || 3 >= 5 &&  == 1: syntax error: operand expected (error token is "== 1")
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
(standard_in) 1: syntax error
./driver: line 40: ((: 4 >= 10 || 4 >= 5 &&  == 1: syntax error: operand expected (error token is "== 1")

Probably because in the following line https://github.com/chaorayn/fast-wait-free-queue/blob/master/driver#L40

$(echo "$PRECISION < 0.02" | bc) gives empty result

Memory leak in LCRQ

I'm now trying to use LCRQ in production and I found that queue_free for it isn't implemented. I'm trying to implement it myself, however currently I'm getting some trouble to keep track of ring queue buffers being swapped of the queue. Could you provide a proper implementation of LCRQ queue_free?

wfqueue MCSP stucking at spin function scope wfqueue.c [30|32]

Hi chaoran, this is confirm stuck if mcsp
Nproc = 16
consumer: 14
producer: 1

0x0000000000400e5f in running_wfq_test (arg_producer=, arg_consumer=, arg_producing=,
arg_consuming=, total_threads=15, test_type=0x401fbd "MCSP") at main_test.c:121
121 while (__sync_fetch_and_add(&config.nConsuming, 0) < TEST_MAX_INPUT * (config.nProducer)) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64
(gdb) info threads
Id Target Id Frame
16 Thread 0x7fffbf7fe700 (LWP 25694) "test" spin (p=0x7fffc8031980) at wfqueue.c:32
15 Thread 0x7fffbffff700 (LWP 25693) "test" spin (p=0x7fffc8032180) at wfqueue.c:32
14 Thread 0x7fffdcff9700 (LWP 25692) "test" spin (p=0x7fffc80324c0) at wfqueue.c:30
13 Thread 0x7fffdd7fa700 (LWP 25691) "test" 0x0000000000401788 in spin (p=0x7fffc8032a80) at wfqueue.c:30
12 Thread 0x7fffddffb700 (LWP 25690) "test" 0x0000000000401788 in spin (p=0x7fffc8032d80) at wfqueue.c:30
11 Thread 0x7fffde7fc700 (LWP 25689) "test" 0x0000000000401788 in spin (p=0x7fffc8033200) at wfqueue.c:30
10 Thread 0x7fffdeffd700 (LWP 25688) "test" spin (p=0x7fffc8033540) at wfqueue.c:32
9 Thread 0x7fffdf7fe700 (LWP 25687) "test" spin (p=0x7fffc80338c0) at wfqueue.c:32
8 Thread 0x7fffdffff700 (LWP 25686) "test" 0x0000000000401788 in spin (p=0x7fffc8033cc0) at wfqueue.c:30
7 Thread 0x7ffff4fec700 (LWP 25685) "test" 0x00000000004017f9 in help_enq (i=91476149, c=0x7fffc8033ec0, th=0x7fffd8001000, q=0x605000)
at wfqueue.c:213
6 Thread 0x7ffff57ed700 (LWP 25684) "test" 0x0000000000401788 in spin (p=0x7fffe4013080) at wfqueue.c:30
5 Thread 0x7ffff5fee700 (LWP 25683) "test" 0x0000000000401788 in spin (p=0x7fffe4013200) at wfqueue.c:30
4 Thread 0x7ffff67ef700 (LWP 25682) "test" 0x0000000000401788 in spin (p=0x7fffc802cd80) at wfqueue.c:30
3 Thread 0x7ffff6ff0700 (LWP 25681) "test" 0x0000000000401788 in spin (p=0x7fffe4013380) at wfqueue.c:30

1 Thread 0x7ffff7fe0740 (LWP 25676) "test" 0x0000000000400e5f in running_wfq_test (arg_producer=, arg_consumer=,
arg_producing=, arg_consuming=, total_threads=15, test_type=0x401fbd "MCSP") at main_test.c:121

Helpers can't recognize different enq-request in one cell

Hi Chaoran.
As I appreciate and study from the wonderful technology and extraordinary ideas of this wait-free queue impl, a tricky question produced in my heart, this question looks like violate Invariant 1(enqueue result cannot be changed in future) in [1], maybe you can have a look when you have time.

It ocurred when someone(enq-thread itself or deq-helper-thread) put a enqueue request in one cell(index) through enq_slow or help_enq, The question is, when we put request in the cell, we know enq.id, the unique request responsible by this cell. But when others help this request to improve parallelism, they may read another enqueue request there:

    long ei = ACQUIRE(&e->id);
    void *ev = ACQUIRE(&e->val);

    if (ei > i) {
        if (c->val == TOP && q->Ei <= i) return BOT;
    } else {
        if ((ei > 0 && CAS(&e->id, &ei, -i)) || (ei == -i && c->val == TOP)) {
            long Ei = q->Ei;
            while (Ei <= i && !CAS(&q->Ei, &Ei, i + 1))
                ;
            c->val = ev;
        }
    }

Actually, the enq of one enqueue thread will evolution like a sequence(val1,id1) (va2,id2) and so on.
We could recognized value belongs to a later request, but we can't distinguish id1 and id2, so image two deq-thread to help the same Cell, maybe thread1 help_enq one Cell with id1(val1) and produce result TOP(because other cell do real-help), then, enqueue thread of id1(val1) produce new request(va2, id2), and thread2 in help_enq process the same Cell with id2(val2) but may produce result val2 if id2 is appropriate. This behavior maybe lost data in help_deq, caused by higher-cell defeats lower-cell then:

        if (new != 0) {
            if (CASra(&deq->idx, &idx, new)) idx = new;
            if (idx >= new) new = 0;
        }

Because the valid value in lower-cell will be lost, and there seems to be no simple solution for me.

Yang and Mellor-Crummey. A Wait-free Queue as Fast as Fetch-and-Add. PPoPP 2016.

Dead code in wfqueue.c

Hi,

I'm trying out a conversion of the wfqueue.c to Rust. The following line of code in enq_slow() seems to be partly dead if (CAS(&enq->id, &id, -i)) id = -i;. The CAS is still needed, I think, but the Rust compiler has identified that the store to id is never needed. Cursory examination of the code following the do-while loop would indeed suggest that is the case, as id is always written to.

Of course, the compiler could be wrong...

chaoran / fast-wait-free-queue Goto Github PK

fast-wait-free-queue's Introduction

Fast Wait Free Queue

Requirements

How to install

How to run

How to map threads to cores

How to add a new queue implementation

How to add a new benchmark

fast-wait-free-queue's People

Contributors

Stargazers

Watchers

Forkers

fast-wait-free-queue's Issues

Recommend Projects

Recommend Topics

Recommend Org