crossbeam-rs / rfcs Goto Github PK

RFCs for changes to Crossbeam

License: Apache License 2.0

rfcs's Introduction

Crossbeam RFCs

This is the repository where proposals for substantial changes to Crossbeam are discussed. Such proposals take the form of a written document (an RFC) and end up with consensus among the following Crossbeam team members:

For real-time collaboration we hang out in the IRC channel (#crossbeam at irc.mozilla.org).

Contribution

Everyone is welcome to submit RFCs and participate in existing ones.

To submit a new RFC:

Copy template.md into text/YYYY-MM-DD-your-proposal-name.md.
Write the document and fill in the blanks.
Submit a pull request.

The life-cycle of an RFC

A proposal is submitted as a PR.
Discussion takes place, and the text is revised in response.
The PR is merged or closed when all of the members reach consensus.

What needs an RFC?

Very large sets of changes.
Discussions that are too difficult to track without an elaborate and evolving document.
Hard decisions that need careful analysis and consensus in order to unblock other things.

What doesn't need an RFC?

Bug fixes.
Documentation changes.
Simple and uncontroversial changes.
Experimental, fast-moving Crossbeam subprojects.

License

Licensed under the terms of MIT license and the Apache License (Version 2.0).

See LICENSE-MIT and LICENSE-APACHE for details.

rfcs's People

Contributors

Stargazers

Watchers

Forkers

schets vtec234 jeehoonkang andrewjstone pslydhh seyi tomtomjhj xingdl2007 rajashyam carllin z330789559 zenofile south37

rfcs's Issues

\]

Deferred destruction and dynamically sized types

I've been thinking quite a bit about the RFC published a few days ago. While implementing the RFC, a few issues with 'static bounds came up that @Vtec234 correctly pointed out to in this comment.

Somewhat relatedly, I also gave some thought to dynamically-sized types. Note that we can't, for example, create an Owned<[T]> and then store it into an Atomic<[T]>. The problem here is that atomic pointers would have to be two words wide (fat pointers), and AtomicPtr that is powering Atomic<_> behind the scenes is just a single-word pointer (thin pointer).

If we want to actually have an atomic pointer to a DST, then we have to embed the size (or length) inside the heap-allocated object to make the pointer thin. See for example thin-vec, which is a Vec the size of just 1 word (instead of 3 words).

Motivation

The Scope::defer method we have is quite nice:

impl Scope {
    fn defer<F: FnOnce() + Send + 'static>(&self, f: F);
}

The bounds on F are self-explanatory: this is a function that will be executed once, perhaps by another thread, after an arbitrary amount of time. This is a nice piece of documentation encoded in types, but I worry that every call to defer method will actually try to cheat around the bounds.

Example: stack

To see what I mean, consider the following scenario. We have a typical stack defined like this:

struct Node<T> {
    value: ManuallyDrop<T>,
    next: Atomic<Node<T>>,
}

struct Stack<T> {
    head: Atomic<Node<T>>,
}

unsafe impl<T: Send> Send for Stack<T> {}
unsafe impl<T: Send> Sync for Stack<T> {}

Now suppose that we have popped a node (p: Ptr<'scope, Node<T>>) from the top of the stack and need to destroy it. We'd typically do that using scope.defer_free, but let's forget about it today and try destructing by calling scope.defer instead:

scope.defer(|| drop(Box::from_raw(p.as_raw())));

But this doesn't compile, for two reasons:

p: 'static is false because p holds the lifetime 'scope.
~~Node<T>: Send is false because T: Send is false.~~

To work around bounds ~~F: Send~~ and F: 'static, let's cheat:

let raw = p.as_raw() as usize;
scope.defer(|| drop(Box::from_raw(raw as *mut Node<T>)));

Example: skiplist

Here's another example. In a skiplist we have dynamically sized nodes:

#[repr(C)]
struct Node<K: 'static, V> {
    value: ManuallyDrop<V>,
    key: ManuallyDrop<K>,
    refcnt: AtomicUsize,
    height: usize,
    pointers: [Atomic<Node<K, V>>; 0], // This array is of length `height`.
}

To allocate and deallocate such nodes, we implement our custom functions:

impl<K: 'static, V> Node<K, V> {
    // Allocates `size_of::<Self>() + size_of::<Atomic<Node<K, V>>>() * height` bytes.
    unsafe fn allocate(height: usize) -> *mut Self {
        // ...
    }

    // Reads `(*raw).height` first to compute the size in bytes, then deallocates the object.
    unsafe fn deallocate(raw: *mut Self) {
        // ...
    }
}

Now think how we would go about inserting a node into a skiplist:

fn insert(&self, key: K, value: V) -> bool {
    epoch::pin(|scope| unsafe {
        let height = self.random_height();
        let mut raw = Node::allocate(height); // Custom allocation.

        // Fill in the blanks.
        ptr::write(&mut (*raw).key, ManuallyDrop::new(key));
        ptr::write(&mut (*raw).value, ManuallyDrop::new(value));
        ptr::write(&mut (*raw).height, height);

        let mut node = Owned::from_raw(raw);

        loop {
            let (found, left, right) = self.search(&raw.key);
            if found {
                return false; // DANGER: `node` will be incorrectly deallocated!
            }

            node.deref().tower(0).store(right[0], SeqCst);
            if let Err((_, n)) = left[0].tower(0).compare_and_set_owned(right[0], node, SeqCst, scope) {
                node = n;
            } else {
                break;
            }
        }

        // TODO: Continue building the tower...
        true
    })
}

A glaring problem here is that, since we're using custom allocation and deallocation, Owned became unusable. We cannot create an Owned from our allocated raw pointer and then just drop it if insertion fails -- deallocation must happen through Node::deallocate!

A correct way of insertion would be:

fn insert(&self, key: K, value: V) -> bool {
    epoch::pin(|scope| unsafe {
        let height = self.random_height();
        let mut raw = Node::allocate(height); // Custom allocation.

        // Fill in the blanks.
        ptr::write(&mut (*raw).key, ManuallyDrop::new(key));
        ptr::write(&mut (*raw).value, ManuallyDrop::new(value));
        ptr::write(&mut (*raw).height, height);

        let node = Ptr::from_raw(raw);

        loop {
            let (found, left, right) = self.search(&raw.key);
            if found {
                // Custom drop and deallocation.
                node.deref().key.manually_drop();
                node.deref().value.manually_drop();
                Node::deallocate(node.as_raw() as *mut _);
                return false;
            }

            node.deref().tower(0).store(right[0], SeqCst);
            if left[0].tower(0).compare_and_set(right[0], node, SeqCst, scope).is_ok() {
                break;
            }
        }

        // TODO: Continue building the tower...
        true
    })
}

This is a bit unfortunate because now we can't rely on Owned anymore. Instead, we're fiddling with unsafe conversion to Ptr and then manually destructing the node where the destructor of Owned would otherwise be executed.

Moving on: let's turn to deferred destruction of nodes. Suppose that we've just removed a node: Ptr<'scope, Node<K, V>> and want to destroy it:

node.deref().value.manually_drop();
scope.defer(|| {
    node.deref().key.manually_drop();
    Node::deallocate(node.as_raw() as *mut _);
});

This is a curious case: why drop the value now and drop the key later? The reason is that we'll implement the skiplist so that other threads have to increment the refcount before using the value, but only if the node is not yet deleted (i.e. the refcount is positive). A node is removed only when the refcount becomes zero, and then we know no one has a chance of accessing the value anymore.

Disclaimer: We don't have to use refcounts, but for the sake of this text, please just accept that the skiplist is implemented this way. :)

Anyways... As before, this doesn't compile. Let's cheat around F: Send and F: 'static:

node.deref().value.manually_drop();
let node = p.as_raw() as usize;
scope.defer(|| {
    let node = node as *mut Node;
    (*node).key.manually_drop();
    Node::deallocate(node);
});

Problem summary

The key take away I want to make from this story is: even though defer's bounds F: Send and F: 'static are quite nice, what is the point if we have to circumvent them every time?

I would argue that defer is inherently very unsafe -- we have to guarantee several things:

That the node just became unreachable from the rest of the data structure.
That it's okay for another thread to execute the deferred destructor (but this is naturally true for all concurrent data structures).
That we won't touch non-'static data in the deferred destructor (e.g. that would be value in the case of skiplists).

When deferring destruction, we're performing the following steps:

Grab a few local variables (Ptrs, which are not 'static; or raw pointers, which are not Send).
Unsafely move them into a closure.
Unsafely drop and destroy some data in the closure.

We're already acutely aware of all the requirements for correct destruction, so the type bounds F: Send and F: 'static are not very helpful -- they're just getting in the way.

The other problem is that Owned does not support custom destructors. Sometimes it's useful to manually allocate and deallocate objects (actually, I'd say quite often: in deques, skiplists, Bw-Trees, even segmented queues).

Proposed solution

Think about Owned::drop and Scope::defer_drop we have in the current API. There is a strong connection between these two methods: they should do exactly the same thing, but the difference is when they do it. The former destructs immediately, and the latter destructs at a later time. They're like two sides of the same coin.

To support custom destructors, let's create the following trait with default implementation (note that we're using the unstable specialization feature):

pub trait Destroy {
    unsafe fn destroy(ptr: *mut Self);
}

impl<T> Destroy for T {
    default unsafe fn destroy(ptr: *mut Self) {
        drop(Box::from_raw(ptr));
    }
}

Next, implement Drop for Owned like this:

impl<T: Destroy> Drop for Owned<T> {
    fn drop(&mut self) {
        unsafe { T::destroy((self.data & !low_bits::<T>()) as *mut T) }
    }
}

Let's also add a similar method to Ptr:

impl<'scope, T: Destroy> Ptr<'scope, T> {
    pub unsafe fn destroy(self) {
        T::destroy(self.as_raw() as *mut T);
    }
}

Now instead of doing this:

scope.defer_drop(p);

we can do this:

scope.defer(|| p.destroy());

The only thing left to do is to relax the bounds on Scope::defer a bit by removing Send + 'static and making it unsafe:

impl Scope {
    unsafe fn defer<F: FnOnce()>(&self, f: F);
}

This makes the method Scope::defer_drop redundant -- of the original defer_free/defer_drop/defer trio, only defer survives. :)

Example: stack

Now let's see how would we destroy a popped node from a stack:

scope.defer(|| p.destroy());

This is not much less ergonomic than calling defer_drop!

Example: skiplist

Let's implement Destroy for Node:

impl<K: 'static, V> Destroy for Node<K, V> {
    unsafe fn destroy(ptr: *mut Self) {
        (*ptr).key.manually_drop();
        Self::deallocate(ptr);
    }
}

Since now we have support for custom destructors in Owned, we can leverage it to insert nodes into a skiplist:

fn insert(&self, key: K, value: V) -> bool {
    epoch::pin(|scope| unsafe {
        let height = self.random_height();
        let mut raw = Node::allocate(height); // Custom allocation.

        // Fill in the blanks.
        ptr::write(&mut (*raw).key, ManuallyDrop::new(key));
        ptr::write(&mut (*raw).value, ManuallyDrop::new(value));
        ptr::write(&mut (*raw).height, height);

        let mut node = Owned::from_raw(raw); // Nice!

        loop {
            let (found, left, right) = self.search(&raw.key);
            if found {
                return false; // Not dangerous anymore - awesome!
            }

            node.deref().tower(0).store(right[0], SeqCst);
            if Err((_, n)) = left[0].tower(0).compare_and_set_owned(right[0], node, SeqCst, scope) {
                node = n;
            } else {
                break;
            }
        }

        // TODO: Continue building the tower...
        true
    })
}

To clean up initialization of Node, we could take a step further and implement:

impl<K: 'static, V> Node<K, V> {
    fn new(key: K, value: V, height: usize) -> Owned<Node<K, V>> {
        // ...
    }
}

Finally, to destroy a removed node from skiplist, we can just do:

node.deref().value.manually_drop();
scope.defer(|| node.destroy());

Nice and clean.

Example: dynamically-sized arrays

Here's one more example. Let's implement a dynamically sized array that can be used as Atomic<Array<T>>:

Code (click to expand)

#![feature(specialization)]

extern crate crossbeam_epoch as epoch;
extern crate stable_heap;

use std::mem;
use std::ptr;
use std::ops::{Index, IndexMut};

use epoch::Owned;
use epoch::Destroy;

#[repr(C)]
struct Array<T: Default> {
    len: usize,
    arr: [T; 0],
}

impl<T> Array<T> {
    fn size_and_align(len: usize) -> (usize, usize) {
        let size = mem::size_of::<Self>() + len * mem::size_of::<T>();
        let align = mem::align_of::<Self>();
        (size, align)
    }

    fn new(len: usize) -> Owned<Self>
    where
        T: Default,
    {
        let (size, align) = Self::size_and_align(len);
        unsafe {
            let mut a = Owned::from_raw(stable_heap::allocate(size, align) as *mut Self);
            a.len = len;
            for i in 0..len {
                ptr::write(a.arr.get_unchecked_mut(i), T::default());
            }
            a
        }
    }

    fn len(&self) -> usize {
        self.len
    }
}

impl<T> Destroy for Array<T> {
    unsafe fn destroy(raw: *mut Self) {
        let (size, align) = Self::size_and_align((*raw).len);
        stable_heap::deallocate(raw as *mut u8, size, align);
    }
}

impl<T> Index<usize> for Array<T> {
    type Output = T;

    fn index(&self, index: usize) -> &T {
        assert!(index < self.len);
        unsafe { &*self.arr.get_unchecked(index) }
    }
}

impl<T> IndexMut<usize> for Array<T> {
    fn index_mut(&mut self, index: usize) -> &mut T {
        assert!(index < self.len);
        unsafe { &mut *self.arr.get_unchecked_mut(index) }
    }
}

fn main() {
    // Let's create an `Owned` array of length 5.
    let mut a = Array::<usize>::new(5);
    for i in 0..5 {
        a[i] = i;
    }
    assert!(a.len() == 5);

    unsafe {
        epoch::unprotected(|scope| {
            // We can convert it into a `Ptr`, if we want...
            let p = a.into_ptr(scope);

            // And destroy it!
            p.destroy();
        });
    }
}

In the Chase-Lev deque we have Buffer<T>, which simply wraps around Vec<T>, creating unnecessary levels of indirection.

To access an element from the deque, we have to:

Get a reference to the heap-allocated Buffer<T>.
Get a reference to the heap-allocated array within the Vec<T>.
Offset into the array.

Using Array<T>, we'd have to do the following steps only:

Get a reference to the heap-allocated Array<T>.
Offset into the array.

This gets us fewer memory allocations and faster access into the array.

But specialization is still in nightly only!

The Destroy trait has a default impl for all types T, so we provide custom implementations by specializing. Specialization is still an unstable feature. Until it gets stable, we can work around the problem like this:

impl<T> Destroy for T {
    #[cfg(not(feature = "nightly"))]
    unsafe fn destroy(ptr: *mut Self) {
        drop(Box::from_raw(ptr));
    }

    #[cfg(feature = "nightly")]
    default unsafe fn destroy(ptr: *mut Self) {
        drop(Box::from_raw(ptr));
    }
}

Of course, if you're not using the nightly feature, specialization is not available.

Alternatives

Now that we have the Destroy trait, would anyone ever use Scope::defer for anything else other than this?

scope.defer(|| ptr.destroy());

I can't imagine such a scenario off the top of my head, so perhaps we could ditch scope and have only defer_destroy:

impl Scope {
    unsafe fn defer_destroy<T>(&self, ptr: Ptr<T>);
}

Final words

Whew - that was a long text! If some parts of it are confusing, let me know and I'll try to clarify.

Finally, I'd like to summarize what all this is actually attempting to achieve:

Unify all the defer* methods into a single one to make the API smaller and simpler.
Get rid of pervasive 'static bounds and prevent scenarios like this one in sled. cc @jeehoonkang
Allow custom destructors in Owned to improve support for DSTs.
Introduce the equivalent of Owned::drop accessible via Ptr (namely, Ptr::destroy).

Let me know what you think! If you like at least some of the suggestions, I'll follow up with a more narrowly scoped RFC(s).

Custom collector in Deque

In PR #21, @jeehoonkang and @Vtec234 pointed out that it should be possible to specify a custom Collector in the deque constructor (Deque::new).

This is not a high-priority issue at the moment, but let's kick off a discussion - how would we go about implementing such a feature?

The proof in relaxed-memory concurrency doesn't account for dynamic participant registration

Proof: https://github.com/crossbeam-rs/rfcs/blob/master/text/2017-07-23-relaxed-memory.md

It assumes that the set of participants ("threads" in the RFC text) is constant. But actually a new participant can join the registry. The proof doesn't account for this possibility of dynamic participant registration.

Fortunately, the proof doesn't seem broken. But we need to update the third bullet of "When pin()'s SC fence is performed before unlink()'s SC fence". Because pin()'s SC fence (3. in the timeline) is performed before try_advance()'s SC fence, the registration of the pinning participant should be visible to try_advance(): 'L42, where the registered participants are iterated over. Thus the loop should visit the pinning participant, and the rest of the proof is exactly the same.

Organization of the crossbeam crate

The main crossbeam crate is going to be an umbrella crate that brings together the most important pieces of the Crossbeam project together and reexports them. I've been thinking what should it look like. Here are some quick ideas...

First, crossbeam depends on crossbeam-epoch and reexports the crate as:

crossbeam::epoch::* // from crossbeam-epoch

Then we have several atomic types, but I'm unsure if they should live in sync::atomic or just atomic. The former is more consistent with the standard library, though.

crossbeam::sync::atomic::{AtomicBox,AtomicArc,AtomicCell} // from crossbeam-atomic

There's also a bunch of data structures:

crossbeam::sync::Stack // from crossbeam-stack
crossbeam::sync::Queue // from crossbeam-queue
crossbeam::sync::channel::* // from crossbeam-channel
crossbeam::sync::{deque,Worker,Stealer} // from crossbeam-deque

Finally, some utilities:

crossbeam::scoped; // from crossbeam-utils
crossbeam::CachePadded; // from crossbeam-utils

But, instead of just shoving utilities into the crate root, we could organize them into submodules:

crossbeam::thread::scoped; // from crossbeam-utils
crossbeam::utils::CachePadded; // from crossbeam-utils

So the questions we need to answer are:

What goes inside crossbeam and what needs to be left outside? When should a Rust programmer reach for crossbeam-X instead of crossbeam?
What hierarchy of submodules do we want? Do we closely mimic std or come up with our own?

Concurrent hash table

This is just a brainstorming issue for concurrent hash tables.

@Amanieu, do you think we could make hashbrown concurrent by having a lock per bucket? When resizing the table we'd probably have to lock all buckets. That sounds like a rather straightforward approach for building a totally generic concurrent hash table without much fuss.

I'm sure we could devise a faster hash table for more specialized use cases. Apart from a fully generic HashMap<K, V>, we'd probably want to build a special HashMap<K: Copy, V> too that can avoid locks in some places.

Thoughts? What do you think would be the best way forward here? Ideally something not overly ambitious that at least gets us started. :)

cc servo/servo#22334
cc @SimonSapin

Crossbeam manifesto

I just read @ticki's blog post that advocates hazard pointers (http://ticki.github.io/blog/fearless-concurrency-with-hazard-pointers/), and @stjepang's comment on it in reddit (https://www.reddit.com/r/rust/comments/6qz9p2/conc_an_efficient_concurrent_reclamation_system/dl20t02/).

Which makes me think that maybe it would be great to document why EBR, and when to use it. In other words, I am suggesting an update on @aturon's blog post, which will serve as the Crossbeam manifesto. I hope we can produce such a document via an RFC process.

Guards can be sound

In the Atomic API RFC, I argued that we should model pinning using scopes rather than guards, for three reasons:

Guards are tricky to implement correctly (in a sound manner).
Guards were a tiny bit slower in my microbenchmarks.
Guards are less explicit and thus easier to misuse (by e.g. accidentally pinning for too long).

This time I'd just like to show that the first point is invalid - it's actually pretty easy to make guards sound.

Here's a quick refresher. The problem was the following: if you pin the current thread and obtain a guard, you can keep having the guard for a very long time, even after the thread-local Handle gets destructed. Using it after that would be unsafe.

And here's the solution... Currently, when traversing the linked list of registered threads, we remove a node if its next pointer is tagged with 1. We should just have an additional check: remove it if the tag is 1 and if the thread is unpinned. That's all we need to change.

cc @hniksic Guards would make the following possible (mentioned here):

pub get_ref(&'a self, key: K) -> Option<ValueGuard<'a, V>>;

Extended atomic types

The standard library offers several primitive atomic types: Atomic{Usize,Isize,Bool,Ptr}. Nightly Rust also has Atomic{U8,I8,U16,I16,U32,I32,U64,I64}.

Those are the primitive types, and then there are crates that expand a little bit. Crate atomic has type Atomic<T> where T can be any type (small Ts that are Copy use primitive atomics, and large or non-Copy Ts simply get wrapped in a mutex). crossbeam has AtomicOption<T> (atomic Option<Box<T>>) and ArcCell<T> (atomic Arc<T>). There's also crate atomic-option with a wider set of methods on AtomicOption<T>.

Taking a step back, I'd summarize and rename all those extended atomic types like this:

AtomicBox<T> - atomic Option<Box<T>> (equivalent AtomicOption<T>)
AtomicArc<T> - atomic Option<Arc<T>> (similar to ArcCell<T>)
AtomicCell<T> - atomic Cell<T> (equivalent to Atomic<T>)

What do you think about including such AtomicBox<T>, AtomicArc<T>, and AtomicCell<T> in Crossbeam?

What I'm attempting to do here is to come up with a well-thought-out set of complementary atomic types that is a natural extension to the primitve atomic types in the standard library. Any thoughts or ideas? Any other nice-to-have atomic types I'm missing?

Crossbeam in elfmalloc

Hi, @joshlf and @ezrosent! Congrats on the first release of elfmalloc! :) This is fantastic - I'm always excited to see new work being created in the domain of concurrency/parallelism using Rust.

I see you're using Crossbeam 0.2. Not sure if you're familiar with the current state of affairs, but we're rewriting the epoch GC and redesigning the API, and hoping to release it soon. I'd suggest looking into this RFCs repository if you're curious to see where we're heading at.

I saw a comment on Reddit saying that, to remove the dependency on std from elfmalloc, crossbeam's epoch-based memory management would need to support no-std. Well, Crossbeam's new epoch-based GC currently allocates in two places:

Thread entries are kept in a linked list (consists of heap-allocated entries).
Garbage bags are kept in an unbounded queue (typical Michael Scott queue, where nodes are allocated on the heap and are connected into a linked list).

We'll have to think how could allocations in those situations be avoided.

Moreover, in this RFC @jeehoonkang proposes introducing realms, which is a way to create multiple garbage collectors and manually register threads to them. Currently, there is only one implicit global epoch GC and threads get lazily and automatically registered behind the scenes. This is similar to Rayon allowing you to create your own thread pool instead of using the global one. Anyways, perhaps this is something that might interest you as well, since you're writing such a low-level thing (an allocator) that needs very precise control?

I wonder if you have any ideas on how we can help you here.
More specifically, my questions would be:

What do you want from Crossbeam that it currently doesn't offer?
Did you run into any obstacles with Crossbeam that need to be fixed?
What might a no_std-compatible API that you'd use look like?

How will we test this?

Hi guys,

I don't know where is better to start this topic, so let it be here.

I have read Atomic API RFC, and the followed discussion; however, I am not sure that I fully understand everything due to lack of theoretical knowledge and practical experience in writing highly concurrent DS and algorithms, and I want to participate in the development of this crate.

But I have some experience in writing tests with production code. (I don't want to argue what is better test first or test last.) But, I have some thoughts how to use tests in development of crossbeam:

add an optional section "How we will test this" (or any other name you would like) to the RFCs that propose new or change existed functionality. That section would contain a list of stress test scenarios, maybe with a brief description of the tests and what kind of load should be (and people like me, would be able to build a context around the problem, and maybe bring some new ideas or suggestions)
include unit tests for single threaded scenarios to implementations of the RFCs. I know this is very personal, but I see some advantages of this. First of all, it is very useful for library developers to taste its API with the tests (It can bring whole new ideas what API should be). Second, it is very convenient to library users to see how they can use API (maybe I am the only one who looks for examples of API usage in tests). And, maybe, the most important that reviewers can spot that some cases weren't covered and needed more attention.

Do we need crossbeam-stack?

Currently, the crossbeam crate provides the Treiber stack. In order for crossbeam to be backward-compatible as much as possible, when we're rebasing it on top of crossbeam-epoch, I think we need to implement crossbeam-stack.

It can be just the Treiber stack at first, but maybe in the future, we want to implement the elimination-backoff stack or another more advanced stacks. For now, we can just copy-and-paste the code from crossbeam or coco.

vim

EDIT: This is what happens when you try writing an issue in vim. 😆

Multiple back-ends

I was wondering if the API could be generalized to the extend that switching to non-epoch (e.g. conc) backends would be possible.

For example, there could be a Backend trait with an interface something like this:

gc for forcing garbage collection.
add_garbage for - well - adding new garbage.
start_block_dtor for blocking an object's destruction
end_block_dtor for ending the block.

Under epochs, 3. and 4. would corresponds to pin and Guard::drop respectively. Under hazards, 3. and 4. would correspond to creating hazards and freeing them respectively.

cc. @stjepang.

Safepoints

The problem is that in some special situations pinning might be prohibitively slow. For example, suppose you have the following code:

for _ in 0..ITERS {
    do_something_that_pins();
}

If do_something_that_pins performs a quick operation and pins the current thread every time, then the cost of pinning might take a large percentage of the total running time. A way to alleviate the problem would be to enclose the whole loop with epoch::pin:

epoch::pin(|scope| {
    for _ in 0..ITERS {
        do_something_that_pins();
    }
})

But the new problem is that the GC will not be able to make any progress during the whole loop, which might take a long time! Let's try fixing that as well:

for _ in 0..ITERS / 50 { // Assumming `ITERS` is divisible by 50.
    epoch::pin(|scope| {
        for _ in 0..50 {
            do_something_that_pins();
        }
    })
}

Okay, that's a bit better, but ugly, error-prone and cumbersome to write.

I'm proposing to introduce safepoints. The idea is to add a new method Scope::safepoint and change epoch::pin so that it takes a closure that accepts a &mut Scope instead of &Scope:

impl Handle {
    pub fn pin<F, R>(&self, f: F) -> R
    where
        F: FnOnce(&mut Scope) -> R,
    {
        // ...
    }
}

impl Scope {
    pub fn safepoint(&mut self) {
        // ...
    }
}

Now we can change the original code like this:

epoch::pin(|scope| {
    for _ in 0..ITERS {
        scope.safepoint();
        do_something_that_pins();
    }
})

Method safepoint increments an internal counter and every, say, 50th step simply unpins and re-pins the current thread. But, of course, only if this scope is the outermost one (the one that actually pinned the thread). This way we have essentially implemented something very much like QSBR (quiescent-state based reclamation).

Note that the safepoint method is pretty fast. We can implement it like this:

fn safepoint(&mut self) {
    if self.is_outermost() && self.increment() {
        let local: &Local = ...;
        let global: &Global = ...;
        let e = global.epoch.load(Relaxed);
        if local.epoch.swap(e, Relaxed) != e {
            fence(SeqCst);
        }
    }
}

Maybe we don't even need to count steps to occasionally re-pin (the && self.increment() part). Perhaps it's just fine to re-pin on every call.

The reason why the method is called safepoint is because it reminds me of safepoints in garbage-collected languages (e.g. see this and this).

cc @hniksic

Hazard pointers

Epoch-based memory reclamation is great, but it is not the right solution for all situations that require deferred memory reclamation in concurrent settings. We will probably need to implement hazard pointers to accommodate the rest.

Here are some of my thoughts on this matter. I don't have any concrete proposals yet - just want to lay everything out so we can share our notes and think this through. If you have any ideas, concerns, interesting use-cases for hazard pointers, or any other comments, please say so. :)

I'm particularly interested in hearing more about @jeehoonkang's wait-free queue. He has already prototyped garbage collection based on a combination of epochs and hazard pointers, so check that out.

Epoch-based reclamation vs hazard pointers

The fundamental difference between epoch-based reclamation (EBR) and hazard pointers (HP) is in how coarse or fine-grained object protection is. With EBR, we declare critical sections by pinning and saying "in this critical section, we might potentially access any object". With HP, however, each hazard pointer protects only a single object at a time. So even if we're accessing just one object for a long time, that will not prevent reclamation of other objects.

EBR has a (mostly theoretical?) danger of accumulating unbounded number of dead unreclaimed objects. E.g. this might happen if a pinned thread gets preempted by the OS for a very long time. On the other hand, with HP we usually have a bounded number of hazard pointers so only a bounded number of dead objects can be be kept unreclaimed.

Pinning a thread and protecting an object with a hazard pointer should be operations with comparable performance costs. They both perform a few atomic loads, atomic stores into thread-locals, and a full fence.

Traversing a linked data structure is almost certainly faster using EBR. For example, if we want to search for a particular node in a skiplist or binary search tree, we'll have to start from the root and make several steps through the data structures to find the node. EBR would incur a fixed cost at the beginning and at the end of the procedure, but HP would come with a performance penalty at each step of traversal.

The main advantage of HP is that hazard pointers can be kept alive for a very long time without worry of blocking memory reclamation globally. With EBR, we must make sure that threads are pinned only for very short periods of time. There is another advantage of HP - it supports eager memory reclamation, as will be demonstrated in the following section.

Eager garbage collection using HP

In crossbeam-deque, we have two structs: Deque (owned by a single thread) and Stealer (shared among multiple threads). A Chase-Lev deque contains a buffer that gets atomically replaced with a larger one when it gets full or a smaller one when its capacity is underused. But note that only operations on Deque (a single thread) can modify the pointer to the buffer.

The buffer is only occasionally replaced, and when it is, we want to destroy the old one as soon as possible (because it might be pretty large). Currently, we call guard.flush() just after replacing the buffer so that it gets moved into the global garbage queue immediately. We also periodically call collect() on every Nth call to epoch::pin() so that garbage collection is triggerred often enough. This is a hacky solution. We can do better.

Suppose that we used HP instead. The cost of pinning and protecting the buffer by a hazard pointer should be comparable because we don't traverse the data structure. In each operation, we only do a single load to acquire the buffer, that's it.

In order to defer destruction of an old buffer, instead of calling guard.defer(...) let's call
retire(old_buffer), which is implemented as:

fn retire(old: *mut Buffer<T>) {
    for haz in hazard::iter_all_hazard_pointers() {
        loop {
            // Is there sombeody still using the old buffer?
            if haz.load(Relaxed) == old {
                // Yes. Let's try setting this hazard pointer to null.
                if haz.compare_and_swap(old, ptr::null(), Relaxed) {
                    // Success! When the owner of `haz` stops using it, they will notice that `haz`
                    // is set to null and will take over the responsibility of retiring the old
                    // buffer. Then that thread will call `retire(old)` to try destroying the
                    // buffer again.
                    return;
                }
            } else {
                // No. Move on.
                break;
            }
        }
    }

    // Nobody is using the old buffer. We may destroy it immediately.
    unsafe { drop(Box::from_raw(old)); }
}

Note that this strategy makes destruction fully eager: the old buffer gets destroyed as soon as all threads stop using it, without any further delay. In fact, eager destruction with hazard pointers is not much different from eager destruction with Rc or Arc.

Long-lived pointers

Iterators over concurrent data structures don't work well with EBR because they can be long-lived,
while pinning must be very quick. In other words, EBR forces us to keep loaded pointers for very
short periods of time.

This is a problem for data structures like wait-free queues and skiplists. Hazard pointers are a very natural solution for long-lived pointers.

Here is an example of how we might iterate over a linked list using HP:

struct Node<T> {
    data: ManuallyDrop<T>,
    next: Atomic<Node<T>>,
}

struct List<T> {
    head: Atomic<Node<T>>,
}

impl<T> List<T> {
    fn count(&self) -> usize {
        // Register two new shields (in a way, they are similar to `epoch::Shared<T>`).
        let mut a = hazard::shield();
        let mut b = hazard::shield();

        let mut count = 0;

        self.head.load(SeqCst, &mut a);

        while !a.is_null() {
            // Load the next pointer into `b` and set `a` to null.
            a.unwrap().next.load(SeqCst, &mut b);
            a.release();

            // Swap `a` and `b` and continue iterating.
            mem::swap(&mut a, &mut b);
            count += 1;
        }

        count

        // `a` and `b` are dropped and get unregistered.
    }
}

Writing the same code using EBR would require us to pin the current thread for the entire duration of count, which might take an unacceptably long time for one pinning. Repinning during iteration (guard.repin()) is not possible because we can't let go of the pointer to the current node.

Hybrid GC: using both EBR and HP at the same time

Suppose we want to add a new operation called pop_head() to List<T>, but use EBR instead of HP. We might implement it as:

impl<T> List<T> {
    fn pop_head(&self) -> Option<T> {
        let guard = &epoch::pin();

        loop {
            let head = self.head.load(SeqCst, guard);
            if head.is_null() {
                return None;
            }

            let next = unsafe { head.deref().next.load(SeqCst, guard) };

            if self.head.compare_and_set(head, next, SeqCst, guard) {
                unsafe {
                    let data = ptr::read(&head.deref().data);

                    guard.defer(move || {
                        // Once EBR decides that `head` is unreachable, we must also make sure that
                        // there are no outstanding hazard pointers to it. We pass the garbage
                        // along to the HP-based GC.
                        //
                        // The first argument to `defer` is the actual pointer, and the second
                        // argument is the destructor.
                        hazard::defer(data.as_raw(), move || {
                            data.into_owned();
                        });
                    });

                    return Some(data);
                }
            }
        }
    }
}

Note that this way a dead object first goes through the epoch-based GC and then through the HP-based GC, where it is finally destroyed.

An interesting challenge with mixing EBR and HP is how to make APIs from crossbeam-epoch and crossbeam-hazard work well together. For example, how are we going to load a crossbeam_epoch::Atomic<T> into a crossbeam_hazard::Shield<T>? Is it going to be atomic.load(SeqCst, &shield)? Or maybe shield.protect(a.load(SeqCst, epoch::unprotected()))? I don't know...

Shield registration

A Shield<T> is a slot that holds a hazard pointer, shielding an object from destruction. (I'm actually using terms shield and hazard pointer interchangeably.) Each shield typically belongs to a specific thread (although it is always readable by all threads). In order to create a shield, we have to register it in some kind of global list so that other threads are aware of all currently existing shields.

We should allow registration of arbitrary number of shields. However, it must be noted that having a lot of registered shields will slow down garbage collection.

Sometimes it makes sense to cache registered shields in TLS to avoid the cost of shield registration on every operation. For example:

thread_local! {
    static A: Shield<u8> = hazard::shield();
    static B: Shield<u8> = hazard::shield();
}

impl<T> List<T> {
    fn count(&self) -> usize {
        A.with(|a| {
            B.with(|b| {
                // Ugly hack, but oh well.
                let a: &Shield<T> = unsafe { mem::transmute(a) };
                let b: &Shield<T> = unsafe { mem::transmute(b) };

                // ...
            })
        })
    }
}

That is some really ugly code, but you get the idea. :)

Previous work

Finally, it's worth looking into existing implementations of HP-based garbage collection. Here are a few examples I'm aware of:

crossbeam-rs / rfcs Goto Github PK

rfcs's Introduction

Crossbeam RFCs

Contribution

The life-cycle of an RFC

What needs an RFC?

What doesn't need an RFC?

License

rfcs's People

Contributors

Stargazers

Watchers

Forkers

rfcs's Issues

Motivation

Example: stack

Example: skiplist

Problem summary

Proposed solution

Example: stack

Example: skiplist

Example: dynamically-sized arrays

But specialization is still in nightly only!

Alternatives

Final words

Epoch-based reclamation vs hazard pointers

Eager garbage collection using HP

Long-lived pointers

Hybrid GC: using both EBR and HP at the same time

Shield registration

Previous work

Recommend Projects

Recommend Topics

Recommend Org