ericlbuehler / trc Goto Github PK

View Code? Open in Web Editor NEW

40.0 5.0 5.0 509 KB

A faster Arc.

Home Page: https://crates.io/crates/trc

License: MIT License

Rust 97.01% Python 2.99%

reference-counting rust smart-pointers

trc's Introduction

Trc

Trc is a performant heap-allocated smart pointer for Rust that implements thread reference counting. Trc stands for: Thread Reference Counted. Trc provides a shared ownership of the data similar to Arc<T> and Rc<T>. It implements thread reference counting, which is based on the observation that most objects are only used by one thread. This means that two reference counts can be created: one for thread-local use, and one atomic one for sharing between threads. Thread reference counting sets the atomic reference count to the number of threads using the data.

A cycle between Trc pointers cannot be deallocated as the reference counts will never reach zero. The solution is a Weak<T>. A Weak<T> is a non-owning reference to the data held by a Trc<T>. They break reference cycles by adding a layer of indirection and act as an observer. They cannot access the data directly, and must be converted back into a Trc. Weak does not keep the value alive (which can be dropped), and only keeps the backing allocation alive.

To soundly implement thread safety Trc is !Send and !Sync. To solve this, Trc introduces a SharedTrc<T>, which is Send and Sync. SharedTrc is the only way to safely send a Trc's data across threads without using a Weak. See SharedTrc for it's API, which is similar to that of Weak.

Because Trc is not part of the standard library, the CoerceUnsized and Receiver traits cannot currently be implemented by default. However, Trc provides dyn_unstable trait which enables the above traits for Trc and SharedTrc and must be used with nightly Rust (cargo +nightly ...).

Examples

See examples here.

Benchmarks

Click here for more benchmarks. Multiple different operating systems, CPUs, and architectures are tested.

Use

To use Trc, simply run cargo add trc, or add trc = "1.2.3". Optionally, you can always use the latest version by adding trc = {git = "https://github.com/EricLBuehler/trc.git"}.

trc's People

Contributors

Stargazers

Watchers

Forkers

maxtnuk sigaloid s4ch hoxxep demonoid81

trc's Issues

Remove RwLock fallbacks

These are pointless.
std::sync::RwLock is not available if std::sync::atomic is empty.

D `AsRawFd`

Atomic `Drop` implementations are unsound

This is largely copied from my Reddit comment.

SharedTrc::drop:

fn drop(&mut self) {
    sub_value(&unsafe { self.data.as_ref() }.atomicref, 1);

    let atomic = unsafe { self.data.as_mut() }
        .atomicref
        .load(std::sync::atomic::Ordering::Relaxed);

   if atomic == 0 {
        unsafe { Box::from_raw(self.data.as_ptr()) };
    }
}

Trc::drop:

fn drop(&mut self) {
    *unsafe { self.threadref.as_mut() } -= 1;

    if *unsafe { self.threadref.as_ref() } == 0 {
        sub_value(&unsafe { self.shared.as_ref() }.atomicref, 1);
        let weak = unsafe { self.shared.as_mut() }
            .weakcount
            .load(std::sync::atomic::Ordering::Relaxed);
        let atomic = unsafe { self.shared.as_mut() }
            .atomicref
            .load(std::sync::atomic::Ordering::Relaxed);
        if weak == 0 && atomic == 0 {
            unsafe { Box::from_raw(self.threadref.as_ptr()) };
            unsafe { std::ptr::drop_in_place(self.shared.as_ptr()) };
        }
    }
}

are both unsound.

First off, your decrement operation and read operation are two different operations. You need to consider the possible interleavings of the operations: I can think of possible interleavings that cause double-free and use-after-free. Consider, for example:

Initial state: atomicref is 2.
A SharedTrc is dropped in thread A.
A SharedTrc is dropped in thread B.
Thread A decrements atomicref.
Thread B decrements atomicref.
Thread A reads atomicref, and gets 0.
Thread A deallocates self.data.
Thread B reads atomicref. This is a use after free, which is undefined behaviour. The likely (but not guaranteed) outcome is to read 0, so let's suppose this happens.
Thread B deallocates self.data. This is a double free, which is undefined behaviour.

The same issue exists for Trc::drop.

The solution here is to use the returned value from the fetch_sub operation to determine when to deallocate. If you get 1, then you are the last reference, and should deallocate. There is, of course, the tricky detail that you have to look at the weak count before deallocating - to solve this, look at how std::sync::Arc does it. It uses the weak count to determine when to deallocate. If there are no Weak references, then the weak count is 1. When the strong count reaches zero, the weak count is decremented - this means only the weak count needs to be considered when deallocating.

Second off, the calls to as_mut() assert to the compiler that we are the only thread with any kind of access to self.data and self.shared. I'm not convinced that this is true, so I'm going to conservatively say this is undefined behaviour. This can be easily fixed, by replacing the calls to as_mut() with calls to as_ref().

I believe that threadref.as_mut() is fine.

Third, you haven't performed the necessary synchronisation. Consider the type

struct MyStruct(AtomicI32);
impl MyStruct {
    fn new() -> Self {
        Self(AtomicI32::new(0))
    }
    fn inc(&self) {
        self.0.fetch_add(1, Relaxed);
    }
}
impl Drop for MyStruct {
    fn drop(&mut self) {
        println!("{}", self.0.get_mut());
    }
}

Now consider

Thread A constructs Trc::new(MyStruct::new()), and stores it in variable x
Thread A constructs SharedTrc::from_trc(&x) and sends it to thread B.
Thread A calls x.inc();. This is a Relaxed atomic read-write. Let us call this access A.
Thread A drops x, resulting in a Relaxed atomic decrement.
Thread B drops the SharedTrc, resulting in a Relaxed atomic decrement.
Thread B calls MyStruct::drop. This is a non-atomic read. Let us call this access B.

Since access B is not atomic, and Acquire/Release semantics have not been used, it is not synchronised with access A. This is a data race, which is undefined behaviour.

To fix this:

Decrementing the counter needs to be a Release atomic operation.
There needs to be an Acquire fence before calling Drop::drop.

We can now establish that all counter decrements happen before the fence. Thus, we can establish that access A happens before access B, synchronising the accesses and avoiding the data race.

So you want to be considered by T-libs?

Before it is worth putting in the queue for review:

This crate must pass miri. Ideally it would also pass miri with the following flags:

This crate must be sound against all the correctness errors we have encountered in Arc and Rc and solved:

In addition you must validate that it remains sound and fast even when used on "weak memory model" platforms. PowerISA (aka POWER aka PowerPC) is the weakest I know of for modern CPUs, but in a pinch, testing on AArch64 should flush out most problems, and is more important because we support aarch64-unknown-linux-gnu as a tier 1 platform. That makes its soundness in implementation of consequently higher concern, though we do take correctness seriously for tier 2 platforms, we're just humans with priorities.
Finally, and this is the most subjective evaluation: There has to be a reason to actually include it in std instead of just leaving it as an external crate. It's not good enough for it to simply be faster in a few cases. We often reject new API simply if it is too confusing or niche to choose when to use it.

-Zmiri-strict-provenance

-Zmiri-tree-borrows

Switch to Criterion for benchmarking

Hi, a better way to measure benchmarks would be to use the Criterion crate as it abstracts over a lot of the hard-to-get-right benchmarking code. I can send a PR if you'd like!

M1 Benchmarks

Making a separate issue for cleanliness.

Issue #10 point (3) described needing ARM / aarch64 benchmarks. Attached are M1 benchmarks that I can include into BENCHMARKS.md if helpful?

Running benches/benchmark.rs (target/release/deps/benchmark-dcc1722be38d6f11)

Clone Trc               time:   [29.543 ns 29.583 ns 29.630 ns]
Clone Arc               time:   [19.866 ns 19.898 ns 19.939 ns]
Clone Rc                time:   [16.147 ns 16.167 ns 16.192 ns]

Multiple clone Trc      time:   [439.11 ns 439.65 ns 440.27 ns]
Multiple clone Arc      time:   [968.11 ns 969.31 ns 970.65 ns]
Multiple clone Rc       time:   [409.21 ns 409.79 ns 410.41 ns]

Deref Trc               time:   [16.216 ns 16.234 ns 16.254 ns]
Deref Arc               time:   [16.603 ns 16.658 ns 16.732 ns]
Deref Rc                time:   [15.510 ns 15.527 ns 15.546 ns]

Multiple deref Trc      time:   [48.608 ns 48.942 ns 49.274 ns]
Multiple deref Arc      time:   [48.444 ns 48.821 ns 49.214 ns]
Multiple deref Rc       time:   [46.125 ns 46.199 ns 46.293 ns]

Multiple threads Trc    time:   [1.6919 ms 1.7013 ms 1.7121 ms]
Multiple threads Arc    time:   [2.6750 ms 2.6922 ms 2.7177 ms]

Multiple threads Trc Medium
                        time:   [1.9873 ms 1.9988 ms 2.0110 ms]
Multiple threads Arc Medium
                        time:   [6.4657 ms 6.4865 ms 6.5094 ms]

Multiple threads Trc Long
                        time:   [8.1004 ms 8.1267 ms 8.1547 ms]
Multiple threads Arc Long
                        time:   [97.849 ms 97.950 ms 98.056 ms]

Multiple threads Trc Super
                        time:   [34.427 ms 34.517 ms 34.609 ms]
Multiple threads Arc Super
                        time:   [476.58 ms 476.73 ms 476.88 ms]

More trait implementations and methods

Methods

Replace Weak::from_trc with Trc::downgrade
Replace Weak::to_trc with Weak::upgrade
Add Trc::get_mut
Add Trc::try_unwrap - Note the necessity for a race condition warning
Add Trc::into_inner
Add Trc::unwrap_or_clone

Traits

"D" signifies impl dependant on T implements the trait.

Seems unsound

What prevents me from simply cloning a Trc value, sending it to another thread and then cloning both values on different threads? Trc should not implement Send and Sync because it's neither. Rather it should be able to produce a Send + Sync value that can be send across thread and then converted to a Trc.

Benchmarks shouldn't use `Instant::now()` in a loop

I see your benchmarks call Instant::now() in the inner loop. That timing function takes much longer than the deref operation being benchmarked, so what you end up benchmarking is how fast you can read the clock, rather than how fast you can deref.

trc/src/main.rs

Line 67 in 56c73e5

let start = Instant::now();

A better approach is to hoist the timing outside the loop, so you only call the clock once per 100000 iterations. You should also use std::hint::black_box to ensure that computations don't get hoisted outside the loop or eliminated by the compiler. So, a better benchmark would be:

fn test_deref_rc() -> f64{
    let rc = Rc::new(100);
    
    let start = Instant::now();
    for _ in 0..100000 {
        std::hint::black_box(*std::hint::black_box(rc));
    }
    let end = Instant::now();
    (end-start).as_nanos() as f64 / 100000.
}

-Zmiri-symbolic-alignment-check

Memory fences and general improvements

Quoted from u/TDplay from this comment.

Todo

Quote

trc/src/trc.rs

Line 245 in 89f3eb2

let weak =

Not entirely sure about this part. You seem to decrement and check the weak count, and then construct and drop a Weak (thus calling Weak::drop, which should also decrement and check the weak count).

It also seems like you only drop the data when the weak count reaches zero, which seems to defeat the point of supporting Weak pointers.

Another point is to perhaps consider using ptr::drop_in_place instead of ptr::read. It makes the intent more obvious, and it can make some optimisations more obvious to the compiler (since it doesn't need to reason about removing the copy). I think this would also allow you to support dynamically-sized types.

Trc::drop looks fine.

trc/src/trc.rs

Line 1148 in 89f3eb2

pub fn to_trc(this: &Self) -> Option<Trc<T>> {

Weak::to_trc has a race condition, leading to a possible use-after-free.

Initial state: We have a SharedTrc in thread A and a Weak in thread B.

Thread B calls Weak::to_trc

Weak::to_trc checks the strong count, and reads 1

Thread A drops the SharedTrc, decrementing the strong count.

SharedTrc::drop notices that the strong count is now zero, so it deallocates the data.

Weak::to_trc increments the strong count and constructs a Trc.

The Trc is now passed to user code, which access the contents, causing a use-after-free.

To resolve this, the zero-check and the increment need to be one atomic operation, and the increment needs to not happen if the zero check fails. fetch_update is perfect for this. Sadly, this is slower than fetch_add, but I can't see a way to avoid it, and it seems the standard library developers couldn't either.

trc/src/trc.rs

Line 1191 in 89f3eb2

if prev > MAX_REFCOUNT {

Clone for Weak technically has a race condition: if a program clones an absurd number of Weaks in different threads, with thread being pre-empted before it can call abort, it may be possible to overflow the weak count. In practice, I don't think any real program will ever do this, and the actual likelihood of this behaviour is slim.

That's the big stuff out of the way, on to the small nitpicks:

Clone is not implemented for SharedTrc, this feels like an oversight.

You might be able to support no_std. Add this to your lib.rs:

#![cfg_attr(not(test), no_std)]

extern crate alloc;
#[cfg(feature = "force_lock")]
extern crate std;

It will then throw errors at every use of std. Replace these with either core or alloc.

trc/src/trc.rs

Line 899 in 89f3eb2

    
           /// Create a new `Trc<T>` from the provided data. This is equivalent to calling `Trc::new` on the same data.

Documentation for Trc::cmp doesn't seem right.

trc/src/trc.rs

Line 793 in 89f3eb2

impl<T: Pointer> Pointer for Trc<T> {

Behaviour of the Pointer formatting seems a little strange. Trc is a pointer, so it should probably print the address the Trc points to. This would also be in-line with what the standard library smart pointers do: Arc, Rc, Box, &T, and &mut T all implement Pointer unconditionally.

trc/src/trc.rs

Line 1043 in 89f3eb2

let layout = unsafe { Layout::from_size_align_unchecked(size, align) };

There is no need for the unsafe call to Layout::from_size_align_unchecked, you can just use:

Layout::new::<SharedTrcInternal<T>>()

If you choose to support dynamically sized types, then you can instead use:

Layout::for_value

Inadequate microbenchmarks

Your benchmarks are unrealistically simple and do not cover common concerns:

And your benchmarks are... probably inadequate. You are microbenching the most primitive operation. It can still be annoyingly obvious to the optimizer if the result is e.g. unused. You need more holistic tests that demonstrate Trc improves a semi-realistic workload. I mean, strictly speaking, that isn't necessary for acceptance to std, I just think you should, in order to better support your claim.

btw since your implementation does two allocations in new() rather than just one, you should probably also include a benchmark for Trc::new(). (and perhaps also one for converting between SharedTrc and Trc.)

And you should make sure the tested functions are actually generating different assembly for these types and operations when compiled. Otherwise, you may be actually microbenchmarking something else, like the quality of LLVM's optimizer in piercing certain abstractions. No, "black box" functions are not actually completely immune to the optimizer's gaze.

ericlbuehler / trc Goto Github PK

trc's Introduction

Trc

Examples

Benchmarks

Use

trc's People

Contributors

Stargazers

Watchers

Forkers

trc's Issues

Methods

Traits

Todo

Quote

Recommend Projects

Recommend Topics

Recommend Org