nessex / rdst Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 753 KB

A flexible, native Rust implementation of unstable radix sort.

License: Apache License 2.0

Rust 100.00%

rust radix-sort sorting-algorithms algorithms multithreading

rdst's Introduction

rdst

rdst is a flexible native Rust implementation of multi-threaded unstable radix sort.

Usage

let mut my_vec = vec![4, 7, 1, 6, 5, 3, 2, 8, 9];

my_vec.radix_sort_unstable();

In the simplest case, you can use this sort by simply calling my_vec.radix_sort_unstable(). If you have a custom type to sort, you may need to implement RadixKey for that type.

Default Implementations

RadixKey is implemented for Vec and [T] of the following types out-of-the-box:

u8, u16, u32, u64, u128, usize
i8, i16, i32, i64, i128, isize
f32, f64
[u8; N]

Implementing `RadixKey`

To be able to sort custom types, implement RadixKey as below.

LEVELS should be set to the total number of bytes you will consider for each item being sorted
get_level should return the corresponding bytes from the least significant byte to the most significant byte

Notes:

This allows you to implement radix keys that span multiple values, or to implement radix keys that only look at part of a value.
You should try to make this as fast as possible, so consider using branchless implementations wherever possible

use rdst::RadixKey;

struct MyType(u32);

impl RadixKey for MyType {
    const LEVELS: usize = 4;

    #[inline]
    fn get_level(&self, level: usize) -> u8 {
        (self.0 >> (level * 8)) as u8
    }
}

Partial `RadixKey`

If you know your type has bytes that will always be zero, you can skip those bytes to speed up the sorting process. For instance, if you have a u32 where values never exceed 10000, you only need to consider two of the bytes. You could implement this as such:

use rdst::RadixKey;
struct U32Wrapper(u32);

impl RadixKey for U32Wrapper {
    const LEVELS: usize = 2;

    #[inline]
    fn get_level(&self, level: usize) -> u8 {
        (self.0 >> (level * 8)) as u8
    }
}

Multi-value `RadixKey`

If your type has multiple values you need to search by, simply create a RadixKey that spans both values.

use rdst::RadixKey;
struct MyStruct {
    key_1: u8,
    key_2: u8,
    key_3: u8,
}
impl RadixKey for MyStruct {
    const LEVELS: usize = 3;

    #[inline]
    fn get_level(&self, level: usize) -> u8 {
        match level {
          0 => self.key_1[0],
          1 => self.key_2[1],
          _ => self.key_3[0],
        }
    }
}

Low-memory Variant

use rdst::RadixSort;
let mut my_vec: Vec<usize> = vec![10, 15, 0, 22, 9];
my_vec
    .radix_sort_builder()
    .with_low_mem_tuner()
    .sort();

This library also includes a mostly in-place variant of radix sort. This is useful in cases where memory or memory bandwidth are more limited. Generally, this algorithm is slightly slower than the standard algorithm, however in specific circumstances this algorithm may even provide a speed boost. It is worth benchmarking against your use-case if you need the ultimate level of performance.

Single-threaded Variant

To make this library use an entirely single-threaded set of algorithms and processes, you can use the following snippet.

use rdst::RadixSort;
let mut my_vec: Vec<usize> = vec![10, 15, 0, 22, 9];

my_vec
    .radix_sort_builder()
    // Use a tuner that only includes single-threaded algorithms
    .with_single_threaded_tuner()
    // Don't run multiple algorithms (even single-threaded ones) in parallel
    .with_parallel(false)
    .sort();

NOTE: If you are ONLY using the single-threaded variant of this radix sort, you can disable the default "multi-threaded" feature on the rdst dependency to remove large sub-dependencies like Rayon.

[dependencies.rdst]
version = "x.y.z"
default-features = false

With the "multi-threaded" feature disabled, even the default my_data.radix_sort_unstable() will use a single-threaded tuner.

Custom Tuners

Tuners are things which you can implement to control which sorting algorithms are used. There are many radix sorting algorithms implemented as part of this crate, and they all have their pros and cons. If you have a very specific use-case it may be worth your time to tune the sort yourself.

use rdst::RadixSort;
use rdst::tuner::{Algorithm, Tuner, TuningParams};

struct MyTuner;

impl Tuner for MyTuner {
    fn pick_algorithm(&self, p: &TuningParams, _counts: &[usize]) -> Algorithm {
        if p.input_len >= 500_000 {
            Algorithm::Ska
        } else {
            Algorithm::Lsb
        }
    }
}

let mut my_vec: Vec<usize> = vec![10, 25, 9, 22, 6];
my_vec
    .radix_sort_builder()
    .with_tuner(&MyTuner {})
    .sort();

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

rdst's People

Contributors

Stargazers

Watchers

Forkers

vigna

rdst's Issues

Panic when sorting a billion items

The following program panics:

use rand::{rngs::SmallRng, Rng, SeedableRng};
use rdst::*;
#[derive(Default, Debug, Copy, Clone, PartialEq, Eq)]
pub struct SigVal {
    pub sig: [u64; 2],
    pub val: usize,
}

impl RadixKey for SigVal {
    const LEVELS: usize = 16;

    fn get_level(&self, level: usize) -> u8 {
        (self.sig[level / 8] >> level % 8) as u8
    }
}

fn main() {
    const N: usize = 1_000_000_000;
    let mut r = SmallRng::seed_from_u64(0);
    let mut v: Vec<_> = (0..N)
        .map(|i| SigVal {
            sig: [r.gen(), r.gen()],
            val: i,
        })
        .collect();
    v.radix_sort_unstable();
}

Result:

thread '<unnamed>' panicked at /Users/vigna/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorts/out_of_place_sort.rs:207:19:
attempt to subtract with overflow

Wrong order after rdst

We have a file containing a billion pairs of usize that rdst fails to sort properly. You can get the data at http://vigna.di.unimi.it/data.tsv.gz (you'll have to gunzip it), and the program to replicate the problem is

use rayon::prelude::*;
use rdst::*;
use std::{fs::File, io::BufRead, io::BufReader};

#[derive(Clone, Copy)]
struct Pair([usize; 2]);

impl RadixKey for Pair {
    const LEVELS: usize = 16;

    fn get_level(&self, level: usize) -> u8 {
        (self.0[1 - level / 8] >> ((level % 8) * 8)) as u8
    }
}

fn main() {
    let mut v = Vec::with_capacity(1_000_000_000);
    for l in BufReader::new(File::open("data.tsv").unwrap()).lines() {
        let line = l.unwrap();
        let mut iter = line.split("\t").map(|x| x.parse::<usize>().unwrap());
        v.push(Pair([iter.next().unwrap(), iter.next().unwrap()]));
    }
    v.radix_sort_unstable();
    v.par_windows(2)
        .for_each(|v| assert!(v[0].0 <= v[1].0, "{:?} > {:?}", v[0].0, v[1].0));
}

which prints

thread '<unnamed>' panicked at src/bin/test.rs:25:23:
[1050093940, 84812442] > [1050093939, 86974995]

Algorithm::MtLsb does not sort correctly

While comparing algorithms I noticed that Algorithm::MtLsb does not always result in a sorted array. My datatype is a u32 and the vector is 40K values distributed over the range 0..4_000_000 with duplicate values here and there.

This is my implementation for the RadixKey trait (Yeet is a struct(u32, u32))

impl RadixKey for Yeet {
    const LEVELS: usize = 4;

    #[inline]
    fn get_level(&self, level: usize) -> u8 {
        (self.y >> (level * 8)) as u8
    }
}

Input of 129 `u32`values which is not sorted correctly by rdst

I noticed that rdst wouldn't reliably sort inputs and managed to reduce it down from my 80 GB of [u8; 36] input use case down to this small testcase, which fails consistently for me:

#[test]
fn broken_radix_sort() {
    let input: [u32; 129] = [
        0x10001, 0x1010000, 0, 0, 0x10000, 0x1000000, 0x1000000, 0x1010000, 0x1010000, 0,
        0x1000000, 0, 0x10000, 0, 0x1000000, 0x1010000, 0x1010000, 0x1010000, 0, 0, 0, 0,
        0x1000000, 0x1010000, 0x1000000, 0x10000, 0, 0x10000, 0x1000000, 0x1000000, 0x1010000,
        0x10000, 0x10000, 0x1000000, 0x1000000, 0, 0x10000, 0x1010000, 0x1000000, 0, 0x10000,
        0x1010000, 0x1000000, 0x1000000, 0x1000000, 0, 0x1010000, 0x1010000, 0, 0x1000000, 0x10000,
        0x1010000, 0x10000, 0, 0x10000, 0x1000000, 0x1010000, 0x1000000, 0x1010000, 0x10000,
        0x10000, 0x10000, 0x1000000, 0x10000, 0x1010000, 0x1000000, 0x1010000, 0x10000, 0,
        0x1000000, 0x1010000, 0x10000, 0, 0x1010000, 0x1010000, 0x10000, 0x10000, 0x1010000,
        0x1010000, 0x10000, 0, 0x10000, 0x10000, 0x10000, 0x10000, 0, 0x1000000, 0x10000,
        0x1000000, 0x1010000, 0x1010000, 0x10000, 0, 0x1010000, 0x1010000, 0x1010000, 0x10000,
        0x1000000, 0x1010000, 0x10000, 0x1000000, 0x1010000, 0x1010000, 0, 0x1000000, 0x1010000,
        0x10000, 0x1000000, 0x1010000, 0x1000000, 0, 0x1000000, 0x1010000, 0x1010000, 0x10000,
        0x1010000, 0, 0, 0x1010000, 0x1010000, 0, 0, 0x1010000, 0x1010000, 0, 0x1000000, 0x1000000,
        0x1000000, 0,
    ];

    let mut sort_with_radix = input;
    let mut sort_with_std = input;

    sort_with_radix.radix_sort_unstable();
    sort_with_std.sort_unstable();

    assert_eq!(sort_with_radix, sort_with_std);
}

thread 'broken_radix_sort' panicked at 'assertion failed: `(left == right)`
  left: `[0, 0, 65536, 0, 0, 65536, 0, 0, 0, 0, 0, 65536, 0, 65536, 65536, 65536, 0, 65536, 0, 65536, 0, 0, 65536, 65536, 0, 65536, 65536, 65536, 65536, 65536, 65536, 0, 65536, 0, 65536, 65536, 65536, 0, 65536, 65536, 65536, 65536, 0, 65536, 65536, 0, 65536, 65536, 0, 65536, 0, 65536, 0, 0, 0, 0, 0, 0, 65537, 16842752, 16777216, 16777216, 16842752, 16842752, 16777216, 16777216, 16842752, 16842752, 16842752, 16777216, 16842752, 16777216, 16777216, 16777216, 16842752, 16777216, 16777216, 16842752, 16777216, 16842752, 16777216, 16777216, 16777216, 16842752, 16842752, 16777216, 16842752, 16777216, 16842752, 16777216, 16842752, 16777216, 16842752, 16777216, 16842752, 16777216, 16842752, 16842752, 16842752, 16842752, 16842752, 16777216, 16777216, 16842752, 16842752, 16842752, 16842752, 16842752, 16777216, 16842752, 16777216, 16842752, 16842752, 16777216, 16842752, 16777216, 16842752, 16777216, 16777216, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16777216, 16777216, 16777216]`,
 right: `[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65536, 65537, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752, 16842752]`', src/main.rs:233:5

`subtract with overflow` panic in `lr_out_of_place_sort`

I'm sorting a list of 5000 struct elements that implement RadixKey as follows:

#[derive(Debug, Clone, Copy)]
struct IH {
    /// Index in the original text.
    idx: usize,
    /// Hash of a prefix-extension of the suffix starting at `i`.
    h: Mod,
}

/// Sort `IH` by the contained hash.
impl RadixKey for IH {
    const LEVELS: usize = 8;
    #[inline]
    fn get_level(&self, level: usize) -> u8 {
        (self.h.0 >> (level * 8)) as u8
    }
}

My tests keep running into this error:

thread 'test::random_ssa' panicked at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorts/out_of_place_sort.rs:249:9:
attempt to subtract with overflow
stack backtrace:
...
   3: rdst::sorts::out_of_place_sort::lr_out_of_place_sort
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorts/out_of_place_sort.rs:249:9
   4: rdst::sorts::lsb_sort::<impl rdst::sorter::Sorter>::lsb_sort_adapter
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorts/lsb_sort.rs:107:21
   5: rdst::sorter::Sorter::run_sort
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorter.rs:64:37
   6: rdst::sorter::Sorter::handle_chunk
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorter.rs:148:9
   7: rdst::sorter::Sorter::top_level_director
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/sorter.rs:164:9
   8: rdst::radix_sort_builder::RadixSortBuilder<T>::sort
             at /home/philae/.local/share/cargo/registry/src/index.crates.io-6f17d22bba15001f/rdst-0.20.11/src/radix_sort_builder.rs:155:9
   9: ssa::Ssa::new_params::dfs
             at ./src/lib.rs:129:13

This doesn't happen for sequences with length up to 4000 though (probably because a different algorithm is used?).

I'm not sure if my data has specific properties. If this is hard to replicate using some random fuzzing I'm happy to attach some data.

Sort by key

Feature suggestion: a version of the sort method with by_key, similar to sort_by_key for slices. This would avoid in many situations to have to define a struct containing a key and a value, when maybe tuples are sufficient.

Make rayon optional

In some cases projects have existing threading solutions and do not want/need to also pull in rayon. It would be useful for rayon and the dependencies needed for multi-threaded sorting to be an optional feature.

Panics when trying single-threaded sort

The following program tries to sort using a single thread.

use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};

use rdst::RadixSort;

const BATCH_SIZE: usize = 1_000_000;

fn sort_slice_rdst(data: &[f64]) -> Vec<f64> {
    let mut data = data.to_vec();
    data.radix_sort_builder().with_parallel(false).sort();
    data
}

#[test]
fn test_rdst() {
    let mut rng = StdRng::seed_from_u64(42);
    let data = (0..BATCH_SIZE)
        .map(|_| rng.gen_range(0..100_000) as f64)
        .collect::<Vec<f64>>();
    let _sorted = sort_slice_rdst(&data);
}

With a BATCH_SIZE of 1_000_000 it fails with


Bad algorithm: MtLsb for len: 1000000
thread 'rdst_test::test_rdst' panicked at 'Bad algorithm: MtLsb for len: 1000000', /home/jhorstmann/.cargo/registry/src/github.com-1ecc6299db9ec823/rdst-0.20.3/src/sorter.rs:69:22

Another BATCH_SIZE of 10_000_000 fails with a different panic:


range end index 10000000 out of range for slice of length 1250000
thread 'rdst_test::test_rdst' panicked at 'range end index 10000000 out of range for slice of length 1250000', /home/jhorstmann/.cargo/registry/src/github.com-1ecc6299db9ec823/rdst-0.20.3/src/sorts/ska_sort.rs:61:18

Idea: Skip counting for initial elements

Writing down a theory, which I never find the time to test out. It's most likely no good, but in the case of extremely large and well distributed datasets, there's a chance it is useful. For example, something like chiapos might benefit a lot from this.

Theory

It's possible to skip counting the first elements until you have a single bucket of the final data filled (or nearly filled).

Calculate a bucket_size of (data.len() / 256).floor()
Create a prefix sums array, with sums of [x for x * bucket_size in 0..256]
Calculate an end offsets array from the sums
Read the first bucket_size elements, and put them in the output array based on their radix and the corresponding prefix (write offset). Increment the write offset by 1 in the sums array.
Scan the sums array, find the end_offset - write_offset that is closest to being full. This difference will be the amount of data to read next without counting.
- Should probably skip when the difference is < 256 items
Repeat 4-5 until either a bucket is completely full, or there are < 256 items till the end of a bucket
Count the remaining items the traditional way
Derive the counts of the initially added items from the prefix sums, add those to the counts from 7.
Use the new and correct counts to determine the real prefix sums and end offsets
Move any items which were placed incorrectly
Add counts for initial items to the real prefix sums (since they are already in-place)
Place the remaining items per usual

Best case

For uniformly distributed data, this will be extremely effective. It will sort the data with very little correction required, and skipping almost all counting.

Worst case

Data which is exponentially distributed and ascending, but not quite sorted, will result in large errors in placement for the initial items.

Comments

This can only be useful for extremely large datasets, at least over 65,535 items but probably not useful till much larger still. However for those large datasets, it may provide a significant speed boost by skipping a lot of counting, or more correctly, deriving the counts after moving the first few items.

Best algorithm for random integers

Based on your experience and testing, do you have a suggested best tuning for sorting random 128-bit keys?