skeeto / hash-prospector Goto Github PK

View Code? Open in Web Editor NEW

656.0 656.0 26.0 89 KB

Automated integer hash function discovery

License: The Unlicense

Makefile 1.71% C 98.29%

hash-prospector's People

Contributors

Stargazers

Watchers

hash-prospector's Issues

For windows

Great job! I love this.
I've converted to windows for those who interested.
with help of Chat GPT
https://gist.github.com/benanil/581a895af67f53a6adfa5bd79b86212f

Example lowbias32() demonstrates periodicity biases

The headline example lowbias32 (with constants 0x7feb352d and 0x846ca68b) has periodicity biases. As a refresher, the periodicity of a perfect/linear function is the number of rounds that it takes before you arrive back at the starting value. For example:

uint64_t period = 0;
uint32_t start = rdrand32();
uint32_t walk = start;
do {
    period++;
    walk = lowbias32(walk);
} while (walk != start);
assert(period < (1ul << 32));
printf("lowbias32 start 0x%08X period: 0x%lX\n", start, period);

Ideally, the period for a 32-bit perfect/linear function is 2^32-1 (or 2^32 if zero as input isn't special) but for the example "lowbias32", the period is usually but not always 0xF671EECE, which is a form of bias. Even worse, there seem to be pathological inputs that have significantly lower periods. For example:

lowbias32 start 0x00000004 period: 0x94C3412
lowbias32 start 0x0000000C period: 0x94C3412
lowbias32 start 0x00000027 period: 0x94C3412
lowbias32 start 0x00000029 period: 0x94C3412
...
lowbias32 start 0x00000548 period: 0x321A6B
...
lowbias32 start 0x00000623 period: 0xA6646

etc.

Usage in Path Tracing

Or Monte Carlo simulations in general. I'm using this for a path tracer to generate coordinates within a pixel to take the next sample at. It's something like this:

struct vec2f { float x; float y; }

vec2f random_within_pixel(vec2f uv_min, vec2f uv_max, int seed) {
  int x = lowbias32(seed);
  int y = lowbias32(seed + 1);  // in the next call to this function: seed += 2
  float u = float(x) / std::numeric_limits<int>::max();
  float v = float(y) / std::numeric_limits<int>::max();
  return {
    uv_min.x + ((uv_max.x - uv_min.x) * u),
    uv_max.y + ((uv_max.y - uv_min.y) * v)
  };
}

The idea of using a hash instead of an RNG is not new for path tracing, Pixar does this with Correlated Multi-Jittered Sampling. Since I don't know much about the gory details of random number generation, I figured I'd get your take on whether or not this is a good use of lowbias32. Thanks in advance.

Provide a vectorized output bit count loop

This loop:

hash-prospector/prospector.c

Lines 567 to 568 in 79a6074

    
           for (int k = 0; k < 32; k++) 
        
               bins[j][k] += (set >> k) & 1;

vectorizes really well even with SSE2 and x86-64 is already required.

Note: if the density of 1's is low enough, branching algorithms similar to this https://lemire.me/blog/2019/05/03/really-fast-bitset-decoding-for-average-densities might be better, but they have worse worst-case performance

Class of Bijective 128-Bit Hashes for GPU

First of all, big fan of this class of hashes. Came across your work through Shadertoy and went way too far down the rabbithole trying to see if I could come up with a good adaption of this for 128-bit hashes on a GPU. I don't really know any C, so I couldn't give you the C equivalent here, but these are super fast on a GPU because 4-element vectors can swap 32-bit values between them at zero cost. I found an article by Marc Reynolds http://marc-b-reynolds.github.io/math/2017/10/13/XorRotate.html about reversing double XOR-rotates and came to the realization that a vector swizzle is just a rotation by multiples of 32 bits. So the GLSL code goes like this:

const uvec4 shft = uvec4(14U, 15U, 16U, 17U);
uvec4 bjhash128(uvec4 p) {
    p ^= p >> shft;
    p *= uvec4(0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U);
    p ^= p.yzwx ^ p.zwxy;
    p ^= p >> shft.yzwx;
    p *= uvec4(0x21F0AAADU, 0x0D0C2D35U, 0xE8F649A9U, 0xD30332D3U);
    p ^= p.yzwx ^ p.zwxy;
    return p ^ p >> shft.wxyz;
}

And the inverse:

uvec4 bjhash128_inv(uvec4 p) {
    p ^= p >> shft.wxyz;
    p.yz ^= p.yz >> (shft.xy << 1U);
    p = dblxoro128_inv(p);
    p *= uvec4(0x333C4925U, 0x12DC7D1DU, 0x349E6A99U, 0x3B43F55BU);
    p ^= p >> shft.yzwx;
    p.xw ^= p.xw >> (shft.yx << 1U);
    p = dblxoro128_inv(p);
    p *= uvec4(0x529E6A99U, 0xD29E6A99U, 0x5ADC7D1DU, 0x929E6A99U);
    p ^= p >> shft;
    p.xy ^= p.xy >> (shft.xy << 1U);
    return p;
}

And the Shadertoy is here https://www.shadertoy.com/view/mstXD2 with further explanation and some basic debugging tools. Obviously the above code was written with vector parallelism in mind, so I apologize if this doesn't translate to C very well, but I'm hoping this might give you a good starting point for developing a faster 128-bit variant in C.
I was rolling around the idea of writing a compute shader to do what your Hash Prospector does, but on the GPU so as to be able to scale it efficiently for 128-bit testing. As such, I still haven't done any tests for statistical robustness other than a visual inspection, so there's room for improvement in choosing the correct constants and bitwise shifts.
That being said, all the bits seem to be visually random, both the lowest and highest bits, so that gives me hope that this will be a feasible variant. I'm just curious as to what your thoughts are.

Normalize the score

I tweaked the score function as follow:

    double sum = 0;
    for (int j = 0; j < nbits; j++) {
        for (int k = 0; k < nbits; k++) {
            double diff = bins[j][k] - n / 2.0;
            sum += diff * diff;
        }
    }
    return 2 * sqrt(sum / n) / nbits;

Now the scores of good hash functions, for both 32 bit and 64 bit, different -q values, are around 1.0.

I don't understand the statistics, but I have a guess on how the score function works. The score function is equivalent to the fair coin test problem. You have 32x32 coins, perform coin flips, and yield 32x32 frequencies, these frequencies deviate from 0.5, compare the variance of frequency to the expected variance of fair coins. And the score can be less than 1.0 due to the variance of variance.

What I don't expect is the exact score is significantly larger than the estimated score. Many xorr,mul sequences yield an estimated score around 1.0, but the exact score of lowbias32 is 11.4. The lowest exact score I tested is two AES round, yielding an exact score of 1.4.

OP-REQ: byte shuffle for 64bit or 32bit operation

Bytes shuffling is another reversible operation and the effects should between mul and xorshift.

Generic version for 32/64 bit hash prospector.

if do not use jit, the 32/64 bit prospector may be more easy to maintain.

[Q] How long is too long when running `prospector`?

This tool seems quite amazing and we were hoping to use it to divine some hash functions for us but after around 4 hours of runtime with 0 output given:

./prospector -8 -p xrot2,mul

, I was wondering how long is too long?

Also, I am aware that xrot2 is only available from the PR here: #13

I'm wondering, is the xrot2 implementation simply faulty? Or is running against the 64 bit space simply too large to realistically test in that timeframe? I also noted the code was only using 1 core as well. Is there an option to parallelize the execution?

The "theoretical bias limit (bias = ~0.021)" for 32-bit hash functions isn't correct.

Introduction

I've been doing a lot of testing using the bias calculation method used in the repo lately, and I wanted to calculate the best possible bias (the bias that would occur if swapping one bit would change every bit with a probability of 50%).

Your blog post mentioned the value bias = ~0.021 in the context of 32-bit hashes, but since I'm testing other powes-of-two, I needed to calculate the bias by my self.

I came to the conclusion that the optimal bias for 32-bit hashes is ~0.006 and that, because I needed to modify the bias calculation code, triple32 has a bias of ~0.0083.

I validated this result by implementing a hash function that returns random number independent of the value to hash and the results confirm my calculations:

original method:
    calculated: ~0.021
    tested:      0.015610715938394663
my method:
    calculated: 0.006087376104758189
    tested:     0.006108424031481263

How I calculated the bias

I was initially skeptical towards the normalization method, because of the "FIXME" comment in the implementation of estimate_bias32, and thus I took the liberty to change how the bias is calculated.

The theoretical bias limit is equal to the mean of the probability distribution of obtaining a specific bias.
Here that is a binomial distribution where with a 50% probability either one or zero is added to the counter, thus the mean is at n/2. For an easier time we can now center the normal distribution by subtracting the mean (n/2) and divide by the n to normalize the bias. Then we take the absolute value since we are interested in the average error, and we need to get rid of the sign to calculate that.

Now we have a half-normal distribution (a normal distribution where the absolute value of the result is taken) from which the mean can be calculated using variance*sqrt(2/pi), where the variance is the variance of the original normal distribution where the absolute value hasn't been taken yet. The original variance can be calculated using the formula sqrt(n*0.5*0.5), since we are dealing with a binomial distribution with p=0.5 and q=0.5.

Here is the complete formula to calculate the theoretical bias limit using maxima:

n: 2**32;
originalVariance: sqrt(n*(0.5)*(0.5));
variance: originalVariance*sqrt(2/%pi);
scaledVariance: variance/n*1000,numer;
print(scaledVariance);
-> 0.006087376104758189

This changed the way the code needs to calculate the mean a bit. So we divide by 4294967296.0 instead of 2147483648.0, although this is just a cosmetic change. And we also take the absolute value of the difference instead of squaring it and calculating the square root later. You'd assume that this would have the same result since abs(a) == sqrt(a*a), but sqrt((b*b) / b) != sqrt(a*a)/b and sqrt(a1*a1 + a2*a2) != sqrt(a1*a1) + sqrt(a2*a2), so this change is vital.

hash-prospector/prospector.c

Lines 659 to 666 in ea7ee99

    
           double mean = 0.0; 
        
           for (int j = 0; j < 32; j++) { 
        
               for (int k = 0; k < 32; k++) { 
        
                   double diff = (bins[j][k] - 2147483648L) / 2147483648.0; 
        
                   mean += (diff * diff) / (32 * 32); 
        
               } 
        
           } 
        
           return sqrt(mean) * 1000.0;

double mean = 0.0;
for (int j = 0; j < 32; j++) {
    for (int k = 0; k < 32; k++) {
        double diff = (bins[j][k] - 2147483648L) / 4294967296.0;
        mean += fabs(diff);
    }
}
return mean * 1000.0 / (32 * 32);

[Q] could you add some 64 bit examples too

since the README lacks anything 64bit, i compiled a GCC with openmp support to run prospector myself. interestingly, only one thread is used. is the omp parallel pragma only used for a section of code that's rarely run ?
i'm running ./prospector -8, ftr. how long will i have to wait in average until a usable hash func is printed on a modern ryzen core ?

New best known functions

I used this project to learn combinatorial optimization.

EDIT: a better function was found in a below comment, the new best is now 0.10704308166917044

[16 21f0aaad 15 d35a2d97 15] = 0.10760229515479501

A metaheuristic was used with an estimate with len = 1 << 25 and constrained to be nearby other good functions (shifts [13, 19], muls popcnt [11, 19]). (Unfortunately I don't remember which metaheuristic I used for this). Finally an exhaustive exact search was applied nearby the best found. I verified the final bias with the hash-prospector library itself.

I also used some explicit SIMD acceleration in the bias calculation along with the current multi threading. The hash itself handles 8 "seeds" at once and the counting is faster too.
SIMD counting pseudo code:

// 256/32=8 so splat the 32-bit flips and use a mask to grab them all at once
__m256i sequence = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i mask = _mm256_sllv_epi32(_mm256_set1_epi8(1), sequence);

for (int i = 0; i < 8; ++i) {
    __m256i splat = _mm256_set1_epi32(_mm256_extract_epi32(flips, i));
    __m256i flipped = _mm256_cmpeq_epi8(_mm256_and_si256(splat, mask), mask);
    byte_bins = _mm256_sub_epi8(byte_bins, flipped);
}

// accumulate intermediate byte bins to larger bins every 255 / 8

Are score of 32bit and 16bit hashes with the same unit?

In prospector.c for searching 32/64 bits hashes, the scores are multiplied by 1000.0, while not in the hp16.c.

Is it on purpose?

Template parses rotate amounts as hex not decimal

The other shift type operations parse the shift amount as decimal, so it would be more logical to parse rotates as decimal too.

./prospector -8 -t 10000 -p rot:32

See these which might be better moved to the other set of switch cases.

hash-prospector/prospector.c

Line 675 in 627f16c

case HF32_ROT:

hash-prospector/prospector.c

Line 679 in 627f16c

case HF64_ROT:

About the avalanche bias

First I think this is a really awesome project! I stumbled across it some years ago when I was playing with a similar thing (automatically find good mixers) and I remember it as very inspiring!

So I found the lists of mixers (in the readme+an issue) and I though it would be a good correlation test to run through my Tests for Randomness (TFR) and PractRand, both using Pelle Evensen's harsh RRC-scheme to scramble the input bits.

I ran through +100 of the prospector mixers (love the plain text format plus 32-bit mixers are so quick to evaluate even with RRC), and I thought it might be interesting to share the results here.

What may come as a surprise is that the original 32-bit prospector (from the article) seems to performing very well. The original is the only one that goes to 2^18 in TFR the second runner up fails at 2^13 of all tested prospector mixers.

This leads me to the question if something in the search code of prospector changed since the original was found?

RRC-PractRand	TFR	Construction	Alias	Comment
14	12	16 85ebca6b 13 c2b2ae35 16 = ?	murmur3	For reference
16	18	15 2c1b3c6d 12 297a2d39 15 = 0.34968228323361017	original	From the prospector article
14	10	16 21f0aaad 15 d35a2d97 15 = 0.10760229515479501	low_bias	The one with lowest bias (by TheIronBorn)

For reference, the furthest I've ever seen a xmxmx 32-bit mixer go in RRC-PractRand is to 2^18 and in TFR to 2^19.

Now PractRand is a bit of a blackbox to me, but since I am very familiar with TFR I investigated how the low_bias one fails. If you run it through TFR it will say something like:

==========================================================================================
mix32::prospector_low_bias 2^10 with 128 samples using 32-bits (0s)
bic,coupon,divisibility,gap,mean,permutation,runs,sac,uniform
==========================================================================================
test       p-value  remark       stream/description
uniform-m  0        failure(10)  mix32::prospector_low_bias(reverse_co-31(counter-1))

***FAILED*** at 2^10 with a total of 1 failures and 0 suspicious results.

In the uniform test we expect a uniform distribution but the statistics says that this is very likely not the case. The mix32::prospector_low_bias(reverse_co-31(counter-1)) indicates the used test sample (out of 128). This is a unitary counter has reversed bits + complement and rotated 31 bits (as per RRC). If we were to recreate this in code it would look something like this in c++ (copy pasted a bit so pardon the overkill with templates etc):

#include <iostream>
#include <vector>

template <typename T>
constexpr int bit_sizeof() {
	return 8 * sizeof(T);
}

template <typename T>
constexpr T ror(T v, int r) {
	constexpr auto Bits = bit_sizeof<T>();
	return (v << r) | (v >> (Bits - r));
}

constexpr uint32_t reverse_bits(uint32_t x) {
	x = ((x >> 1) & 0x55555555) | ((x & 0x55555555) << 1);
	x = ((x >> 2) & 0x33333333) | ((x & 0x33333333) << 2);
	x = ((x >> 4) & 0x0F0F0F0F) | ((x & 0x0F0F0F0F) << 4);
	x = ((x >> 8) & 0x00FF00FF) | ((x & 0x00FF00FF) << 8);
	return (x >> 16) | (x << 16);
}

int main() {
	std::vector<uint32_t> ys;
	for (uint32_t i = 1; i <= 1024; ++i) {
		auto x = ror(~reverse_bits(i), 31);
		x ^= x >> 16;
		x *= 0x21f0aaad;
		x ^= x >> 15;
		x *= 0xd35a2d97;
		x ^= x >> 15;
		ys.push_back(x);
	}

	for (const auto y : ys) {
		std::cout << y << ",";
	}
}

This will output 2^10 values: 2731428922,3383114888,3653537803,1575712730,... now if we examine those with chi2 statistics in Mathematica we get:

PearsonChiSquareTest[low_bias_values, UniformDistribution[{0, 2^32-1}]] this outputs a p-value close to zero (2.23e-22).

Doing the same for the original mixer gives a p-value of 0.11 which means indistinguishable from uniform distribution. It's possible to see this with the naked eye, here are histograms of the output for the same rrc-stream from the lowbias, the original and trng (true random data):

Here it looks like the lowest bias version is not necessarily the best according to RRC-PractRand and TFR (see attached logs for many more examples). Maybe the bias calculated in prospector is an insufficient metric?

On a positive note, I strongly believe avalanche in some form is a sufficient test. In TFR, both sac and bic are very strong indicators, while most others (uniform, gap, coupons etc ..) statistics will fail early or not at all. So maybe only some minor changes could help.

I've attached two text files with logs for all prospector mixers I tested (tfr+practrand). The first column is how far the mixer came in the tests:

tfr_prospector.txt
pr_prospector.txt

This is more of a thought, if this seems interesting/relevant we can dig deeper, keep up the good work!

Any interest in CLMUL and XROT2?

Are there any plans or is there any interest at all in having

CLMUL, the carryless multiplication which has its own opcode on modern CPUs, and
XROT2, that is the invertible double xor rotate x ^= (x >>> a) ^ (x >>> b),

implemented?

I would expect CLMUL to have slightly less influence on bias than a regular integer multiplication due to the lack of carry's left-bound "smearing" effect.

XROT2 should show an effect on lowering bias which I would assume somewhere between one and two xor-shifts – remains to be proven. Execution on super-pipelined CPUs could speed-wise benefit from working on the very same x being rotated twice. However, little coding challenge here: two parameters a and b need to be parsed from command line's -p pattern.

[Q] 16-bit hashes?

There may be some exploratory or practical merit in evaluating 16-bit hashes as well. A lot of microcontrollers, embedded devices etc. are running lesser bitness CPUs than 32 bits, and there are very few publicly available good hashing algorithms for those devices. Searching their whole domains is also much easier than even for 32-bit hashes.

Add crc32c to the list of invertible functions.

crc32c instruction from SSE4.2 has latency of 3 cycles (similar to integer multiplication) and thoughput of 1 operation per cycle. It is invertible on uint32. If you change one bit of input, only one bit of output is changed and other are unchanged.

The operation itself is very bad as generic hash function (no avalanche), but it can be used in construction of hash functions. Or it can be used as a hash function in some applications where avalanche property is unneeded.

[Q] One round 32 bit hash?

Curious as to how well a one round 32 bit hash could perform, probably not great, but I can think of a few uses for it, like folding multiple coordinates into a single integer before using a more robust finalizer like the triple32.

Option to evaluate 64-bit functions as 32-bit hashes

This is an enhancement request, rather than a defect, so feel free to accept or close if you aren't doing further development.

In some situations a 32-bit hash is specified, but fast 64-bit arithmetic is available (e.g. Java on 64-bit machines).
Generating 64-bit hash functions of the form
mul
xorr
mul
rot:32
but calling them as 32-bit functions (zero extend input to 64 bits, take lower 32 bits of the output) can work much better than the 2-round 32-bit functions and nearly as well as the 3-round 32-bit functions. It would be nice if hash-prospector could search for these. I think this could be done quite simply with an extra flag and a different test for whether to call the 32-bit or 64-bit functions. The 64-bit generated functions appear to be callable as 32-bit functions without change.

[Q] Seeded integer hash functions

Thank you for the cool project and writeup!

I was curious — do you have recommendations for integer hash function families? That is, given some amount of seed entropy, choose an integer hash function; for example, a naive thing one could do with a 32 bit seed is return lowbias32(x ^ seed).

Consider adding another inversible function

The "xorsquare" function (which I briefly describe here) is of the form:

x=((x+a)&-2)^(x*x);

for any integer a.

Note: alternative version (x^=((x+a)*(x+a))&-2;) produces identical results for values of a that differ in the most significant bit.

It is straightforward to show (e.g. by induction on wordsize) that it is inversible.

	double mean = 0.0;
	for (int j = 0; j < 32; j++) {
	for (int k = 0; k < 32; k++) {
	double diff = (bins[j][k] - 2147483648L) / 2147483648.0;
	mean += (diff * diff) / (32 * 32);
	}
	}
	return sqrt(mean) * 1000.0;

skeeto / hash-prospector Goto Github PK

hash-prospector's People

Contributors

Stargazers

Watchers

Forkers

hash-prospector's Issues

Introduction

How I calculated the bias

Recommend Projects

Recommend Topics

Recommend Org