UMASH: a fast almost universal 64-bit string hash
STATUS: the hash and fingerprint algorithms are finalized, and so
is the mapping from umash_params_derive
inputs to UMASH parameters.
However, the ABI is not finalized; in particular, passing random bytes
to umash_params_prepare
may still result in different parameters.
UMASH is a string hash function with throughput (22 GB/s on a 2.5 GHz Xeon 8175M) and latency (9-20 ns for input sizes up to 64 bytes on the same machine) comparable to that of performance-optimised hashes like MurmurHash3, XXH3, or farmhash. Its 64-bit output is almost universal, and it, as well as both its 32-bit halves, passes both Reini Urban's fork of SMHasher and Yves Orton's extended version (after expanding each seed to a 320-byte key for the latter).
Unlike most other non-cryptographic hash functions
(CLHash is a rare exception) which
do not prevent seed-independent collisions
and thus usually suffer from such weaknesses,
UMASH provably avoids parameter-independent collisions. For any two
inputs of l
bytes or fewer, the probability that a randomly
parameterised UMASH assigns them the same 64 bit hash value is less
than ceil(l / 4096) 2**-55
. UMASH also offers a fingerprinting mode
that simply computes two nearly-independent hashes at the same time.
The resulting 128-bit fingerprint
collides pairs of l
-or-fewer-byte inputs with probability less than
ceil(l / 2**27)**2 * 2**-82
; that's less than 2**-70
(1e-21
) for
up to 7.5 GB of data.
See umash_reference.py
(pre-rendered in umash.pdf
) for details and
rationale about the design, and a proof sketch for the collision bound.
The blog post announcing UMASH
includes a higher level overview and may also provide useful context.
There has been a major change to the algorithm since that post; the new fingerprinting algorithm is described in this post but some documentation and comments still have to be overhauled. However, the reference implementation and tests do reflect this new algorithm.
If you're not into details, you can also just copy umash.c
and
umash.h
in your project: they're distributed under the MIT license.
The current implementation only build with gcc-compatible compilers
that support the integer overflow builtins
introduced by GCC 5 (April 2015) and targets x86-64 machines with the
CLMUL extension
(available since 2011 on Intel and AMD). That's simply because we
only use UMASH on such platforms at Backtrace.
There should be no reason we can't also target other compilers, or
other architectures with carry-less multiplication instructions
(e.g., VMULL
on ARMv8).
Quick start
Here's how to use UMASH for a simple batch hash or fingerprint computation.
First, we need to generate struct umash_params
that will define the
parameters ("key") of the UMASH hash or fingerprint function.
For a hashing use case, one could fill a struct umash_params params
with random bits (e.g., with
a getrandom(2)
syscall),
and call umash_params_prepare(¶ms)
to convert the random bits
into a valid key. This last call may fail by returning false
;
however, the probability of that happening are astronomically small
(less than 2**-100
) if the input data is actually uniformly random.
Fingerprinting often needs a deterministic set of parameters that will
be preserved across program invocations. For that use case, one
should either fill a struct umash_params
with hardcoded random contents
before calling umash_params_prepare
, or use umash_params_derive
to
deterministically generate an unpredictable set of parameters from
a 64-bit value and a 32-byte secret.
For a fingerprinting use case, each program should use its own 32-byte secret.
Given a fully initialised struct umash_params params
, we can now
call umash_full
or umash_fprint
to hash or fingerprint a sequence
of bytes. The seed
argument is orthogonal to the collision bounds,
but may be used to get different values, e.g., when growing a hash
table afer too many collisions. The fingerprint returned by
umash_fprint
is simply an array of two hash values. We can compute
either of these 64-bit hash values by calling umash_full
: letting
which = 0
computes the first hash value in the fingerprint, and
which = 1
computes the second. In practice, computing the second
hash value is as slow as computing a full fingerprint, so that's
rarely a good option.
See example.c
for a quick example.
$ cc -O2 -W -Wall example.c umash.c -mpclmul -o example
$ ./example "the quick brown fox"
Input: the quick brown fox
Fingerprint: 398c5bb5cc113d03, 3a52693519575aba
Hash 0: 398c5bb5cc113d03
Hash 1: 3a52693519575aba
We can confirm that the parameters are constructed deterministically,
and that calling umash_full
with which = 0
or which = 1
gets us
the two halves of the umash_fprint
fingerprint.
Please note that UMASH is still not fully finalised; while the source code should be deterministic for a given revision, different revisions may compute different values for the exact same inputs.
Hacking on UMASH
The test suite calls into a shared object with test-only external
symbols with Python 3, CFFI,
and Hypothesis. As long as Python3 and
venv are installed, you
may execute t/run-tests.sh
to download test dependencies, build the
current version of UMASH and run all the pytests in the t/
directory. t/run-tests-public.sh
only exercises the public
interface, which may be helpful to test a production build or when
making extensive internal changes.
The Python test code is automatically formatted with
black. We try to make sure the C code
sticks to something similar to the
FreeBSD KNF;
when in doubt, whatever t/format.sh
does is good enough.
See smhasher/HOWTO-SMHASHER.md
for patches to easily integrate UMASH
in the SMHasher hash performance and quality test suite.
We are also setting up Jupyter notebooks to make it easier to compare
different implementations, to visualise the results, and to
automatically run a set of statistical tests on that data. See
bench/README.md
for more details.
Help wanted
The UMASH algorithm is now frozen. However, there's plenty of work left:
- The block fingerprinting loop could be vastly improved, especially with AVX-256.
- We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface could look like.
- A size-optimised implementation could be helpful.
- How fast could we go on a GPU?
And of course, portability to other C compilers or platforms is interesting.