dynatrace-oss / hash4j Goto Github PK

Dynatrace hash library for Java

License: Apache License 2.0

Java 99.05% Python 0.20% C++ 0.73% Shell 0.01% Go 0.01%

murmur3 hash-functions hashing-algorithm non-cryptographic-hash-functions java hash wyhash hash-algorithm stream-processing hyperloglog

hash4j's People

Contributors

Stargazers

Watchers

Forkers

leicmi mrgreywater acruise thaingo lvjinbiao adedayoominiyi jofernmorais

hash4j's Issues

Implement imohash or a similar sampling hash

Is your feature request related to a problem? Please describe.
When comparing large files for duplicates, usual hashing algorithms are limited by the speed of which the full file content can be read. This starts to be a problem when hashing large files. By only hashing parts of the data and its size you don't have to read all the data but get a hash that can indicate duplicates with sufficient confidence.

Describe the solution you'd like
Implement imohash or a similar sampling hash algorithm.

Describe alternatives you've considered
Implementing it on top of hash4j.

Discrete-Incremental Hashing

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.

Request to publish JMH results

I'm interested in seeing (roughly) how fast Hash4J's hash functions are, and possibly how fast the comparable Guava/OpenHFT functions are too. I could run these on my own machine, but it would be nice as a quick approximation of what kind of performance numbers are reasonable.

In particular, the blog post describing the need for Hash4J mentions exact compatibility with the reference C++ implementation. I'm curious if the compatibility comes at a noticeable cost compared to the alternative implementations?

Komihash5 tag references in README

May I suggest you to remove compatibility references to all versions beside 5.0, and add a reference to a recently released 5.10? I'd like to remove intermediate tags as to not propagate sub-optimal releases.

MinHash output is not mergeable with hash sizes under 64

Describe the bug
The output of MinHash.compute cannot be merged with another output on hash sizes of under 64

To Reproduce

import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;

import java.util.Arrays;
import java.util.stream.IntStream;

public class MinHashBugRepro {

    public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
        PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
        byte[] output = pah.create(components);
        IntStream.range(0, components).forEach(idx ->
            pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
        );
        return output;
    }

    public static void main(String[] args) {
        SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
        SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
        Hasher64 hasher = Hashing.wyhashFinal4();

        Long hello = hasher.hashCharsToLong("hello");
        Long world = hasher.hashCharsToLong("world");

        // when using a 32-bit hash, this merge function doesn't work
        byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
        byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
        byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged32 = merge(10, 32, one32, two32);
        System.out.println(Arrays.equals(merged32, three32));

        // when using a 64-bit hash this merge function does work
        byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
        byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
        byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged64 = merge(10, 64, one64, two64);
        System.out.println(Arrays.equals(merged64, three64));
    }
}

Expected behavior
I expect the data stored in the byte[]s to be mergeable, even if the library does not include a merge function

Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.

Possible fix
I am reasonably sure that this is because of this section:

          if (hash < work[i]) {
            work[i] = hash;
          }

and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing

Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
~~SuperMinHash merges fine with this merge method on all bit sizes, it does not work on SimHash or FastSimHash~~

Benchmark publishing

Could you please add a topic on the mainpage that lists benchmark results of all zero-allocation hash functions available? I may be asking for a trouble for komihash, but it would be interesting to know anyway.

Ultraloglog and MinHash?

Hello Ertl,

Since ultraloglog uses more bits than hyperloglog for fast access and update. The idea of locality sensitive hashing (e.g. approximate MinHash to estimate Jaccard), based on setsektch, can it be also applied in the ultraloglog case? or it is not possible but we have to rely on inclusions and exclusion rule after obtaining the cardinality to approximate jaccard index, at the risk of losing locality sensitive hashing property?

Thanks,

Jianshu

Create Hash4j logo

`merge` methods for SimilarityHash variants

Problem

After adding UltraLogLog support to Apache Pinot I've been looking at adding some of the MinHash variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.

Solution

I'd like the SimilarityHasher interface to also have a merge method that takes two byte[] and returns a byte[] that represents the merged state.

Alternatives

I've tried implementing the merge functions myself, and run into problems like #169
I did consider a half way solution of just streaming hashes into it, but that's also not available in the current interface

`UltraLogLog::estimate` does not match Algorithm 6

There may be two bugs in the normal- and large-range computations
(respectively) of UltraLogLog::estimate.

When evaluating the contribution of a register not in the small or large
range, UltraLogLog::estimate makes use of the precomputed register
contribution table¹ (due to equation 14). There are
$236=4w_{max}-12-4=4(65-p_{min})-16$ values in this table to cover the
range of register values encountered in practice ($p \geq p_{min}=3$).
Importantly, the table does not vary with respect to p. However, when
indexing into UltraLogLog::REGISTER_CONTRIBUTIONS, the index is taken
to be $r - 4p-4$², which varies with $p$. My assumption was that the
index would be $r - 12$ (when $r < 4w$) here.

Meanwhile, in the large-range calculation, it appears that register
values can only contribute to $c_{4w+k,0 \leq k < 4}$ if they are in the
range $[252,256)$³. In contrast to the register contribution table,
(valid) register values for the large-range contribution do vary with
$p$; I expected the algorithm to check r == 4 * (65 - p) + {0,1,2,3},
instead of fixed values (seemingly corresponding to $p=2$).

Please let me know if I have misunderstood anything in your paper⁴
or UltraLogLog::estimate. Thank you for your time.

JDK8 support?

Hey folks,

I need a JVM version of WyHash that operates in a compatible way with a particular version of upstream. I don't know offhand what version of upstream I need to match, I can get that with a bit of effort.

It turns out hash4j matches that specific implementation already, unlike any other Java implementation I can find, so it'd be nice to use it, but I need to target JDK8 at the moment.

Would there be interest in adding a JDK8-compatible operating mode to hash4j? I've spent a bit of time on this unfinished experimental approach to the problem, I could continue with this if there's interest.

Thanks!

note: there may be a license concern arising from my wip, don't merge it until that's dealt with

Add HashValue::toCharArray, HashValue::toString()

Is your feature request related to a problem? Please describe.
There is currently no easy way to get the HashValue result as a byte array. Also the toString function doesn't return a human-readable version of the hash (like it does for e.g. UUID). This is especially a problem for hashes with a size of 128Bit (or larger).

Describe the solution you'd like
A simple implementation of HashValue::toCharArray() and HashValue::toString() for all variants of HashValueXXX.

Describe alternatives you've considered
HashValue128::toCharArray():

ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(hash.getMostSignificantBits());
bb.putLong(hash.getLeastSignificantBits());
return bb.array();

HashValue128::toString():

new UUID(hash.getMostSignificantBits(), hash.getLeastSignificantBits()).toString()