dynatrace-oss / hash4j Goto Github PK
View Code? Open in Web Editor NEWDynatrace hash library for Java
License: Apache License 2.0
Dynatrace hash library for Java
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
When comparing large files for duplicates, usual hashing algorithms are limited by the speed of which the full file content can be read. This starts to be a problem when hashing large files. By only hashing parts of the data and its size you don't have to read all the data but get a hash that can indicate duplicates with sufficient confidence.
Describe the solution you'd like
Implement imohash or a similar sampling hash algorithm.
Describe alternatives you've considered
Implementing it on top of hash4j.
I've noticed that your Java implementation of komihash
uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.
But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash
, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l
, r2l
, r3l
, r4l
multiplication results into corresponding Seed1..4
values, and then XOR them with Seed5-8
values.
To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.
I'm interested in seeing (roughly) how fast Hash4J's hash functions are, and possibly how fast the comparable Guava/OpenHFT functions are too. I could run these on my own machine, but it would be nice as a quick approximation of what kind of performance numbers are reasonable.
In particular, the blog post describing the need for Hash4J mentions exact compatibility with the reference C++ implementation. I'm curious if the compatibility comes at a noticeable cost compared to the alternative implementations?
May I suggest you to remove compatibility references to all versions beside 5.0, and add a reference to a recently released 5.10? I'd like to remove intermediate tags as to not propagate sub-optimal releases.
Describe the bug
The output of MinHash.compute
cannot be merged with another output on hash sizes of under 64
To Reproduce
import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;
import java.util.Arrays;
import java.util.stream.IntStream;
public class MinHashBugRepro {
public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
byte[] output = pah.create(components);
IntStream.range(0, components).forEach(idx ->
pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
);
return output;
}
public static void main(String[] args) {
SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
Hasher64 hasher = Hashing.wyhashFinal4();
Long hello = hasher.hashCharsToLong("hello");
Long world = hasher.hashCharsToLong("world");
// when using a 32-bit hash, this merge function doesn't work
byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
byte[] merged32 = merge(10, 32, one32, two32);
System.out.println(Arrays.equals(merged32, three32));
// when using a 64-bit hash this merge function does work
byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
byte[] merged64 = merge(10, 64, one64, two64);
System.out.println(Arrays.equals(merged64, three64));
}
}
Expected behavior
I expect the data stored in the byte[]
s to be mergeable, even if the library does not include a merge function
Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.
Possible fix
I am reasonably sure that this is because of this section:
if (hash < work[i]) {
work[i] = hash;
}
and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing
Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
SuperMinHash
merges fine with this merge
method on all bit sizes, it does not work on SimHash
or FastSimHash
Could you please add a topic on the mainpage that lists benchmark results of all zero-allocation hash functions available? I may be asking for a trouble for komihash, but it would be interesting to know anyway.
Hello Ertl,
Since ultraloglog uses more bits than hyperloglog for fast access and update. The idea of locality sensitive hashing (e.g. approximate MinHash to estimate Jaccard), based on setsektch, can it be also applied in the ultraloglog case? or it is not possible but we have to rely on inclusions and exclusion rule after obtaining the cardinality to approximate jaccard index, at the risk of losing locality sensitive hashing property?
Thanks,
Jianshu
After adding UltraLogLog
support to Apache Pinot I've been looking at adding some of the MinHash
variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.
I'd like the SimilarityHasher
interface to also have a merge
method that takes two byte[]
and returns a byte[]
that represents the merged state.
There may be two bugs in the normal- and large-range computations
(respectively) of UltraLogLog::estimate
.
When evaluating the contribution of a register not in the small or large
range, UltraLogLog::estimate
makes use of the precomputed register
contribution table1 (due to equation 14). There are
range of register values encountered in practice (
Importantly, the table does not vary with respect to p
. However, when
indexing into UltraLogLog::REGISTER_CONTRIBUTIONS
, the index is taken
to be
index would be
Meanwhile, in the large-range calculation, it appears that register
values can only contribute to
range
(valid) register values for the large-range contribution do vary with
r == 4 * (65 - p) + {0,1,2,3}
,
instead of fixed values (seemingly corresponding to
Please let me know if I have misunderstood anything in your paper4
or UltraLogLog::estimate
. Thank you for your time.
https://github.com/dynatrace-oss/hash4j/blob/fe692fa466b27e0fb44a4fcbe679c4a518892a71/src/main/java/com/dynatrace/hash4j/distinctcount/UltraLogLog.java#L557 ↩
https://github.com/dynatrace-oss/hash4j/blob/fe692fa466b27e0fb44a4fcbe679c4a518892a71/src/main/java/com/dynatrace/hash4j/distinctcount/UltraLogLog.java#L908 ↩
https://github.com/dynatrace-oss/hash4j/blob/fe692fa466b27e0fb44a4fcbe679c4a518892a71/src/main/java/com/dynatrace/hash4j/distinctcount/UltraLogLog.java#L917-L920 ↩
Hey folks,
I need a JVM version of WyHash that operates in a compatible way with a particular version of upstream. I don't know offhand what version of upstream I need to match, I can get that with a bit of effort.
It turns out hash4j matches that specific implementation already, unlike any other Java implementation I can find, so it'd be nice to use it, but I need to target JDK8 at the moment.
Would there be interest in adding a JDK8-compatible operating mode to hash4j? I've spent a bit of time on this unfinished experimental approach to the problem, I could continue with this if there's interest.
Thanks!
note: there may be a license concern arising from my wip, don't merge it until that's dealt with
Is your feature request related to a problem? Please describe.
There is currently no easy way to get the HashValue result as a byte array. Also the toString function doesn't return a human-readable version of the hash (like it does for e.g. UUID). This is especially a problem for hashes with a size of 128Bit (or larger).
Describe the solution you'd like
A simple implementation of HashValue::toCharArray() and HashValue::toString() for all variants of HashValueXXX.
Describe alternatives you've considered
HashValue128::toCharArray():
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(hash.getMostSignificantBits());
bb.putLong(hash.getLeastSignificantBits());
return bb.array();
HashValue128::toString():
new UUID(hash.getMostSignificantBits(), hash.getLeastSignificantBits()).toString()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.