Giter Site home page Giter Site logo

aggregateknowledge / java-hll Goto Github PK

View Code? Open in Web Editor NEW
311.0 311.0 70.0 427 KB

Java library for the HyperLogLog algorithm

Home Page: http://research.neustar.biz/2013/12/24/open-source-release-java-hll/

License: Apache License 2.0

Java 100.00%

java-hll's People

Contributors

blinsay avatar yerenkow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-hll's Issues

Large range HLL resulting in 0 cardinality

There are two unit tests SparseHLLTest#largeRangeSmokeTest() and FullHLLTest#largeRangeSmokeTest() that are supposed to test cardinalities of very large HLLs. However, if you insert

System.out.println(cardinality + ", " + expected);

at the bottom of them, you see that both the expected and actual cardinalities are 0. They should instead be a very large number.

IntegrationTestGenerator: StackOverflowError when running it

When I run the IntegrationTestGenerator class, I get a StackOverflowError:

Exception in thread "main" java.lang.StackOverflowError
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)

Wrong max value for log2m?

The docs say that the log2m parameter must "be at least 4 and at most 31", but it seems that 30 is the highest possible value.

The following line raises a NegativeArraySizeException:

new HLL(31, 8, -1, true, HLLType.FULL);

Here's the full stack trace:

Exception in thread "main" java.lang.NegativeArraySizeException
    at net.agkn.hll.util.BitVector.<init>(BitVector.java:68)
    at net.agkn.hll.HLL.initializeStorage(HLL.java:434)
    at net.agkn.hll.HLL.<init>(HLL.java:202)
    at Spike.main(Spike.java:7)

If I change 31 to 30 it works fine.

Problem with toBytes and fromBytes (error is 76%, too big)

test1 (Scala code) gives error of less than 2%:

import java.nio.charset.Charset
import com.google.common.hash.Hashing
import net.agkn.hll.HLL

def test1() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128
  val hll = new HLL(13, 5)

  for (i <- 1 to 74570) {
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")
  }
  println(hll.cardinality.toInt)
}

test2 gives error of 76%!!!

def test2() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128  

  var b: Array[Byte] = null
  for (i <- 1 to 74570) {
    val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")

    if (dau1 != dau2) b = hll.toBytes
  }

  val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
  println(hll.cardinality.toInt)
}

The output looks like this (note that the value drops dramatically from 73818 to 17757):

...
dau1 = dau2 = 73818, i = 74560
dau1 = dau2 = 73818, i = 74561
dau1 = dau2 = 73818, i = 74562
dau1 = dau2 = 17757, i = 74565
dau1 = dau2 = 17760, i = 74567
...

Higher log2m value not always producing more accurate estimates

My understanding from the HLL algorithm (which may be flawed, in which case please correct me and close this issue) is that for any fixed set of input values, the accuracy of any estimate from an HLL built from those values should increase as the "m" value used in the HLL increases.

Ie:

if you build 2 HLL instances, with different log2m settings, and add the exact same set of (raw) values to both, then the HLL with the larger log2m will give you the most accurate results then the HLL with a smaller log2m setting.

In my testing however, I'm frequently encountering situations where "smaller" HLL instances are producing more accurate cardinality estimates -- which I can't explain.

I've created a reproducible test case that demonstrates the problem, which i will post as a separate comment.

Unable to Serialize HLL ?

I am playing around with your cardinality libraries which i'd like to wrap in a map/reduce paradigm and plug into various compute platforms out there....hence chose Adaptive HLL which has a merge method on it....

I'd also like to save the object with its state on disk so that the compute/query platforms can interpret these computed objects and answer questions (instead of flowing raw event data again)...But i am unable to serialize the Adaptive Hyperloglog object. Was this intentionally left out?

There is not much documentation - but the library seems quite clean and seems like a potential fit for what i have in mind...can you please help?

Thanks,
Rohan

Fix documentation

Since there's no such constructor:
public HLL(final int log2m, final int regwidth)
But it used in project introduction

Error rate is more than theoretical error rate

The research paper claims the error rate to be 1.04/sqrt(m)

For log2m = 11 and w = 5 the theoretical error rate should be ~2.3% but in our tests we have seen a variance of upto 3.5% for inserting 100K unique alphanumeric UUIDs. Below is our pseudo code :

HLL hll = new HLL( 11, 5 );
HashFunction hashFunction = Hashing.murmur3_128();

for (String str : UUIDs) {
  Hasher hasher = hashFunction.newHasher();
  hasher.putString(str, StandardCharsets.UTF_8);
  hll.addRaw(hasher.hash().asLong());
}

How can we reduce this additional 1.2% variance and fall in the theoretically claimed error rate ?

java.lang.NoSuchMethodError: it.unimi.dsi.fastutil.longs.LongOpenHashSet.clone()

I've found a condition where a NoSuchMethodError can be thrown. Below is what I do to get it to trigger (I hope you don't mind that it's in scala). I am using hll version 1.6.0 out of maven.

val h1 = new HLL(13, 5)
val h2 = new HLL(13, 5)
h2.addRaw(2)
h1.union(h2) // <-- exception here

results in:

Exception in thread "main" java.lang.NoSuchMethodError: it.unimi.dsi.fastutil.longs.LongOpenHashSet.clone()Lit/unimi/dsi/fastutil/longs/LongOpenHashSet;
    at net.agkn.hll.HLL.heterogenousUnion(HLL.java:679)
    at net.agkn.hll.HLL.union(HLL.java:639)

Is anyone else running into this issue?

Maximum or Average Error?

In the docs, the relative error is given by the expression ยฑ1.04/โˆš(2log2m).

Is this an average error or a maximum error? If average, is it even possible to determine a maximum error?

MAXIMUM_LOG2M_PARAM inconsistent with storage specification and postgresql-hll

The storage specification, and postgress-hll docs each say...

https://github.com/aggregateknowledge/hll-storage-spec/blob/v1.0.0/STORAGE.md

registerWidth may take values from 1 to 8, inclusive, and log2(numberOfRegisters) may take on 4 to 31, inclusive.

https://github.com/aggregateknowledge/postgresql-hll/blob/master/README.markdown#explanation-of-parameters-and-tuning

The log-base-2 of the number of registers used in the HyperLogLog algorithm. Must be at least 4 and at most 31.

However in the javacode itself...

public static final int MAXIMUM_LOG2M_PARAM = 30;

From what i can see this means postgresql-hll can produce a serialized HLL which can not be parsed by HLL.fromBytes.

Unusual behavior for HLL with register width 6

There seems to be some kind of a bug with HLL execution; when one instantiates an HLL as having a regwidth of 6, the cardinality returned is consistently 0 at large sample sizes. The following test was constructed and run in FullHLLTest. Could this be something to do with cutoff measurements? Thanks!

/**
 * Test case for 6-bit cutoff problems
 */
@Test
public void sixBitSixteenKRegisterCutoffBugTest() {
    Long temp = System.currentTimeMillis();

    // this does NOT work
    final HLL brokenHll4 = new HLL(13, 6);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll4.addRaw(random);
        }

        final long cardinality = brokenHll4.cardinality();

        assertTrue(cardinality > 0); // this does NOT work
    }

    // this DOES work
    final HLL brokenHll5 = new HLL(13, 5);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll5.addRaw(random);
        }

        final long cardinality = brokenHll5.cardinality();

        assertTrue(cardinality > 0); // this DOES work
    }
}

Out-of-bounds when a blank HLL is unioned with a smaller blank HLL

A union operation done on two HLL objects immediately after they are initialized, with a smaller log2m in the parameter HLL than the one doing the union, throws an ArrayIndexOutOfBoundsException. The below code should be capable of reproducing this error.

HLL shortHLL = new HLL(14, 6, -1, false, HLLType.FULL);
HLL longHLL = new HLL(18, 6, -1, false, HLLType.FULL);
longHLL.union(shortHLL);

int overflow during serialization of a FULL type HLL when log2m and regwidth are both large

BigEndianAscendingWordSerializer and BigEndianAscendingWordDeserializer both suffer from int overflow bugs: multiplying large ints and then assigning to long instead of casting those ints to longs before multiplying them.

I don't have a pull request handy (since this project no longer seems actively maintained, we've forked & imported directly into the Apache Solr code base) but I wanted to file this issue to make existing users aware of the bug -- you can see the details involved in fixing/testing in this issue/commit in the Lucene/Solr code base...

https://svn.apache.org/viewvc?view=revision&revision=1697969
https://issues.apache.org/jira/browse/SOLR-7954

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.