tdunning / t-digest Goto Github PK

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

License: Apache License 2.0

Java 92.25% R 7.75%

quantile accuracy online-algorithms t-digest

t-digest's Introduction

t-digest ·

A new data structure for accurate online accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very friendly to parallel programs making it useful in map-reduce and parallel streaming applications implemented using, say, Apache Spark.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-digest over previous digests for this purpose is that the t-digest handles data with full floating point resolution. With small changes, the t-digest can handle values from any ordered set for which we can compute something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by alternative digest algorithms in spite of the fact that t-digests are much more compact, particularly when serialized.

In summary, the particularly interesting characteristics of the t-digest are that it

has smaller summaries when serialized
works on double precision floating point as well as integers.
provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
is very fast (~ 140 ns per add)
is very simple (~ 5000 lines of code total, <1000 for the most advanced implementation alone)
has a reference implementation that has > 90% test coverage
can be used with map-reduce very easily because digests can be merged
requires no dynamic allocation after initial creation (MergingDigest only)
has no runtime dependencies

Recent News

There is a new article (open access!) in Software Impacts on the t-digest, focussed particularly on this reference implementation.

Lots has happened in t-digest lately. Most recently, with the help of people posting their observations of subtle misbehavior over the last 2 years, I figured out that the sort in the MergingDigest really needed to be stable. This helps particularly with repeated values. Stabilizing the sort appears to have no negative impact on accuracy nor significant change in speed, but testing is continuing. As part of introducing this change to the sort, I made the core implementation pickier about enforcing the size invariants which forced updates to a number of tests.

The basic gist of other recent changes is that the core algorithms have been made much more rigorous and the associated papers in the docs directory have been updated to match the reality of the most advanced implementations. The general areas of improvement include substantial speedups, a new framework for dealing with scale functions, real proofs of size bounds and invariants for all current scale functions, much improved interpolation algorithms, better accuracy testing and splitting the entire distribution into parts for the core algorithms, quality testing, benchmarking and documentation.

I am working on a 4.0 release that incorporates all of these improvements. The remaining punch list for the release is roughly:

~~verify all tests are clean and not disabled~~ (done!)
~~integrate all scale functions into AVLTreeDigest~~ (done!)
describe accuracy using the quality suite
extend benchmarks to include AVLTreeDigest as first-class alternative
measure merging performance
consider issue #87
review all outstanding issues (add unit tests if necessary or close if not)

Publication work is now centered around comparisons with the KLL digest (spoiler, the t-digest is much smaller and possibly 2 orders of magnitude more accurate than KLL). I would still like to see potential co-authors who could accelerate these submissions are encouraged to speak up! In the meantime, an archived pre-print of the main paper is available.

In research areas, there are some ideas being thrown around about how to bring strict guarantees similar to the GK or KLL algorithms to the t-digest. There is some promise here, but nothing real yet. If you are interested in a research project, this could be an interesting one.

Scale Functions

The idea of scale functions is the heart of the t-digest. But things don't quite work the way that we originally thought. Originally, it was presumed that accuracy should be proportional to the square of the size of a cluster. That isn't true in practice. That means that scale functions need to be much more aggressive about controlling cluster sizes near the tails. We now have 4 scale functions supported for both major digest forms (MergingDigest and AVLTreeDigest) to allow different trade-offs in terms of accuracy.

These scale functions now have associated proofs that they all preserve the key invariants necessary to build an accurate digest and that they all give tight bounds on the size of a digest. Having new scale functions means that we can get much better tail accuracy than before without losing much in terms of median accuracy. It also means that insertion into a MergingDigest is faster than before since we have been able to eliminate all fancy functions like sqrt, log or sin from the critical path (although sqrt is faster than you might think).

There are also suggestions that asymmetric scale functions would be useful. These would allow good single-tailed accuracy with (slightly) smaller digests. A paper has been submitted on this by the developer who came up with the idea and feedback from users about the utility of such scale functions would be welcome.

Better Interpolation

The better accuracy achieved by the new scale functions partly comes from the fact that the most extreme clusters near q=0 or q=1 are limited to only a single sample. Handling these singletons well makes a huge difference in the accuracy of tail estimates. Handling the transition to non-singletons is also very important.

Both cases are handled much better than before.

The better interpolation has been fully integrated and tested in both the MergingDigest and AVLTreeDigest with very good improvements in accuracy. The bug detected in the AVLTreeDigest that affected data with many repeated values has also been fixed.

Two-level Merging

We now have a trick for the MergingDigest that uses a higher value of the compression parameter (delta) while we are accumulating a t-digest and a lower value when we are about to store or display a t-digest. This two-level merging has a small (negative) effect on speed, but a substantial (positive) effect on accuracy because clusters are ordered more strictly. This better ordering of clusters means that the effects of the improved interpolation are much easier to observe.

Extending this to AVLTreeDigest is theoretically possible, but it isn't clear the effect it will have.

Repo Reorg

The t-digest repository is now split into different functional areas. This is important because it simplifies the code used in production by extracting the (slow) code that generates data for accuracy testing, but also because it lets us avoid any dependencies on GPL code (notably the jmh benchmarking tools) in the released artifacts.

The major areas are

core - this is where the t-digest and unit tests live
docs - the main paper and auxiliary proofs live here
benchmarks - this is the code that tests the speed of the digest algos
quality - this is the code that generates and analyzes accuracy information

Within the docs sub-directory, proofs of invariant preservation and size bounds are moved to docs/proofs and all figures in docs/t-digest-paper are collected into a single directory to avoid cluster.

LogHistogram and FloatHistogram

This package also has an implementation of FloatHistogram which is another way to look at distributions where all measurements are positive and where you want relative accuracy in the measurement space instead of accuracy defined in quantiles. This FloatHistogram makes use of the floating point hardware to implement variable width bins so that adding data is very fast (5ns/data point in benchmarks) and the resulting sketch is small for reasonable accuracy levels. For instance, if you require dynamic range of a million and are OK with about bins being about ±10%, then you only need 80 counters.

Since the bins for FloatHistogram's are static rather than adaptive, they can be combined very easily. Thus you can store a histogram for short periods of time and combined them at query time if you are looking at metrics for your system. You can also reweight histograms to avoid errors due to structured omission.

Another class called LogHistogram is also available in t-digest. The LogHistogram is very much like the FloatHistogram, but it incorporates a clever quadratic update step (thanks to Otmar Ertl) so that the bucket widths vary more precisely and thus the number of buckets can be decreased by about 40% while getting the same accuracy. This is particularly important when you are maintaining only modest accuracy and want small histograms.

In the future, I will incorporate some of the interpolation tricks from the main t-digest into the LogHistogram implementation.

Compile and Test

You have to have Java 1.8 to compile and run this code. You will also need maven (3+ preferred) to compile and test this software. In order to build the figures that go into the theory paper, you will need R. In order to format the paper, you will need latex. Pre-built pdf versions of all figures and papers are provided so you won't need latex if you don't need to make changes to these documents.

On Ubuntu, you can get the necessary pre-requisites for compiling the code with the following:

sudo apt-get install  openjdk-8-jdk git maven

Once you have these installed, use this to build and test the software:

cd t-digest; mvn test

Most of the very slow tests are in the quality module so if you just run the tests in core module, you can save considerable time.

Testing Accuracy and Comparing to Q-digest

The normal test suite produces a number of diagnostics that describe the scaling and accuracy characteristics of t-digests. In order to produce nice visualizations of these properties, you need to have many more samples. To get this enhanced view, run the tests in the quality module by running the full test suite once or, subsequently, by running just the tests in the quality sub-directory.

cd quality; mvn test

The data from these tests are stored in a variety of data files in the quality directory. Some of these files are quite large.

I have prepared detailed instructions on producing all of the figures used in the main paper.

Most of these scripts will complete almost instantaneously; one or two will take a few tens of seconds.

The output of these scripts are a collection of PDF files that can be viewed with any suitable viewer such as Preview on a Mac. Many of these images are used as figures in the main t-digest paper.

Implementations in Other Languages

The t-digest algorithm has been ported to other languages:

Python: tdigest
Go: github.com/spenczar/tdigest github.com/influxdata/tdigest
JavaScript: tdigest
C++: CPP TDigest, FB's Folly Implementation (high performance)
C++: TDigest as part of Apache Arrow
CUDA C++: tdigest.cu as part of libcudf in RAPIDS powering the approx_percentile and percentile_approx expressions in Spark SQL with RAPIDS Accelerator for Apache Spark
Rust: t-digest and its modified version in Apache Arrow Datafusion
Scala: TDigest.scala
C: tdigestc (w/ bindings to Go, Java, Python, JS via wasm)
C: t-digest-c as part of RedisBloom
Clojure: t-digest for Clojure
C#: t-digest-csharp (.NET Core)
Kotlin multiplatform: tdigest_kotlin_multiplatform
OCaml: tdigest. Purely functional, can also compile to JS via js_of_ocaml.
Redis: Redis Stack supports t-digest.

Continuous Integration

The t-digest project makes use of Travis integration with Github for testing whenever a change is made.

You can see the reports at:

https://travis-ci.org/tdunning/t-digest

travis update

Installation

The t-Digest library Jars are released via Maven Central Repository. The current version is 3.3.

     <dependency>
         <groupId>com.tdunning</groupId>
         <artifactId>t-digest</artifactId>
         <version>3.3</version>
     </dependency>

t-digest's People

Contributors

Stargazers

Watchers

Forkers

d13sl0w jaypalominodb apsaltis jpountz supercargo nremond allenday peternachbaur youyong205 fsgp dariadaria smorin daria-sukhareva saikswaroop tanbamboo merlinr68 oertl ominiavincit muntasirraihan payne julienledem musicx kalaidin rodrigoparada ksenji camdavidsonpilon chodar smarthi drej82 bpholden pippobaudos rayortigas sudarshansarolkar mindis mkolod ekoziol jayakumark iranaway itismewxg codeaudit lucentcosmos spenczar rstonkus brharrington k0stia himanshug markpavey thomasdarimont shub0 evilhonduco ompanda digideskio rkw0k pranasblk demianr callumkift jcrotinger colinsongf brucezu iceqp hanumathrao kkpapa ntirupattur bgloeckle mailmahee minhqnguyen orenov kyotoyaho tobrien nikolayvoronchikhin whiker tkelleyireland chr1st0p emlyn huangjun6919 christian-blades-cb anvinjain leoromanovsky zhangxinming1991 bilus benzei phamngoclinh96 dodecamer jbrownuf07 kiefersmith leusonmario kapiya vaful sudeepdino008 dsbib viniciusvdias fleeyou arproio shacharbz meytaldimant atzoum jdrew1303 isuranimalasri zer0id0l sathishs

t-digest's Issues

Add serializability

We use a patched version of t-digest because we have to send AVLTreeDigest over the wire (because we are using spark) and therefore the class has to be serializable. Would you be open to merge this as a pull request? If so we would provide one.

Btw. thx for this awesome library - saves us a ton of time computing percentiles!

Unit test in error

Hi,

I'm trying to compile the sources to be able to play a bit with them. Unfortunately, I got
this error and don't really understand how it could work ?

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.372 sec <<< FAILURE! - in com.tdunning.math.stats.ArrayDigestTest
testGamma(com.tdunning.math.stats.ArrayDigestTest)  Time elapsed: 0.372 sec  <<< ERROR!
java.lang.UnsupportedOperationException: Default operation
    at com.tdunning.math.stats.ArrayDigest.cdf(ArrayDigest.java:286)
    at com.tdunning.math.stats.ArrayDigestTest.runTest(ArrayDigestTest.java:212)
    at com.tdunning.math.stats.ArrayDigestTest.testGamma(ArrayDigestTest.java:153)

Getting IllegalArgumentException when adding digests

I'm using Spark to build a bunch of AVLTreeDigests, one for each partition of the data. All digests are initialized with compression 100, and values are all added with weight 1. When I then try to add these partition digests I get this exception:

java.lang.IllegalArgumentException
    at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:60)
    at com.tdunning.math.stats.AbstractTDigest.add(AbstractTDigest.java:166)
    at test.TDigestWrapper.$plus(TestDigest.scala:34)
    at test.DigestAccumulatorParam.addInPlace(TestDigest.scala:42)
    at test.DigestAccumulatorParam.addInPlace(TestDigest.scala:41)
    at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:80)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:342)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:337)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at org.apache.spark.Accumulators$.add(Accumulators.scala:337)
    at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:945)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1014)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1454)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

The two digests in question seem to be empty actually, but with different centroids. The accumulator one has an empty centroids array whereas the other digest has a single centroid Centroid{centroid=NaN, count=0}. That latter one was deserialized via AVLTreeDigest.fromBytes.

totalDigest add spark dataframe column / array

Currently, we are loading a spark dataframe column into totalDigest with a hackerish way of using .take() and foreach, as tdigests keep throwing an "object not serializable" error without .take(). Is there a more native way of loading large array / spark dataframe columns directly into totalDigest?

Sample code we using
val totalDigest = TDigest.createDigest(100)
val data =df.select("col").rdd.map(r => r.getDouble(0)).take(numberOfRows)
data.foreach(value => totalDigest.add(value))

Alternative, using array of bytes:
http://apache-spark-user-list.1001560.n3.nabble.com/Percentile-td19978.html#a20032

Revamp accuracy test

Should include uniform and skewed distributions for all active digest types
Should include merge strategy (1000 digests merged simultaneously)
Should include ordered insert and ordered merge insert

Should include bucket distribution testing (hopefully faster than before)

Should include accuracy versus size summary

Support datasets that contain more than 2B values

Current implementation uses ints in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.

Improvement to t-digest stability under merging

I noticed that merging t-digests appeared to cause some drift from the underlying distribution, at least using the method mentioned in the original paper where clusters are randomly shuffled and inserted. Inserting combined clusters from largest-to-smallest gave a big improvement in "convergence" of the merged clusters. I wrote up the results here:

http://erikerlandson.github.io/blog/2016/12/19/converging-monoid-addition-for-t-digest/

Histogram tree is slow

The implementation of a binary tree in t-digest is a quick and dirty version of an A-A tree. It could be made much faster through the use of type specific arrays or even just cleaner code.

Update cluster algorithm description in article and docs

high centroid frequency causes overflow - giving incorrect results

See sample code. When centroid frequency goes into billions and approaches INT.MAX_VALUE, quantile method seems to go through an overflow and reports incorrect result.

Data-structure state seems to be consistent - but when evaluating quantile, incorrect results get returned.

[https://github.com/rohancs/t-digest-bug/blob/master/src/main/java/TDigestBug.java]

Which TDigest do you recommend?

It looks like you have 4 kinds of TDigest in this codebase.

AVLTreeDigest
ArrayDigest
MergingDigest
TreeDigest

I cannot tell which implementation is the most highly recommended. Do you have a suggestion?

I am building a MapReduce job using Apache Spark.

underlying distribution : powerlaw

Could you please explain what is expected if the underlying distribution is a powerlaw (pareto specifically). The deviation of computed 50-pct from real 50-pct ranges from 0.01% to 15%. Furthermore, it deviates based on shuffle.

My data set has 600K data points. Do we have any specific scaling guide to replace 4*q*(1-q) for powerlaw types.

Very slow performance; what am I missing?

I have been playing with the t-digest on a sample dataset of latencies and comparing performance against raw sorting to derive quantile statistics. I am using the standard APIs (and recommended configuration values) to build the t-digest, however I am observing significantly worse times than raw sorting. Is this to be expected? I would appreciate any help in this regards.

Here is my sample code:

long[] latencies = new long[1000000];
...
TDigest tDigest = TDigest.createDigest(100);
for (int i=0; i < 1000000; i++) {
    tDigest.add(latencies[i]);
}
System.out.println("Median = " + tDigest.quantile(0.5));

I used a dataset with 1 million latency values (in ms). The max is 100,000. It took about 2.7 secs to build the digest and extract the 50% quantile value. With a dataset with 2 million latency values, the computation time doubled. Raw sorting is under 100ms.

Reducing the compression factor to 5 helps but computation time is still excessively high.

smallByteSize methods are very trappy in many classes -- should be changed or have warnings in javadocs

When integrating t-digest into Solr, the first patch contributed had code which looked like this...

ByteBuffer buf = ByteBuffer.allocate(tdigest.smallByteSize());
tdigest.asSmallBytes(buf);
res.add("percentiles", buf.array());

...the contributor who wrote that code did so based on the javadocs of ArrayDigest.smallByteSize, not realizing that under the covers hte internals of this method are:

allocate a ByteBuffer based on the result of byteSize()
do a complete serialization using asSmallBytes(ByteBuffer)
return the final position of the ByteBuffer

...meaning that ultimately at least 2x the smallByteSize of RAM was being allocated everytime, and the ArrayDigest was being serialized twice.

Methods like ArrayDigest.smallByteSize (AVLTreeDigest.smallByteSize, etc...) that have such a high internal cost should either be changed to just call byteSize() (since the overallocation the user will have to do is still better for space/time performance then doing that same overallocation internally plus a behind the scenes serialiation that is thrown away) or warn users about this heavy internal cost in the javadocs.

MergingDigest.centroids is wrong on an empty digest

When calling centroids on an empty MergingDigest, a list containing a single element with NaN as a centroid is returned instead of an empty list:

MergingDigest d = new MergingDigest(100);
System.out.println(d.centroids()); // [Centroid{centroid=NaN, count=0}]

Space complexity of t-digest

The q-digest has a size of at most 3k, where k is the compression parameter. What is the space complexity of the t-digest?

JMH testing for speed

This should be revamped and results recorded

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting

Got down a rabbit hole, came out with this. It is a small part of a bigger project in javascript but some ideas might be of use to others. Sorry no Java.

In short, it flips around the value-rank compression approach. Instead of having approximated weighted-value centroids for a given rank, this algorithm uses actual values as-is but with approximated ranks.

maximum value and rank instead of weighted value centroids:

actual values are retained. Better for discrete or repeated values
cdf is directly implied from a given maximum and rank F(x) = P(X <= x)
ranks are interpolated (ie values are exact, ranks are not)

fixed size, simple array, no tree structure,

no compression cycles
fixed memory use
weighting function calculated only once (fixed target cumulative probability)

There is already a number of tests and benchmarks but it is still a proof of concept at this time since it is still fairly new. I can elaborate more if there is any interest. (unless it already exists somewhere?)

histo.pdf is corrupt

the histo pdf file
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
appears to be corrupt.

Adjusting the centroid threshold values to obtain better accuracy at interesting values

The current algorithm defines a threshold for the number of values any centroid can hold as proportional to 4q(1-q). This means that computing quantiles at extreme values (near 0 or 1) gives the most accurate values, and computing the median value will gives the least accurate value (since the threshold is highest in the middle of the histogram.

One thought I had was that if the algorithm was told up-front what quantiles are the most interesting to the user could this distribution be modified to concentrate the accuracy (i.e. lower the threshold) at these values? So for example if the user is most interested in the median value the thresholds equation used could be proportional to 4q(q-1)+1.

I can see two complications coming from this approach:

If the user is interested in multiple quantile values the calculation of the threshold value could become complex. Of course there is nothing stopping the user from asking the algorithm for other quantile values, they would just receive lower accuracy results for these other values compared with the value the distribution is centred on. So maybe to start it could be restricted to a single 'interesting' quantile.
For use-cases where the values recorded in the histogram follow a normal Gaussian distribution (or approximate to this distribution) the number of compressions required will be higher than with the current distribution. The worst case of this would be when the user is most interested in the median value and the distribution is normal Gaussian. In this case the values recorded will be concentrated in the area that has the lowest threshold so will cause a lot of new centroids to be created and therefore more merge events (on the flip side of this the algorithm should preform much less merges in the ordered values case). This may just be the cost that needs to be paid to have high accuracy at the median?

I would appreciate any thoughts on this approach. Does it sound like something thats worth exploring? Have I missed something that would mean this wouldn't be a good idea? If this idea seems to make sense I can work on trying it out and seeing how it performs with different distributions of data.

TDigest class does not implement Serializable interface in last release.

Very good library, i have tested it, thanks a lot, however this library can't be used with Spark without implementing TDigest class Serializable interface.

My question is when will be new release with new features, in particular just adding Serializable support?
Thanks in advance.

Inverse quantile algorithm is non-contiguous

Not only it gives strange results at the end (and indeed, python implementation fixes that; sorry, not fluent at Java), but also the ranges that are covered as q goes up are non-contiguous.
I propose the following changes:

t <- 0, q <- $1 + q (\sum c_i.count - 1)$
for i <- 1..m
  if q \leq t + c_i.count
    if i = 1 : return c_i.mean
    if i = m : return c_i.mean
    low <- (c_{i-1}.mean + c_i.mean)/2
    high <- (c_{i+1}.mean + c_i.mean)/2
    return low + (high - low) * (q - t)/c_i.count

This has the property of being contiguous and returning precise values for $q = 0, 1$.

Deserialization of MergingDigest BuferUnderflowException in 3.1

Heya, I'm writing an adaptor and noticed that if I create and serialize a MergingDigest, I can't deserialize it. Instead I have to switch to the AVLTreeDigest implementation. Is that intentional? An example test:

@Test
  public void mergingDigestSerDes() throws Exception {
    final TDigest out = MergingDigest.createDigest(100);
    out.add(42.5);
    out.add(1);
    out.add(24.0);
    assertEquals(40.649, out.quantile(0.95), 0.001);
    
    final ByteBuffer output = ByteBuffer.allocate(out.smallByteSize());
    out.asSmallBytes(output);
    
    ByteBuffer input = ByteBuffer.wrap(output.array());
    try {
      MergingDigest.fromBytes(input);
    } catch (BufferUnderflowException e) {
      System.out.println("WTF?");
    }
    
    input = ByteBuffer.wrap(output.array());
    final TDigest in = AVLTreeDigest.fromBytes(input);
    assertEquals(40.649, in.quantile(0.95), 0.001);
  }

And from #52 , I'd like solution number 2 where the type of tree is also encoded in the byte array.

Graphing scripts should produce corresponding plots for all algorithms

The R scripts that produce the figures for the paper have hard coded path names that are no longer current. Now the algorithm name is embedded in the path name. It should be easy to produce all the plots at one go.

C++ t-digest is buggy

The clustering C++ implementation that you link to does not handle full clusters properly.

Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q)

Since the statistical error when estimating quantiles is proportional to sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it could be better to choose the centroid sizes accordingly. In the add() method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the figures are also shown for the original approach. The error-scaling and scaling figures show a reduction of digest sizes by a factor more than 2. The error figures suggest that the new approach is somewhat less accurate for the Gamma distribution. However, I believe that the accuracy is still small if compared to the statistical error introduced by data sampling. It would be interesting to measure the quality of the t-digest algorithm in terms of the statistical error as proposed in Numericial Recipes (3rd edition, p. 438):

..., there are statistical errors. One way to characterize these is to ask what
value j has A_j closest to the quantile reported by IQ agent, and then how small is |j-pN| as a fraction of sqrt(N p (1-p)). If this fraction is less than 1, then the estimate is “good enough,” meaning that no method can do substantially better at estimating the population quantiles given your sample.

(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile estimator, t-digest in our case)

error-scaling (2_sqrt(q_(1-q)))
error-scaling (4_q_(1-q))
scaling (2_sqrt(q_(1-q)))
scaling (4_q_(1-q))
error (2_sqrt(q_(1-q)))
error (4_q_(1-q))
sizes (2_sqrt(q_(1-q)))
sizes(4_q_(1-q))

Build is unstable under some circumstances

Building and testing on my laptop is pretty stable but building and testing on travis has >50% failure rate. Building on a large AWS node also shows significant failure rates.

The build command on travis differs from the one I normally use. Travis uses:

 mvn -B clean install -DskipTests
 mvn -B test

I emulated this on the large AWS node and got the following results:

baseline
rm -rf ~/.m2 ; mvn -B clean install -DskipTests ; mvn -B test
      7 [INFO] BUILD FAILURE
     13 [INFO] BUILD SUCCESS

no-rm-baseline
mvn -B clean install -DskipTests ; mvn -B test
      6 [INFO] BUILD FAILURE
     14 [INFO] BUILD SUCCESS

no-clean-no-rm-baseline
mvn -B install -DskipTests ; mvn -B test
      5 [INFO] BUILD FAILURE
     15 [INFO] BUILD SUCCESS

bare-test-batch-baseline
mvn -B test
      3 [INFO] BUILD FAILURE
      7 [INFO] BUILD SUCCESS

no-batch-baseline
mvn test
      2 [INFO] BUILD FAILURE
      8 [INFO] BUILD SUCCESS

I see no difference in failure rates here. I also looked at the environment and was unable to discern any difference due to -B.

The pom has the following surefire configuration:

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>2.18.1</version>
            <configuration>
                <parallel>methods</parallel>
                <perCoreThreadCount>true</perCoreThreadCount>
                <threadCount>1</threadCount>
            </configuration>
        </plugin>

Commenting out the configuration elements here doesn't seem to change anything either.

Remove deprecated kinds of t-digest

This includes:

ArrayDigest
TreeDigest
GroupTree

and related tests

AVLTreeDigest with a lot of datas : integer overflow

For a spacial project, we use TDigest to make statistics on 5 billion stars. Thanks for your works, it's very useful !

With integer fields, AVLTreeDigest give false results while TreeDigest give good results.

The problem become from "int[] aggregatedCounts", in AVLGroupTree, sometimes sum overflow integer capacity and the result become negative.

I change "int" to "long" and after AVLTreeDigest gave good results (the same result as TreeDigest).

AVLGroupTree_patch.txt

Typo in the construction algorithm

Going by the text,

This set is reduced by retaining only centroids with a count less than 4 \delta q (1 - q) n

Going by the code,

c_i.count + w_n \leq 4 n \delta q(c_i) (1 - q(c_i))

The latter condition is not only incorrect, it renders the minimum subsequently taken completely useless, as w_n \leq Max(i) - c_i.count already. This leads to pumping the exact value a lot even if the capacity would be exceeded a little bit.

Also relevant is the fact that if c_i already exists in the set then all boundaries on its size are ignored.

MergingDigestTest fails on mvn test

After running mvn test, one test fails. This is the output:

[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building T-Digest 3.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ t-digest ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /Users/grega/Development/t-digest/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.0:compile (default-compile) @ t-digest ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ t-digest ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /Users/grega/Development/t-digest/src/test/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.0:testCompile (default-testCompile) @ t-digest ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:2.16:test (default-test) @ t-digest ---
[INFO] Surefire report directory: /Users/grega/Development/t-digest/target/surefire-reports
[INFO] parallel='methods', perCoreThreadCount=true, threadCount=1, useUnlimitedThreads=false, threadCountSuites=0, threadCountClasses=0, threadCountMethods=0

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.tdunning.math.stats.SortTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.144 sec - in com.tdunning.math.stats.SortTest
Running com.tdunning.math.stats.AVLGroupTreeTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.221 sec - in com.tdunning.math.stats.AVLGroupTreeTest
Running com.tdunning.math.stats.GroupTreeTest
0:0.0(1)    1:0.0(1)    2:1.0(1)    3:1.0(1)    4:2.0(1)    5:2.5(2)    6:3.0(1)    7:3.0(1)    8:4.0(1)    9:4.0(1)    10:5.0(1)   Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.083 sec - in com.tdunning.math.stats.GroupTreeTest
Running com.tdunning.math.stats.IntAVLTreeTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.615 sec - in com.tdunning.math.stats.IntAVLTreeTest
Running com.tdunning.math.stats.TDigestTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec - in com.tdunning.math.stats.TDigestTest
Running com.tdunning.math.stats.AVLTreeDigestTest
#19.501140us per point
#794 centroids
Iteration 1# 15.788570us per point
#851 centroids
# big 4491 bytes
# small 4491 bytes
#15.375480us per point
#843 centroids
#53.819260us per point
#4045 centroids
#5.940800us per point
#852 centroids
Start 0Start 1Start 2Finish 1Finish 2Finish 0Tests run: 20, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 67.329 sec - in com.tdunning.math.stats.AVLTreeDigestTest
Running com.tdunning.math.stats.ArrayDigestTest
#19.488430us per point
#838 centroids
#25.380990us per point
#842 centroids
#18.371530us per point
#841 centroids
#15.674220us per point
#854 centroids
#5.107180us per point
#866 centroids
#4.435620us per point
#858 centroids
#3.549310us per point
#865 centroids
#4.670660us per point
#854 centroids
#4.219440us per point
#833 centroids
#4.117870us per point
#853 centroids
# big 4438 bytes
# small 4438 bytes
#44.124310us per point
#732 centroids
#26.484020us per point
#781 centroids
#21.444550us per point
#778 centroids
Iteration 1# 21.182410us per point
#847 centroids
#126.478320us per point
#3183 centroids
#50.969960us per point
#3195 centroids
#35.860900us per point
#3219 centroids
0.0 0.0000.1    0.0970.2    0.1990.3    0.3100.4    0.4060.5    0.4970.6    0.6170.7    0.7170.8    0.8130.9    0.9151.0    0.999# 25.518120us per point
#843 centroids
Start 1Start 0Start 2Finish 0Finish 1Finish 2Tests run: 22, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 116.769 sec - in com.tdunning.math.stats.ArrayDigestTest
Running com.tdunning.math.stats.MergingDigestTest
#30.942950us per point
#443 centroids
#28.387310us per point
#318 centroids
# big 4576 bytes
# small 4576 bytes
#41.353910us per point
#4045 centroids
#28.407520us per point
#314 centroids
Start 0Start 1Start 2Finish 0Finish 2Finish 1Tests run: 20, Failures: 0, Errors: 3, Skipped: 1, Time elapsed: 59.454 sec <<< FAILURE! - in com.tdunning.math.stats.MergingDigestTest
testMerge(com.tdunning.math.stats.MergingDigestTest)  Time elapsed: 3.628 sec  <<< ERROR!
java.util.concurrent.ExecutionException: java.lang.AssertionError: expected:<100000> but was:<69>
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:188)
    at com.tdunning.math.stats.TDigestTest.merge(TDigestTest.java:179)
    at com.tdunning.math.stats.MergingDigestTest.testMerge(MergingDigestTest.java:477)
Caused by: java.lang.AssertionError: expected:<100000> but was:<69>
    at org.junit.Assert.fail(Assert.java:91)
    at org.junit.Assert.failNotEquals(Assert.java:645)
    at org.junit.Assert.assertEquals(Assert.java:126)
    at org.junit.Assert.assertEquals(Assert.java:470)
    at org.junit.Assert.assertEquals(Assert.java:454)
    at com.tdunning.math.stats.TDigestTest$1.call(TDigestTest.java:145)
    at com.tdunning.math.stats.TDigestTest$1.call(TDigestTest.java:91)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

testNarrowNormal(com.tdunning.math.stats.MergingDigestTest)  Time elapsed: 0.073 sec  <<< ERROR!
java.lang.NullPointerException: null
    at com.tdunning.math.stats.MergingDigest.mergeNewValues(MergingDigest.java:148)
    at com.tdunning.math.stats.MergingDigest.add(MergingDigest.java:118)
    at com.tdunning.math.stats.AbstractTDigest.add(AbstractTDigest.java:154)
    at com.tdunning.math.stats.TDigestTest.runTest(TDigestTest.java:268)
    at com.tdunning.math.stats.MergingDigestTest.testNarrowNormal(MergingDigestTest.java:108)

testSorted(com.tdunning.math.stats.MergingDigestTest)  Time elapsed: 0.001 sec  <<< ERROR!
java.lang.NullPointerException: null

Running com.tdunning.math.stats.TreeDigestTest
#22.434220us per point
#749 centroids
#12.782380us per point
#834 centroids
# big 4464 bytes
# small 4464 bytes
Iteration 1# 12.420250us per point
#824 centroids
#96.659150us per point
#4700 centroids
#11.418890us per point
#847 centroids
Start 0Start 1Start 2Finish 2Finish 1Finish 0Tests run: 20, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 85.281 sec - in com.tdunning.math.stats.TreeDigestTest

Results :

Tests in error: 
  MergingDigestTest.testMerge:477->TDigestTest.merge:179 ? Execution java.lang.A...
  MergingDigestTest.testNarrowNormal:108->TDigestTest.runTest:268 ? NullPointer
  MergingDigestTest.testSorted ? NullPointer

Tests run: 103, Failures: 0, Errors: 3, Skipped: 4

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:04 min
[INFO] Finished at: 2015-03-14T11:45:54+01:00
[INFO] Final Memory: 10M/245M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project t-digest: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/grega/Development/t-digest/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project t-digest: There are test failures.

Please refer to /Users/grega/Development/t-digest/target/surefire-reports for the individual test results.
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:347)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:154)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:213)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.

Please refer to /Users/grega/Development/t-digest/target/surefire-reports for the individual test results.
    at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:82)
    at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:190)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:852)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:720)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
    ... 19 more
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Test method testScaling() always adds values in ascending order

The unit test method testScaling() defined in ArrayDigestTest, TreeDigestTest, and AVLTreeDigestTest generates a list of random numbers, sorts them, and always adds them in ascending order to the TDigest data structures. I am not sure if this is on purpose, since adding values in ascending order is a pathologic case.

jdk8 doclint incompatibility

Hi!

Library do not compile with jdk8. Main Issue is javadoc error, "self-closing elements not allowed" and "mallformed html"

https://kojipkgs.fedoraproject.org//work/tasks/8668/10158668/build.log
Simple auto-patching of first one can be done:
$ grep -ir -e "
"
src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/TreeDigest.java: *

src/main/java/com/tdunning/math/stats/ArrayDigest.java: *

$ sed "s;
;
;g" -i src/main/java/com/tdunning/math/stats/TDigest.java
$ sed "s;
;
;g" -i src/main/java/com/tdunning/math/stats/TreeDigest.java
$ sed "s;
;
;g" -i src/main/java/com/tdunning/math/stats/ArrayDigest.java
$ grep -ir -e "
"

Whether to repalce by

or
or just nothing is up to you.
I went with
jsut because it seemed to be most close to original idea.

Except the self closing element, there is huge nuyber of < and > chas i javadoc whcih have to berepalced by <> . I did this manually.

It resulted to:
http://pkgs.fedoraproject.org/cgit/t-digest.git/tree/jdk8-javadoc.patch?id=7c47e6cf5f8e8b7fd78bda838a78cbadd2da38a1

If you wont me to do an pull request with this patch, or with different replacement, I will be happy to proceed.

In all cases, it would be nice to have t-digest buildbale by jdk8 (as I have now - http://koji.fedoraproject.org/koji/taskinfo?taskID=10338753)

Error in AVLTreeDigest.fromBytes

Hello,

In AVLTreeDigest, in method asBytes, you write compression as Float but you read it as Double in fromBytes method.

Sincerely

Nicolas

bug_correction.txt

Question: Proof of bounds on merging digest size

I've been working on a python implementation of the merging digest (written mostly in c). Code is here. I'm running into bounds issues on the centroid array, following the theoretical size of ceil(compression * pi/2) (as seen here).

I modified the java implementation to test this bound, and get out-of-bounds errors when running the existing test suite. I also modified the java tests to output the input data, and ran my implementation on those inputs - finding the same results. As such, I trust my implementation matches yours - and since both have out-of-bounds errors on buffers of that size, I'm skeptical of that theoretical size.

From reading the paper, it appears you concluded that the bounds on the number of maximally merged centroids is 2 * compression (see here), which differs from what's stated in the code. I can kind of convince myself of this proof by iterating through a worst case scenario on paper, but am not math-y enough to write out a formal proof. Using a bound of 2 * compression seems to squash all the out-of-bounds errors though.

Question:

What is the actual theoretical bound?
Can you sketch out a proof of how you came by this number?

Release of version 3.2

Hi,

First up, thanks for releasing the code!

If possible, could you provide an ETA for the release of version 3.2? The recent Serialisation changes (#56) together with the pending PR #66 would allow us to use your library as is.

Thanks

General factory method for "fromBytes"

The new TDigest.createDigest() factory method in version 3.1 looks very handy, but seems to really only useful for people operating in simple single node environments. In order to serialize/deserialize/merge TDigests in a distributed application, it's currently still neccessary to use hardcoded implementations to leverage the fromBytes and as(Small)Bytes methods.

It would be helpful if there was a general (static) factory contract in the TDigest class for deserializing a ByteBuffer into a TDigest of unknonw concrete type.

Possible improvement to the speed of the algorithm

Hi, I've recently learned about t-digest and really like the algorithm. I coded a very naive implementation in C++ using std::multimap for centroids and found 3µs per measurement way too expensive (I'm trying to use it in a fast map/reduce-like setting with up to 10 million measurements per chunk, and 3µs made response unacceptable).

However in my search for improving the speed I found that too much time is spent in computing the head sum after finding closest centroids, which was possible to amortise: I first collect a big enough array of measurements (e.g. 8192), which when full is turned into centroids with weight 1 and sorted with the rest of the summary. Then I go over centroids from left to right compress it by checking if it is possible to merge two centroids by computing q = sum + (l.count + r.count) / 2 (for the possibly merged centroid) and compute k based on it. If l.count + r.count is less than or equal to k, then I merge two centroids into one, which becomes the new l, otherwise I consider the next pair and so forth.

In my very simple checks on random data so far I found this seems to give less accurate results than the original t-digest, however the accuracy was still pretty good considering the final summary size, and the best part is that this variant gives me around 70ns per measurement. I considered other variants, like computing all head sums and then trying to merge closest or smallest centroids first, but this didn't give much accuracy improvement while slowing it down.

I'm wondering what you might think about this extremely simplified version and what are fatal flaws with using it like this?

Future-proof serialization

Currently, we depend on Java's serialization which is not a good strategy.

It would be much better to do our own and make all the different kinds of Digest use interchangeable formats.

Allow arbitrary scaling laws for centroid sizes

As discussed in #30 it makes sense to allow other scaling laws for centroid sizes than 4*q*(1-q). For example, 2*sqrt(q*(1-q)) could be useful if the underlying distribution needs to be estimated. Sometimes only one-sided extreme quantiles are interesting. For example, in case of response times, where slow values (quantiles close to 100%) are more interesting, it could make sense to have a scaling law like (q<0.5)?1.:4.*q*(1-q).

Improved constraint on centroid sizes

I propose an improved condition, whether neighboring centroids can be merged or not: Let {(x_1, w_1), (x_2, w_2),...,(x_N, w_N)} be the centroids in ascending order where x_i is the mean and w_i is the weight of the i-th centroid. Let W be the sum of all weights W := w_1 + w_2 +...+ w_N.

The original t-digest condition is equivalent to w_i/W*f(q_i) <= 4*delta, where f(q):=1/(q*(1-q)) and deltais the reciprocal value of the compression.

Instead, I propose to use the constraint int_{z_i}^{z_{i+1}} f(q) dq <= 4*delta with z_i:=(sum_{j<i} w_i)/W. Note, that this inequality is more strict than the original constraint due to Jensen's inequality and because f(q) is a convex function. The integral of f(q) from z_i to z_{i+1} can be solved analytically and the new constraint can be expressed as
ln(z_{i+1}/(1-z_{i+1})) - ln(z_{i}/(1-z_{i})) <= 4*delta or equivalently as (z_{i+1}-z_{i})<=(e^{4*delta}-1)*z_{i}*(1-z_{i+1}). The last inequality can be evaluated very efficiently, if the constant first factor on the right hand side is precalculated.

Since the integral of f(q) is indefinite at boundaries 0 and 1, the new constraint inhibits in a natural way merging with the first or the last centroid. This is not the case for the original constraint.

Histogram

Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.

The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.

Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.

This is the code I used to produce histogram-from-tdigest

Is using Centroid's mean() and count() even the right way to go about this?

Thank you.

Abstract class TDigest mixes interface and abstract implementation

THese need to be disentangled and more public

Make TDigest serializable

I was wondering why TDigest is not Serializable. I found myself writing some boilerplate code to serialize/deserialize it between tasks when using it in Spark and it would be cool if it was serializable by default. This is what I have to do now:

def serialize(tdigest: TDigest) = {
  val arr = new Array[Byte](tdigest.byteSize)
  tdigest.asBytes(ByteBuffer.wrap(arr))
  arr
}

def deserialize(arr: Array[Byte]) = {
  AVLTreeDigest.fromBytes(ByteBuffer.wrap(arr))
}

Do you have any thoughts on that? I could submit a PR for this if you think it'd be useful.

Add time decay to t-digest

By adding a little more state, the averages kept in the centroids could have true exponential decay of results.

http://tdunning.blogspot.de/2011/03/exponential-weighted-averages-with.html

Typo on t-digest paper?

Hi guys.

In Section 3.5, you are referring to a Figure 8 for the q- vs. t- digest comparison. I think the correct figure would be Figure 7.

Cheers,

-gunes

Publish in maven central

Would be nice to have this available at maven central or any other maven repo.

ArrayDigest merge causes a much less accurate output than TreeDigest

I'm running a test that processes ~16 Million floats (representing url request timing). My use case is distributed processing (Spark), so I need to make use of the merge functionality. For the ArrayDigest implementation, the data accuracy is extremely high when adding centoids point by point. However, when switching to merge, all bets are off. My average absolute error goes from around 0.5% to 10%. If I switch to the TreeDigest implementation, the merge accuracy stays high.

Other potentially helpful facts:

I'm using a dataset of url times in the range of 0 - 20s.
In the ArrayDigest implementation, I need to call compact manually every 100 merges or so in order to prevent the process from hanging. TreeDigest does not need a manual compaction.

I'm hoping to use this on a project soon. If I do, I'm more than happy to help troubleshoot (and hopefully submit a patch). If not, I can send you my test dataset to help you hunt down the issue.

big difference in ArrayDigest vs AVLTreeDigest quantile result

Hi,

I'm running a very simple simulation and surprisingly found some big difference in quantile result from ArrayDigest and AVLTreeDigest. AVLTreeDigest is giving less accurate results.

Simulation setup:

send 20k measurement, where each measurement has 0.1% chance of being 100, and 99.9% chance of being 5
create a ArrayDigest or AVLTreeDigest with compression = 15
Expected result should be: 95-99 quantile = 5; 99.9 quantile = 100

Result (from one simulation run, but each run is very close to each other):

ArrayDigest:

0.95, quantile value 5.0
0.96, quantile value 5.0
0.97, quantile value 5.0
0.98, quantile value 5.0
0.99, quantile value 5.0
0.999, quantile value 100.0

AVLTreeDigest:

0.95, quantile value 5.0
0.96, quantile value 5.0
0.97, quantile value 5.0
0.98, quantile value 8.790523690773068
0.99, quantile value 56.1720698254364
0.999, quantile value 98.81546134663341

Is this expected?

Code can be found here

Questions about merging implementation

Hey, I'm working on a Go port of the MergingDigest. I noticed that you haven't finished updating your paper to cover the merging implementation, which left me with some questions about its behavior that I was hoping you could clarify.

First off, I notice that the implementations of cdfs and quantiles depend on an important assumption detailed in figure 2 of your paper: the samples assigned to a given centroid are approximately uniformly distributed between the midpoints to the adjacent centroids on either side. I generated a similar graph for your java implementation using deviation-merge.csv and this is what I got (edit: by the way this graph contains all 3 distributions, gamma seq and uniform, which might explain its slightly uneven shape):

Most of the values are between -0.5 and 0.5 as with your original fig2, but a significant amount of values are outside that in the -1 to 1 range, and there's also a long tail of values trailing out towards the previous centroid.

Here's the graph from my implementation (I've tried uniform, exponential and normal distributions, they all had the same shape):

It looks narrower than yours but the shape is similar. What I'm wondering is, do you think the uniform distribution you assumed for the other t-digest implementations still holds up for this one? Both your implementation and mine have a distinctly nonuniform shape with a long tail towards the previous centroid. I don't know how much error is being introduced by assuming a uniform distribution between the midpoints, but it doesn't really look uniform to me.

Second, I noticed that your implementation uses a quadratic equation to estimate the size of the temporary centroid buffer, seen here. Could you explain what data this quadratic is derived from? I would like to do some tests of my own to corroborate your equation.

One more thing I was curious about - in the updated version of your paper section 2.1, you mention the mapping from quantiles to indices, implemented here. I was wondering what motivated this particular choice, and if there were any other functions with similar shapes that you've considered exploring. I was also wondering how this upper bound was computed, and how it relates to the sinusoidal mapping that was chosen.

Finally, thanks for developing this data structure, and for providing an open-source reference implementation. I look forward to reading the updated version of your paper with more details on the merging variant.