Hey, I'm working on a <a href="https://github.com/stripe/veneur/blob/master/tdigest/me

Questions about merging implementation about t-digest HOT 4 OPEN

tdunning commented on May 25, 2024

Questions about merging implementation

from t-digest.

Comments (4)

tdunning commented on May 25, 2024

Thanks for your excellent comments.

I am re-running the distribution tests to see what has happened with the new algorithms, but I don't have answers on that yet.

The quadratic equation was just a hold-over from a previous world and was very wrong in the current world. It is now gone. The merging algorithm has hard bounds on size that are basically just 2 * compression.

Also, if you check out the latest master version of the paper, you should have a decent description of the algorithm. It is vastly simpler than before.

from t-digest.

tdunning commented on May 25, 2024

No progress on this, but there have been a number of changes that could improve behavior.

The distribution of values in a bin is critical to understand since a good characterization of this could improve accuracy, possibly substantially.

from t-digest.

tdunning commented on May 25, 2024

Tommy,

I just wanted to thank you for your observation. I have been working on the quality assessment in anticipation of the 4.0 release and the issue you have highlighted is a big deal that compromises tail accuracy significantly.

This can be seen quite dramatically by first using sorted test data. Here are the data points for the first few centroids of a typical run:

That samples that contribute to each centroid have a nice uniform distribution and the centroids have fewer and fewer samples as we get to the left end. All is as we would want.

On the other hand, if you look at what happens with randomly ordered data (but still with a really big sort buffer):

With a more rational-sized sort buffer, the tails of each centroid become elongated:

Even worse than just being elongated, these clusters are asymmetrical.

I am investigating merge strategies that might avoid this problem.

from t-digest.

tdunning commented on May 25, 2024

Progress report on this.

The issue that Tommy has highlighted is actually quite serious and has a bad effect on accuracy. In investigating this, I have learned a lot about different size limiting strategies as well. I will describe the side learning elsewhere, but the money graph is this one.

It is clear that the update strategy that is used in the tree digest (attach points to nearest cluster if space available) is doing a much better job and I attribute the improvement to the fact that the update strategy isn't biased. In the merging digest the updates are always done by a linear scan and that makes new data get added to centroids mostly from one side (either the left side, or on the outer side if we alternate scan direction). This causes the smearing seen in these graphs. The improvement in the graph above was had by changing from sqrt(q(1-q)) as a size limit to q(1-q). There are some normalizing factors in there, but the effect is that clusters near the limits are kept smaller for longer, hopefully limiting the effect of smear. This makes errors about 5x smaller for extreme q, but doesn't really solve the core problem yet.

Improving this situation and making it easier to run tests like this is a major priority for me right now. Of course, I have only minutes per week to work on this so progress is lamentably slow.

from t-digest.

Questions about merging implementation about t-digest HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent