Got down a rabbit hole, came out with thi

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting about t-digest HOT 4 CLOSED

tdunning commented on June 5, 2024

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting

from t-digest.

Comments (4)

tdunning commented on June 5, 2024 1

I haven't looked at this enough yet, but here are some thoughts:

I recently looked into keeping min/max for each centroid/bucket in
t-digest itself. I convinced myself that the accuracy would be lower
because during early adaptation or when merging digests, you wind up with
the ideal square distribution of samples with some small leaky tails. That
means that min and max are not really good estimators of how wide the
general mass of the bucket really is.
Centroids are really clean and simple to update. The issue above makes
most other measures harder
Have you looked at the MergingDigest work? It gives you completely static
memory allocation and no tree. The addition of new samples is dead simple.
CDF and quantile computation is nearly the same as for tree digests.
Can you say how your insertion algorithm is different from t-digest and
different from Greenwald-Khanna's algorithm (paper attached)?

from t-digest.

hville commented on June 5, 2024

Thanks for your time and the pointers since I am not familiar with the literature.
I have read the Greenwald-Khanna paper and the MergingDigest code and I think that what I am doing is very close to the MergingDigest idea, but more naive and with the value-rank axis inverted throughout:

min/max vs centroids

I don't follow your point but in any case, the min are not used and the max are interpolated.
max[i-1] < newVal < max[i]
newRnk = rnk[i] - (rnk[i] - (rnk[i-1] + 1)) * (max[i] - newVal) / (max[i] - max[i-1])

I think this makes it it is much closer to t-Digest than to the Greenwald-Khanna since it only uses a single interpolated value. It interpolates the rank instead of the value.

I also have a hunch that data is better used by increasing the number of buckets and overall resolution instead of keeping min values.

weighting

The relative target weighting of buckets is fixed at the onset. Any distribution can be used and this is the current implementation:
w: target bucket relative weight
p = idx/(N-1): bucket relative position

Assuming that the bucket error is proportional to the bucket weight and that we want every bucket to have similar kind of relative error:
Constant RMS edge error = errRMS = sqrt( (w/p)^2 + (w/(1-p))^2 )
errRMS = w/(p(1-p)) * sqrt(1 - 2p(1-p))
w[i] = errRMS * (p(1-p)) / sqrt(1 - 2p(1-p))

w[i] ~= k p(1-p) (1 + 2p(1-p)) is a very close approximation using Maclaurin series and easy to integrate
Cumulative target weights = ( 15p^2 + 10p^3 - 30p^4 + 12p^5 ) / 7

This relative error approach is actually much closer to t-Digest than Greenwald-Khanna where the later uses absolute error instead.
I thought the t-Digest implementation paper used something like 1/w(1-w) but the MergingDigest uses a cumulative arcsin function that I think is the inverse of the above.
Again, it looks to me like it is very close to the t-Digest but with the value-rank axis reversed and the values precomputed by buckets.

insertion

Since the relative target bucket weight are fixed at the onset, there is relatively little cost to merging values and no buffer is required. Although the target weights need to be stored, they can be shared across many instances of the same size.

for a new value v find v[i-1] < v <= v[i+1]
calculate the interpolated rank r for the new value
compare r with the precomputed target weights
if r/N > w[i+1] replace the bucket [i+1] and continue shifting right with the replaced bucket
if r/N < w[i] replace the bucket [i] and continue shifting left with the replaced bucket

I guess I could take some of the unit-test cases and check where/when the error profile passes or fails.

from t-digest.

hville commented on June 5, 2024

I'm closing this since it is not an issue and was simply open to share ideas.

from t-digest.

tdunning commented on June 5, 2024

One last comment is that inverting the axes might be a very clever thought.
I am not clear whether that is entirely true, but it is worth following up
on at some point.

The point is that a rational approximation for sin(k) is super easy while
asin(q) is harder.

On Tue, Sep 20, 2016 at 1:45 AM, hugo [email protected] wrote:

Closed #71 #71.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSerSOukwU_EbH9lLwE28cmLvfNjmXks5qrx6rgaJpZM4J6CYk
.

from t-digest.

Simple alternate algorithm using maxima, ranks and fixed cumulative weighting about t-digest HOT 4 CLOSED

Comments (4)

min/max vs centroids

weighting

insertion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent