Giter Site home page Giter Site logo

Comments (4)

tdunning avatar tdunning commented on June 5, 2024 1

I haven't looked at this enough yet, but here are some thoughts:

  • I recently looked into keeping min/max for each centroid/bucket in
    t-digest itself. I convinced myself that the accuracy would be lower
    because during early adaptation or when merging digests, you wind up with
    the ideal square distribution of samples with some small leaky tails. That
    means that min and max are not really good estimators of how wide the
    general mass of the bucket really is.
  • Centroids are really clean and simple to update. The issue above makes
    most other measures harder
  • Have you looked at the MergingDigest work? It gives you completely static
    memory allocation and no tree. The addition of new samples is dead simple.
    CDF and quantile computation is nearly the same as for tree digests.
  • Can you say how your insertion algorithm is different from t-digest and
    different from Greenwald-Khanna's algorithm (paper attached)?

from t-digest.

hville avatar hville commented on June 5, 2024

Thanks for your time and the pointers since I am not familiar with the literature.
I have read the Greenwald-Khanna paper and the MergingDigest code and I think that what I am doing is very close to the MergingDigest idea, but more naive and with the value-rank axis inverted throughout:

min/max vs centroids

I don't follow your point but in any case, the min are not used and the max are interpolated.
max[i-1] < newVal < max[i]
newRnk = rnk[i] - (rnk[i] - (rnk[i-1] + 1)) * (max[i] - newVal) / (max[i] - max[i-1])

I think this makes it it is much closer to t-Digest than to the Greenwald-Khanna since it only uses a single interpolated value. It interpolates the rank instead of the value.

I also have a hunch that data is better used by increasing the number of buckets and overall resolution instead of keeping min values.

weighting

The relative target weighting of buckets is fixed at the onset. Any distribution can be used and this is the current implementation:
w: target bucket relative weight
p = idx/(N-1): bucket relative position

Assuming that the bucket error is proportional to the bucket weight and that we want every bucket to have similar kind of relative error:
Constant RMS edge error = errRMS = sqrt( (w/p)^2 + (w/(1-p))^2 )
errRMS = w/(p(1-p)) * sqrt(1 - 2p(1-p))
w[i] = errRMS * (p(1-p)) / sqrt(1 - 2p(1-p))

w[i] ~= k p(1-p) (1 + 2p(1-p)) is a very close approximation using Maclaurin series and easy to integrate
Cumulative target weights = ( 15p^2 + 10p^3 - 30p^4 + 12p^5 ) / 7

This relative error approach is actually much closer to t-Digest than Greenwald-Khanna where the later uses absolute error instead.
I thought the t-Digest implementation paper used something like 1/w(1-w) but the MergingDigest uses a cumulative arcsin function that I think is the inverse of the above.
Again, it looks to me like it is very close to the t-Digest but with the value-rank axis reversed and the values precomputed by buckets.

insertion

Since the relative target bucket weight are fixed at the onset, there is relatively little cost to merging values and no buffer is required. Although the target weights need to be stored, they can be shared across many instances of the same size.

  • for a new value v find v[i-1] < v <= v[i+1]
  • calculate the interpolated rank r for the new value
  • compare r with the precomputed target weights
  • if r/N > w[i+1] replace the bucket [i+1] and continue shifting right with the replaced bucket
  • if r/N < w[i] replace the bucket [i] and continue shifting left with the replaced bucket

I guess I could take some of the unit-test cases and check where/when the error profile passes or fails.

from t-digest.

hville avatar hville commented on June 5, 2024

I'm closing this since it is not an issue and was simply open to share ideas.

from t-digest.

tdunning avatar tdunning commented on June 5, 2024

One last comment is that inverting the axes might be a very clever thought.
I am not clear whether that is entirely true, but it is worth following up
on at some point.

The point is that a rational approximation for sin(k) is super easy while
asin(q) is harder.

On Tue, Sep 20, 2016 at 1:45 AM, hugo [email protected] wrote:

Closed #71 #71.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSerSOukwU_EbH9lLwE28cmLvfNjmXks5qrx6rgaJpZM4J6CYk
.

from t-digest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.