Comments (4)
I haven't looked at this enough yet, but here are some thoughts:
- I recently looked into keeping min/max for each centroid/bucket in
t-digest itself. I convinced myself that the accuracy would be lower
because during early adaptation or when merging digests, you wind up with
the ideal square distribution of samples with some small leaky tails. That
means that min and max are not really good estimators of how wide the
general mass of the bucket really is. - Centroids are really clean and simple to update. The issue above makes
most other measures harder - Have you looked at the MergingDigest work? It gives you completely static
memory allocation and no tree. The addition of new samples is dead simple.
CDF and quantile computation is nearly the same as for tree digests. - Can you say how your insertion algorithm is different from t-digest and
different from Greenwald-Khanna's algorithm (paper attached)?
from t-digest.
Thanks for your time and the pointers since I am not familiar with the literature.
I have read the Greenwald-Khanna paper and the MergingDigest code and I think that what I am doing is very close to the MergingDigest idea, but more naive and with the value-rank axis inverted throughout:
min/max vs centroids
I don't follow your point but in any case, the min are not used and the max are interpolated.
max[i-1] < newVal < max[i]
newRnk = rnk[i] - (rnk[i] - (rnk[i-1] + 1)) * (max[i] - newVal) / (max[i] - max[i-1])
I think this makes it it is much closer to t-Digest than to the Greenwald-Khanna since it only uses a single interpolated value. It interpolates the rank instead of the value.
I also have a hunch that data is better used by increasing the number of buckets and overall resolution instead of keeping min values.
weighting
The relative target weighting of buckets is fixed at the onset. Any distribution can be used and this is the current implementation:
w
: target bucket relative weight
p = idx/(N-1)
: bucket relative position
Assuming that the bucket error is proportional to the bucket weight and that we want every bucket to have similar kind of relative error:
Constant RMS edge error = errRMS = sqrt( (w/p)^2 + (w/(1-p))^2 )
errRMS = w/(p(1-p)) * sqrt(1 - 2p(1-p))
w[i] = errRMS * (p(1-p)) / sqrt(1 - 2p(1-p))
w[i] ~= k p(1-p) (1 + 2p(1-p))
is a very close approximation using Maclaurin series and easy to integrate
Cumulative target weights = ( 15p^2 + 10p^3 - 30p^4 + 12p^5 ) / 7
This relative error approach is actually much closer to t-Digest than Greenwald-Khanna where the later uses absolute error instead.
I thought the t-Digest implementation paper used something like 1/w(1-w)
but the MergingDigest uses a cumulative arcsin function that I think is the inverse of the above.
Again, it looks to me like it is very close to the t-Digest but with the value-rank axis reversed and the values precomputed by buckets.
insertion
Since the relative target bucket weight are fixed at the onset, there is relatively little cost to merging values and no buffer is required. Although the target weights need to be stored, they can be shared across many instances of the same size.
- for a new value
v
findv[i-1] < v <= v[i+1]
- calculate the interpolated rank
r
for the new value - compare
r
with the precomputed target weights - if
r/N > w[i+1]
replace the bucket [i+1] and continue shifting right with the replaced bucket - if
r/N < w[i]
replace the bucket [i] and continue shifting left with the replaced bucket
I guess I could take some of the unit-test cases and check where/when the error profile passes or fails.
from t-digest.
I'm closing this since it is not an issue and was simply open to share ideas.
from t-digest.
One last comment is that inverting the axes might be a very clever thought.
I am not clear whether that is entirely true, but it is worth following up
on at some point.
The point is that a rational approximation for sin(k) is super easy while
asin(q) is harder.
On Tue, Sep 20, 2016 at 1:45 AM, hugo [email protected] wrote:
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPSerSOukwU_EbH9lLwE28cmLvfNjmXks5qrx6rgaJpZM4J6CYk
.
from t-digest.
Related Issues (20)
- Mergeability of t-digest HOT 3
- Allow AVLTreeDigest's to be identical to another given the same set of inputs HOT 1
- Release notes for 3.3? HOT 1
- Will merging multiple t-digest preserve the exact value of min/max? HOT 3
- Behavior when compression ratio is 1 HOT 1
- TDigest objet serializable HOT 1
- tag missing problem HOT 2
- Decay TDigest HOT 4
- T-Digest (Re)Construction
- Merge implementation of MergingDigest HOT 2
- Question on quantile calculation logic HOT 3
- -deleted- HOT 1
- Add support for double weights HOT 2
- how to implement sliding windows quantile? HOT 1
- Determining quality HOT 3
- OpenTelemetry, Summaries and TDigests HOT 5
- Have `TDigest` implement `Consumer` HOT 1
- New release? HOT 4
- AssertionError if weight > 1 HOT 3
- Modifying T-digest that handle deletion HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t-digest.