Comments (4)
Thanks for your excellent comments.
I am re-running the distribution tests to see what has happened with the new algorithms, but I don't have answers on that yet.
The quadratic equation was just a hold-over from a previous world and was very wrong in the current world. It is now gone. The merging algorithm has hard bounds on size that are basically just 2 * compression.
Also, if you check out the latest master version of the paper, you should have a decent description of the algorithm. It is vastly simpler than before.
from t-digest.
No progress on this, but there have been a number of changes that could improve behavior.
The distribution of values in a bin is critical to understand since a good characterization of this could improve accuracy, possibly substantially.
from t-digest.
Tommy,
I just wanted to thank you for your observation. I have been working on the quality assessment in anticipation of the 4.0 release and the issue you have highlighted is a big deal that compromises tail accuracy significantly.
This can be seen quite dramatically by first using sorted test data. Here are the data points for the first few centroids of a typical run:
That samples that contribute to each centroid have a nice uniform distribution and the centroids have fewer and fewer samples as we get to the left end. All is as we would want.
On the other hand, if you look at what happens with randomly ordered data (but still with a really big sort buffer):
With a more rational-sized sort buffer, the tails of each centroid become elongated:
Even worse than just being elongated, these clusters are asymmetrical.
I am investigating merge strategies that might avoid this problem.
from t-digest.
Progress report on this.
The issue that Tommy has highlighted is actually quite serious and has a bad effect on accuracy. In investigating this, I have learned a lot about different size limiting strategies as well. I will describe the side learning elsewhere, but the money graph is this one.
It is clear that the update strategy that is used in the tree digest (attach points to nearest cluster if space available) is doing a much better job and I attribute the improvement to the fact that the update strategy isn't biased. In the merging digest the updates are always done by a linear scan and that makes new data get added to centroids mostly from one side (either the left side, or on the outer side if we alternate scan direction). This causes the smearing seen in these graphs. The improvement in the graph above was had by changing from sqrt(q(1-q)) as a size limit to q(1-q). There are some normalizing factors in there, but the effect is that clusters near the limits are kept smaller for longer, hopefully limiting the effect of smear. This makes errors about 5x smaller for extreme q, but doesn't really solve the core problem yet.
Improving this situation and making it easier to run tests like this is a major priority for me right now. Of course, I have only minutes per week to work on this so progress is lamentably slow.
from t-digest.
Related Issues (20)
- Mergeability of t-digest HOT 3
- Allow AVLTreeDigest's to be identical to another given the same set of inputs HOT 1
- Release notes for 3.3? HOT 1
- Will merging multiple t-digest preserve the exact value of min/max? HOT 3
- Behavior when compression ratio is 1 HOT 1
- TDigest objet serializable HOT 1
- tag missing problem HOT 2
- Decay TDigest HOT 4
- T-Digest (Re)Construction
- Merge implementation of MergingDigest HOT 2
- Question on quantile calculation logic HOT 3
- -deleted- HOT 1
- Add support for double weights HOT 2
- how to implement sliding windows quantile? HOT 1
- Determining quality HOT 3
- OpenTelemetry, Summaries and TDigests HOT 5
- Have `TDigest` implement `Consumer` HOT 1
- New release? HOT 3
- AssertionError if weight > 1 HOT 3
- Modifying T-digest that handle deletion HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t-digest.