Comments (8)
To the above formula for q
I just had an idea to use either left q
or right q
when computing k
, whichever gives the smallest q * (1 - q)
. This way two centroids on the opposite sides of the center would only get merged if each of them has enough capacity.
from t-digest.
Yes. Head sum is the critical issue.
What I would recommend is that you take a look at the MergingArray implementation. No need for trees. Very fast insertion.
The algorithm is even simpler than the tree versions.
The idea is that you simply accumulate new data into an array. When the array fills, you sort it and merge with the existing t-digest centroids. Because the merge transits the merged data in order, all of the head sum stuff is handled trivially. This also allows an incremental improvement suggested by otmar that we can get a very precise bound on whether the new data and the existing centroids should merge so that there is a finite bound on the size of the entire data structure. As a result, we can also avoid all dynamic allocation while the algorithm is running while preserving accuracy as before.
So far, I have only been able to match the performance of the existing tree based algorithm (the AVL-tree), but I really think that we could beat it significantly. Right now, the cost is about 200ns per new data point. I think that can drop noticeably.
from t-digest.
Ah! So sorry, somehow I missed MergingDigest
class among all other classes, and it looks a lot like what I had in mind, except the condition for merging is different. Description of integratedLocation
is very helpful, thank you very much!
from t-digest.
Hmm. I tried using merging condition of ∆integratedLocation <= 1
like in MergingDigest
, it causes a much smaller number of centroids in the end, and there are other differences like edge centroids no longer stay at count 1, etc. However this: https://github.com/snaury/t-digest-simple/blob/master/merging_tdigest.hpp (sorry for C++, I'm not as comfortable with Java) gives a similar number of final centroids to TreeDigest
and is much faster (looks like asin
is a little slower than simple multiplications).
from t-digest.
Yes, the integrated location approach does give fewer centroids. In fact,
it gives a bounded number while the other criteria tend to give an
unbounded O(log(n)) result.
Bounds are good.
On Thu, Oct 22, 2015 at 4:29 PM, Alexey Borzenkov [email protected]
wrote:
Hmm. I tried using merging condition of ∆integratedLocation <= 1 like in
MergingDigest, it causes a much smaller number of centroids in the end,
but accuracy is not as nice, and there are other differences like edge
centroids no longer stay at count 1, etc. However this:
https://github.com/snaury/t-digest-simple/blob/master/merging_tdigest.hpp
(sorry for C++, I'm not as comfortable with Java) gives a similar number of
final centroids to TreeDigest and is much faster (looks like asin is a
little slower than simple multiplications).—
Reply to this email directly or view it on GitHub
#61 (comment).
from t-digest.
Is there further action required here?
Did you get a fully functional cpp version going? Did you want to host it as part of this main repo?
from t-digest.
No, I think MergingDigest
in Java is already good, and although what I proposed is a little faster in practice it comes at the expense of memory. If I noticed MergingDigest
from the start I would probably not even created this issue.
As for C++ implementations I've done several and I feel like once you understand the algorithm it becomes very easy to reimplement and adapt it in a way suitable for the task at hand, while coding a generic implementation worth making it official would take too much work or make it very complex and hard to understand. There are a lot of different ways for computing quantiles and percentiles, for example, as well as choices of data types. I'm not sure I would be able to summon enough will to do the grunt work in my free time.
from t-digest.
OK.
Btw... I have some new ideas coming that could improve accuracy and
run-time noticeably.
On Sun, May 29, 2016 at 5:04 PM, Alexey Borzenkov [email protected]
wrote:
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#61 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAPSepY8rRCLwyierL-F7ZRjXA5MbHocks5qGbj8gaJpZM4GT-8l
.
from t-digest.
Related Issues (20)
- Mergeability of t-digest HOT 3
- Allow AVLTreeDigest's to be identical to another given the same set of inputs HOT 1
- Release notes for 3.3? HOT 1
- Will merging multiple t-digest preserve the exact value of min/max? HOT 3
- Behavior when compression ratio is 1 HOT 1
- TDigest objet serializable HOT 1
- tag missing problem HOT 2
- Decay TDigest HOT 4
- T-Digest (Re)Construction
- Merge implementation of MergingDigest HOT 2
- Question on quantile calculation logic HOT 3
- -deleted- HOT 1
- Add support for double weights HOT 2
- how to implement sliding windows quantile? HOT 1
- Determining quality HOT 3
- OpenTelemetry, Summaries and TDigests HOT 5
- Have `TDigest` implement `Consumer` HOT 1
- New release? HOT 3
- AssertionError if weight > 1 HOT 3
- Modifying T-digest that handle deletion HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t-digest.