Comments (8)
These results are anomalous.
In my own tests, I add a million or so points to a t-digest and see
insertion times of about 200 ns. You are seeing times that are more than
100 times longer than that.
One question I have is whether you are initializing the latencies to
interesting values or if they are all set to zero. If you have identical
values, there might be a corner case that is causing an inefficient data
structure.
Have you tried this experiment with the MergingDigest?
To do this, use the one line:
TDigest tDigest = MergingDigest.create(100);
On Tue, Dec 15, 2015 at 12:19 PM, akalani [email protected] wrote:
I have been playing with the t-digest on a sample dataset of latencies and
comparing performance against raw sorting to derive quantile statistics I
am using the standard APIs (and recommended configuration values) to build
the t-digest, however I am observing significantly worse times than raw
sorting Is this to be expected? I would appreciate any help in this regardsHere is my sample code:
long[] latencies = new long[1000000];
TDigest tDigest = TDigestcreateDigest(100);
for (int i=0; i < 1000000; i++) {
tDigestadd(latencies[i]);
}
Systemoutprintln("Median = " + tDigestquantile(05));I used a dataset with 1 million latency values (in ms) The max is 100,000
It took about 27 secs to build the digest and extract the 50% quantile
value With a dataset with 2 million latency values, the computation time
doubled Raw sorting is under 100msReducing the compression factor to 5 helps but computation time is still
excessively high—
Reply to this email directly or view it on GitHub
#62.
from t-digest.
I initialized the latencies by randomizing and duplicating a sample set of Google ping latencies. The values are in milliseconds; the max (the timeout value) in the set is 100,000, though the max values are outliers and occur infrequently.
Could the outlier timeout values be causing performance degradation?
I have built the TDigest using the MergingDigest API, but see no improvement in performance.
from t-digest.
The outliers don't cause any problems at all. Nor does the actual distribution cause any issues except possible when there are massive numbers of duplicates. The MergingDigest should not have any distributional issues.
Can you share your code? Your data?
from t-digest.
Here is the test that I wrote:
http://pastebin.com/Rmue62ch
The Java class takes the path to the latencies file as an input. Here is an output that it generated for the latencies dataset with 1 million records (time is in ms, and memory size is in KB):
TDigest: time = 1218
TDigest: median = 44.43894320782193
TDigest: mem size = 21496
Raw: time = 54
Raw: median = 45
Raw: mem size = 8000016
The data file is attached.
latencies_1m.txt.zip
from t-digest.
A minor correction, the memory size reported above is in bytes, not kilobytes.
from t-digest.
OK. I find a few issues.
-
you are passing in long's, the digest works in doubles. This costs a bit.
-
your timing was confused by the prints. I changed the code to store the time into variables.
-
there is some warmup to be had
Here is my adapted code:
double[] latencies = readLatenciesFile(args[0]);
long t1 = System.currentTimeMillis();
TDigest tDigest = MergingDigest.createDigest(200);
for (int i = 0; i < RECORD_COUNT; i++) {
tDigest.add(latencies[i]);
}
long t2 = System.currentTimeMillis();
Arrays.sort(latencies);
long t3 = System.currentTimeMillis();
System.out.println("Read time = " + (t1 - start));
System.out.println("TDigest: time = " + (t2 - t1));
System.out.println("TDigest: median = " + tDigest.quantile(0.5));
// System.out.println("TDigest: mem size = " + SizeEstimator.estimate(tDigest));
System.out.println("Raw: time = " + (t3 - t2));
System.out.println("Raw: median = " + latencies[latencies.length / 2]);
System.out.println();
After this, I get results like this on my laptop (note that these are with compression = 200):
Read time = 551
TDigest: time = 571
TDigest: median = 44.55266098760193
Raw: time = 302
Raw: median = 44.552607870527396
Read time = 204
TDigest: time = 325
TDigest: median = 44.54911853552305
Raw: time = 78
Raw: median = 44.549239966209704
Read time = 156
TDigest: time = 295
TDigest: median = 44.5561107724475
Raw: time = 83
Raw: median = 44.55575939279197
Read time = 136
TDigest: time = 297
TDigest: median = 44.552868775685916
Raw: time = 83
Raw: median = 44.55243261312767
Read time = 139
TDigest: time = 307
TDigest: median = 44.557862132539675
Raw: time = 82
Raw: median = 44.557593756576985
This means that the t-digest is showing at about 4x slower than sorting the array and the cost per element is about 300 ns which isn't far from the result that I had from micro-benchmarking. I think that there is about another 2-4x speedup available in the code by simple improvements, but I don't have time to do it so these times will have to stand. It makes some sense that a raw sort of a million items would be very fast since the merging digest has to touch the data more often. If there are k elements in the insertion buffer and n in the retained buffer for the merging digest, the sorting time will be N log k versus N log N (with similar constants) for the entire sort where k = 32, log k = 5, and N = 10^6, log N = 20. The insertion time will require a pass over the retention buffer for each sort of the insertion buffer and thus will require n N/k = 6 N touches. With better optimization on the part of the java library writers, this isn't unreasonable.
from t-digest.
Thanks very much for your help. I will keep the above in mind.
from t-digest.
I just did some experiments (see the t-digests-benchmark repo). I think that your observations lead some good speedups. I experimented with allocating bigger insertion buffers and found that due to the phenomenon that you noted (built-in sort is fast), we can make some big improvements.
Here are the results:
Benchmark (compression) (factor) Mode Samples Mean Mean error Units
c.t.MergeBench.add 100 1 avgt 5 2747.205 71.506 ns/op
c.t.MergeBench.add 100 10 avgt 5 625.746 14.814 ns/op
c.t.MergeBench.add 100 100 avgt 5 416.151 14.514 ns/op
c.t.MergeBench.add 200 1 avgt 5 3137.047 97.135 ns/op
c.t.MergeBench.add 200 10 avgt 5 659.008 33.876 ns/op
c.t.MergeBench.add 200 100 avgt 5 422.534 8.808 ns/op
c.t.MergeBench.add 500 1 avgt 5 3843.771 137.530 ns/op
c.t.MergeBench.add 500 10 avgt 5 749.032 27.951 ns/op
c.t.MergeBench.add 500 100 avgt 5 442.522 14.167 ns/op
c.t.MergeBench.add 1000 1 avgt 5 5875.645 218.997 ns/op
c.t.MergeBench.add 1000 10 avgt 5 953.075 19.278 ns/op
c.t.MergeBench.add 1000 100 avgt 5 474.294 17.505 ns/op
Mean is the time you want to look at. Factor determines how much more space I am allocating. Allocating 100x more space makes a whacking big difference. I will look at this some more and decide if that is a reasonable thing to do. If so, good news for speed, especially for higher compression factors.
from t-digest.
Related Issues (20)
- Mergeability of t-digest HOT 3
- Allow AVLTreeDigest's to be identical to another given the same set of inputs HOT 1
- Release notes for 3.3? HOT 1
- Will merging multiple t-digest preserve the exact value of min/max? HOT 3
- Behavior when compression ratio is 1 HOT 1
- TDigest objet serializable HOT 1
- tag missing problem HOT 2
- Decay TDigest HOT 4
- T-Digest (Re)Construction
- Merge implementation of MergingDigest HOT 2
- Question on quantile calculation logic HOT 3
- -deleted- HOT 1
- Add support for double weights HOT 2
- how to implement sliding windows quantile? HOT 1
- Determining quality HOT 3
- OpenTelemetry, Summaries and TDigests HOT 5
- Have `TDigest` implement `Consumer` HOT 1
- New release? HOT 3
- AssertionError if weight > 1 HOT 3
- Modifying T-digest that handle deletion HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t-digest.