Comments (3)
from t-digest.
The test that passes in 3.2 and fails in 3.3 is essentially a (seeded) uniform random distribution of 10000 doubles in the range [-10000.0, 10000.0]
. We check abs((p_x-t_x)/p_x) < 0.005
for x in [75, 99, 99.9]
(p_x
is percentile, t_x
is t-digest with compression 100). (The test also does the same checks on 4 non-overlapping subsets that in union equal the original distribution.)
Upon seeing these fail in 3.3, I added a single root-mean-square error calculation that accumulates the errors across this distribution, as well as re-seeding the test w/ 1000 distinct seeds. A bit hand-wavy, but in the end the calculation produces
3.2 RMSE = 0.00090
3.3 RMSE = 0.00127
I suspect I might find
upping the default value to 200 had the desired effect (same memory use, improved accuracy)
, so I will look into that.
I think another factor is I don't have a good idea of what "compression" is, or how to think about how developers may need to tweak compression from release to release. The javadocs say "100 is a common value for normal uses".
I would have expected a given compression X to be "equivalent" from release to release in one of these two dimensions:
- compression X achieves the same memory usage (and hopefully improved "quality")
- compression X achieves the same "quality" (and hopefully improved memory usage)
But it seems like compression is a more nuanced value? (ie, based on your statements and my findings, X=100 uses less memory, but also less "quality" on 3.3, so it doesn't follow one of the two dimensions above).
I wonder if it would be useful to have user-level abstractions that could be exposed as the two dimensions above?
Ie,
public static double qualityToCompression(double quality) {
return quality * QUALITY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}
public static double memoryToCompression(double memory) {
return quality * MEMORY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}
Then user-code would be able to try and lock-in on the mode they are trying to optimize for:
TDigest tDigestForTesting = createDigest(qualityToCompression(100.0));
...
TDigest tDigestForProduction = createDigest(memoryToCompression(...));
...
This is a bit long winded... but I very much appreciate your response, and will get back after I analyze the memory usage a bit more.
from t-digest.
from t-digest.
Related Issues (20)
- Allow AVLTreeDigest's to be identical to another given the same set of inputs HOT 1
- Release notes for 3.3? HOT 1
- Will merging multiple t-digest preserve the exact value of min/max? HOT 3
- Behavior when compression ratio is 1 HOT 1
- TDigest objet serializable HOT 1
- tag missing problem HOT 2
- Decay TDigest HOT 4
- T-Digest (Re)Construction
- Merge implementation of MergingDigest HOT 2
- Question on quantile calculation logic HOT 3
- -deleted- HOT 1
- Add support for double weights HOT 2
- how to implement sliding windows quantile? HOT 1
- OpenTelemetry, Summaries and TDigests HOT 5
- Have `TDigest` implement `Consumer` HOT 1
- New release? HOT 4
- AssertionError if weight > 1 HOT 3
- Modifying T-digest that handle deletion HOT 1
- New release?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t-digest.