Comments (8)
Looks strange - are you able to share the data?
from tdigest.
It attached in the second line
or link:
https://github.com/CamDavidsonPilon/tdigest/files/2613099/my_set.zip
from tdigest.
Hi, i have a same issue here.
t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])
t.percentile(25)```
result was
`-523.5`
from tdigest.
Hi, I noticed this too. Thanks for everything you've done! In the cases it does work, it seems to be doing a great job! However, with this simple example posted above I'm now a little worried. Do you know when this error case occurs? Is this something inherent to TDigests? When should we watch out for something like this happening?
from tdigest.
@CamDavidsonPilon , @idanmoradarthas , @phucbui95 - I found where this regression was introduced:
commit 1f76113e3fce0d5b7189fd083085ba64d4129e05 (HEAD)
Author: Cameron Davidson-Pilon <[email protected]>
Date: Tue Dec 26 23:04:33 2017 -0500
bump to v0.5, drop python2 support
The commit before returns the correct results:
Author: Cameron Davidson-Pilon <[email protected]>
Date: Mon Feb 20 11:22:54 2017 -0500
test against 3.6
Fix: roll back the implementation of def percentile(self, p) back to the implementation in 9cd2536. @CamDavidsonPilon , are you able to explain this further? I still have to run your test cases to confirm, but this seems promising from some manual tests.
Examples:
Code:
from tdigest import TDigest
t = TDigest()
t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])
for percent in [0, 25, 50, 75, 100]:
value = t.percentile(percent)
print(f"p{percent} = {value}")
Output with 1f76113 (broken):
p0 = 69.0
p25 = -1906.0
p50 = -1116.0
p75 = -326.0
p100 = 464.0
Output with 9cd2536 (working):
p0 = 69.0
p25 = 69.0
p50 = 69.0
p75 = 69.0
p100 = 1649.0
from tdigest.
Addendum - this may not be as simple as it seems, but may be a step in the right direction. I further compared a test case (bigJump) in the reference implementation (https://github.com/tdunning/t-digest/blob/ff3232bc25a69961fc7bf4911f8de0026bd28c44/core/src/test/java/com/tdunning/math/stats/TDigestTest.java)
Here is the java test - notice the assertions, then compare with the values with the working python implementation (9cd2536):
TDigest digest = factory(100).create();
for (int i = 1; i < 20; i++) {
digest.add(i);
}
digest.add(1_000_000);
assertEquals(18, digest.quantile(0.89999999), 0);
assertEquals(19, digest.quantile(0.9), 0);
assertEquals(19, digest.quantile(0.949999999), 0);
assertEquals(1_000_000, digest.quantile(0.95), 0);
Working python implementation results (9cd2536):
Code:
from tdigest import TDigest
l = [i for i in range(20)]
l.append(1000000)
print(l)
t = TDigest()
t.batch_update(l)
for percent in [89.999999, 90, 94.9999999, 95,]:
value = t.percentile(percent)
print(f"p{percent} = {value}")
Output:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000000]
p89.999999 = 18.39999979
p90 = 18.400000000000002
p94.9999999 = 225014.93950018956
p95 = 225014.94999999963
Notice that the p94.9999999 and p95 results are still quite different from the reference implementation. This will probably only happen in extreme scenarios, but I thought I'd point it out. Perhaps the reference implementation has some other conditions that handle these extremities. I don't think the python implementation is wrong, but this just may be one of the accuracy tradeoffs made with TDigest (just a guess).
from tdigest.
@rlele5 unfortunately I'm not involved much in this repo anymore - if you suggest a fix, with an appropriate test, I'd be happy to review and merge it.
from tdigest.
@CamDavidsonPilon , sounds good. I'll have to go back and understand why the change was made in the first place and will see if there are any further changes needed.
from tdigest.
Related Issues (20)
- Sometimes a test fails (test_uniform)
- tdigest.__version__ returns wrong version
- tdigest fails to serialize in spark
- Horribly slow on PyPy HOT 2
- Negative trimmed mean
- Updates to algorithm in latest paper [help needed] HOT 3
- wheel incorrectly contains `.pyc` files HOT 3
- Not able to install python package HOT 1
- CDF benchmark HOT 1
- well done HOT 1
- Is the code still maintained? HOT 1
- python2.7 ypeError: unbound method _cython_3_0_0a9.cython_function_or_method object must be called with RBTree instance as first argument (got AccumulationTree instance instead)
- pyspark example misses import
- Corner case where values are identical? HOT 1
- Best way to serialize and load HOT 1
- Scale function
- Tag request for git project: tdigest 0.5.2.2 HOT 1
- Conda Distribution
- Alternative faster t-digest implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tdigest.