Giter Site home page Giter Site logo

Comments (8)

CamDavidsonPilon avatar CamDavidsonPilon commented on May 21, 2024

Looks strange - are you able to share the data?

from tdigest.

idanmoradarthas avatar idanmoradarthas commented on May 21, 2024

It attached in the second line

or link:

https://github.com/CamDavidsonPilon/tdigest/files/2613099/my_set.zip

from tdigest.

phucbui95 avatar phucbui95 commented on May 21, 2024

Hi, i have a same issue here.

t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])
t.percentile(25)```
result was
`-523.5`

from tdigest.

rlele5 avatar rlele5 commented on May 21, 2024

Hi, I noticed this too. Thanks for everything you've done! In the cases it does work, it seems to be doing a great job! However, with this simple example posted above I'm now a little worried. Do you know when this error case occurs? Is this something inherent to TDigests? When should we watch out for something like this happening?

from tdigest.

rlele5 avatar rlele5 commented on May 21, 2024

@CamDavidsonPilon , @idanmoradarthas , @phucbui95 - I found where this regression was introduced:

commit 1f76113e3fce0d5b7189fd083085ba64d4129e05 (HEAD)
Author: Cameron Davidson-Pilon <[email protected]>
Date:   Tue Dec 26 23:04:33 2017 -0500

    bump to v0.5, drop python2 support

The commit before returns the correct results:

Author: Cameron Davidson-Pilon <[email protected]>
Date:   Mon Feb 20 11:22:54 2017 -0500

    test against 3.6

Fix: roll back the implementation of def percentile(self, p) back to the implementation in 9cd2536. @CamDavidsonPilon , are you able to explain this further? I still have to run your test cases to confirm, but this seems promising from some manual tests.


Examples:
Code:

from tdigest import TDigest
t = TDigest()
t.batch_update([1649.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0, 69.0])

for percent in [0, 25, 50, 75, 100]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output with 1f76113 (broken):

p0 = 69.0
p25 = -1906.0
p50 = -1116.0
p75 = -326.0
p100 = 464.0

Output with 9cd2536 (working):

p0 = 69.0
p25 = 69.0
p50 = 69.0
p75 = 69.0
p100 = 1649.0

from tdigest.

rlele5 avatar rlele5 commented on May 21, 2024

Addendum - this may not be as simple as it seems, but may be a step in the right direction. I further compared a test case (bigJump) in the reference implementation (https://github.com/tdunning/t-digest/blob/ff3232bc25a69961fc7bf4911f8de0026bd28c44/core/src/test/java/com/tdunning/math/stats/TDigestTest.java)

Here is the java test - notice the assertions, then compare with the values with the working python implementation (9cd2536):

        TDigest digest = factory(100).create();
        for (int i = 1; i < 20; i++) {
            digest.add(i);
        }
        digest.add(1_000_000);

        assertEquals(18, digest.quantile(0.89999999), 0);
        assertEquals(19, digest.quantile(0.9), 0);
        assertEquals(19, digest.quantile(0.949999999), 0);
        assertEquals(1_000_000, digest.quantile(0.95), 0);

Working python implementation results (9cd2536):
Code:

from tdigest import TDigest

l = [i for i in range(20)]
l.append(1000000)
print(l)
t = TDigest()
t.batch_update(l)

for percent in [89.999999, 90, 94.9999999, 95,]:
    value = t.percentile(percent)
    print(f"p{percent} = {value}")

Output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000000]
p89.999999 = 18.39999979
p90 = 18.400000000000002
p94.9999999 = 225014.93950018956
p95 = 225014.94999999963

Notice that the p94.9999999 and p95 results are still quite different from the reference implementation. This will probably only happen in extreme scenarios, but I thought I'd point it out. Perhaps the reference implementation has some other conditions that handle these extremities. I don't think the python implementation is wrong, but this just may be one of the accuracy tradeoffs made with TDigest (just a guess).

from tdigest.

CamDavidsonPilon avatar CamDavidsonPilon commented on May 21, 2024

@rlele5 unfortunately I'm not involved much in this repo anymore - if you suggest a fix, with an appropriate test, I'd be happy to review and merge it.

from tdigest.

rlele5 avatar rlele5 commented on May 21, 2024

@CamDavidsonPilon , sounds good. I'll have to go back and understand why the change was made in the first place and will see if there are any further changes needed.

from tdigest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.