Giter Site home page Giter Site logo

Comments (4)

tdunning avatar tdunning commented on May 25, 2024

Thanks for your excellent comments.

I am re-running the distribution tests to see what has happened with the new algorithms, but I don't have answers on that yet.

The quadratic equation was just a hold-over from a previous world and was very wrong in the current world. It is now gone. The merging algorithm has hard bounds on size that are basically just 2 * compression.

Also, if you check out the latest master version of the paper, you should have a decent description of the algorithm. It is vastly simpler than before.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

No progress on this, but there have been a number of changes that could improve behavior.

The distribution of values in a bin is critical to understand since a good characterization of this could improve accuracy, possibly substantially.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

Tommy,

I just wanted to thank you for your observation. I have been working on the quality assessment in anticipation of the 4.0 release and the issue you have highlighted is a big deal that compromises tail accuracy significantly.

This can be seen quite dramatically by first using sorted test data. Here are the data points for the first few centroids of a typical run:

image

That samples that contribute to each centroid have a nice uniform distribution and the centroids have fewer and fewer samples as we get to the left end. All is as we would want.

On the other hand, if you look at what happens with randomly ordered data (but still with a really big sort buffer):

image

With a more rational-sized sort buffer, the tails of each centroid become elongated:

image

Even worse than just being elongated, these clusters are asymmetrical.

I am investigating merge strategies that might avoid this problem.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

Progress report on this.

The issue that Tommy has highlighted is actually quite serious and has a bad effect on accuracy. In investigating this, I have learned a lot about different size limiting strategies as well. I will describe the side learning elsewhere, but the money graph is this one.

image

It is clear that the update strategy that is used in the tree digest (attach points to nearest cluster if space available) is doing a much better job and I attribute the improvement to the fact that the update strategy isn't biased. In the merging digest the updates are always done by a linear scan and that makes new data get added to centroids mostly from one side (either the left side, or on the outer side if we alternate scan direction). This causes the smearing seen in these graphs. The improvement in the graph above was had by changing from sqrt(q(1-q)) as a size limit to q(1-q). There are some normalizing factors in there, but the effect is that clusters near the limits are kept smaller for longer, hopefully limiting the effect of smear. This makes errors about 5x smaller for extreme q, but doesn't really solve the core problem yet.

Improving this situation and making it easier to run tests like this is a major priority for me right now. Of course, I have only minutes per week to work on this so progress is lamentably slow.

from t-digest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.