Giter Site home page Giter Site logo

Add support for double weights about t-digest HOT 2 OPEN

EnricoMi avatar EnricoMi commented on September 21, 2024
Add support for double weights

from t-digest.

Comments (2)

tdunning avatar tdunning commented on September 21, 2024

Yeah... there has been a fair bit of discussion on this.

The core question is what does the t-digest invariant actually mean with non-integer weights.

Do you have thoughts on that?

The key problems in the past include:

  • violation of invariant has been allowed for centroids with weight = 1. If weights < 1 are allowed, what is the status of this exemption? Is a weight less than one still an indivisible value?
  • if the exemption is removed so that all centroids must meet the invariant, we will require an infinite number of centroids for K=2 or K=3. How could that be resolved?
  • if we assume that any centroid that is added represents a single sample with variable weight rather than the number of samples represented by the weight, then we could allow the exemption from the scale invariant for all samples <= 1. What happens if we merge two such centroids and the weight is still < 1. Do we have to remember that this new centroid has more than one sample?

So, what do you think?

from t-digest.

EnricoMi avatar EnricoMi commented on September 21, 2024

Centroids now have to maintain their cardinality (the number of samples). Then, the exemption can be done based on the cardinality, not the weight (in fact, weight used to be some kind of cardinality, with the assumption of unit weight).

With non-integer weights, the t-digest invariant ${|C|}_{k} = k (q_{right}) − k (q_{left}) ≤ 1$ requires ${q}_{left}$ and ${q}_{right}$ to be normalized by the sum of all weights, not the number of all samples (which used to be the same with unit weights):

$$q_{left} = W_{left}(C)/∑w$$

$$q_{right} = q_{left} + |C|/∑w$$

Then, the invariant should behave identical to equivalent integer weights.

For example, non-integer weighted samples (sample value, sample weight)

(1, 0.1), (2, 0.25), (2, 0.2)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 0.1)\}, \{2, 0.45)\}$

(0, 0.1/0.55), (0.1/0.55, 1)

are equivalent to these integer-weighted samples:

(1, 2), (2, 9)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 2)\}, \{(2, 9)\}$

(0, 2/11), (2/11, 1)

Difference is that a cluster $\{(1, 0.1)\}$ has cardinality 1, which is exempted from the invariant, while cluster $\{(1, 2)\}$ has cardinality 2, which is not exempted from the invariant.

from t-digest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.