Giter Site home page Giter Site logo

Comments (5)

lemire avatar lemire commented on August 27, 2024

What is rand() in this context?

from simdcompressionandintersection.

marsupialtail avatar marsupialtail commented on August 27, 2024

let's say it is a random number with uniform or normal distribution from 0 to 10k

from simdcompressionandintersection.

searchivarius avatar searchivarius commented on August 27, 2024

@marsupialtail with most lightweight integer encoding schemes the lower bound is equal to the number of binary digits necessary to encode the number. However, this doesn't include the length of the code. So if the length is a random variable X, the mean of this variable is a lower bound for the average number of bits per code (if I don't get something wrong). It will be about log_2 10K / 2 for uniform. What is a normal distribution is not clear since we cannot have negative numbers for most encoding schemes.

There's also some overhead (to encode code length) that might be sometimes tricky to compute.

However, this number, is not an absolute theoretical minimum, because it depends on the encoding scheme and statistical properties of the data, e.g., on repetitiveness. If you have a single number repeated gazillion of times, little space is needed.

from simdcompressionandintersection.

lemire avatar lemire commented on August 27, 2024

Let me add that by information theoretical arguments, it is a impossible to consistently store sequences of random integers between 0 and N using less than log N bits per integer on average. If you beat this, either you do not use true random integers or you have made a mistake.

In effect, random numbers cannot be compressed… and that’s a tautology because that is one definition of random (cannot be compressed).

from simdcompressionandintersection.

searchivarius avatar searchivarius commented on August 27, 2024

Only if they are independent. If independence isn't there, an easy counter example:
generate a random number in [0, N] then repeat it a random number [1, M] times.
repeat this procedure many times.
Although the numbers are random, this procedure generates highly compressible sequences when the median number of repetitions is high.

from simdcompressionandintersection.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.