Giter Site home page Giter Site logo

ami's Introduction

Approximate Mutual Information (AMI)

A compression-based measure mutual information between strings.

Theory

Data Compression

There is a close relationship between the entropy of a string and the extent to which it can be losslessly compressed. Namely, according to Shannon's source-coding theorem, there exists no lossless code able to compress a string beyond its entropy. Compression schemes hence allow us to obtain an upper bound on the entropy of a string.

Mutual Information

The mutual information between two random variables, denoted $I(X\ ;\ Y)$, measures how much information $X$ provides about $Y$ and visa-versa. It is a symmetric measure given by

$$I(X\ ;\ Y) = H(X) - H(X|Y) = I(Y\ ;\ X) = H(Y) - H(Y|X)$$

where $H(\cdot)$ is the entropy.

It tells us how much our uncertainty in $X$ is reduced after observing $Y$ and visa-versa. Now, since lossless compression gives an upper-bound on the entropy of string, we can also use it to approximate the mutual information between two strings. Let $L_C(x)$ denote the length of string $x$ when losslessly encoded using $C(\cdot)$; and let $L_C(x|y)$ denote the length of string $x$ when encoded after observing $y$. Note that while $H(X|Y) \le H(X)$, the inequality $L_C(x|y) \le L_C(x)$ may not always hold. As such, we retroactively limit the length of $L_C(x|y)$ to $L_C(x)$.

To compute $L_C(x|y)$ we use the fact that $H(X|Y) = H(Y, X) - H(Y)$, obtaining the approximation $L_C(x|y) = L_C(y||x) - L_C(y)$, where $||$ denotes string concatenation. Hence, the mutual information between $x$ and $y$ can be approximated as

$$I(x\ ;\ y) \approx L_C(x) - L_C(x|y) \approx L_C(x) - L_C(y||x) + L_C(y)$$

Note that we may choose to use different codes for encoding $x$ and $x|y$. If a specialised conditional code for $x|y$ can be used, this is preferrable to the approximation $L_C(y||x) - L_C(y)$. Hence, the more general form of the equation is

$$I(x\ ;\ y) \approx L_x(x) - L_{x|y}(x|y)$$

in which separate codes are used.

Implementation

Here's an example of ami implemented in Python, using zlib as the encoder.

import zlib


def ami(x: str, y: str) -> float:
        """
        Returns the normalized approximate mutual information between 
        strings x and y.
        """

        lx = len(zlib.compress(x))
        ly = len(zlib.compress(y))
        lyx = len(zlib.compress(y + x))
        lx_y = min(lx, max(0, lyx - ly)) # 0 <= L(x|y) <= L(x)
        ixy = lx - lx_y
        
        return ixy / lx

Uses

AMI generally measures the amount of information shared between strings. If we sample from two independent random variables, the respective samples should share approximately no information, and their normalized AMI score will be very close to zero. In contrast, two identical strings should have a score close to 1, indicating that they're highly correlated. Scores in between 0 and 1 therefore indicate the degree of correlation between strings.

Unlike some other measures, this correlation need not be linear, or even numerical in nature. In fact, any statistical correlation can be modelled simply by changing the compression scheme. Zlib's standard encoder is a reasonable choice due to its efficiency and ability to significantly compress a broad range of file types. However, a priori knowledge about the strings in question can be used to inform the choice of encoder.

One use of AMI is in performing document classification. For example, the distribution of French language strings is sufficiently different from the distribution of Spanish language strings that even a simple encoder such as Zlib can accurately to distinguish between them. For distributions which are closer, a more specific encoder may be required.

ami's People

Contributors

oelin avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.