cbg-ethz / bmi Goto Github PK

View Code? Open in Web Editor NEW

31.0 4.0 5.0 1.03 MB

Mutual information estimators and benchmark

Home Page: https://cbg-ethz.github.io/bmi/

License: MIT License

Python 94.49% TeX 5.51%

benchmark estimator mutual-information python

bmi's Issues

Minibenchmarks/specific problems one can encounter

Sparsity of interactions.
Spiral plots:
- We know that task is getting harder at harder speed, performance drops.
- We compare performance of different estimators with each other.
- Mention that it's probably of breaking colinearities and neighborhoods and PDF is more tricky.
  - Add to discussion: neural estimators may not model hard density functions.
High MI hard to estimate.
Do tails matter?
- Apply |x|^(1+a) homeomorphism and change a. For each a consider a distribution and its asynhed, uniformized, and "standardization" versions.
- See what happens for Student t with different number of degrees of freedom
How to normalize?
- Asynhed, uniformized, and standardization (maybe not with normalizing flow, but with uniformization and then CDFitaion via normal along each axis)

Generated results:

npoints -> estimatator -> task -> MI estimate
- what we plot: estimator -> task -> f(npoints, MI estimate)
npoints = 5k, estimator, task, preprocessing -> MI estimate
- some estimators, some tasks

Do tails matter?
- Student t vs normal
- Half-cube
- "Detailing" with async

Figures order

One proposition

Demonstration of distributions
Benchmark figure
Specific issues ("minibenchmarks")

Another one

Demonstration of distributions
Specific issues
Benchmark figure

Rename fine distributions to BMMs

This PR is, as proposed by @grfrederic, about updating naming conventions.

Tasks:

Change the import in the package. (from ... import ... as bmm rather than from ... import ... as fine)
Update unit tests.
Update Snakemake workflows.
Update the documentation: check if API is rendered properly.
Update the documentation: adjust the tutorial.

Add the tasks based on fine distributions

Add some tasks to the benchmark based on the fine distributions framework.

For example, tasks involving outliers and discrete-continuous distributions.
Note that some more thinking is needed what exactly tasks to add.

Better tests for estimators

it would be nice to organize our estimator tests, that is have a generic testing function:

test_estimator_on_task(estimator, task, n_samples, seed, abs_error, rel_error)

and then use it to build our tests:

a generic/easy group for all estimators
R and julia estimators can be optionally tested (with the same tests?)
longer/advanced tests for estimators we expect to perform well (esp KSG and neural estimators)

Update github readme

Include info about available estimators
Show basic examples of running a given estimator on a given task

Conditional mutual information

Add conditional MI estimators and samplers.

Note that we have the chain rule:
$$I(X; Y, Z) = I(X; Z) + I(X; Y\mid Z).$$

Add MI estimators in R

As @a-marx told me, KSG, G-KSG, gKNN, and LNN are implemented in this repository. For the demo, look here, at lines 97–113.

We can add it as a git submodule and plug it into our framework by creating an appropriate wrapper script in R. It it probably the best to parametrize it with argparse.

Principled approach to handling NaNs

In #110 there was an issue that NaNs are raised.

I think this may be a problem of numerical approximations: sometimes we may have $p(x, y)\approx 0$, so that numerically $\log p(x, y) = -\text{inf}$.

If all $\log p(x, y)$, $\log p(x)$, and $\log p(y)$ evaluate to $-\text{inf}$, then PMI evaluates to NaN.

I asked ChatGPT and it sugggested to use the following construction:

import jax.numpy as jnp

def custom_subtract(a, b, c):
    # Calculate the result for a - (b + c)
    result = a - (b + c)

    # Create a mask for the special case when all inputs are -inf
    mask = jnp.logical_and(jnp.logical_and(a == -jnp.inf, b == -jnp.inf), c == -jnp.inf)

    # Return 0 where the mask is True, and the original result otherwise
    return jnp.where(mask, 0.0, result)

# Test
a = jnp.array(-float('inf'))
b = jnp.array(-float('inf'))
c = jnp.array(-float('inf'))

print(custom_subtract(a, b, c))  # Should print 0.0

What is motivation of choosing JAX as basis?

There are more and more neural estimators, which are implemented using PyTorch. I feel that using JAX contributes to some troubles for future implementations. This means that for every new estimator, a JAX version is required, which is very difficult.

Unable to Build pands._lib.algos

Python == 3.12.0; benchmark-mi == 0.1.3 (also tried 0.1.2), pandas == 2.2.2
Installed Visual C++ 14 Build Tools from MS website. Also, downloaded dependencies in requirements.txt.
Attempted to install via Pycharm Community, as well as through Windows terminal.

It appears that the installer is unable to use a slicing function from pandas. The pyproject.toml file lists pands == '^1.5.3', but the current pandas release is 2.2.2. Could this be causing the issue?

Installation Output
UPDATING build\lib.win-amd64-cpython-312\pandas/_version.py
set build\lib.win-amd64-cpython-312\pandas/_version.py to '1.5.3'
running build_ext
building 'pandas._libs.algos' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pandas
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pandas)

Implement multivariate Student-t sampler

Student-t distribution has its multivariate generalization (which also for $\nu \gg 2$ is similar to the normal).

Sampling can be efficiently implemented as described here.
Mutual information can be calculated analytically, as described in
R.B. Arellano-Valle et al., Shannon Entropy and Mutual Information for Multivariate Skew-Elliptical Distributions, Scandinavian Journal of Statistics, Vol. 40, No. 1 (March 2013), p. 47.

Moreover, the mentioned article describes MI of several other families of distributions, although sampling from them may be tricky and MI calculation may require numerical integration.

Standardize key/seed parameters for samplers

Normal sampler takes jax key, lets just allow for an int and cast it to jax key manually.

Tests for the Spiral

In #23 we have an example of the spiraling diffeomorphism. Think whether the API is right and write the tests (currently they are missing).

1D + 1D visualisation

Utilities to visualise the joint distribution of two 1-dimensional variables. Probably a thin wrapper around Seaborn.

Implement LNC estimator

Implement LNC estimator, which is KSG with additional correction terms.

Implement Geometric KNN estimator

Implement the G-KNN estimator.

Changes to the manuscript

Use the NeurIPS template.
Cite the variational estimators literature (Poole et al., Song and Ermon, McAllester and Stratos)
Benchmark with tasks created using the BMM models to have a table in the appendix.
Make Section 4 a Subsection 3.4.
Revisit the introduction.
Revisit the discussion and link to the appendix for more results.
Answer NeurIPS checklist questions.
Update the arXiv version.

Benchmark versioning

Introduce principled versioning of the benchmark, using GitHub releases.

Additional changes:

Version number in Python code or ReadMe?
Draft the v1.0 release, when it's done.

Error raised during smoothing if training is too short

There is a bug in smoothing the training:

src/bmi/estimators/neural/_mine_estimator.py:357: in estimate
    return self.estimate_with_info(x, y).mi_estimate
src/bmi/estimators/neural/_mine_estimator.py:335: in estimate_with_info
    training_log, trained_critic = mine_training(
src/bmi/estimators/neural/_mine_estimator.py:248: in mine_training
    training_log.finish()
src/bmi/estimators/neural/_training_log.py:107: in finish
    self.detect_warnings()
src/bmi/estimators/neural/_training_log.py:120: in detect_warnings
    train_mi_smooth = (cs[w:] - cs[:-w]) / w
jax/_src/numpy/lax_numpy.py:5071: in deferring_binary_op
    return binary_op(*args)

which arises in the settings when the training is too short. I added a TODO in _training_log.py:

        # TODO(Pawel, Frederic): If training smooth window is too
        #   long we will have an error that subtraction between (n,)
        #   and (0,) arrays cannot be performed.
        train_mi_smooth = (cs[w:] - cs[:-w]) / w

JointDistribution wraps and unwraps X and Y

Currently the JointDistribution wraps and unwraps X and Y samples into one array XY by concatenation and slicing.

This is suboptimal: for example, X and Y need to have the same dtype and working with continuous and categorical variables requires manual casting.

Instead, we can use JointDistribution from TFP on JAX.

Improve the documentation

There's room for improvement for documentation:

Add a picture of the benchmark to the ReadMe/docs.
Explicitly list estimators and cite their resources (See #135). Some ideas:
- Add estimator.cite() method.
- Add the citations to the documentation (e.g., to the webpage listing the existing estimators by including them in the docstrings).

Additionally, the following tutorial sections in documentation would be useful to add:

How to use the samplers
How to use the tasks.
How to use the estimators. (See #135)
How to add a new estimator. (See #134, #135)
How to define and use the fine distributions. (See #138)
How to use Snakemake workflows.

A possible suggestion how to structure things: https://omnibenchmark.org/

Add difference of cross-entropies estimator

Add the difference of cross-entropy estimator.

Implement Mutual Information Neural Estimation

It would be nice to have a JAX (+Equinox) based implementation of Mutual Information Neural Estimation. A PyTorch version can be found, e.g., in the Latte project.

Wrap the estimators to make them easier to run

Currently we have an interface for Python estimators taking the X and Y samples and a ExternalEstimator class which takes the path to the task.
The latter one is very convenient when one loads the tasks from the disk. It'd be good to abstract it into an interface and make the existing Python estimators implement such interface.

Modify the estimator interface to provide parameters.
Adjust the existing implementations to provide the parameters.
Define a new interface ITaskEstimator, for returning the parameters and estimating the MI using the loaded data.
Adjust ExternalEstimator, so it implements the ITaskEstimator interface.
Create a factory method which takes the estimator and generates wraps it into a ITaskEstimator implementation.

Add an MI estimator in Julia

To emulate the experience from the Julia users perspective, create a simple Julia script reading the data in the provided format and running some implemented MI estimators

Fix importing

so we don't have to manually import like this:
import bmi.samplers.SplitMultinormal

Implement a (hidden) Markov model

Use TFP on JAX to define a new BMM:

X1 – X2 – ... – Xn 
 |   |          |
Y1   Y2 – ... –  Yn

However, this requires first fixing #161, so that the joint distribution returns tuples of arrays for $(X, Y)$, rather than sampling them jointly and unwrapping.

Implement G-KSG

G-KSG is a geodesic variant of KSG, based on geodesic random forests to find the neighborhoods.
There exists an R implementation.

Allow for Python 3.11 and 3.12

Python 3.12 was released in October 2023 and we still currently have the following pins in pyproject.toml:

# <3.11 because of PyType. Update when it's resolved
# <3.12 because of SciPy. Update when it's resolved
python = ">=3.9,<3.11"

The idea for this issue is to update the dependencies, so that they work with Python 3.11 and 3.12. It's also likely that we can drop 3.9 entirely.

Tasks:

Update pyproject.toml, resolving the dependencies appropriately.
Update .github/workflows, so that we test against 3.11 and 3.12.

Rename `JointDistribution` to `BendAndMixModel`

Currently the core class for BMMs is called JointDistribution, which may make it confusing with TensorFlowProbability convention.

Add dropout to neural estimators

MINE, InfoNCE, and other neural estimators could use random state to use dropout. This requires changes in the training loop and some refactoring.

Clean up imports

We've made a lot of changes when moving to our new tasks/benchmark API. It would be nice to rethink which tasks/estimators etc should be exported by default. For example:

when importing bmi.benchmark, should functions for creating tasks (which live in bmi.benchmark.tasks) be reexported there and users create the tasks themselves, or should the tasks from the benchmark list be reexported under convenient names?
should external estimators be exported separately, or included in bmi.estimators? we could go with the latter and raise warnings when someone tries to initialize an estimator and R/julia/some necessary packages is not installed.

Add the documentation on the GMM-based estimator

The GMM-based estimator is not documented properly:

Add GMMs to the list of estimators.
Add documentation on GMMs.
Update the version of the package to 0.2.0.
Use poetry to submit a new version of the package to PyPI.

Implement early stopping for neural estimators

The default setup for neural estimators should be training on a 50-50 train/test split, early stopping when test MI stops growing, and return the highest test MI as the estimate.

Adaptive histograms

Implement an adaptive binning strategy for the histogram-based MI estimator. Also, consider binning by the number of samples and estimating the volume, rather than current strategy (equally-sized bins and different number of samples).

Change the package API

This PR proposes how to refactor the package, so it's easier to use.

Benchmark tasks

Instead of generated values for a range of seeds, tasks will be more like a named sampler. Currently tasks are objects with several samples.

xs, ys = task.sample(n_samples=5000, seed=42)

task.id  # unique
mn_sparse_3x3

> task.name  # pretty
Multinormal (sparse) 3 × 3

> task.params
(serializible info about the task)

> task.save_metadata('path/to/save.yaml')

> task.save_sample('path/to/save.csv', n_samples=5000, seed=42)
(includes info from above)

> from bmi.tasks import read_sample
> x, y = read_sample('path/to/read')

Dumping task could be a functionality in benchmark, for example:

> from bmi.tasks import dump_task
> dump_task('path/', task, seeds=[0, 1, 2], samples=[1000, 2000])

should create:

path/
  task_id/
    metadata.yaml
    samples/
      1000-0.csv
      1000-1.csv
      1000-2.csv
      2000-0.csv
      2000-1.csv
      2000-2.csv

We need an official dictionary of tasks, BENCHMARK_TASKS:

> from bmi.benchmark import BENCHMARK_TASKS
> task = BENCHMARK_TASKS['some_task_id']

We can have a script for non-Python users that allows
easy task generation:

$ python generate_task.py TASK_ID SEEDS SAMPLES PATH

Estimators

We wrap external estimators so they behave like regular estimators by on-the-fly saving the needed sample (ideally in /run/user/$uid or tmp/ or some other ramdisk).

> from bmi.estimators import InfoNCEEstimator
> from bmi.estimators import JuliaTransferEstimator
> from bmi.benchmark import BENCHMARK_TASKS
> task = BENCHMARK_TASKS['some_task_id']
> xs, ys = task.sample(5000, 0)
> InfoNCEEstimator.estimate(xs, ys)
> JuliaTransferEstimator.estimate(xs, ys)

Benchmark

We want benchmarks to be easily run and configured through Snakemake. This is out of scope for this issue, but we should keep it in mind.

Inconsistent `.mutual_information` method

In our samplers .mutual_information() is a callable method, in tasks it is a @property. It would be nice for them to be consistent.