cbg-ethz / bmi Goto Github PK
View Code? Open in Web Editor NEWMutual information estimators and benchmark
Home Page: https://cbg-ethz.github.io/bmi/
License: MIT License
Mutual information estimators and benchmark
Home Page: https://cbg-ethz.github.io/bmi/
License: MIT License
Generated results:
This PR is, as proposed by @grfrederic, about updating naming conventions.
Tasks:
from ... import ... as bmm
rather than from ... import ... as fine
)Add some tasks to the benchmark based on the fine distributions framework.
For example, tasks involving outliers and discrete-continuous distributions.
Note that some more thinking is needed what exactly tasks to add.
it would be nice to organize our estimator tests, that is have a generic testing function:
test_estimator_on_task(estimator, task, n_samples, seed, abs_error, rel_error)
and then use it to build our tests:
Add conditional MI estimators and samplers.
Note that we have the chain rule:
As @a-marx told me, KSG, G-KSG, gKNN, and LNN are implemented in this repository. For the demo, look here, at lines 97–113.
We can add it as a git submodule and plug it into our framework by creating an appropriate wrapper script in R. It it probably the best to parametrize it with argparse.
In #110 there was an issue that NaNs are raised.
I think this may be a problem of numerical approximations: sometimes we may have
If all
I asked ChatGPT and it sugggested to use the following construction:
import jax.numpy as jnp
def custom_subtract(a, b, c):
# Calculate the result for a - (b + c)
result = a - (b + c)
# Create a mask for the special case when all inputs are -inf
mask = jnp.logical_and(jnp.logical_and(a == -jnp.inf, b == -jnp.inf), c == -jnp.inf)
# Return 0 where the mask is True, and the original result otherwise
return jnp.where(mask, 0.0, result)
# Test
a = jnp.array(-float('inf'))
b = jnp.array(-float('inf'))
c = jnp.array(-float('inf'))
print(custom_subtract(a, b, c)) # Should print 0.0
There are more and more neural estimators, which are implemented using PyTorch. I feel that using JAX contributes to some troubles for future implementations. This means that for every new estimator, a JAX version is required, which is very difficult.
Python == 3.12.0; benchmark-mi == 0.1.3 (also tried 0.1.2), pandas == 2.2.2
Installed Visual C++ 14 Build Tools from MS website. Also, downloaded dependencies in requirements.txt.
Attempted to install via Pycharm Community, as well as through Windows terminal.
It appears that the installer is unable to use a slicing function from pandas. The pyproject.toml file lists pands == '^1.5.3', but the current pandas release is 2.2.2. Could this be causing the issue?
Installation Output
UPDATING build\lib.win-amd64-cpython-312\pandas/_version.py
set build\lib.win-amd64-cpython-312\pandas/_version.py to '1.5.3'
running build_ext
building 'pandas._libs.algos' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pandas
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pandas)
Student-t distribution has its multivariate generalization (which also for
Moreover, the mentioned article describes MI of several other families of distributions, although sampling from them may be tricky and MI calculation may require numerical integration.
Normal sampler takes jax key, lets just allow for an int and cast it to jax key manually.
In #23 we have an example of the spiraling diffeomorphism. Think whether the API is right and write the tests (currently they are missing).
Utilities to visualise the joint distribution of two 1-dimensional variables. Probably a thin wrapper around Seaborn.
Implement LNC estimator, which is KSG with additional correction terms.
Implement the G-KNN estimator.
Introduce principled versioning of the benchmark, using GitHub releases.
Additional changes:
There is a bug in smoothing the training:
src/bmi/estimators/neural/_mine_estimator.py:357: in estimate
return self.estimate_with_info(x, y).mi_estimate
src/bmi/estimators/neural/_mine_estimator.py:335: in estimate_with_info
training_log, trained_critic = mine_training(
src/bmi/estimators/neural/_mine_estimator.py:248: in mine_training
training_log.finish()
src/bmi/estimators/neural/_training_log.py:107: in finish
self.detect_warnings()
src/bmi/estimators/neural/_training_log.py:120: in detect_warnings
train_mi_smooth = (cs[w:] - cs[:-w]) / w
jax/_src/numpy/lax_numpy.py:5071: in deferring_binary_op
return binary_op(*args)
which arises in the settings when the training is too short. I added a TODO in _training_log.py
:
# TODO(Pawel, Frederic): If training smooth window is too
# long we will have an error that subtraction between (n,)
# and (0,) arrays cannot be performed.
train_mi_smooth = (cs[w:] - cs[:-w]) / w
Currently the JointDistribution
wraps and unwraps X
and Y
samples into one array XY
by concatenation and slicing.
This is suboptimal: for example, X
and Y
need to have the same dtype
and working with continuous and categorical variables requires manual casting.
Instead, we can use JointDistribution
from TFP on JAX.
There's room for improvement for documentation:
estimator.cite()
method.Additionally, the following tutorial sections in documentation would be useful to add:
A possible suggestion how to structure things: https://omnibenchmark.org/
Add the difference of cross-entropy estimator.
It would be nice to have a JAX (+Equinox) based implementation of Mutual Information Neural Estimation. A PyTorch version can be found, e.g., in the Latte project.
Currently we have an interface for Python estimators taking the X and Y samples and a ExternalEstimator
class which takes the path to the task.
The latter one is very convenient when one loads the tasks from the disk. It'd be good to abstract it into an interface and make the existing Python estimators implement such interface.
ITaskEstimator
, for returning the parameters and estimating the MI using the loaded data.ExternalEstimator
, so it implements the ITaskEstimator
interface.ITaskEstimator
implementation.To emulate the experience from the Julia users perspective, create a simple Julia script reading the data in the provided format and running some implemented MI estimators
so we don't have to manually import like this:
import bmi.samplers.SplitMultinormal
Use TFP on JAX to define a new BMM:
X1 – X2 – ... – Xn
| | |
Y1 Y2 – ... – Yn
However, this requires first fixing #161, so that the joint distribution returns tuples of arrays for
G-KSG is a geodesic variant of KSG, based on geodesic random forests to find the neighborhoods.
There exists an R implementation.
Python 3.12 was released in October 2023 and we still currently have the following pins in pyproject.toml
:
# <3.11 because of PyType. Update when it's resolved
# <3.12 because of SciPy. Update when it's resolved
python = ">=3.9,<3.11"
The idea for this issue is to update the dependencies, so that they work with Python 3.11 and 3.12. It's also likely that we can drop 3.9 entirely.
Tasks:
pyproject.toml
, resolving the dependencies appropriately..github/workflows
, so that we test against 3.11 and 3.12.Currently the core class for BMMs is called JointDistribution
, which may make it confusing with TensorFlowProbability convention.
MINE, InfoNCE, and other neural estimators could use random state to use dropout. This requires changes in the training loop and some refactoring.
We've made a lot of changes when moving to our new tasks/benchmark API. It would be nice to rethink which tasks/estimators etc should be exported by default. For example:
bmi.benchmark
, should functions for creating tasks (which live in bmi.benchmark.tasks
) be reexported there and users create the tasks themselves, or should the tasks from the benchmark list be reexported under convenient names?bmi.estimators
? we could go with the latter and raise warnings when someone tries to initialize an estimator and R/julia/some necessary packages is not installed.The GMM-based estimator is not documented properly:
0.2.0
.poetry
to submit a new version of the package to PyPI.The default setup for neural estimators should be training on a 50-50 train/test split, early stopping when test MI stops growing, and return the highest test MI as the estimate.
Implement an adaptive binning strategy for the histogram-based MI estimator. Also, consider binning by the number of samples and estimating the volume, rather than current strategy (equally-sized bins and different number of samples).
This PR proposes how to refactor the package, so it's easier to use.
Instead of generated values for a range of seeds, tasks will be more like a named sampler. Currently tasks are objects with several samples.
xs, ys = task.sample(n_samples=5000, seed=42)
task.id # unique
mn_sparse_3x3
> task.name # pretty
Multinormal (sparse) 3 × 3
> task.params
(serializible info about the task)
> task.save_metadata('path/to/save.yaml')
> task.save_sample('path/to/save.csv', n_samples=5000, seed=42)
(includes info from above)
> from bmi.tasks import read_sample
> x, y = read_sample('path/to/read')
Dumping task could be a functionality in benchmark, for example:
> from bmi.tasks import dump_task
> dump_task('path/', task, seeds=[0, 1, 2], samples=[1000, 2000])
should create:
path/
task_id/
metadata.yaml
samples/
1000-0.csv
1000-1.csv
1000-2.csv
2000-0.csv
2000-1.csv
2000-2.csv
We need an official dictionary of tasks, BENCHMARK_TASKS
:
> from bmi.benchmark import BENCHMARK_TASKS
> task = BENCHMARK_TASKS['some_task_id']
We can have a script for non-Python users that allows
easy task generation:
$ python generate_task.py TASK_ID SEEDS SAMPLES PATH
We wrap external estimators so they behave like regular estimators by on-the-fly saving the needed sample (ideally in /run/user/$uid
or tmp/
or some other ramdisk).
> from bmi.estimators import InfoNCEEstimator
> from bmi.estimators import JuliaTransferEstimator
> from bmi.benchmark import BENCHMARK_TASKS
> task = BENCHMARK_TASKS['some_task_id']
> xs, ys = task.sample(5000, 0)
> InfoNCEEstimator.estimate(xs, ys)
> JuliaTransferEstimator.estimate(xs, ys)
We want benchmarks to be easily run and configured through Snakemake. This is out of scope for this issue, but we should keep it in mind.
In our samplers .mutual_information()
is a callable method, in tasks it is a @property
. It would be nice for them to be consistent.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.