emukit / emukit Goto Github PK

A Python-based toolbox of various methods in decision making, uncertainty quantification and statistical emulation: multi-fidelity, experimental design, Bayesian optimisation, Bayesian quadrature, etc.

Home Page: https://emukit.github.io/emukit/

License: Apache License 2.0

Python 99.86% Stan 0.14%

machine-learning bayesian-optimization uncertainty-quantification multi-fidelity experimental-design bayesian-quadrature sensitivity-analysis decision-making emulation python

emukit's Introduction

Emukit

| | |

Website | Documentation | Contribution Guide

Emukit is a highly adaptable Python toolkit for enriching decision making under uncertainty. This is particularly pertinent to complex systems where data is scarce or difficult to acquire. In these scenarios, propagating well-calibrated uncertainty estimates within a design loop or computational pipeline ensures that constrained resources are used effectively.

The main features currently available in Emukit are:

Multi-fidelity emulation: build surrogate models when data is obtained from multiple information sources that have different fidelity and/or cost;
Bayesian optimisation: optimise physical experiments and tune parameters of machine learning algorithms;
Experimental design/Active learning: design the most informative experiments and perform active learning with machine learning models;
Sensitivity analysis: analyse the influence of inputs on the outputs of a given system;
Bayesian quadrature: efficiently compute the integrals of functions that are expensive to evaluate.

Emukit is agnostic to the underlying modelling framework, which means you can use any tool of your choice in the Python ecosystem to build the machine learning model, and still be able to use Emukit.

Installation

To install emukit, simply run

pip install emukit

For other install options, see our documentation.

Dependencies / Prerequisites

Emukit's primary dependencies are Numpy and GPy. See requirements.

Getting started

For examples see our tutorial notebooks.

Documentation

To learn more about Emukit, refer to our documentation.

To learn about emulation as a concept, check out the Emukit playground project.

Citing the library

If you are using emukit, we would appreciate if you could cite our papers about Emukit in your research:

@inproceedings{emukit2019,
  author = {Paleyes, Andrei and Pullin, Mark and Mahsereci, Maren and McCollum, Cliff and Lawrence, Neil and González, Javier},
  title = {Emulation of physical processes with {E}mukit},
  booktitle = {Second Workshop on Machine Learning and the Physical Sciences, NeurIPS},
  year = {2019}
}

@article{emukit2023,
  title={Emukit: A {P}ython toolkit for decision making under uncertainty},
  author={Andrei Paleyes and Maren Mahsereci and Neil D. Lawrence},
  journal={Proceedings of the Python in Science Conference},
  year={2023}
}

The papers themselves can be found at these links: NeurIPS workshop 2019, SciPy conference 2023.

License

Emukit is licensed under Apache 2.0. Please refer to LICENSE and NOTICE for further license information.

emukit's People

Contributors

Stargazers

Watchers

Forkers

aaronkl javiergonzalezh marpulli alpiges etsangsplk meissnereric dotran mmahsereci tomliptrot shyamalschandra crawford30 hamzaimran01 datajavelin tdiethe megrasaurousrex siddharthml henrynj mnazaal eamanu jejjohnson palindromik lilyevanshogwarts onponomarev polivucci cnxtech davidjanz vishwa97 vishalbelsare mahendra-ramajayam emc2-2022 mazzol maditya0310ucsb evsikov bing-jing checorh wangshuo17 xiaoyulu2014 avidereta jimmy-inl ertandemiral floyebolu florianboergel brookefzy sunnyszy mlatcl mashanaslidnyk ahirvoas aeivazi driesverstraete clairecp zerocurve jibancat prashantvaidyanathan geofiber am764733 bouthilx yimingzhang521 rishirelan cas91010491 esiivola lawrennd yuhuiharry rns294 henrymoss neochaos12 rvvincelli harrisonzhu508 kenzaxtazi tpielok brunokm dominickzhang charelstoncrabb mrhheffernan biaohe zhucer2003 izsahara ntenenz sparsel chengsfsu sennendoko alibaheri ndalchau taufiquehassan zyhuster1 chrisliu2007 rogersguo drahnreb lsnop inwaves eymc ai-app lfabris-mhpc makuche ablancha yuansuny nikosmemmos minzhu123 cryptowealth-technology kilojoules irove108

emukit's Issues

Implementation of the Local Penalization acquisition function.

New name for 'update_data' method in IModel

The naming of the method 'update_data' in IModel has been causing confusion. We should consider renaming it.

Create the documentation

Emukit should have documentation up and running on some doc hosting, e.g. RTD. Here is the scope:

Generate docs for API
Add notebooks to the docs
Additional docs on architecture of the library
Index page for all this stuff above
Host the documentation

Result from loops

At the minute the loop does not return anything and there is no quick way to get a solution from a loop. We also can't access the model easily. We should fix this.

Multi-output sensitvity analysis

The tests suggest that the sensitivity analysis should work with multi-output functions because the test model used has 5-dimensional outputs. However, the total variance output is a scalar and I'd expect it to have 5 entries, one for each output. Is there a bug in the code or my understanding?

Add build for python 3.7

Right now we build for 3.5 and 3.6. 3.7 is around for some time now, we should add build and make sure we support this version too.

Implement model based sensitivity analysis.

Add ways to customize outer loop flow

It should be possible to customize the loop flow without rewriting it. For instance, if a user wants to store some benchmarking information between iterations, they should be able to add it into the existing loop without redefining it.

Right now we have self.custom_step() at the end of each iteration, which does the job to some extent, but obviously isn't very flexible and does not cover for many scenarios.

Integrating model hyperparameters

Integrating over model hyper-parameters needs to be implemented and offered as an option when hyper-priors are used in the models. This can provide fundamentally different (and many times better) results than optimizing the models as it is currently done.

matplotlib as requirements

If I install emukit via pip, matplotlib won't be install. However, it seems that it is a requirement of GPy (at least if I do from emukit.test_functions import forrester_function it throws me an error "ImportError: No module named 'matplotlib'" somewhere in GPy).

Could we add matplotlib to requirements.txt?

Add GP model.

Implement context variables

This will allow us to fix certain parameters during the optimization

find a place for mcmc_sampler.py

it is currently in bayesian_optimization/util/ and 'util' is not a good place to be.

Get optional dependencies straight

We have a few models with optional dependencies at the moment. They are not tested, and result in empty doc pages. We need to find a good place for them to go, and think about how to manage their dependencies appropriately.

Parallel evaluations of the user function

Following some previous discussions, we should start considering cases where the UserFunction is evaluated simultaneously in several location. This is common in experimental design, BayesOpt etc. For 1.0 we don't need implementations for everything but it will be really useful to get the design and structure of how we are going to handle these cases in the future.

Integration against Gaussian probability measure for BQ

so far BQ only integrates against the uniform measure. Integration against a Gaussian would be a nice addition.

Add logging

It is currently difficult to keep track of what is happening when a loop is running. The loops potentially take a long time to run. We should add logging to allow users to monitor the loops.

Add get_bounds method to ParameterSpace

It should return the maximum and minimum values in the input domain for each parameter.

Tidy up example functions

The Forrester function is implemented in both the Bayesian optimization and multi-fidelity modules. Shall we put all test functions/toy simulators into one top-level module?

Examples for the landing page

The landing page will be the first thing users will see when approaching the library. When writting the blogs for the methods it is very important that we send an clear message of how the library should be used. For the examples involving loops (BO, ED, BQ) this is what I propose:

In all the cards in the landing page we explain the idea of the method that each card links to and we add a simple example with code.
We use the same objective function and the same pattern in all the examples.
All the examples use specific methods-loops and have the same structure.

Definition of the objective function (same in all examples)
Definition of the model (same in all examples)
Definition of the elements specific of each method (this changes across examples).
Creation of the loop (also specific).
Run the loop and show results (same).

The code for 1, 2, and 5 should be the same in all the examples.

Sensitivity analysis will also use 1 and 2 but as there is no loop the results with be directly shown (specific call to 5).

The multi-fidelity are just models that we offer. The way to go here is to use do 1 and 2 for some example and then repeat 3-5 for one of the other applications (ED?).

It is important that we don't use any class from the /examples folder (like GPBayesianOptimization) as the idea with this blogpost is to show the modularity of the library and be clear about the core structure. We can have a separate card for examples and how to wrap things up.

Does it make sense?

Add categorical variable in the class space.

clean up epmgp.py

it is not easy to read. Also we might want to move it out of the util folder

Bayesian quadrature component.

Implementation of Vanilla method to start with.

Stopping condition for cost sensitive evaluations

Currently the stopping condition is only checked at beginning of each loop-step. If the evaluation is cost/budget sensitive e.g., in multi-fidelity/source models or models where the input location determines the evaluation cost, then we might want to check after point calculation, too. Alternatively the point calculator could be smarter and only return points which are still below budget, but that is harder to do.

Use abstractions appropriately

Before going 1.0 we should review our use of abstractions. Based on recent conversations I propose this approach:

On OuterLoop level we should be dealing in ModelUpdater, CandPointCalc and so on.
On concrete method level we should be dealing in problem specific terms: acquisition, update step size, etc.
If concrete method does not suit user needs, it should be clear how to create own implementation of the OuterLoop. In fact, users should be encouraged to do that.

There are few places in the code base where we currently don't follow this approach, and if everyone agrees with this proposal, we should identify and fix those places.

Move GPBayesianOptimization method to emukit.examples

Finish "basic use of library" notebook

There is a section about BQ which reads "TODO"

BQ package does not take infty as integral bounds currently.

The quadrature kernel can handle it but not the continuous parameters for the optimization of the acquisitions. We might get away with setting the bounds very large for the optimizer but keeping them as infty for the kernel integrals.

Organize folder 'examples'

The examples folders in going to contain methods that have been built with emukit but that are not necessarily a core part of the library. To keep that folder organized as it grows we need to set up some standards. Let's at least create a new folder for each new example with a descriptive name. Something like this:

https://github.com/apache/incubator-mxnet/tree/master/example

Exact integration in integrated variance acquisition

Investigate whether we can use BQ package to do integral of variance in integrated variance acquisition function in experimental design for uniform integration measure in with constant integration bounds. We currently use monte carlo integration which we should keep for when we can't integrate the GP exactly.

Rename gp_bayesian_optimization examples folder

It has random forest and BNN based BO in it.

Validate notebooks

We need a way to make sure our notebooks are tested:

At the very least they should be valid json files
One step up is to make sure they are valid jupyter notebook files
Even higher goal is to make sure they execute without errors

Point 3 was already implemented in GPyOpt: https://github.com/SheffieldML/GPyOpt/blob/master/manual/notebooks_check.py . We can consider adopting this script. However we need to make sure that whatever validation method we choose can be run as a part of Travis CI build.

Standard acquisition function test suite

We should be able to have some standard tests for all acquisition functions such as testing output shapes and numerical gradient checks. This would make it easier to create new acquisitions and make sure they are all tested.

Add discrete variables in the class space

Mac builds take a long time to complete

... and sometimes even timeout for transient reasons, succeeding on retry.

Let's try to investigate and see if we can cut this time down. Mark already started in #122

Need to improve test coverage (current 81%)

Some files are even still untouched:

https://codecov.io/gh/amzn/emukit/list/master/

Updating acquisition function in the loop

Some acquisition functions, such as entropy search, require a recalculation of some parameters after each iteration. This doesn't neatly fit into our OuterLoop framework at the minute. We may need to add an update method to the acquisition which is called after observing new data

Dependency on GPy and GPyOpt

I'd like to open a discussion on Emukit's dependency on GPy and GPyOpt.

GPy dependency causes us some trouble: it is most likely cause of installation failures, it drags in matplotlib and plotly, and it has known issues with Python 3.7. For these reasons I would like to see if we could somehow get rid of that dependency being required.

We depend on GPy and GPyOpt in following places:

Model free designs
It is possible to even copy over relevant code pieces from GPyOpt and get rid of that dependency.
Optimizer
Same thing, we could copy, or re-impelement
Multi-fidelity
This is harder, but we could consider multi-fidelity as an extra bit of the package. People could successfully run decision loops in Emukit without multi-fidelity feature.

Note that 1 and 2 help us remove GPyOpt as a dependency at all.

Opinions welcome.

Implement parallelization in BayesOpt component

We need to have the Local Penalization but having other methods would be nice too.

Verify Windows support

We may have windows users, lots of them. Our builds currently run for linux and osx only. We should verify that emukit works fine on windows, and see if we can have a Travis build for that.

cost sensitive loop

So far Emukit only considers the case where one has only one single model for the objective function. However, there are many cases, for example EIperSec, Fabolas, MTBO, .... , where one has two models: one for the objective function and one for the cost of evaluating the objective function.
It would be fairly straightforward to bootstrap from the already existing BayesianOptimizationLoop module to implement these methods. However, a cost sensitive loops seems to be a fundamental problem that might be also interesting for Bayesian quadrature or Experimental Design and I am wondering whether it makes sense to implement a general CostSensitiveOuterLoop class instead?

WSABI for BQ package

both WSABI-L and WSABI-M (Gunter et al. 2014) with their corresponding acquisition function might be a nice addition.

WSABI-L (is already useful without the variance implementation)
- WSABI-L mean prediction
- WSABI-L integral variance
- WSABI-L acquisition function
WSABI-M (is already useful without the variance implementation)
- WSABI-M mean prediction
- WSABI-M integral variance
- WSABI-M acquisition function