Giter Site home page Giter Site logo

kjappelbaum / pyepal Goto Github PK

View Code? Open in Web Editor NEW
35.0 3.0 4.0 10.75 MB

Multiobjective active learning with tunable accuracy/efficiency tradeoff and clear stopping criterion.

License: Apache License 2.0

Python 86.59% Jupyter Notebook 13.41%
active-learning machine-learning multiobjective pareto python hacktoberfest

pyepal's Introduction

Continuous integration Python package pre-commit
Code health Maintenance Maintainability Language grade: Python GitHub last commit codecov
Documentation and tutorial Documentation Status Binder
Social Gitter
Python PyPI - Python Version Code style: black
License License
Citation Paper DOI Zenodo archive

Generalized Python implementation of a modified version of the ε-PAL algorithm [1, 2].

For more detailed docs go here.

Installation

To install the latest stable release use

pip install pyepal

or the conda channel (recommended)

conda install pyepal -c conda-forge

to install the latest development version from the head use

pip install git+https://github.com/kjappelbaum/pyepal.git

Developers can install the extras [testing, docs, pre-commit]. Installation should take only a few minutes.

Additional Notes

  • On macOS you might need to install libomp (e.g., brew install libomp) for multithreading in some models.

  • We currently support Python 3.7 and 3.8.

  • If you want to limit how many CPUs openblas uses, you can export OPENBLAS_NUM_THREADS=1

Usage

The main logic is implemented in the PALBase class. There are some prebuilt classes for common use cases (GPy, sklearn) that inherit from this class. For more details about how to use the code and notes about the tutorials see the docs.

Pre-Built classes

scikit-learn

If you want to use a list of sklearn models, you can use the PALSklearn class. To use it for one step, you can follow the following code snippet. The basic principle is the same for all the different PAL classes.

from pyepal import PALSklearn
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern

# For each objective, initialize a model
gpr_objective_0 = GaussianProcessRegressor(RBF())
gpr_objective_1 = GaussianProcessRegressor(RBF())

# The minimal input to create a PAL instance is a list of models,
# the design space (X, in ML terms "feature matrix") and the number of objectives
palsklearn_instance = PALSklearn(X, [gpr_objective_0, gpr_objective_1], 2)

# the next step is to provide some initial measurements.
# You can do this with the update_train_set function, which you
# can use throughout the active learning process to update the training set.
# For this, provide a numpy array of indices in your design space
# and the corresponding measurements
sampled_indices = np.array([1,2,3])
measurements = np.array([[1,2],
                        [0.8, 1],
                        [7,1]])
palsklearn_instance.update_train_set(sampled_indices, measurements)

# Now, you're ready to run the first iteration.
# This will return the next index to sample and update all the attributes
# If there are no unclassified samples left, it will return None and
# print a statement saying that the classification is completed
index_to_sample = palsklearn_instance.run_one_step()

GPy

If you want to use a list of GPy models, you can use the PALGPy class.

Coregionalized GPR

Coregionalized GPR models can utilize correlations between the objectives and also work in the cases in which some of the objectives are not measured for all samples.

Custom classes

You will need to implement the _train() and _predict() functions if you inherit from PALBase. If you want to tune the hyperparameters of your models while new training points are added, you can implement a schedule by setting the _should_optimize_hyperparameters() function and the _set_hyperparameters() function, which sets the hyperparameters for the model(s).

If you need to train a model, use self.design_space as the feature matrix and self.y as the target vector. Note that in self.y all objectives are turned into maximization problems. That is, if one of your problems is a minimization problem, PyePAL will flip its sign in self.y.

A basic example of how a custom class can be implemented is the PALSklearn class:

class PALSklearn(PALBase):
    """PAL class for a list of Sklearn (GPR) models, with one model per objective"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        validate_number_models(self.models, self.ndim)

    def _train(self):
        for i, model in enumerate(self.models):
            model.fit(self.design_space[self.sampled], self.y[self.sampled, i].reshape(-1,1))

    def _predict(self):
        means, stds = [], []
        for model in self.models:
            mean, std = model.predict(self.design_space, return_std=True)
            means.append(mean.reshape(-1, 1))
            stds.append(std.reshape(-1, 1))

        self._means = np.hstack(mean)
        self.std = np.hstack(stds)

For scheduling of the hyperparameter optimization, we have some predefined schedules in the pyepal.pal.schedules module.

Test the algorithms

If the full design space is known, you can use a while loop to fully explore the space with PyePAL. For the theoretical guarantees of PyePAL to hold, you'll need to sample until all uncertainties are below epsilon. In practice, it is usually enough to require as a termination criterion that there are no unclassified samples left. For this you can use the following snippet

from pyepal.utils import exhaust_loop
from pyepal.models.gpr import build_model

# indices for initialization
sample_idx = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 60, 70])

# build one model per objective
model_0 = build_model(X[sample_idx], y[sample_idx], 0)
model_1 = build_model(X[sample_idx], y[sample_idx], 1)

# initialize the PAL instance
palinstance = PALGPy(X, [model_0, model_1], 2, beta_scale=1)
palinstance.update_train_set(sample_idx, y[sample_idx])

# This will run the sampling and training as long as there
# are unclassified samples
exhaust_loop(palinstance, y)

To measure the performance, you can use the get_hypervolume function from pyepal.pal.utils. More indicators are implemented in packages like deap, pagmo, or pymoo.

References

  1. Zuluaga, M.; Krause, A.; Püschel, M. E-PAL: An Active Learning Approach to the Multi-Objective Optimization Problem. Journal of Machine Learning Research 2016, 17 (104), 1–32.
  2. Zuluaga, M.; Sergent, G.; Krause, A.; Püschel, M. Active Learning for Multi-Objective Optimization; Dasgupta, S., McAllester, D., Eds.; Proceedings of machine learning research; PMLR: Atlanta, Georgia, USA, 2013; Vol. 28, pp 462–470.

Citation

If you find this code useful for your work, please cite:

Acknowledgments

The research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 666983, MaGic), by the NCCR-MARVEL, funded by the Swiss National Science Foundation, and by the Swiss National Science Foundation (SNSF) under Grant 200021_172759. Part of the work was performed as part of the Explore Together internship program at BASF.

pyepal's People

Contributors

byooooo avatar deepsource-autofix[bot] avatar dependabot[bot] avatar github-actions[bot] avatar kjappelbaum avatar lgtm-com[bot] avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pyepal's Issues

update docs

add docs for

  • intuition about the algorithm
  • batch sampling (#26)
  • exhaust loop (#28 )
  • explain hyperparameters
  • plotting API docs
  • plotting in getting started
  • explain that epsilon does not matter anymore in case error bars do not overlap
  • properties of PAL class
  • example of overconfident GPR
  • plots of hyperparameter influences

also there are some typos,

  • ds to keep in mind is that 𝜖-PAL will not work with the predicti + heading of this section

Test Coverage

Expected Behavior

  • Every function has a test, especially for those in pal.core it would be important. But it is not clear to me what good test cases would be.

Actual Behavior

  • Minority of the functions has a test

Batch sampling utility

Expected Behavior

In dispersant_screener we had the batch sampling argument in the pal function. There should be an easy way to do the same here.

Actual Behavior

Here, I didn't implement it as it simply means that the sample function needs to be run n times.

Options

  • refactor the sampling function to also take the number of samples
  • add a utility/wrapper function

PALGBDT implementation

  • Use quantile loss on a GBDT to predict the errorbars (using LightGBM as optional dependency

We need to spend some thought on how to optimize the hyperparameters. I would expect GBDT to be much less sensitive to the hyperparameters than GPR but we should still update them at times. Cross validation is too expensive but I'm also not sure if we should include hyperopt or something like this as dependency ...

Also the criterion is a bit less clear for GBDT as in the Bayesian case we can simply use the likely hood, here we would need some cross-validation

warning about overconfident model

Expected Behavior

  • PAL warns the user when it likely makes a mistake

Actual Behavior

  • There is no warning in case of overconfident models

  • check if there is an established method to do this

  • otherwise, check some heuristics, like if the mean variance or sum variance is below some threshold

update example notebook

  • more iterations of run_one_step (maybe implement a simple function that "measures" something)

shall we enable branch protection?

@byooooo I feel that it would enhance the code quality if we enable (at least after the first release) branch protection. This is, every PR needs a review before being merged. (in this case you'd need to review my changes ;)

Update author list in code (?)

Expected Behavior

All authors that are on the paper are also authors in the setup.py. Should also all have access to the repo.

Actual Behavior

Only Brian and I have access to the code and are listed as authors of the code.

Add optional logging of hypervolume

I didn't port this part over yet. Either we should make sure that the dependencies are light (need to check pygmo installation process), then it can be default, otherwise it should be optional.

drop the n_dims argument in the class initialisation?

In many classes, where we ask users to provide one model per objective this is actually redundant.
It is mostly needed as a check and for the coregionalized case so we might as well consider dropping it in the general case and only make it a requirement in the coregionalized case?

Pin dependencies

Expected Behavior

In the requirements.txt and setup.py versions are pinned and we use those versions in the tests.

Actual Behavior

Versions are not pinned.

Archive code on zenodo.

Expected Behavior

Releases of the code are automatically archived on Zenodo. And we have a badge that redirects to this repository entry.

Actual Behavior

Not implemented.

Rescale the hyperrectangles for the sampling step

Expected Behavior

We should sample scale invariant. I.e. we should divide the length of the edge of the hyperrectangle by the mu. This is, use the coefficient of variation instead of the variance

deal with np.nan in cross-validation

Expected Behavior

Should not raise any error due to missing data. We should just skip those cases.

Actual Behavior

Raises an error.

Steps to Reproduce the Problem

Run ._crossvalidate() on some dataset with missing observations.

spell check

uncertainity -> uncertainty (mispelled at the moment)

Add badge to MyBinder for example notebooks

Expected Behavior

Example can be run on MyBinder without any installation.
We need to check how to best deal with the GPy dependency here.

Actual Behavior

Currently not possible because repo is private.

Add prospector to CI

Expected Behavior

Static analysis is performed with prospector before the code is checked in.

Actual Behavior

Using it in our pre-commit workflow is not so nice as we'd need to install the dependencies. So the most reasonable thing is to check types/formatting .... in the pre-commit workflow and then run prospector before pytest in the the python package workflow.

add home link in docs

Expected Behavior

Should be able to click somewhere to get back to home.

Actual Behavior

Not possible.

add option to exclude high-std points

Sometimes one wants to be super sure and exclude high-variance point from the classification (z>1 or 2) because then the model is probably really unsure in that region of design space

multiprocessing for GP training

for sklearn, it might help to add a feature that allows one to train multiple GPs on multiple cores. We could add a multiprocessing wrapper so that we perform training on multiple cores asynchronously. (assuming if i understand correctly, sklearn GPs are trained on 1 core)

example code:

import multiprocessing as mp
def _train_parallel(self):
    def worker(i):
        self.model[i].fit(
            self.design_space[self.sampled[:, i]],
            self.y[self.sampled[:, i], i].reshape(-1, 1),
        )
        return

    tasks = range(len(self.models))
    pool = mp.Pool(processes=mp.cpu_count())
    pool.map(worker, tasks)

Docstrings for PAL classes

Expected Behavior

Autocomplete in jupyter should show up all options.

Actual Behavior

Due to missing docstring users do not see the possible arguments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.