dflemin3 / approxposterior Goto Github PK

View Code? Open in Web Editor NEW

40.0 8.0 9.0 51.27 MB

A Python package for approximate Bayesian inference and optimization using Gaussian processes

Home Page: https://dflemin3.github.io/approxposterior/

License: MIT License

Python 99.05% Makefile 0.22% TeX 0.73%

gaussian-processes python inference bayesian-inference approximate-inference bayesian-optimization

approxposterior's Introduction

approxposterior

A Python package for approximate Bayesian inference with computationally-expensive models

Overview

approxposterior is a Python package for efficient approximate Bayesian inference and Bayesian optimization of computationally-expensive models. approxposterior trains a Gaussian process (GP) surrogate for the computationally-expensive model and employs an active learning approach to iteratively improve the GPs predictive performance while minimizing the number of calls to the expensive model required to generate the GP's training set.

approxposterior implements both the Bayesian Active Learning for Posterior Estimation (BAPE, Kandasamy et al. (2017)) and Adaptive Gaussian process approximation for Bayesian inference with expensive likelihood functions (AGP, Wang & Li (2018)) algorithms for estimating posterior probability distributions for use with inference problems with computationally-expensive models. In such situations, the goal is to infer posterior probability distributions for model parameters, given some data, with the additional constraint of minimizing the number of forward model evaluations given the model's assumed large computational cost. approxposterior trains a Gaussian Process (GP) surrogate model for the likelihood evaluation by modeling the covariances in logprobability (logprior + loglikelihood) space. approxposterior then uses this GP within an MCMC sampler for each likelihood evaluation to perform the inference. approxposterior iteratively improves the GP's predictive performance by leveraging the inherent uncertainty in the GP's predictions to identify high-likelihood regions in parameter space where the GP is uncertain. approxposterior then evaluates the forward model at these points to expand the training set in relevant regions of parameter space, re-training the GP to maximize its predictive ability while minimizing the size of the training set. Check out the BAPE paper by Kandasamy et al. (2017) and the AGP paper by Wang & Li (2018) for in-depth descriptions of the respective algorithms.

Documentation

Check out the documentation at https://dflemin3.github.io/approxposterior/ for a more in-depth explanation about the code, detailed API notes, numerous examples with figures.

Installation

Using conda:

conda install -c conda-forge approxposterior

Using pip:

pip install approxposterior

This step can fail if george (the Python Gaussian Process package) is not properly installed and compiled. To install george, run

conda install -c conda-forge george

From source:

git clone https://github.com/dflemin3/approxposterior.git
cd approxposterior
python setup.py install

A simple example

Below is a simple application of approxposterior based on the Wang & Li (2017) example.

from approxposterior import approx, gpUtils, likelihood as lh, utility as ut
import numpy as np

# Define algorithm parameters
m0 = 50                           # Initial size of training set
m = 20                            # Number of new points to find each iteration
nmax = 2                          # Maximum number of iterations
bounds = [(-5,5), (-5,5)]         # Prior bounds
algorithm = "bape"                # Use the Kandasamy et al. (2017) formalism
seed = 57                         # RNG seed
np.random.seed(seed)

# emcee MCMC parameters
samplerKwargs = {"nwalkers" : 20}        # emcee.EnsembleSampler parameters
mcmcKwargs = {"iterations" : int(2.0e4)} # emcee.EnsembleSampler.run_mcmc parameters

# Sample design points from prior
theta = lh.rosenbrockSample(m0)

# Evaluate forward model log likelihood + lnprior for each theta
y = np.zeros(len(theta))
for ii in range(len(theta)):
    y[ii] = lh.rosenbrockLnlike(theta[ii]) + lh.rosenbrockLnprior(theta[ii])

# Default GP with an ExpSquaredKernel
gp = gpUtils.defaultGP(theta, y, white_noise=-12)

# Initialize object using the Wang & Li (2018) Rosenbrock function example
ap = approx.ApproxPosterior(theta=theta,
                            y=y,
                            gp=gp,
                            lnprior=lh.rosenbrockLnprior,
                            lnlike=lh.rosenbrockLnlike,
                            priorSample=lh.rosenbrockSample,
                            bounds=bounds,
                            algorithm=algorithm)

# Run!
ap.run(m=m, nmax=nmax, estBurnin=True, nGPRestarts=3, mcmcKwargs=mcmcKwargs,
       cache=False, samplerKwargs=samplerKwargs, verbose=True, thinChains=False,
       onlyLastMCMC=True)

# Check out the final posterior distribution!
import corner

# Load in chain from last iteration
samples = ap.sampler.get_chain(discard=ap.iburns[-1], flat=True, thin=ap.ithins[-1])

# Corner plot!
fig = corner.corner(samples, quantiles=[0.16, 0.5, 0.84], show_titles=True,
                    scale_hist=True, plot_contours=True)

# Plot where forward model was evaluated - uncomment to plot!
fig.axes[2].scatter(ap.theta[m0:,0], ap.theta[m0:,1], s=10, color="red", zorder=20)

# Save figure
fig.savefig("finalPosterior.png", bbox_inches="tight")

The final distribution will look something like this:

The red points were selected by approxposterior by maximizing the BAPE utility function. At each red point, approxposterior ran the forward model to evaluate the true likelihood and added this input-likelihood pair to the GP's training set. We retrain the GP each time to improve its predictive ability. Note how the points are selected in regions of high posterior density, precisely where we would want to maximize the GP's predictive ability! By using the BAPE point selection scheme, approxposterior does not waste computational resources by evaluating the forward model in low likelihood regions.

Check out the examples directory for Jupyter Notebook examples and explanations. Check out the full documentation for a more in-depth explanation of classes, methods, variables, and how to use the code.

Contribution

If you would like to contribute to this code, please feel free to fork the repository, make some edits, and open a pull request. If you find a bug, have a suggestion, etc, please open up an issue!

Citation

If you use this code, please cite the following:

Fleming and VanderPlas (2018):

@ARTICLE{Fleming2018,
   author = {{Fleming}, D.~P. and {VanderPlas}, J.},
    title = "{approxposterior: Approximate Posterior Distributions in Python}",
  journal = {The Journal of Open Source Software},
     year = 2018,
    month = sep,
   volume = 3,
    pages = {781},
      doi = {10.21105/joss.00781},
   adsurl = {http://adsabs.harvard.edu/abs/2018JOSS....3..781P},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Kandasamy et al. (2017):

@article{Kandasamy2017,
title = "Query efficient posterior estimation in scientific experiments via Bayesian active learning",
journal = "Artificial Intelligence",
volume = "243",
pages = "45 - 56",
year = "2017",
issn = "0004-3702",
doi = "https://doi.org/10.1016/j.artint.2016.11.002",
url = "http://www.sciencedirect.com/science/article/pii/S0004370216301394",
author = "Kirthevasan Kandasamy and Jeff Schneider and Barnabás Póczos",
keywords = "Posterior estimation, Active learning, Gaussian processes"}

Wang & Li (2018):

@article{Wang2018,
author = {Wang, Hongqiao and Li, Jinglai},
title = {Adaptive Gaussian Process Approximation for Bayesian Inference with Expensive Likelihood Functions},
journal = {Neural Computation},
volume = {30},
number = {11},
pages = {3072-3094},
year = {2018},
doi = {10.1162/neco\_a\_01127},
URL = { https://doi.org/10.1162/neco_a_01127},
eprint = {https://doi.org/10.1162/neco_a_01127}}

approxposterior's People

Contributors

Stargazers

Watchers

Forkers

jedbrown rorybarnes jlustigy nvieira-mcgill jbirky standardgalactic danosu syrte wn1695173791

approxposterior's Issues

Add support to handle/save blobs

MCMC can also track blobs, the 2nd returned parameter(s) in emcee's LnLike functions. approxposterior should include the functionality to handle them, either by saving them or ignoring them, depending on the user's needs.

Add MultiNest for posterior retreival

I should implement MultiNest (specifically PyMultiNest) as nested sampling typically converges more quickly than MCMC and can handle multi-modal posterior distributions, whereas MCMC cannot. It should be pretty straight-forward given how the samplers both required an LnPrior and an LnLikelihood function. I'll need to make sure to properly handle the kwargs in the mcmcUtils validation functions.

Installation Issues

I get errors when I try to install the package on a brand new Ubuntu 16.04 instance (run by Docker).

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
# apt-get update
# apt-get install python curl git
# python --version
Python 2.7.12
# curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
# python get-pip.py
# pip —version
pip 10.0.1 from /usr/local/lib/python2.7/dist-packages/pip (python 2.7)

# pip install approxposterior
This fails with the following message:
...
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-W2oMFN/subprocess32/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-GrV0mc/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-W2oMFN/subprocess32/

Trying to install from source now:
# git clone https://github.com/dflemin3/approxposterior.git
# cd approxposterior/
# python setup.py install
This fails with the following message:
...
Installed /usr/local/lib/python2.7/dist-packages/corner-2.0.1-py2.7.egg
Searching for george
Reading https://pypi.org/simple/george/
Downloading https://files.pythonhosted.org/packages/27/19/9de575be629e3a41c3ca6b1e2c80e0ae90a2e831436c5f70cc8d72e37ab7/george-0.3.1.tar.gz#sha256=175f1a8022430adab8bf8e5e0ffc3941e5f48ca831bd9460e69716cc2820e628
Best match: george 0.3.1
Processing george-0.3.1.tar.gz
Writing /tmp/easy_install-Nz8xxm/george-0.3.1/setup.cfg
Running george-0.3.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Nz8xxm/george-0.3.1/egg-dist-tmp-WOjxot
error: Setup script exited with error: SandboxViolation: mkdir('/root/.local', 448) {}
The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.
This package cannot be safely installed by EasyInstall, and may not
support alternate installation locations even if you run its setup
script by hand.  Please inform the package's author and the EasyInstall
maintainers to find out if a fix or workaround is available.

Installing conda as recommended in the instructions:
# apt-get install bzip2
# cd /tmp
# curl -O https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
# bash Anaconda3-5.0.1-Linux-x86_64.sh 
# source /root/.bashrc
# conda --version
conda 4.3.30

After conda is installed, pip returns the same error as before:
# pip install approxposterior
...
Command "/root/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-wjqzy3ec/george/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-tmg_y5gf-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-wjqzy3ec/george/

Trying to install the package from source again:
# python setup.py install
...
RuntimeError: Unsupported compiler -- at least C++11 support is needed!

# apt-get install gcc
# gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Same error again:
# python setup.py install
...
RuntimeError: Unsupported compiler -- at least C++11 support is needed!

Please provide more detailed instructions on building the package and avoiding these errors. Please list all dependencies, all required customizations, and all installation steps. If Ubuntu 16.04 is not your primary target, please specify which distribution you want the package to run on.

Use latin hypercube sampler to initialize GP optimizations

This could help to ensure that the global minimum is actually found, and that any potential local minima are included in the cross-validation calculation (see #41).

Scaling parameter values to improve GP hyperparameter optimization

In the original BAPE algorithm paper, Kandasamy+2015 scaled model parameter values between [0,1] using the appropriate simple linear transformation. Performing this scaling in approxposterior could be useful for convergence and numerical stability issues by keeping parameter values in a reasonable range, especially for metric scales.

This can be implemented without too much difficulty using the sklearn preprocessing module, e.g. the MinMaxScaler. Furthermore, the sklearn codebase is well-tested and robust, so it's inclusion shouldn't introduce too many dependency issues.

To do this, I could either use the bounds kwarg that stipulates the hard bounds for model parameters, or I could train the scaler on the GP's initial theta, although I think the former idea is more desirable relative to the latter.

conda installation is broke

approxposterior v0.2rc0 depends on emcee version > 2.2.1, their rc02. Currently, that version is not available on conda-forge, hampering the conda installation. Installation using pip and from source still work, but I need to figure out the conda issue.

user-defined likelihood function example Jupyter notebook

I currently have lots of notebooks examining how to use the code and it's behavior in certain regimes (see the Notebook examples in the docs, here: https://dflemin3.github.io/approxposterior/tutorial.html), but I currently lack a tutorial for how to implement user-defined functions that approxposterior needs.

refactor burn-in estimation

Since the user has access to the entire emcee sampler object, and hence the full MCMC chains, the user could (and probably should) run their own burn-in/convergence diagnostics. However, it would be useful to give the user some more in-house burn-in estimation procedures, e.g. the Gelman-Rubin statistic, to help their analysis. I should also write a test for these.

Expose GP to user

The GP's performance is highly dependent on its hyperparameters, obviously (see here from the george docs, so my scheme of a "general" way of initializing the GP is not going to work for everyone's use case. I need to allow the user to full define the GP, provide initial guesses for hyperparameters, e.g. the metric, amplitude, etc..., and so on. My hyperparameter optimization scheme should be ok, however.

Use other regression algorithms besides the GP for logprobability predictions

Currently, as per Kandasamy+2015's BAPE algorithm, I use the GP in a regression framework to predict logprobabilities for use with likelihood-based inferences, e.g. MCMC. In principle, there is no reason not to use a different regression algorithm, e.g. random forest, to make the logprobability predictions given the training set {theta, y}. In this framework, the GP would still be used to select points by maximizing the utility function, thereby only building the training set in high-likelihood regions of parameter space, but the GP would not be explicitly used for posterior estimation.

In general, this would require the user to specify an algorithm, a way to train it, and a way to optimize its hyperparameters. Sklearn pipelines could be excellent for this task, however the question becomes how much work is placed on the user to implement this machinery for their runs. I imagine that a meta-model class, with train, predict, etc methods could be specified that can ingest either the GP or an sklearn-friendly object. I suppose using the superclass of sklearn estimators could make this tractable, especially with sklearn estimators of course, and would mostly require me writing an sklearn-like class to play nice with the george gaussian process. The fit method, for instance, could call approxposterior's hyperparameter optimization routines, while the predict method could call george's predict method to estimate means and variances of the conditional distribution.

Checking GP predictive accuracy on-the-fly

One way to check the accuracy of the GP's predictions is by, for each of the m new points in parameter space identified by the GP, compute the GP's predictions before running the forward model. These predictions can be cached and then compared against the result of the forward model to estimate metrics like relative error.

Unnecessary creation of a new GP object in `findNextPoint`

Hi David, thanks for the nice package! It is really helpful!

I plan to apply your package to integer parameters, which might require playing some tricks on the kernel (e.g., adding a white noise).
Here I have a question relating to the code at approx.py#L712-L716

self.gp = george.GP(kernel=self.gp.kernel, fit_mean=True,
                    mean=self.gp.mean,
                    white_noise=self.gp.white_noise,
                    fit_white_noise=False)
self.gp.set_parameter_vector(currentHype)

It creates a new GP object, however, it does not hold all the original properties in self.gp, e.g., fit_white_noise can be True in a user-supplied gp.

While we can check the status in gp.models["white_noise"].unfrozen_mask in principle, it seems the creation of a new GP object here is not necessary. Because the following line will update all the things we need (as far as I see)

self.gp.compute(self.theta)

If the above understanding is correct, it sounds reasonable to remove lines L712-L716.
Am I right? Or I have missed something? Thanks!

Let users set the name of the output files

Currently chainFile does this for the MCMC chains, but it would be nice to make this an attribute of approxposterior.approx.ApproxPosterior for all associated outputs.

Standardize code formatting

For the 1.0 release, approxposterior should use packages like flake8 or black to make sure I'm following all relevant PEPs and such. Furthermore, I can use these packages similar to coveralls during my CI checks.

-inf likelihood in allowable regions of parameter space

Some forward models can return 0 likelihood in parameter space, which formally speaking, is fine. The issue, however, is that 0 probability corresponds to -inf loglikelihood. The GP (approxposterior uses https://github.com/dfm/george for the GP implementation) needs to learn off of these input-output pairs, (theta, y). The GP can learn with an -inf, but all the predictions (means, variances, etc) become -infs and prevent the algorithm from properly functioning.

Potential solutions: -When a -inf loglikelihood arises, set it to some low value
- Throw away -infs (but this requires throwing away the results from a
computationally expensive likelihood evaluation?

Optimize the GP less

Currently, I re-optimize the GP hyperparameters every time I add a new design point. This can get computationally expensive, so it may be a good idea to add the option to re-optimize the GP even nGPOpt times AP adds a new design point.

Explore approxposterior parallelization paradigm

Currently, I am getting varied training sets in approxposterior runs with the exact same initial conditions. I suspect that this has to do with the GP optimization (see #41 ) with random seeding, and how the findNextPoint method depends on the GP, and therefore so too does the training set that approxposterior generates.

However, this presents something of an opportunity to use approxposterior in parallel. Suppose we initialize N identical approxposterior objects and run them all in parallel. Simultaneously, we train a GP surrogate on the ensemble training set and run MCMC. The approxposterior "walkers" (if you will) never waste time deriving the approximate posterior using MCMC, and will constantly be helping to generate a master training set.

I'm really just spitballing here. In general, it would be great to take advantage of multiple cores to build an optimal training set for the surrogate model.

Fitting for white noise

In general, approxposterior won't just work on simulated data, it needs to handle real data, obviously! Need to implement ability to supply, fit for white noise.

Add bounds, scale to ApproxPosterior object?

It could be useful, and make the code cleaner, to have both bounds and scale be initialized with the ApproxPosterior object. Bounds is used in numerous places to maintain numerically stability and validate point selection. Scaling parameters is used throughout now, as well, so it could make sense to initialize the scaler earlier on. Also, this could allow for the user to select several common scales, like MinMax, Normal, etc, using the sklearn interface. MinMax is easy to initialize with bounds for training the scaler, but Normal and other types of scaling requires fitting on training data, which in our case, could be the initial theta.

I can get around breaking the API by allowing the user to override bounds in the ApproxPosterior run method and with sensible default choices for both scale and bounds.

Add a warning for when the GP optimization optGPEveryN > m

If optGPEveryN > m, that is, the GP hyperparameter optimization cadence is greater than the number of new points to find per iteration, the GP hyperparameters will never be re-optimized. approxposterior should at least warn the user if this is occurring. See line 681 in approx.py for where this can be fixed and/or where an error/warning could be raised.

Don't use nbsphinx_prompt_width to hide prompts

I'm creating this issue because your repo contains a conf.py file with

nbsphinx_prompt_width = 0

... presumably in an attempt to remove the prompts from code cells.

I just want to give you a heads-up that this will stop working once spatialaudio/nbsphinx#439 is merged.

For more reliable alternatives, see https://nbsphinx.readthedocs.io/en/latest/custom-css.html.

log prior called twice

Look prior is called twice here:
mu += self._lnprior(theta_test) in _gpll(self, theta) and
y_t = self._lnlike(theta_t, *args, **kwargs) + self._lnprior(theta_t) in the run method. Pretty sure the former is overkill, but verify this.

Multiprocessing is slow: too much overhead spinning up new processes

Currently, the multiprocess implementation for parallelizing GP optimizations and new design point selection is slow, presumably because spinning up new processes is expensive since it requires pickling the GP and sending it to each process, which can be expensive due to the GP's non-trivial structure and large-ish covariance matrix.

Potential fixes include schemes to share the data with each process using scheme like what is documented here.

Single parameter inference causes ValueError

When performing single-parameter inference, np.hstack in findNextPoint line 663 fails because of a dimension mismatch.

Issue appears in version with pip, but appears solved using the GitHub version.

user-defined emcee samplers

User should be able to supply their own initialized emcee (http://dfm.io/emcee/current/, the MCMC implementation approxposterior is currently set up to use) sampler object. The code is mostly there for a user to provide an initialized emcee sampler object, but it needs to be finalized. The user will need to pass None, or some dummy function, for the sampler's likelihood function since approxposterior will replace it with the GP loglikelihood function.

iburn estimate can fail

traceback:

[ 0.04016588 0.04634279 0.0908399 0.12369704]
Calling MIERUN for run: /Users/Jake/Documents/smart_runs/titan/runmie_titan_test3_zq2rn.scr
SUCCESS: MIERUN run complete: /Users/Jake/Documents/smart_runs/titan/runmie_titan_test3_zq2rn.scr
Calling SMART for run: /Users/Jake/Documents/smart_runs/titan/ap_tmp/runsmart_titan_test3_zq2rn_hitran2012_1666_16666cm.scr
SUCCESS: SMART run complete: /Users/Jake/Documents/smart_runs/titan/ap_tmp/runsmart_titan_test3_zq2rn_hitran2012_1666_16666cm.scr
-12597246.1497
/Users/Jake/Projects/Packages/approxposterior/approxposterior/mcmc_utils.py:39: RuntimeWarning: invalid value encountered in true_divide
result = r/(variance*(np.arange(n, 0, -1)))

ValueError Traceback (most recent call last)
in ()
6 bounds=bounds, which_kernel=which_kernel,
7 n_kl_samples=100000, verbose=False, debug=False,
----> 8 timing=False)
/Users/Jake/Projects/Packages/approxposterior/approxposterior/bp.py in run(self, theta, y, m0, m, M, nmax, Dmax, kmax, sampler, cv, seed, timing, which_kernel, bounds, debug, n_kl_samples, verbose, update_prior, **kw)
285
286 # Estimate burn-in, save it
--> 287 iburn = mcmc_utils.estimate_burnin(sampler, nwalk, nsteps, ndim)
288 self.iburns.append(iburn)
289
/Users/Jake/Projects/Packages/approxposterior/approxposterior/mcmc_utils.pyc in estimate_burnin(sampler, nwalk, nsteps, ndim)
111
112 # Save autocorrelation length
--> 113 autolength.append(np.min(roots))
114
115 # List of chains that we are keeping
/Users/Jake/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims)
2370
2371 return _methods._amin(a, axis=axis,
-> 2372 out=out, **kwargs)
2373
2374
/Users/Jake/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
ValueError: zero-size array to reduction operation minimum which has no identity

potential fix: try/except ValueError loop where iburn set to 1. Add assert statement to ensure sampler has non-zero length, i.e. to ensure than that the MCMC actually ran

Parallel approxposterior using python 3.8+ multiprocessing

As per the Python 3.8 release notes:

multiprocessing can now use shared memory segments to avoid pickling costs between processes

This new change can enable parallel approxposterior in terms of parallelizing various optimization tasks as now the GP no longer needs to be pickled as this task was slower than the optimization.

Discarding unimportant parameters

After each iteration, the code could run a check for which parameters have no impact on the likelihood, and they could be removed to speed up the next iteration. A user-defined parameter (or two) would need to be defined to set tolerances for such pruning. For cases with dozens or hundreds of independent variables, the removal of unimportant parameters could save significant computational time.

Can't clone from [email protected]:dflemin3

Hi! Per the install instructions in the README, I tried the following step:

➜  cloned_projects git clone [email protected]:dflemin3/approxposterior.git
Cloning into 'approxposterior'...
The authenticity of host 'github.com (192.30.253.112)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,192.30.253.112' (RSA) to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

When I tried to clone directly from the https address, things worked fine:

➜  cloned_projects git clone https://github.com/dflemin3/approxposterior.git
Cloning into 'approxposterior'...
remote: Enumerating objects: 312, done.
remote: Counting objects: 100% (312/312), done.
remote: Compressing objects: 100% (184/184), done.
remote: Total 2047 (delta 213), reused 199 (delta 128), pack-reused 1735
Receiving objects: 100% (2047/2047), 40.47 MiB | 8.51 MiB/s, done.
Resolving deltas: 100% (1297/1297), done.

Just wanted to leave a bit of feedback regarding the suggested install steps! (:

Can approxposterior derive posterior distributions for parameters that aren't initial conditions?

Right now, no. Approxposterior can only derive posterior distributions for forward model inputs, i.e. initial conditions. It would be awesome to be able to derive posterior distributions for model outputs, like final water content for an exoplanet subject to hydrodynamic atmospheric escape. This can't be done using the BAPE or AGP algorithms since the GP predicts the likelihood of a given set of initial conditions by training on model input-output pairs, where the output is transformed to a likelihood by comparing to observations, perhaps using a Chi^2-like metic.

What I could do is allow the user to specify some ML algorithm to predict the output of interest, like a random forest model, that trains on the forward model input-output pairs, using cross-validation to optimize hyperparameters. I assume that this approach won't generalize well since the training sets will be small by design and the forward models are typically rather complicated. This would probably result in appreciable systematic error, but it's probably worth experimenting with for a bit to see if it's feasible.

Cross-Validation to select GP hyperparameters

Currently, approxposterior selects GP hyperparameters by optimizing the marginal loglikelihood. This can potentially lead to overfitting, so I should implemented the ability for users to use K-fold cross validation to optimize GP hyperparameters. This can get tricky as the dimensionality grows, however, so care should be taken when determined why hyperparameters to try during the cross-validation.

Add run_mcmc method

Add a method to just run the MCMC part of approxposterior's run method for ApproxPosterior objects that have already been initialized. Adding this functionality would require making sure everything was initialized properly, e.g. theta and y exist, the mcmc and sampler kwargs are valid, and so on.

Utility functions for training set initialization

I should add utility functions to help users initialize their training sets. That is, given hard bounds for parameter ranges, functions that generate sets of initial conditions for the user's foward model to build the initial GP training set. Options will include sampling from the prior (already implemented), uniform over the ranges, and Latin hypercube sampling as was used in this paper and this paper by Simeon Bird and collaborators. The latin hypercube sampling option can be particularly useful, it seems.

After implementing these utility functions, I should add tests and documentation/examples.

Implement Bayesian Optimization

In principle, approxposterior should be able to perform Bayesian optimization, given the GP model, to find maximum likelihood solutions. This is obviously not a new idea and can be implemented using george, e.g. this example, but it would be a good tool to joint the approxposterior ecosystem. For example, one could wish to plot the maximum likelihood estimation on a corner plot of approxposterior constraints, or maybe someone is only interested in the "best fit" example. Either way, it could be a nice addition.

A great algorithm to implement would be Jones+1998.

allow args, kwargs in likelihood wrapper functions

User-defined loglikelihood functions should be able to accept args and kwargs. Typically, users want to compare model predictions to observed data in their likelihood functions, perhaps using a chi-squared metric, or something similar. A user could pass in the data via an arg or kwarg, so this functionality should be implemented.

Posterior Fitting Test

Make a test that validates our procedure of fitting a Gaussian Mixture Model to the posterior distribution. Can be done by fitting 2 disjoint Gaussians with known mus and sigmas, resampling from the fits and comparing the sample's mus and sigmas to the known values.

more tests!

I need to implement more tests, like inf and NaN handling from likelihood function outputs, to insure consistent, numerically-accurate behavior.