flatironinstitute / bayes-kit Goto Github PK
View Code? Open in Web Editor NEWBayesian inference and posterior analysis for Python
License: MIT License
Bayesian inference and posterior analysis for Python
License: MIT License
Consider the scenario of testing a sampler against a model where we know the expectations (e.g. any Normal model). The Metropolis sampler has tests along the lines of
model = ...
# set up
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
var = draws.var(axis=0, ddof=1)
np.testing.assert_allclose(mean, model.posterior_mean(), atol=0.1)
np.testing.assert_allclose(var, model.posterior_variance(), atol=0.1)
These are tests with absolute tolerances. Absolute tolerance tests can require a large number of iterations, awkwardly chosen tolerance levels, or both.
I think it makes more sense to evaluate a sampler's draws using a monte carlo standard error, see mcse_mean(x)
model = ...
# setup
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
se_mean = mcse_mean(draws, axis = 0) # needs implementation
std = draws.std(axis=0, ddof=1)
se_std = mcse_std(draws, axis = 0) # needs implementation
z = 5
np.all(np.abs( (mean - model.posterior_mean()) / se_mean) < z )
np.all(np.abs( (std - model.posterior_std()) / se_std) < z )
Such tests more naturally incorporate sampler efficiency into testing and validation, and should hopefully remove some of the awkwardly chosen tolerance values.
The goal of this issue is to implement a more robust fix for the issue exposed through PR #28.
Many samplers in this package can accept a set of explicit initial parameter values (constructor parameter init
). Currently, there's no check that the supplied prior for the model parameters actually has the right size/shape, leading to possible undefined behavior.
Most likely the fix is going to be something like modifying the constructors of the affected samplers to throw an error if init.shape != model.dims()
, but there's a few issues to discuss:
init
parameter has the right overall length but a different shape from _theta
, do we want to reshape
or throw?Protocol
with a partial implementation, or through a shared utility function, etc? We've talked about inheritance vis-a-vis samplers, but haven't really implemented anything like it yet.I expect the overall code changes to be minimal (unless we get really ambitious with project structure), but I'd like to have a record of the discussion in case we need to refer back to it.
Several of our algorithms seem to fail sporadically on this model and report an incorrect mean. The incorrect value is always roughly the same, 0.38
for the mean.
Running this locally, it seems that both smc and mala seem to often coverge to a number around 0.35, which makes me wonder two things:
First, is our analytic calculation of the mean correct? We currently have the value as 7/25 = 0.28
Second, is our testing tolerance too high? (almost certainly)
Failures
tempered_smc:
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008341002/jobs/6882368677
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008269276/jobs/6882211668
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971370594/jobs/6808183840
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622237
https://github.com/flatironinstitute/bayes-kit/actions/runs/4000634814/jobs/6865992560
mala:
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622046
Implement this sampler, which is an ensemble Metropolis method.
There may be a better way to cache log density and gradient calls in the drghmc
sampler.
At the moment, a stack is used as a cache when we recursively compute the log acceptance probability of a proposed draw. The cache is primarily used in the sample()
and accept()
functions, when they each call the proposal_map()
function.
However, @WardBrian ingeniously pointed out that the recursive calls already create its own stack. Thus it appears to be possible to use two variables as a cache in the sample()
function: one variable for the current draw and one variable for the proposed draw. The cache in the accept()
function can thus be replaced by adding a single input argument to the accept()
function call and returning the appropriate cached value; this approach would implicitly use the recursive function call stack instead of manually implementing a cache stack.
This issue is intended to serve as a reminder to include this feature after the pull request #39 is accepted.
Add an implementation of partial momentum refresh (aka underdamped) MALA. The idea is to only partially refresh the momentum each iteration. This method was introduced in
I found Radford Neal's description to be the clearest (page 5, displayed equations in middle), where the momentum (which I'm calling rho
to match our other notation) is updated for mixture rate alpha in (0, 1)
as
z ~ normal(0, I)
rho[n] = alpha* rho[n-1] + sqrt(1 - alpha^2) * z
With alpha = 0
, we get back standard MALA and results should match. As alpha
increases, the fraction refreshed decreases and induces more persistence of momentum. With alpha = 1
, we get no refresh and the result is a single, pure Hamiltonian trajectory that continues each iteration (never leaving the same level set of the Hamiltonian other than for arithmetic/discretization error).
Lines 120 to 135 in 8f8c494
I might be missing something here, but I think you need to normalize acor so that acor[0] is 1.
As it says on the tin:
refactor the Metropolis adjusted Langevin (MALA) sampler to be an instance of Metropolis-Hastings
We do not want to implement standard Metropolis as an instance of MH as it involves unnecessary arithmetic.
Implement the sampler described in section 5 of
The basic reversible update version (issue #23) should be done first.
Lines 57 to 75 in 8f8c494
I think line 74 should be n = n + 1
instead of n = n + 2
to be consistent with the description of the function. Also, with n = n + 2
I confirmed that you can get negative contributions to the area under the ACF.
Add an implementation of split R-hat as detailed in BDA3.
This came up in a discussion with @gil2rok during #39, but applies to all of our algorithms. The user provided log_density_gradient function may throw an exception (we may even expect it to, e.g. if the parameters are out of support).
Currently, if this happens, the exception is propagated all the way up out of the algorithm and into the caller's code. This is probably not what we want to do. Instead, we could work as Stan does and treat any exception in the log density calculation as resulting in a rejection (assuming the algorithm contains an accept-reject step)
Implement Stan's cross-chain (R-hat) adjusted ESS.
I created a local test suite that tries to run bayes_kit.HMCDiag(..., stepsize=0.01, steps=50)
and bayes_kit.MALA(epsilon=0.01)
on suite of problems from posteriordb. I am just testing that outputs are not NaN and that no RuntimeError
s are encountered. I ran a total of 92 tests: the two algorithms above x 46 posteriors from posteriordb. 48/92 failed, and 44/92 passed.
This is not an issue with BayesKit per se. HMC and MALA can surely be made to work with smarter initialization and tuning. But this does call attention to the fact that it would be nice to support smarter initialization/warmup procedures here.
Also, I'm curious if it would be useful to contribute my posteriordb integration (including these tests) to this repository directly. If so, I can create a separate PR for that.
Add the potential scale reduction convergence statistic that accepts ragged input arrays.
Per my discussion with @WardBrian , in HMC a draw can refer to two distinct concepts:
theta
, which denotes a draw from the target distribution(theta, rho)
that contains position and momentum variables accordingly, denoting a draw from the joint distribution.In Bayes-Kit
, the data type DrawAndLogP
suggests definition 1 when it is defined here. However, comments from @bob-carpenter in my pull request here suggest definition 2. Additionally, Stan documentation uses the phrase "draw" in HMC to refer to definition 1 here.
My code for the drghmc
sampler uses the DrawAndLogP
data type, suggesting definition 1, but in its documentation uses a draw to refer to definition 2. I worry this may be confusing.
Lastly, if Bayes-Kit is intended as a pedagogical resource, it may be especially worthwhile to clarify this confusion. Would love to hear other's thoughts on this!
Lines 8 to 26 in 8f8c494
Seems that you are truncating the Fourier coefficients to force them to be a power of 2. When transforming back the power spectrum, this will have the effect of sampling on a different grid than the original data. This can be solved by omitting the size parameter, and just doing the full fft. You may also consider keeping only the first half of the output array.
Lines 124 to 130 in 8f8c494
I think minprev should really be prev_min
Most of the tests in test_hmc.py
fail, likely as a result of the recent refactoring of hmc.py
. To reproduce, simply call the functions in test_hmc.py
from the base directory of the bayes-kit repo.
I noticed that there are two distinct errors causing these tests to fail.
The first issue occurs when using list comprehension to sample from the hmc
object:
draws = np.array([hmc.sample()[0] for _ in range(M)])
This line of code is used throughout test_hmc.py
and is supposed to create an array of HMC samples. Instead, this code creates an array containing the last hmc.sample()
element repeated M
times. My hunch is that this occurs because of late-binding in Python lists, as explained here.
To resolve this issue on my machine, I have simply removed the list comprehension:
draws = np.zeros(M)
for i in range(M):
draws[i] = hmc.sample()[0]
After fixing the list comprehension error, the test_hmc_diag_rep()
and test_hmc_binom()
tests still fail. These errors are harder to diagnose and may arise from incorrect HMC implementation, improper HMC parameters, etc.
Goal: organize classes for inference and posterior analysis API
Continuous only, or also discrete parameters? (allow discrete data and generated quantities)
Two ways to operate (easily interconvertible)
How to deal with transforms and generated quantities like in Stan? They generate new random variables as functiosn of others, which get their own expectations, control variates, etc.
Densities just functions over vectors rather than over arbitrary sequences of data types
Could introduce a higher-level notion of a distribution like in Boost, which can involve densities and RNGs bundled; could apply to prior specification or as a way to describe posterior class (sample plus density plus derivatives)
How to store all the meta-data like configuration, dates, etc.? Probably just a dictionary that is ideally usable as input.
log prior plus log liklihood is equal to log density plus constant if all specified
float log_density(vector)
vector grad_log_density(vector)
and matrix hessian_log_density(vector)
float log_prior(vector)
and vector grad_log_prior(vector)
and matrix hessian_log_prior(vector)
float log_likelihood(vector)
and vector grad_log_likelihood(vector)
and hessian_log_likelihood(vector)
vector prior_sample()
vector sample()
array(vector) ensemble_sample()
(vector, weight) importance_sample()
sampler sampling_importance_resampler(importance_sampler)
approximate samplers add approximate density, effectively giving importance sampler
importance_sampler approximate_sampler()
dictionary variational_fit()
approximate samplers add approximate density, effectively giving importance sampler
importance_sampler laplace_fit()
sampler approximate_sampler()
Tricky because they add an expectation-specific shadow value that
is averaged along with draws to compute estimate of expectation.
vector control_variate(vector draws, array(vector) gradients)
Allow user to specify:
How to handle adaptation for adaptive samplers? They are just stateful in perhaps a discrete way like Stan Phase II warmup.
How to monitor convergence of adaptation to stop adaptation and start sampling?
all algorithms work with ragged data structures
During review of PR #39, a request was made to change the leapfrog implementation in that PR (see resolved issues in highlighted section). The end result was subtly different:
# pos = position
# mo = momentum
# grad = gradient
def leapfrog(pos, mo):
grad = get_grad(pos)
mo_mid = mo + (0.5 * stepsize) * grad
pos += stepsize * mo_mid
for i in range(stepcount - 1):
grad = get_grad(pos)
mo_mid += stepsize * grad
pos += stepsize * mo_mid
grad = get_grad(pos)
mo = mo_mid + (0.5 * stepsize) * grad
return (pos, mo)
Specifically, the steps in the stepcount
loop are reordered, and the loop iterates one fewer time. This is achieved by moving half a step forward, rather than backward, before entering the loop (and skipping the loop entirely in the event only a single step is taken).
The original implementation in that PR was an exact copy of the leapfrog implementation in hmc.py which was introduced in PR #30, so presumably any updates to the leapfrog implementation in #39 should be applied to the code touched in #30 as well.
Opening this issue to remind us to update the HMC implementation to match the finalized version in DRGHMC.
Lines 99 to 135 in 8f8c494
I have a couple comments about this function. These are minor so feel free to ignore!
First, I think isotonic regression is a better choice than this scheme for approximating the decay by a monotonically decreasing sequence. This scheme will tend to underestimate the area under the curve. Take the following example: 1, 0.3, 0.6, 0
(Ignoring the pairing part of this) The present scheme would approximate it by 1, 0.3, 0.3, 0
, whereas isotonic regression would give 1, 0.45, 0.45, 0
. The integral of the former is 1.6, whereas the later is 1.9.
My second comment is that when n is even, the n-1 sample of the acor is neglected.
Add implementation of ranked R-hat as described in this paper:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.