Testing against known models, use mcse instead of absolute tolerance

Consider the scenario of testing a sampler against a model where we know the expectations (e.g. any Normal model). The Metropolis sampler has tests along the lines of

model = ...
# set up 
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
var = draws.var(axis=0, ddof=1)

np.testing.assert_allclose(mean, model.posterior_mean(), atol=0.1)
np.testing.assert_allclose(var, model.posterior_variance(), atol=0.1)

These are tests with absolute tolerances. Absolute tolerance tests can require a large number of iterations, awkwardly chosen tolerance levels, or both.

I think it makes more sense to evaluate a sampler's draws using a monte carlo standard error, see mcse_mean(x) $= sd(x) / \sqrt{ess\_mean(x)}$ and mcse_std(x) from the stan-dev package posterior. Such monte carlo standard error tests would look something like

model = ...
# setup
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
se_mean = mcse_mean(draws, axis = 0) # needs implementation
std = draws.std(axis=0, ddof=1)
se_std = mcse_std(draws, axis = 0) # needs implementation

z = 5
np.all(np.abs( (mean - model.posterior_mean()) / se_mean) < z )
np.all(np.abs( (std - model.posterior_std()) / se_std) < z )

Such tests more naturally incorporate sampler efficiency into testing and validation, and should hopefully remove some of the awkwardly chosen tolerance values.

For samplers, check that initialization values for parameters match the size/dimensions of the input model

The goal of this issue is to implement a more robust fix for the issue exposed through PR #28.

Many samplers in this package can accept a set of explicit initial parameter values (constructor parameter init). Currently, there's no check that the supplied prior for the model parameters actually has the right size/shape, leading to possible undefined behavior.

Most likely the fix is going to be something like modifying the constructors of the affected samplers to throw an error if init.shape != model.dims(), but there's a few issues to discuss:

Do we have a clear design decision that model dimensions are always linear?
In the event that the init parameter has the right overall length but a different shape from _theta, do we want to reshape or throw?
Can/should we generalize this check to avoid code repetition among the different affected models, either through a Protocol with a partial implementation, or through a shared utility function, etc? We've talked about inheritance vis-a-vis samplers, but haven't really implemented anything like it yet.

I expect the overall code changes to be minimal (unless we get really ambitious with project structure), but I'd like to have a record of the discussion in case we need to refer back to it.

Testing instability in beta-binomial test model

Several of our algorithms seem to fail sporadically on this model and report an incorrect mean. The incorrect value is always roughly the same, 0.38 for the mean.

Running this locally, it seems that both smc and mala seem to often coverge to a number around 0.35, which makes me wonder two things:
First, is our analytic calculation of the mean correct? We currently have the value as 7/25 = 0.28

Second, is our testing tolerance too high? (almost certainly)

Failures

tempered_smc:
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008341002/jobs/6882368677
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008269276/jobs/6882211668
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971370594/jobs/6808183840
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622237
https://github.com/flatironinstitute/bayes-kit/actions/runs/4000634814/jobs/6865992560

mala:
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622046

differential evolution sampler

Implement this sampler, which is an ensemble Metropolis method.

Braak, C.J.T., 2006. A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces. Statistics and Computing, 16, pp.239-249.

Improve Gradient Caching in DRGHMC Sampler

There may be a better way to cache log density and gradient calls in the drghmc sampler.

At the moment, a stack is used as a cache when we recursively compute the log acceptance probability of a proposed draw. The cache is primarily used in the sample() and accept() functions, when they each call the proposal_map() function.

However, @WardBrian ingeniously pointed out that the recursive calls already create its own stack. Thus it appears to be possible to use two variables as a cache in the sample() function: one variable for the current draw and one variable for the proposed draw. The cache in the accept() function can thus be replaced by adding a single input argument to the accept() function call and returning the appropriate cached value; this approach would implicitly use the recursive function call stack instead of manually implementing a cache stack.

This issue is intended to serve as a reminder to include this feature after the pull request #39 is accepted.

partial momentum refresh MALA

Add an implementation of partial momentum refresh (aka underdamped) MALA. The idea is to only partially refresh the momentum each iteration. This method was introduced in

Horowitz, A.M., 1991. A generalized guided Monte Carlo algorithm. Physics Letters B, 268(2), pp.247-252.

I found Radford Neal's description to be the clearest (page 5, displayed equations in middle), where the momentum (which I'm calling rho to match our other notation) is updated for mixture rate alpha in (0, 1) as

z ~ normal(0, I)
rho[n] = alpha* rho[n-1] + sqrt(1 - alpha^2) * z

With alpha = 0, we get back standard MALA and results should match. As alpha increases, the fraction refreshed decreases and induces more persistence of momentum. With alpha = 1, we get no refresh and the result is a single, pure Hamiltonian trajectory that continues each iteration (never leaving the same level set of the Hamiltonian other than for arithmetic/discretization error).

normalize acor[0] to be 1?

bayes-kit/bayes_kit/ess.py

Lines 120 to 135 in 8f8c494

    
           if len(chain) < 4: 
        
               raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }") 
        
           acor = autocorr(chain) 
        
           n = first_neg_pair_start(acor) 
        
           prev_min = acor[1] + acor[2] 
        
           # convex minorization uses slow loop 
        
           accum = prev_min 
        
           i = 3 
        
           while i + 1 < n: 
        
               minprev = min(prev_min, acor[i] + acor[i + 1]) 
        
               accum = accum + minprev 
        
               i = i + 2 
        
           # end diff code 
        
           sigma_sq_hat = acor[0] + 2 * accum 
        
           ess = len(chain) / sigma_sq_hat 
        
           return ess

I might be missing something here, but I think you need to normalize acor so that acor[0] is 1.

add Metropolis-Hastings; refactor MALA to use

As it says on the tin:

add a Metropolis-Hastings sampler
refactor the Metropolis adjusted Langevin (MALA) sampler to be an instance of Metropolis-Hastings

We do not want to implement standard Metropolis as an instance of MH as it involves unnecessary arithmetic.

Neal's non-reversible update partial momentum-refresh MALA

Implement the sampler described in section 5 of

Neal, R.M., 2020. Non-reversibly updating a uniform [0, 1] value for Metropolis accept/reject decisions. arXiv preprint arXiv:2001.11950.

The basic reversible update version (issue #23) should be done first.

in first_neg_pair_start, use n = n + 1

bayes-kit/bayes_kit/ess.py

Lines 57 to 75 in 8f8c494

    
           def first_neg_pair_start(chain: VectorType) -> IntType: 
        
               """ 
        
               Return the index of first element of the sequence whose sum with the following 
        
               element is negative, or the length of the sequence if there is no such element. 
        
               Parameters: 
        
               chain: input sequence 
        
               Return: 
        
               index of first element whose sum with following element is negative, or 
        
               the number of elements if there is no such element 
        
               """ 
        
               N = len(chain) 
        
               n = 0 
        
               while n + 1 < N: 
        
                   if chain[n] + chain[n + 1] < 0: 
        
                       return n 
        
                   n = n + 2 
        
               return N

I think line 74 should be n = n + 1 instead of n = n + 2 to be consistent with the description of the function. Also, with n = n + 2 I confirmed that you can get negative contributions to the area under the ACF.

implement split R-hat a la BDA3

Add an implementation of split R-hat as detailed in BDA3.

Handle user-provided log densities that throw exceptions

This came up in a discussion with @gil2rok during #39, but applies to all of our algorithms. The user provided log_density_gradient function may throw an exception (we may even expect it to, e.g. if the parameters are out of support).

Currently, if this happens, the exception is propagated all the way up out of the algorithm and into the caller's code. This is probably not what we want to do. Instead, we could work as Stan does and treat any exception in the log density calculation as resulting in a rejection (assuming the algorithm contains an accept-reject step)

add r-hat adjusted ESS

Implement Stan's cross-chain (R-hat) adjusted ESS.

Algorithms fail on surprisingly many posteriordb problems

I created a local test suite that tries to run bayes_kit.HMCDiag(..., stepsize=0.01, steps=50) and bayes_kit.MALA(epsilon=0.01) on suite of problems from posteriordb. I am just testing that outputs are not NaN and that no RuntimeErrors are encountered. I ran a total of 92 tests: the two algorithms above x 46 posteriors from posteriordb. 48/92 failed, and 44/92 passed.

This is not an issue with BayesKit per se. HMC and MALA can surely be made to work with smarter initialization and tuning. But this does call attention to the fact that it would be nice to support smarter initialization/warmup procedures here.

Also, I'm curious if it would be useful to contribute my posteriordb integration (including these tests) to this repository directly. If so, I can create a separate PR for that.

add ragged, scalar R-hat function

Add the potential scale reduction convergence statistic that accepts ragged input arrays.

add an R-hat function that takes multiple chains of potentially different lengths and returns the traditional Gelman and Rubin diagnostic (1992) as defined in Bayesian Data Analysis, 3rd edition.
add tests against alternative or known implementations

Clarifying Draw Terminology

Per my discussion with @WardBrian , in HMC a draw can refer to two distinct concepts:

The position variable theta, which denotes a draw from the target distribution
The tuple (theta, rho) that contains position and momentum variables accordingly, denoting a draw from the joint distribution.

In Bayes-Kit, the data type DrawAndLogP suggests definition 1 when it is defined here. However, comments from @bob-carpenter in my pull request here suggest definition 2. Additionally, Stan documentation uses the phrase "draw" in HMC to refer to definition 1 here.

My code for the drghmc sampler uses the DrawAndLogP data type, suggesting definition 1, but in its documentation uses a draw to refer to definition 2. I worry this may be confusing.

Lastly, if Bayes-Kit is intended as a pedagogical resource, it may be especially worthwhile to clarify this confusion. Would love to hear other's thoughts on this!

autocorr_fft - truncating to be a power of 2?

bayes-kit/bayes_kit/ess.py

Lines 8 to 26 in 8f8c494

    
           def autocorr_fft(chain: VectorType) -> VectorType: 
        
               """ 
        
               Return sample autocorrelations at all lags for the specified sequence. 
        
               Algorithmically, this function calls a fast Fourier transform (FFT). 
        
               Parameters: 
        
               chain: sequence whose autocorrelation is returned 
        
               Returns: 
        
               autocorrelation estimates at all lags for the specified sequence 
        
               """ 
        
               size = 2 ** np.ceil(np.log2(2 * len(chain) - 1)).astype("int") 
        
               var = np.var(chain) 
        
               ndata = chain - np.mean(chain) 
        
               fft = np.fft.fft(ndata, size) 
        
               pwr = np.abs(fft) ** 2 
        
               N = len(ndata) 
        
               acorr = np.fft.ifft(pwr).real / var / N 
        
               return acorr

Seems that you are truncating the Fourier coefficients to force them to be a power of 2. When transforming back the power spectrum, this will have the effect of sampling on a different grid than the original data. This can be solved by omitting the size parameter, and just doing the full fft. You may also consider keeping only the first half of the output array.

small bug in ess_imse

bayes-kit/bayes_kit/ess.py

Lines 124 to 130 in 8f8c494

    
           prev_min = acor[1] + acor[2] 
        
           # convex minorization uses slow loop 
        
           accum = prev_min 
        
           i = 3 
        
           while i + 1 < n: 
        
               minprev = min(prev_min, acor[i] + acor[i + 1]) 
        
               accum = accum + minprev

I think minprev should really be prev_min

HMC Tests Are Failing

Overview

Most of the tests in test_hmc.py fail, likely as a result of the recent refactoring of hmc.py. To reproduce, simply call the functions in test_hmc.py from the base directory of the bayes-kit repo.

I noticed that there are two distinct errors causing these tests to fail.

List Comprehension Error

The first issue occurs when using list comprehension to sample from the hmc object:

draws = np.array([hmc.sample()[0] for _ in range(M)])

This line of code is used throughout test_hmc.py and is supposed to create an array of HMC samples. Instead, this code creates an array containing the last hmc.sample() element repeated M times. My hunch is that this occurs because of late-binding in Python lists, as explained here.

To resolve this issue on my machine, I have simply removed the list comprehension:

draws = np.zeros(M)
for i in range(M):
    draws[i] = hmc.sample()[0]

Statistical Error

After fixing the list comprehension error, the test_hmc_diag_rep() and test_hmc_binom() tests still fail. These errors are harder to diagnose and may arise from incorrect HMC implementation, improper HMC parameters, etc.

API class design

Design considerations

Goal: organize classes for inference and posterior analysis API

Continuous only, or also discrete parameters? (allow discrete data and generated quantities)
Two ways to operate (easily interconvertible)
- batch: take a batch of draws of a given size
- online: take a single draw
How to deal with transforms and generated quantities like in Stan? They generate new random variables as functiosn of others, which get their own expectations, control variates, etc.
Densities just functions over vectors rather than over arbitrary sequences of data types
Could introduce a higher-level notion of a distribution like in Boost, which can involve densities and RNGs bundled; could apply to prior specification or as a way to describe posterior class (sample plus density plus derivatives)
How to store all the meta-data like configuration, dates, etc.? Probably just a dictionary that is ideally usable as input.

Models

log prior plus log liklihood is equal to log density plus constant if all specified

float log_density(vector) vector grad_log_density(vector) and matrix hessian_log_density(vector)
float log_prior(vector) and vector grad_log_prior(vector) and matrix hessian_log_prior(vector)
float log_likelihood(vector) and vector grad_log_likelihood(vector) and hessian_log_likelihood(vector)
vector prior_sample()

Monte Carlo samplers

vector sample()
array(vector) ensemble_sample()
(vector, weight) importance_sample()
sampler sampling_importance_resampler(importance_sampler)

Variational approximation

approximate samplers add approximate density, effectively giving importance sampler

importance_sampler approximate_sampler()
dictionary variational_fit()

Laplace approximation

approximate samplers add approximate density, effectively giving importance sampler

importance_sampler laplace_fit()
sampler approximate_sampler()

Control variates

Tricky because they add an expectation-specific shadow value that
is averaged along with draws to compute estimate of expectation.

vector control_variate(vector draws, array(vector) gradients)

Top-level controllers

Allow user to specify:

total time to spend on inference
target ESS size (e.g., implement by iterative deepening)
number of iterations

How to handle adaptation for adaptive samplers? They are just stateful in perhaps a discrete way like Stan Phase II warmup.

How to monitor convergence of adaptation to stop adaptation and start sampling?

Posterior analysis

all algorithms work with ragged data structures

R-hat (basic, split, rank, mini group nR-hat)
ESS (basic, multi-chain R-hat based, different estimators?)
- head vs. tail
sample mean, standard-deviation
standard error (if standard-dev and ESS available)
quantiles (median, central interval, arbitrary)
log density output
other diagnostics (e.g., tree depth, divergence, etc.)

Harmonize leapfrog implementations between HMC and DRGHMC

During review of PR #39, a request was made to change the leapfrog implementation in that PR (see resolved issues in highlighted section). The end result was subtly different:

# pos = position
# mo = momentum
# grad = gradient

def leapfrog(pos, mo):
	grad = get_grad(pos)
	mo_mid = mo + (0.5 * stepsize) * grad
	pos += stepsize * mo_mid
	
	for i in range(stepcount - 1):
		grad = get_grad(pos)
		mo_mid += stepsize * grad
		pos += stepsize * mo_mid

	grad = get_grad(pos)
	mo = mo_mid + (0.5 * stepsize) * grad
	return (pos, mo)

Specifically, the steps in the stepcount loop are reordered, and the loop iterates one fewer time. This is achieved by moving half a step forward, rather than backward, before entering the loop (and skipping the loop entirely in the event only a single step is taken).

The original implementation in that PR was an exact copy of the leapfrog implementation in hmc.py which was introduced in PR #30, so presumably any updates to the leapfrog implementation in #39 should be applied to the code touched in #30 as well.

Opening this issue to remind us to update the HMC implementation to match the finalized version in DRGHMC.

minor comments on ess_imse

bayes-kit/bayes_kit/ess.py

Lines 99 to 135 in 8f8c494

    
           def ess_imse(chain: VectorType) -> FloatType: 
        
               """ 
        
               Return an estimate of the effective sample size (ESS) of the specified Markov chain 
        
               using the initial monotone sequence estimator (IMSE).  This is the most accurate 
        
               of the available ESS estimators.  Because of the convex minorization used, 
        
               this approach is slower than using the IPSE function `ess_ipse`. 
        
               This estimator was introduced in the following paper. 
        
               Geyer, C.J., 1992. Practical Markov chain Monte Carlo. Statistical Science 
        
               7(4):473--483.  
        
               Parameters: 
        
               chain: Markov chain whose ESS is returned 
        
               Return: 
        
               estimated effective sample size for the specified Markov chain 
        
               Throws: 
        
               ValueError: if there are fewer than 4 elements in the chain 
        
               """ 
        
               if len(chain) < 4: 
        
                   raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }") 
        
               acor = autocorr(chain) 
        
               n = first_neg_pair_start(acor) 
        
               prev_min = acor[1] + acor[2] 
        
               # convex minorization uses slow loop 
        
               accum = prev_min 
        
               i = 3 
        
               while i + 1 < n: 
        
                   minprev = min(prev_min, acor[i] + acor[i + 1]) 
        
                   accum = accum + minprev 
        
                   i = i + 2 
        
               # end diff code 
        
               sigma_sq_hat = acor[0] + 2 * accum 
        
               ess = len(chain) / sigma_sq_hat 
        
               return ess

I have a couple comments about this function. These are minor so feel free to ignore!

First, I think isotonic regression is a better choice than this scheme for approximating the decay by a monotonically decreasing sequence. This scheme will tend to underestimate the area under the curve. Take the following example: 1, 0.3, 0.6, 0

(Ignoring the pairing part of this) The present scheme would approximate it by 1, 0.3, 0.3, 0, whereas isotonic regression would give 1, 0.45, 0.45, 0. The integral of the former is 1.6, whereas the later is 1.9.

My second comment is that when n is even, the n-1 sample of the acor is neglected.

ranked and split r-hat

Add implementation of ranked R-hat as described in this paper:

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Bürkner, P.C., 2021. Rank-normalization, folding, and localization: An improved R ̂ for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), pp.667-718.

	if len(chain) < 4:
	raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }")
	acor = autocorr(chain)
	n = first_neg_pair_start(acor)
	prev_min = acor[1] + acor[2]
	# convex minorization uses slow loop
	accum = prev_min
	i = 3
	while i + 1 < n:
	minprev = min(prev_min, acor[i] + acor[i + 1])
	accum = accum + minprev
	i = i + 2
	# end diff code
	sigma_sq_hat = acor[0] + 2 * accum
	ess = len(chain) / sigma_sq_hat
	return ess

	def first_neg_pair_start(chain: VectorType) -> IntType:
	"""
	Return the index of first element of the sequence whose sum with the following
	element is negative, or the length of the sequence if there is no such element.

	Parameters:
	chain: input sequence

	Return:
	index of first element whose sum with following element is negative, or
	the number of elements if there is no such element
	"""
	N = len(chain)
	n = 0
	while n + 1 < N:
	if chain[n] + chain[n + 1] < 0:
	return n
	n = n + 2
	return N

	def autocorr_fft(chain: VectorType) -> VectorType:
	"""
	Return sample autocorrelations at all lags for the specified sequence.
	Algorithmically, this function calls a fast Fourier transform (FFT).

	Parameters:
	chain: sequence whose autocorrelation is returned

	Returns:
	autocorrelation estimates at all lags for the specified sequence
	"""
	size = 2 ** np.ceil(np.log2(2 * len(chain) - 1)).astype("int")
	var = np.var(chain)
	ndata = chain - np.mean(chain)
	fft = np.fft.fft(ndata, size)
	pwr = np.abs(fft) ** 2
	N = len(ndata)
	acorr = np.fft.ifft(pwr).real / var / N
	return acorr

	def ess_imse(chain: VectorType) -> FloatType:
	"""
	Return an estimate of the effective sample size (ESS) of the specified Markov chain
	using the initial monotone sequence estimator (IMSE). This is the most accurate
	of the available ESS estimators. Because of the convex minorization used,
	this approach is slower than using the IPSE function `ess_ipse`.

	This estimator was introduced in the following paper.

	Geyer, C.J., 1992. Practical Markov chain Monte Carlo. Statistical Science
	7(4):473--483.

	Parameters:
	chain: Markov chain whose ESS is returned

	Return:
	estimated effective sample size for the specified Markov chain

	Throws:
	ValueError: if there are fewer than 4 elements in the chain
	"""
	if len(chain) < 4:
	raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }")
	acor = autocorr(chain)
	n = first_neg_pair_start(acor)
	prev_min = acor[1] + acor[2]
	# convex minorization uses slow loop
	accum = prev_min
	i = 3
	while i + 1 < n:
	minprev = min(prev_min, acor[i] + acor[i + 1])
	accum = accum + minprev
	i = i + 2
	# end diff code
	sigma_sq_hat = acor[0] + 2 * accum
	ess = len(chain) / sigma_sq_hat
	return ess

flatironinstitute / bayes-kit Goto Github PK

bayes-kit's People

Contributors

Stargazers

Watchers

Forkers

bayes-kit's Issues

Overview

List Comprehension Error

Statistical Error

Design considerations

Models

Monte Carlo samplers

Variational approximation

Laplace approximation

Control variates

Top-level controllers

Posterior analysis

Recommend Projects

Recommend Topics

Recommend Org