Giter Site home page Giter Site logo

bayes-kit's People

Contributors

bob-carpenter avatar gil2rok avatar jsoules avatar wardbrian avatar wrongu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bayes-kit's Issues

Testing against known models, use mcse instead of absolute tolerance

Consider the scenario of testing a sampler against a model where we know the expectations (e.g. any Normal model). The Metropolis sampler has tests along the lines of

model = ...
# set up 
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
var = draws.var(axis=0, ddof=1)

np.testing.assert_allclose(mean, model.posterior_mean(), atol=0.1)
np.testing.assert_allclose(var, model.posterior_variance(), atol=0.1)

These are tests with absolute tolerances. Absolute tolerance tests can require a large number of iterations, awkwardly chosen tolerance levels, or both.

I think it makes more sense to evaluate a sampler's draws using a monte carlo standard error, see mcse_mean(x) $= sd(x) / \sqrt{ess\_mean(x)}$ and mcse_std(x) from the stan-dev package posterior. Such monte carlo standard error tests would look something like

model = ...
# setup
draws = np.array([metropolis.sample()[0] for _ in range(M)])
mean = draws.mean(axis=0)
se_mean = mcse_mean(draws, axis = 0) # needs implementation
std = draws.std(axis=0, ddof=1)
se_std = mcse_std(draws, axis = 0) # needs implementation

z = 5
np.all(np.abs( (mean - model.posterior_mean()) / se_mean) < z )
np.all(np.abs( (std - model.posterior_std()) / se_std) < z )

Such tests more naturally incorporate sampler efficiency into testing and validation, and should hopefully remove some of the awkwardly chosen tolerance values.

For samplers, check that initialization values for parameters match the size/dimensions of the input model

The goal of this issue is to implement a more robust fix for the issue exposed through PR #28.

Many samplers in this package can accept a set of explicit initial parameter values (constructor parameter init). Currently, there's no check that the supplied prior for the model parameters actually has the right size/shape, leading to possible undefined behavior.

Most likely the fix is going to be something like modifying the constructors of the affected samplers to throw an error if init.shape != model.dims(), but there's a few issues to discuss:

  1. Do we have a clear design decision that model dimensions are always linear?
  2. In the event that the init parameter has the right overall length but a different shape from _theta, do we want to reshape or throw?
  3. Can/should we generalize this check to avoid code repetition among the different affected models, either through a Protocol with a partial implementation, or through a shared utility function, etc? We've talked about inheritance vis-a-vis samplers, but haven't really implemented anything like it yet.

I expect the overall code changes to be minimal (unless we get really ambitious with project structure), but I'd like to have a record of the discussion in case we need to refer back to it.

Testing instability in beta-binomial test model

Several of our algorithms seem to fail sporadically on this model and report an incorrect mean. The incorrect value is always roughly the same, 0.38 for the mean.

Running this locally, it seems that both smc and mala seem to often coverge to a number around 0.35, which makes me wonder two things:
First, is our analytic calculation of the mean correct? We currently have the value as 7/25 = 0.28

Second, is our testing tolerance too high? (almost certainly)

Failures

tempered_smc:
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008341002/jobs/6882368677
https://github.com/flatironinstitute/bayes-kit/actions/runs/4008269276/jobs/6882211668
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971370594/jobs/6808183840
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622237
https://github.com/flatironinstitute/bayes-kit/actions/runs/4000634814/jobs/6865992560

mala:
https://github.com/flatironinstitute/bayes-kit/actions/runs/3971574251/jobs/6808622046

Improve Gradient Caching in DRGHMC Sampler

There may be a better way to cache log density and gradient calls in the drghmc sampler.

At the moment, a stack is used as a cache when we recursively compute the log acceptance probability of a proposed draw. The cache is primarily used in the sample() and accept() functions, when they each call the proposal_map() function.

However, @WardBrian ingeniously pointed out that the recursive calls already create its own stack. Thus it appears to be possible to use two variables as a cache in the sample() function: one variable for the current draw and one variable for the proposed draw. The cache in the accept() function can thus be replaced by adding a single input argument to the accept() function call and returning the appropriate cached value; this approach would implicitly use the recursive function call stack instead of manually implementing a cache stack.

This issue is intended to serve as a reminder to include this feature after the pull request #39 is accepted.

partial momentum refresh MALA

Add an implementation of partial momentum refresh (aka underdamped) MALA. The idea is to only partially refresh the momentum each iteration. This method was introduced in

I found Radford Neal's description to be the clearest (page 5, displayed equations in middle), where the momentum (which I'm calling rho to match our other notation) is updated for mixture rate alpha in (0, 1) as

z ~ normal(0, I)
rho[n] = alpha* rho[n-1] + sqrt(1 - alpha^2) * z

With alpha = 0, we get back standard MALA and results should match. As alpha increases, the fraction refreshed decreases and induces more persistence of momentum. With alpha = 1, we get no refresh and the result is a single, pure Hamiltonian trajectory that continues each iteration (never leaving the same level set of the Hamiltonian other than for arithmetic/discretization error).

normalize acor[0] to be 1?

bayes-kit/bayes_kit/ess.py

Lines 120 to 135 in 8f8c494

if len(chain) < 4:
raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }")
acor = autocorr(chain)
n = first_neg_pair_start(acor)
prev_min = acor[1] + acor[2]
# convex minorization uses slow loop
accum = prev_min
i = 3
while i + 1 < n:
minprev = min(prev_min, acor[i] + acor[i + 1])
accum = accum + minprev
i = i + 2
# end diff code
sigma_sq_hat = acor[0] + 2 * accum
ess = len(chain) / sigma_sq_hat
return ess

I might be missing something here, but I think you need to normalize acor so that acor[0] is 1.

in first_neg_pair_start, use n = n + 1

def first_neg_pair_start(chain: VectorType) -> IntType:
"""
Return the index of first element of the sequence whose sum with the following
element is negative, or the length of the sequence if there is no such element.
Parameters:
chain: input sequence
Return:
index of first element whose sum with following element is negative, or
the number of elements if there is no such element
"""
N = len(chain)
n = 0
while n + 1 < N:
if chain[n] + chain[n + 1] < 0:
return n
n = n + 2
return N

I think line 74 should be n = n + 1 instead of n = n + 2 to be consistent with the description of the function. Also, with n = n + 2 I confirmed that you can get negative contributions to the area under the ACF.

Handle user-provided log densities that throw exceptions

This came up in a discussion with @gil2rok during #39, but applies to all of our algorithms. The user provided log_density_gradient function may throw an exception (we may even expect it to, e.g. if the parameters are out of support).

Currently, if this happens, the exception is propagated all the way up out of the algorithm and into the caller's code. This is probably not what we want to do. Instead, we could work as Stan does and treat any exception in the log density calculation as resulting in a rejection (assuming the algorithm contains an accept-reject step)

Algorithms fail on surprisingly many posteriordb problems

I created a local test suite that tries to run bayes_kit.HMCDiag(..., stepsize=0.01, steps=50) and bayes_kit.MALA(epsilon=0.01) on suite of problems from posteriordb. I am just testing that outputs are not NaN and that no RuntimeErrors are encountered. I ran a total of 92 tests: the two algorithms above x 46 posteriors from posteriordb. 48/92 failed, and 44/92 passed.

This is not an issue with BayesKit per se. HMC and MALA can surely be made to work with smarter initialization and tuning. But this does call attention to the fact that it would be nice to support smarter initialization/warmup procedures here.

Also, I'm curious if it would be useful to contribute my posteriordb integration (including these tests) to this repository directly. If so, I can create a separate PR for that.

Clarifying Draw Terminology

Per my discussion with @WardBrian , in HMC a draw can refer to two distinct concepts:

  1. The position variable theta, which denotes a draw from the target distribution
  2. The tuple (theta, rho) that contains position and momentum variables accordingly, denoting a draw from the joint distribution.

In Bayes-Kit, the data type DrawAndLogP suggests definition 1 when it is defined here. However, comments from @bob-carpenter in my pull request here suggest definition 2. Additionally, Stan documentation uses the phrase "draw" in HMC to refer to definition 1 here.

My code for the drghmc sampler uses the DrawAndLogP data type, suggesting definition 1, but in its documentation uses a draw to refer to definition 2. I worry this may be confusing.

Lastly, if Bayes-Kit is intended as a pedagogical resource, it may be especially worthwhile to clarify this confusion. Would love to hear other's thoughts on this!

autocorr_fft - truncating to be a power of 2?

def autocorr_fft(chain: VectorType) -> VectorType:
"""
Return sample autocorrelations at all lags for the specified sequence.
Algorithmically, this function calls a fast Fourier transform (FFT).
Parameters:
chain: sequence whose autocorrelation is returned
Returns:
autocorrelation estimates at all lags for the specified sequence
"""
size = 2 ** np.ceil(np.log2(2 * len(chain) - 1)).astype("int")
var = np.var(chain)
ndata = chain - np.mean(chain)
fft = np.fft.fft(ndata, size)
pwr = np.abs(fft) ** 2
N = len(ndata)
acorr = np.fft.ifft(pwr).real / var / N
return acorr

Seems that you are truncating the Fourier coefficients to force them to be a power of 2. When transforming back the power spectrum, this will have the effect of sampling on a different grid than the original data. This can be solved by omitting the size parameter, and just doing the full fft. You may also consider keeping only the first half of the output array.

small bug in ess_imse

bayes-kit/bayes_kit/ess.py

Lines 124 to 130 in 8f8c494

prev_min = acor[1] + acor[2]
# convex minorization uses slow loop
accum = prev_min
i = 3
while i + 1 < n:
minprev = min(prev_min, acor[i] + acor[i + 1])
accum = accum + minprev

I think minprev should really be prev_min

HMC Tests Are Failing

Overview

Most of the tests in test_hmc.py fail, likely as a result of the recent refactoring of hmc.py. To reproduce, simply call the functions in test_hmc.py from the base directory of the bayes-kit repo.

I noticed that there are two distinct errors causing these tests to fail.

List Comprehension Error

The first issue occurs when using list comprehension to sample from the hmc object:

draws = np.array([hmc.sample()[0] for _ in range(M)])

This line of code is used throughout test_hmc.py and is supposed to create an array of HMC samples. Instead, this code creates an array containing the last hmc.sample() element repeated M times. My hunch is that this occurs because of late-binding in Python lists, as explained here.

To resolve this issue on my machine, I have simply removed the list comprehension:

draws = np.zeros(M)
for i in range(M):
    draws[i] = hmc.sample()[0]

Statistical Error

After fixing the list comprehension error, the test_hmc_diag_rep() and test_hmc_binom() tests still fail. These errors are harder to diagnose and may arise from incorrect HMC implementation, improper HMC parameters, etc.

API class design

Design considerations

Goal: organize classes for inference and posterior analysis API

  • Continuous only, or also discrete parameters? (allow discrete data and generated quantities)

  • Two ways to operate (easily interconvertible)

    • batch: take a batch of draws of a given size
    • online: take a single draw
  • How to deal with transforms and generated quantities like in Stan? They generate new random variables as functiosn of others, which get their own expectations, control variates, etc.

  • Densities just functions over vectors rather than over arbitrary sequences of data types

  • Could introduce a higher-level notion of a distribution like in Boost, which can involve densities and RNGs bundled; could apply to prior specification or as a way to describe posterior class (sample plus density plus derivatives)

  • How to store all the meta-data like configuration, dates, etc.? Probably just a dictionary that is ideally usable as input.

Models

log prior plus log liklihood is equal to log density plus constant if all specified

  • float log_density(vector) vector grad_log_density(vector) and matrix hessian_log_density(vector)

  • float log_prior(vector) and vector grad_log_prior(vector) and matrix hessian_log_prior(vector)

  • float log_likelihood(vector) and vector grad_log_likelihood(vector) and hessian_log_likelihood(vector)

  • vector prior_sample()

Monte Carlo samplers

  • vector sample()

  • array(vector) ensemble_sample()

  • (vector, weight) importance_sample()

  • sampler sampling_importance_resampler(importance_sampler)

Variational approximation

approximate samplers add approximate density, effectively giving importance sampler

  • importance_sampler approximate_sampler()
  • dictionary variational_fit()

Laplace approximation

approximate samplers add approximate density, effectively giving importance sampler

  • importance_sampler laplace_fit()
  • sampler approximate_sampler()

Control variates

Tricky because they add an expectation-specific shadow value that
is averaged along with draws to compute estimate of expectation.

  • vector control_variate(vector draws, array(vector) gradients)

Top-level controllers

Allow user to specify:

  • total time to spend on inference
  • target ESS size (e.g., implement by iterative deepening)
  • number of iterations

How to handle adaptation for adaptive samplers? They are just stateful in perhaps a discrete way like Stan Phase II warmup.

How to monitor convergence of adaptation to stop adaptation and start sampling?

Posterior analysis

all algorithms work with ragged data structures

  • R-hat (basic, split, rank, mini group nR-hat)
  • ESS (basic, multi-chain R-hat based, different estimators?)
    • head vs. tail
  • sample mean, standard-deviation
  • standard error (if standard-dev and ESS available)
  • quantiles (median, central interval, arbitrary)
  • log density output
  • other diagnostics (e.g., tree depth, divergence, etc.)

Harmonize leapfrog implementations between HMC and DRGHMC

During review of PR #39, a request was made to change the leapfrog implementation in that PR (see resolved issues in highlighted section). The end result was subtly different:

# pos = position
# mo = momentum
# grad = gradient

def leapfrog(pos, mo):
	grad = get_grad(pos)
	mo_mid = mo + (0.5 * stepsize) * grad
	pos += stepsize * mo_mid
	
	for i in range(stepcount - 1):
		grad = get_grad(pos)
		mo_mid += stepsize * grad
		pos += stepsize * mo_mid

	grad = get_grad(pos)
	mo = mo_mid + (0.5 * stepsize) * grad
	return (pos, mo)

Specifically, the steps in the stepcount loop are reordered, and the loop iterates one fewer time. This is achieved by moving half a step forward, rather than backward, before entering the loop (and skipping the loop entirely in the event only a single step is taken).

The original implementation in that PR was an exact copy of the leapfrog implementation in hmc.py which was introduced in PR #30, so presumably any updates to the leapfrog implementation in #39 should be applied to the code touched in #30 as well.

Opening this issue to remind us to update the HMC implementation to match the finalized version in DRGHMC.

minor comments on ess_imse

bayes-kit/bayes_kit/ess.py

Lines 99 to 135 in 8f8c494

def ess_imse(chain: VectorType) -> FloatType:
"""
Return an estimate of the effective sample size (ESS) of the specified Markov chain
using the initial monotone sequence estimator (IMSE). This is the most accurate
of the available ESS estimators. Because of the convex minorization used,
this approach is slower than using the IPSE function `ess_ipse`.
This estimator was introduced in the following paper.
Geyer, C.J., 1992. Practical Markov chain Monte Carlo. Statistical Science
7(4):473--483.
Parameters:
chain: Markov chain whose ESS is returned
Return:
estimated effective sample size for the specified Markov chain
Throws:
ValueError: if there are fewer than 4 elements in the chain
"""
if len(chain) < 4:
raise ValueError(f"ess requires len(chains) >=4, but {len(chain) = }")
acor = autocorr(chain)
n = first_neg_pair_start(acor)
prev_min = acor[1] + acor[2]
# convex minorization uses slow loop
accum = prev_min
i = 3
while i + 1 < n:
minprev = min(prev_min, acor[i] + acor[i + 1])
accum = accum + minprev
i = i + 2
# end diff code
sigma_sq_hat = acor[0] + 2 * accum
ess = len(chain) / sigma_sq_hat
return ess

I have a couple comments about this function. These are minor so feel free to ignore!

First, I think isotonic regression is a better choice than this scheme for approximating the decay by a monotonically decreasing sequence. This scheme will tend to underestimate the area under the curve. Take the following example: 1, 0.3, 0.6, 0

(Ignoring the pairing part of this) The present scheme would approximate it by 1, 0.3, 0.3, 0, whereas isotonic regression would give 1, 0.45, 0.45, 0. The integral of the former is 1.6, whereas the later is 1.9.

My second comment is that when n is even, the n-1 sample of the acor is neglected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.