stan-dev / loo Goto Github PK

loo R package for approximate leave-one-out cross-validation (LOO-CV) and Pareto smoothed importance sampling (PSIS)

License: Other

R 95.30% TeX 4.70%

cross-validation stan r-package bayesian-data-analysis information-criterion bayesian-methods model-comparison bayesian bayesian-inference bayesian-statistics

loo's People

Contributors

Stargazers

Watchers

loo's Issues

loo on rstanarm object failing

Not actually sure if this is a loo or rstanarm issue.

Trying to run loo on output from stan_glmer (I have used the data option). This has run successfully before (R.3.3.2), and I could easily compare the models.
I've been rerunning analyses for a revision, and have also upgrades R and rstanarm and loo, but now I can't get loo running on the rstanarm objects.

Where Treatment is the output from running stan_glmer, I get the following error:

Treatment = stan_glmer (Acc~  Treatment + (Treatment|ID),
                           adapt_delta=.99, init_r = 0.1,
                           data=df, family='binomial')

LOO = loo(Treatment, cores=4)

Error in vapply(out, "[[", 2L, FUN.VALUE = numeric(1)) : 
  values must be length 1,
 but FUN(X[[1]]) result is length 0

Comparison of kfold objects broken

With github main branch versions of rstanarm and loo

library("MASS")      # needed for mvrnorm()
library("rstanarm")
options(mc.cores = parallel::detectCores())

set.seed(1754)
n <- 60
k <- 30
rho <- 0.8
Sigma <- rho*array(1, c(k,k)) + (1-rho)*diag(k)
X <- mvrnorm(n, rep(0,k), Sigma)
b <- c(c(-1, 1, 2), rep(0,k-3))
y <- X %*% b + rnorm(n)*2
fake <- data.frame(X, y)

fit_1 <- stan_glm(y ~ ., prior=normal(0, 10, autoscale=FALSE), data=fake)
loo_1 <- loo(fit_1)
kfold_1 <- kfold(fit_1)
fit_2 <- update(fit_1, prior=hs())
kfold_2 <- kfold(fit_2)
compare_models(kfold_1,kfold_2)

produces an errror

Error in if (dim(loo_a)[2] != dim(loo_b)[2]) { : 
  argument is of length zero

as dim of both objects is NULL

k_warnings needs a newline at the end

Otherwise, you get something that looks like this when R is executed in a terminal

All Pareto k estimates OK (k < 0.5)>

It should look like

All Pareto k estimates OK (k < 0.5)
>

p_waic warnings in hierarchical models

As I understand it, the warning p_waic > 0.4 is there to highlight when the model is over-parameterized, in the sense of approaching one degree of freedom per data point.

My question is whether this is still a concern when rows in the log_lik matrix each correspond to many data points (e.g. if the information criterion is "focused" on predicting new blocks rather than new data points).

If my understanding above is correct, then it seems like it wouldn't be a problem, but I was hoping for confirmation. I checked Vehtari, Gelman, and Gabry 2016 as well as Gelman, Hwang and Vehtari 2013 and didn't find any specific discussion of this issue, so I just wanted to make sure I wasn't missing anything. Thanks so much.

model weights for k-fold cross-validation?

@avehtari @yao-yl Is using stacking_weights recommended also when lpd_point comes from K-fold cross validation?

Allow disabling r_eff warning if log_ratios are not obtained from MCMC

It's possible to use psis with log_ratios not obtained from MCMC, so allow r_eff=NA to disable the warnings about not specifying r_eff.

Problem with "extract_log_lik" in rstan 2.9.0/loo 0.1.5

I wanted to explore loo. I loaded some previously fit models, and tried :

> class(Duree.Incr)
[1] "stanfit"
attr(,"package")
[1] "rstan"
> foo <- extract_log_lik(Duree.Incr)
Error in extract_log_lik(Duree.Incr) : Please load the 'rstan' package.

However :

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=fr_FR.utf8       LC_NUMERIC=C            
 [3] LC_TIME=fr_FR.utf8        LC_COLLATE=fr_FR.utf8   
 [5] LC_MONETARY=fr_FR.utf8    LC_MESSAGES=fr_FR.utf8  
 [7] LC_PAPER=fr_FR.utf8       LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C          
[11] LC_MEASUREMENT=fr_FR.utf8 LC_IDENTIFICATION=C     

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] loo_0.1.5     rstan_2.9.0-3 ggplot2_2.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        matrixStats_0.50.1 lattice_0.20-33    MASS_7.3-45      
 [5] brms_0.8.0.9000    grid_3.2.3         plyr_1.8.3         nlme_3.1-125     
 [9] gtable_0.2.0       stats4_3.2.3       coda_0.18-1        scales_0.4.0     
[13] minqa_1.2.4        nloptr_1.0.4       Matrix_1.2-4       splines_3.2.3    
[17] statmod_1.4.24     lme4_1.1-11        tools_3.2.3        munsell_0.4.3    
[21] abind_1.4-3        parallel_3.2.3     compiler_3.2.3     inline_0.3.14    
[25] colorspace_1.2-6   gridExtra_2.2.1

What happens ?

plot of loo object should allow plot params

loo object plots have labels on x and y axis, "Data point" and "Shape parameter k" respectively. Plot should allow params like main, etc.

Would a model_avg function make sense?

Thanks for an interesting and well-documented package!

So, I am new to Bayesian approaches. In the frequentist world, I did many analyses like this, namely specifying a set of plausible models, ranking them by their AIC values, and averaging the models with the appropriate weightings to obtain model-averaged parameter estimates, predictions etc.

full_model <- lm(y ~ x1 + x2)
aic_table <- MuMIn::dredge(full_model)
model_averaging_results <- MuMIn::model.avg(aic_table)

As I understand it, loo provides a way to estimate model weights, which are a lot like Akaike weights in terms of their interpretation (i.e. models with a weight near 1 are likely to be the 'best' model in the set) and intended use (i.e. the aim is then to average across models - NOT to simply pick a single top model). Hope that's right?

Assuming I understand correctly, would it make sense to add a convenience function with a similar aim to MuMIn::model.avg? Ideally, the new loo:model_avg function could be used like this (here I assume you again wrote it in a way that allows integration with brms, rstanarm etc):

model1 <- brm(y ~ x1 + x2)
model2 <- brm(y ~ x1)
loo_values <- loo_model_weights(model1, model2)
model_averaging_results <- model_avg(object = list(model1, model2), 
                                     weights = loo_values)
summary(model_averaging_results) # averaged posterior distributions for each parameter 
predict(model_averaging_results) # averaged predicted values, etc etc

Maybe the model averaging is so easy, or so case-specific, that it doesn't need a separate function? But if that's the case, some worked examples in the vignette would be really handy. As it stands, I am not really sure what to do with the LOO values calculated for my models!

Thanks

Add warnings to waic

Add warnings to waic corresponding to khat warnings in loo.

PSIS improvements

Brief description

The new and improved PSIS will take into account the MCMC effective sample size, fit the GPD using tail portion ->0, uses weakly informative prior for GPD, and
provides effective sample size and variance estimates that are easier to interpret
than khat values. Updated paper https://arxiv.org/abs/1507.02646. Implementation details below.

The new-psis branch has some preliminary work on implementing these changes but it is not ready yet.

Implementation checklist

computation of relative MCMC efficiency for exp(log_lik)
new number of weights in tail for GPD fit
weakly informative prior for k
generic effective sample estimate for PSIS
variance of the expected loo predictive density var(exp(loo_i))=var(epd_i)
variance of the expected loo log predictive density var(loo_i)=var(elpd_i)
variance of other functions (e.g. when using E_loo)
khat of other functions (e.g. when using E_loo)
update documentation

Details

NOTE: the code snippets below are Matlabish pseudocode, where some
variables are vectors and broadcasting operators are assumed. Art
Owen's book chapter http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf
also has most of the equations, which should be relatively straightforward to
implement in R.

Relative effective sample size of exp(log_lik) from MCMC with the
usual algorithm. We could use use neff for log_lik, but neff for exp(log_lik)
is better for the normalization term in self normalized IS.

rel_neff = mcmc_neff(exp(log_lik)) / S

The number of weights used to fit the generalized Pareto distribution is now
decreasing with the number of draws S, and is also adjusted based
on the MCMC neff for exp(log_lik). This will answer the questions about
the asymptotic properties, works better for thick tailed proposal
distributions, and is adjusted based on dependent samples in Markov
chain. The number of weights used to fit the GPD is now 3*sqrt(S) but
adjusted with relative effective sample size of MCMC, and no more than 20% of weights:

n_pareto = ceil(min(0.2 * S, 3 * sqrt(S) / rel_neff))

The Pareto fit uses a prior for k. This will stabilize estimates for very small
Monte Carlo sample sizes and low neff cases. The weakly informative prior for
the generalized Pareto distribution fit is a Gaussian prior centered on kprior=0.5.
After computing khat, adjust it as follows:

a = 10
khat = khat * n / (n + a) + a * 0.5 / (n + a)

Generic effective sample size for PSIS:
- w denotes the normalized Pareto smoothed weights
- the first part is the usual estimate which is accurate for k < 0.7
- rel_neff is the relative effective sample size of MCMC from above

neff_psis = 1. / sum(w.^2) * rel_neff;

E[epd_i]
- w denotes the normalized Pareto smoothed weights

E[epd_i] = sum(exp(log_lik) .* w)

Variance of the expected loo predictive density var(exp(loo_i))=var(epd_i)
(this is not for the expected loo log predictive density)
- w denotes the normalized Pareto smoothed weights
- rel_neff is the relative effective sample size of MCMC
- This is based on eq 9.9 in http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf

var_epd_i = sum(w.^2.*(exp(log_lik)-E[epd_i]).^2);

In order to get the variance of the expected loo log predictive density
var(loo_i)=var(elpd_i), we need to project the distribution of epd_i through
log transform. We probably should do this with quadrature, but I quickly computed
by sampling from the Gaussian and taking the logarithm.
- with the addition of dividing by rel_neff

z ~ N(E[epd_i], var[epd_i])
var(elpd_i) = var(log(z))/rel_neff;

Total variance for elpd=sum(elpd_i). Square root of this could be reported to users as the Monte Carlo error estimate (which is reliable only if all k<0.7)

var(elpd) = sum_i(var(elpd_i))

For variance of other functions (e.g. when using E_loo), use equation
9.9 in http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf and divide
by rel_neff (the relative effective sample size of MCMC). For example,
for the predictive mean mu

var_epd_i = sum(w. ^ 2. * (mu - sum(w. * mu)).^2) / rel_neff;

khat estimate for other functions (e.g. when using E_loo). Compute function specific khat using (see Epifani et al (2008) ofr justification), where h is the function and w are raw weights.

sqrt(1+h(theta)^2).*w(theta)

Add more doc about interpreting loo output: approx SE of elpd_diff, p_loo, k

From Discourse:

http://discourse.mc-stan.org/t/loo-comparison-in-reference-to-standard-error/4009/4?u=jonah

problems using loo in clogit models

In a rstanarm branch, I introduced a stan_clogit function which is similar to the clogit function in the survival package but is actually described in more detail by Stata ( http://www.stata.com/manuals13/rclogit.pdf ). So, you will be able to do

post <- stan_clogit(case ~ spontaneous + induced, strata = stratum,
                            data = infert[order(infert$stratum, !infert$case),], QR = TRUE)

The problem is that some groups have the same values on the dummy variables spontaneous and induced, such as

> infert[infert$stratum == 3,]
    education age parity induced case spontaneous stratum pooled.stratum
3      0-5yrs  39      6       2    1           0       3              4
86     0-5yrs  39      6       2    0           0       3              4
168    0-5yrs  39      6       2    0           0       3              4

Thus, no matter what are the posterior realizations of the coefficients on spontaneous and induced, the likelihood for group 3 is the same.

I am pretty sure the correct concept for a clogit model is to imagine leaving out one group rather than one observation, but when I call loo::loo.matrix, the third column of the input is a constant, which causes the Pareto k estimate to be infinite. Since the third group could be omitted and only change the log-likelihood by a constant, this seems to be unreasonable.

The question becomes, what function should be catching this? We could have stan_clogit drop groups that have only one unique row in the design matri{x,ces}. We could have loo.stanreg omit such groups. Or loo::loo could check which Pareto k estimates are infinite and change them to some number when the log likelihoods are finite but constant. Thoughts @avehtari ?

Error with `ppc_loo_interval`: missing value where TRUE/FALSE needed

Getting the following error when running ppc_loo_interval on a model estimated using CmdStan. For context, I've read the samples into an R stanfit object, and both y_rep and log_lik were included in the generated quantities block (unfortunately can't share the model nor the data).

Here is the command:

# estimate psisloo, plot loo_intervals
ppmat <- rstan::extract(fit,  'y_rep')[['y_rep']]
lw <- try(loo::extract_log_lik(fit, parameter_name = 'log_lik'))
psis <- loo::psislw(lw)
bayesplot::ppc_loo_intervals(y = true_vals, yrep = ppmat, lw = psis$lw_smooth)

Here is the error:

Error in if (all(w == w[1])) return(quantile(x, probs = probs, names = FALSE)) :
  missing value where TRUE/FALSE needed

Seems to be due to this being in the input:

>  any(is.na(psis$lw_smooth))
[1] TRUE
> which(is.na(psis$lw_smooth))
 [1]   6831   9815  10118  12118  13815  14879  16879  18879  20831  35594
[11]  37594  39594  46118  62521  78118  80831  83248  85248  88521  90521
[21]  97594 100831 103594 105594 114831 116803 122521 152831 162780 164831
[31] 166831 175594 176095 197594 237553 239553 255594 256010

However, this is not derived from the log-lik input:

> any(is.na(lw))
[1] FALSE

And, I don't see anything odd about these log-lik values vs the others:

> summary(lw[!is.na(psis$lw_smooth)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 -59634   -8560   -6848   -9261   -5771   -3354
> summary(lw[is.na(psis$lw_smooth)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 -19224   -4574   -4278   -4658   -3701   -3417

These are not clustered within certain observations:

> apply(psis$lw_smooth, FUN = function(x) sum(is.na(x)), MARGIN = 2)
  [1] 0 0 0 1 1 1 2 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
 [38] 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0

Seems these are caused by values of psis$lw_smooth that are -Inf

Here is the relevant debug context:

debugging in: E_fun(x[, i], w[, i], probs)
debug: {
    stopifnot(all(probs > 0 & probs < 1))
    if (all(w == w[1]))
        return(quantile(x, probs = probs, names = FALSE))
    ord <- order(x)
    x <- x[ord]
    w <- w[ord]
    ww <- cumsum(w)
    ww <- ww/ww[length(ww)]
    qq <- numeric(length(probs))
    for (j in seq_along(probs)) {
        ids <- which(ww >= probs[j])
        wi <- min(ids)
        if (wi == 1) {
            qq[j] <- x[1]
        }
        else {
            w1 <- ww[wi - 1]
            x1 <- x[wi - 1]
            qq[j] <- x1 + (x[wi] - x1) * (probs[j] - w1)/(ww[wi] -
                w1)
        }
    }
    return(qq)
}
Browse[2]> all(w == w[1])
[1] NA
Browse[2]> any(is.na(w))
[1] TRUE
Browse[2]> w[is.na(w)]
[1] NaN

Looking within the loo::psis function:

Browse[2]>
debug: if (cores == 1) {
    out <- lapply(X = 1:N, FUN = .psis_loop)
} else {
    if (.Platform$OS.type != "windows") {
        out <- mclapply(X = 1:N, FUN = .psis_loop, mc.cores = cores)
    }
    else {
        cl <- makePSOCKcluster(cores)
        on.exit(stopCluster(cl))
        out <- parLapply(cl, X = 1:N, fun = .psis_loop)
    }
}
Browse[2]>
debug: out <- mclapply(X = 1:N, FUN = .psis_loop, mc.cores = cores)
Browse[2]> str(out[[128]]$lw_new)
 num [1:2000] -Inf -Inf -Inf -Inf -Inf ...

For now, I'm pretty sure this is a terrible model fit (the loo-checks are part of an evaluation pipeline), but posting here in case it's helpful for others.

Add p_waic warning to waic

Add warning if any p_{waic,i} > 0.4 and recommend loo instead.

relative_eff uses slightly wrong algorithm

relative_eff computation in R has two problems:

initial sequence truncation is based on individual lags, but should be based on pairs as in Stan
autocorrelations are used differently, which may be suboptimal compared to FFT version used in Stan

Fix

use Geyer's initial monotone sequence as in Stan
check that autocorrelation values match with code in Stan or change the code to use FFT

allow input as a matrix of posterior draws, N, and a log-likelihood function

You could save a lot of RAM if the user did not have to input a SxN matrix of log-likelihoods. I think it should be possible for the user to pass a function like

ll <- function(i, draws) {
  # return a vector of log-likelihoods for observation i 
}

and have functions like loo call vapply(1:N, FUN = foo, FUN.VALUE = double(?)) where foo is defined in loo's NAMESPACE and calls ll(i,draws) internally. Then foo(i) takes the result of ll(i,draws) and returns whatever quantities are needed from the i-th observation. This would only require memory to hold about S doubles at a time.

Bayesian bootstrap for uncertainty estimates

Summary:
Add

Function for Bayesian bootstrap
Function for getting BB-samples describing the predictive performance uncertainty

Description:
Bayesian bootstrap can be used to get samples from the distribution of LOO predictive performance estimate. Bayesian bootstrap is equal to Dirichlet distribution model. Function should have optional arguments for alpha and random seed. Alpha not equal to 1, is needed later. Same random seed allows easier comparison of models. See Aki Vehtari and Jouko Lampinen (2002). Bayesian model assessment and comparison using cross-validation predictive densities. Neural Computation, 14(10):2439-2468.

Example code with log score (and seed for Dirichlet not fixed)

data(radon)
y<-radon$log_radon
## Fit the first model
modelA <- stan_lmer(
    log_radon ~ floor + log_uranium + floor:log_uranium + (1 + floor | county),
    data = radon,
    cores = 4,
    iter = 2000,
    chains = 4)
looA<-loo(modelA)
loos<-looA$pointwise[,1]
## number of observations
N<-length(loos)
## number of BB-samples
nb<-10000
## Dirichlet alpha
alpha<-1
## nb samples from the Dirichlet distribution
library(extraDistr)
dirw<-rdirichlet(nb,matrix(alpha,1,N))
## BB-samples from elppd
bbelppd=rowSums(t(t(as.matrix(dirw))*as.vector(loos)))*N

k-fold-cv helper functions

Adapt helper functions from here to deal with k-fold CV. (See also #25).

The user should add to the data declaration a vector that indicates which observations go to training (trainidx).

And the models should look like this:

model {
...
for (n in trainidx) 
      target += model_lpdf[n];
}

and some models could be written simply as

model {
...
y[trainidx] ~ normal(mu, sigma);
}

And the generative quantities has a vector[N_obs] log_lik. My helper functions extract the relevant log_lik, and calculate the elpd like it is done with loo.

Vignette: LOO for non-factorizable joint normal models

Just opening an issue to have a record of this on the to-do list. @avehtari and @paul-buerkner are already working on it.

(This is related to the issue of single data points raised in #71.)

compare should produce Aikaike weights

Richard's compare function produces posterior probabilities that each model in the list has the best expected out-of-sample predictions. These are useful for weighted posterior predictions.

Averaging predictive distributions

This is related to #82 but should be it’s own issue.

Currently we rely on users to use the model weights to create the appropriate mixture of predictive distributions. We should recommend (and demonstrate in a vignette with rstan and loo, and automate in rstanarm and brms) a method for doing this.

There are various options for doing it.
Today @avehtari and I discussed this and we are leaning towards the approach of taking (approx) S*weight_k draws from each posterior predictive distribution, where there are K models and S is the desired sample size. This is better when weights are small than sampling from each posterior predictive distribution with probability weight_k, and easier to implement than a stratified version (maybe an option at some point?).

helper function for k-fold-cv

The package is named as loo, but when psis-loo fails and k-fold-cv is used, then corresponding function could form a object which could be given to compare function.

loo fails with error on a particular matrix

I have a matrix—which I will happily send you—that causes loo to return an error:

> load('log_lik_mat.RData')
> dim(log_lik_mat)
[1]  100 3255
> loo(log_lik_mat, cores=1)
Error in if (sigma <= 0) return(rep(NaN, length(p))) : 
  missing value where TRUE/FALSE needed

The matrix contains no NaNs and no Infs but does contain a lot of zeros.

However, the following works fine:

> log_lik_mat <- matrix(0, nrow=100, ncol=3255)
> loo(log_lik_mat)
Computed from 100 by 3255 log-likelihood matrix

         Estimate  SE
elpd_loo      0.0 0.0
p_loo         0.0 0.0
looic         0.0 0.0

All Pareto k estimates OK (k < 0.5)>

(Minor note: missing newline in output for successful result.)

don't call functions from parallel package if cores=1

This should more easily allow other package authors using loo to write tests that run the loo function.

Error message when using compare_models function for loo objects

I'm using the compare_models function on loo objects in rstanarm, and receiving an error message I don't understand. All the models I'm comparing use the same dataset (same # observations, same response variable, same family of distributions, same transformation applied to the response variable), but when I call compare_models, I receive the following error message:

Error: Not all models have the same y variable.

Full description of the problem is here: http://discourse.mc-stan.org/t/error-message-when-using-compare-models-function-for-loo-objects-of-hierarchical-models/3008

And the link to my R project (with .RData and .R files) is here: https://www.dropbox.com/sh/jdmjy2w55fd24pm/AABL0Q4Myl9rA38F850AGmUTa?dl=0

loo-R2

Add loo-R2 computation. Useful to get something easier to interpret, but better than just R2 when comparing models.

looR2 <- function(fit) {
    y <- get_y(fit)
    ypred <- posterior_linpred(fit)
    ll <- log_lik(fit)
    r_eff <- relative_eff(exp(ll), chain_id = rep(1:4, each = 1000))
    psis_object <- psis(log_ratios = -ll, r_eff = r_eff)
    ypredloo <- E_loo(ypred, psis_object, log_ratios = -ll)$value
    eloo <- ypredloo-y
    return(1-var(eloo)/var(y))
}

model_weights does not yet seem to work

I recently updated to the dev version of loo 2.0 (new-psis branch) to prepare brms for its release. I also tried out the new model_weights function, which results in an error. Reproducible example:

library(brms)
fit1 <- brm(rating ~ treat + period, data = inhaler)
fit2 <- brm(rating ~ treat + period + carry, data = inhaler)

ll1 <- log_lik(fit1)
ll2 <- log_lik(fit2)
loo::model_weights(list(ll1, ll2))

"Error in elpd_loo[k] <- L$elpd_loo : replacement has length zero"

Getting NA result

Hi, I'm running LOO on log-likelihood matrices from two very similar Stan models with the same number of samples and data points, and one of them is returning NA results. I don't see any obvious differences between the matrices (neither have any zeros or NaN's). Any clue why this might be the case? Happy to transfer the two matrices but they're about 500MB each.

Computed from 4000 by 18600 log-likelihood matrix

Estimate SE
elpd_loo NA NA
p_loo NA NA
looic NA NA

LOO fails for data with only one observation

Reproducible example with the dev version of brms:

fit1 <- brm(count ~ log_Age_c, epilepsy)
LOO(fit1, newdata = epilepsy[1, ])
> Error in colSums2(x) : 
>   Argument 'dim' must be an integer vector of length two.

I think there is a drop = FALSE argument missing somewhere when indexing observations. If possible it would be amazing if this could be fixed before releasing loo 2.0.

Speed up loo calcs

Write offending (slow) underlying functions in C++

double check k-fold helpers

@avehtari I'm going to add some tests for the k-fold helper functions, but can you also double check that they work as intended? Thanks!

psislw uses too many cores

psislw should honour mc.cores like rstan.

https://github.com/stan-dev/loo/blob/master/R/psislw.R#L40

Simple fix would be to replace the cores default with getOption("mc.cores", 1)

It's dangerous to use parallel::detectCores() by default, for example in multi-user environments.

lx() consumes too much RAM

Currently,

> loo:::lx
function (a, x) 
{
    b <- -a
    bx <- outer(x, b)
    d <- dim(bx)
    k <- .colMeans(log1p(bx), d[1], d[2])
    log(b/k) - k - 1
}

stores a giant bx matrix in RAM while it works on it one column at a time. Even if it is a bit slower, it would increase the number of situations where loo is RAM-feasible if you instead did

lx <- function(a,x) {
  a <- -a
  k <- sapply(a, FUN = function(y) mean(log1p(y * x)))
  log(a/k) - k - 1
}

warn when pareto_k is infinite

plot.looshould throw a warning when some of the shape parameters of the Generalized Pareto distribution are Inf. Such points are omitted from the plot, which may give a misleading impression of how robust the estimates are.

add se_diff column to result when comparing > 2 models

The approximate standard error of the elpd difference is included when comparing 2 models. We can also include se(elpd_diff) as a column in the output when comparing > 2 models if, like elpd_diff, it refers to the each row vs the first row and not all pairwise combinations.

Add example of PPC to vignette

The current vignette demonstrates model comparison, but your paper (http://arxiv.org/abs/1507.04544) suggests that you can use LOO/PSIS-LOO for posterior predictive checks as well. Would one simply be looking at the distribution of the pointwise LOO values and comparing this distribution to some expected standard? Or maybe one would need to add code in generated quantities that samples new observations given the model and creates a log_lik2 for these simulated samples, permitting you to loo::compare(loo::loo(log_lik),loo::loo(log_lik2))?

make MCSE NA if any k > 0.7

based on Aki's experiments we should report MCSE as NA if there are any pareto k diagnostics > 0.7

make a plot() method for a loo object

The plot produced by print.loo() with plot_k = TRUE is really good but calling it is weird. So, you should make a plot.loo method that does that plot and if print.loo is called with plot_k = TRUE then call plot.loo().

Also, for values greater than 0.5, the plot should print the observation number next to the point.

Checking for rstan_options seems unnecessary

Why does https://github.com/stan-dev/loo/blob/master/R/extract_log_lik.R#L45 check for the existence of rstan_options? This causes an error in hBayesDM. If I just define

rstan_options <- function(...) stop("should not have been called")

loo() works.

psis() with small log ratios

When the log ratio in psis() function is small, it seems it cannot return the correct log weights. A small log weights will (often) occur when dropping some constant in log probability computation.

For example:

log_ratios=rnorm(1000,-100, 1)
psis_weights=psis(log_ratios=log_ratios)$log_weights

Both psis and weights(psis) will return a constant log weight for each data point

> summary(psis1$log_weights)
       V1        
 Min.   :-97.24  
 1st Qu.:-97.24  
 Median :-97.24  
 Mean   :-97.24  
 3rd Qu.:-97.24  
 Max.   :-97.24  
> summary(weights(psis1,normalize=T))
       V1        
 Min.   :-6.908  
 1st Qu.:-6.908  
 Median :-6.908  
 Mean   :-6.908  
 3rd Qu.:-6.908  
 Max.   :-6.908

This can be solved by first normalizing the log ratios

log_ratios=log_ratios-mean(log_ratios)
psis_weights=psis(log_ratios=log_ratios)$log_weights

>summary(psis_weights)
       V1         
 Min.   :-6.8539  
 1st Qu.:-4.2206  
 Median :-3.5175  
 Mean   :-3.5160  
 3rd Qu.:-2.8394  
 Max.   :-0.1453

The k hat estimations are the same in these two cases. I suspect it is a numerical problem when doing exp and log.

model_weights and model_select should not print inside the function

If special printing is desired, such as the current

The stacking weights are:
Model 1 Model 2 
  xxx   xxx

I suggest giving the output of model_weights and model_select a class and define a print method for these classes. This is cleaner than automatically printing within the function.

error loo 2.0 when using reloo=TRUE

Hello,

I run a model using brms, apply loo with reloo=TRUE and get this:

Error: No Pareto k estimates found in the 'loo' object.
Traceback:

1. loo(m2.2, reloo = TRUE)
2. loo.brmsfit(m2.2, reloo = TRUE)
3. eval(cl, parent.frame())
4. eval(cl, parent.frame())
5. LOO(x = m2.2, reloo = TRUE)
6. LOO.brmsfit(x = m2.2, reloo = TRUE) ...

13. stop2("No Pareto k estimates found in the 'loo' object.")
14. stop(..., call. = FALSE)

Any ideas on what the problem is?

loo crashes on me

The waic function works, but loo fails with:

R3.1.3> try({

```
print(loo(log_lik))
```
})
Error in log_lik + psis$lw_smooth : non-conformable arrays
In addition: Warning message:
In mclapply(X = 1:N, FUN = .psis, mc.cores = cores) :
scheduled cores 3, 4 encountered errors in user code, all values of the jobs will be affected

Any idea? This is loo 0.1.2

Problem installing loo from CRAN and GitHub

Here is the error:

Error in namespaceExport(ns, exports) :
undefined exports: allValue, allocArray, allocMatrix, allocVector, anyMissing, anyValue, binCounts, binMeans, colAlls, colAnyMissings, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, count, diff2, indexByRow, iqr, iqrDiff, logSumExp, madDiff, mean2, meanOver, product, rowAlls, rowAnyMissings, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars, sdDiff, signTabulate, sum2, sumOver, t_tx_OP_y, validateIndices, varDiff, weightedMad, weightedMean, weightedMedian, weightedSd, weightedVar, x_OP_y
ERROR: lazy loading failed for package ‘loo’

add elpd_diff as first column of compare output

for example:

     elpd_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
fit2   0.0     -78.7      4.6         4.2   1.3    157.4   9.3   
fit1  -4.7     -83.4      4.3         3.2   1.2    166.8   8.7   
fit3  -6.0     -84.7      3.5        10.7   2.0    169.5   7.0

In vignette the first model is not using dist100

When preparing data, rescaled distance is defined
wells$dist100 <- wells$dist / 100 # rescale
but for the first model it's not used
x <- scale(model.matrix(~ 0 + dist + arsenic, wells))
for the second model the comment says that the change is log(arsenic), but now also the rescaled dist100 is used
# First run a second model using log(arsenic) instead of arsenic
data$x <- scale(model.matrix(~ 0 + dist100 + log(arsenic), wells))

Change the first model to use also the rescaled dist100

broom tidiers for loo objects

It looks like there are already tidiers in the broom package for rstanarm and stanfit objects, but I didn't see any for loo objects. So I wrote some. I could submit these to broom or they could be included here.

augment.loo <- function(x, data = NULL, ...) {
  out <- as_tibble(x$pointwise)
  names(out) <- paste0(".", names(out))
  out$.pareto_k <- x$pareto_k
  if (!is.null(data)) {
    bind_cols(data, out)
  } else {
    out
  }
}

glance.loo <- function(x, ...) {
  out <- tibble::as_tibble(unclass(x[setdiff(names(x), c("pointwise", "pareto_k"))]))
  out$n <- attr(x, "log_lik_dim")[2]
  out$n_sims <- attr(x, "log_lik_dim")[1]
  out
}

tidy.loo <- function(x, ...) {
  elpd <- grep("^elpd_(loo|waic)$", names(x), value = TRUE)
  p <- grep("^p_(loo|waic)$", names(x), value = TRUE)
  ic <- grep("^(waic|looic)$", names(x), value = TRUE)
  tibble::tibble(
    param = c(elpd, p, ic),
    estimate = c(x[[elpd]], x[[p]], x[[ic]]),
    std.err = c(x[[paste0("se_", elpd)]],
                x[[paste0("se_", p)]],
                x[[paste0("se_", ic)]]))
}

loo crashes R when dealing with large log-likelihood matrices

loo keeps crashing R when I call loo() on a log likelihood matrix of around 100 MB. Size is 5250 iterations by 2552 observations of log-likelihood. I've played around with reducing the size by limiting the number of rows and I am able to get it to run without crashing. Also seems to happen more often in RStudio than stand alone R.

I suspect an issue with running out memory?

Session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

8 core laptop with 64 MB of RAM.

Extremely large k?

Currently I am using PSIS diagnostics to evaluate ADVI performance in all stan examples. Here is one example in which the log density ration between the target and proposal is extremely heavy-tailed (if you are curious, it is from /ARM/Ch.18/weight_censored.stan, thought it is unrelated). The input log_ratio is vector of length 3000 without any NA, but I will get an error message when I use function psis() to estimate k:

>library(loo)
>load("log_ratio.RData") ## Attached files.
>psis(log_ratios= log_ratio)  ## log_ratio is a vector of log ratios.
Error in if (any(k > too_high)) { : missing value where TRUE/FALSE needed

The log_ratio Rdata is attached at https://drive.google.com/open?id=1SdZB36hqkG8ftfh5L_efyehiVAGDnLWF
It seems the k-hat estimation is so large that it becomes NA. This error appears in both the released version (2.0) and the develop version.

If we check the density of this log ratio, it is indeed heavy-right tailed. I tried using a subset of log_ratio, say log_ratio[1:300], then I could get a finite k-hat=60. If I increase that sample size, the estimated k-hat will also increase and eventually blow up, which I believe is the reason why such error occurs.

I can also make everything run by manually deleting the single largest log ratio.

From a practical perspective, I guess it is enough to simply add something (e.g., a pre-truncation of log ratios, or an if statement to exclude k-hat estimation=NA) that prevents k-hat blowing up, since a k>60 is already large enough for users to stop immediately. Furthermore, I am curious for how we can solve it theoretically-- We claim the generalized-Pareto distribution can well approximate any one-dimensional tail distribution, but then why do we have this "non-convergence" (hat k \to infinity)？

stan-dev / loo Goto Github PK

loo's People

Contributors

Stargazers

Watchers

Forkers

loo's Issues

Brief description

Implementation checklist

Details

Recommend Projects

Recommend Topics

Recommend Org