rmcelreath / rethinking Goto Github PK

View Code? Open in Web Editor NEW

2.1K 2.1K 596.0 2.8 MB

Statistical Rethinking course and book package

R 100.00%

rethinking's People

Contributors

Stargazers

Watchers

Forkers

noamross davharris ddeepakkumar ashander vsbuffalo oliyiyi arturochian rtrangucci mattjyoung iddisam wfuu elcovi fototo ajaytsao xmur michaellevy stanke zhmxu loudermilk cello0418 tkngoutham mehbob kaneyori rainbowwhale amit-sp-singh ana2s007 thezedd fsgp mathesong zengfengbo rah5ff rdesannicolas guilhermegarcia lebailly pdmeira paulc00 akzaidi adamwespiser iglpdc bhatmonica thimotyb amitaikoretz imanojkumar e-mu-pi allendowney wavelets dagtann nilescbn sshegheva sandboxorg fengwangjiang peterquid brenden-sweetman linkfar jrombouts ioanfilip2 nsnen butterkaffee anuradha-bajpai zngtian fjro fred777 anhnguyendepocen bluelou wen036 xiaoxiangma rpruim rohitjacob fdabl butterflyology gmichel-smu dalejbarr miriam5ht clehene haowang666 madhatter106 geauxdojang murpher9 zdong1 eiskrzypczakhennig feiyang-sun maltelau eamayo scisus dxander k-gerst alkavaev mpgartland ryanbower aminelg pb-pravin cintrond medic20 kamal-hothi rainfireliang nk3182 caesaryang shark8me ahartikainen arvinds-ds

rethinking's Issues

Using MAP with custom likelihood functions

From the documentation, it is not clear to me how map() will form it's function calls to the likelihood function, in a way that makes it straightforward to write a custom function. Could you clarify this in the documentation?

It seems that map() expects the likelihood function to return log(p) values, and that it passes the option "log=true", and that it also passes the data itself as "x=". I can't find that this is described anywhere. My apologies if this is already explained somewhere, but I missed it.

It seems that perhaps this is standardized in R itself, but I wasn't able to find any official probability distribution API documentation, except this PDF on the cran list of distributions:

http://www.rmetrics.org/Meielisalp2009/Presentations/Scott.pdf
https://cran.r-project.org/web/views/Distributions.html

Some published likelihood functions I have seen don't seem to follow these guidelines, for example the Wallenius noncentral hypergeometric distribution provided by the biasedUrn package.

map2stan function crashing R on RStudio 99.902, R version 3.2.1 for mac

Attempting to recreate the following examples

ch10, Kline data
ch08, rugged data

System:
RStudio 99.902, R version 3.2.1 for Mac

Description:

when loading 'rethinking' library, get the following messages

Loading required package: rstan
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.10.1, packaged: 2016-06-24 13:22:16 UTC, GitRev: 85f7a56811da)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Loading required package: parallel
rethinking (Version 1.59)
Warning messages:
1: package ‘rstan’ was built under R version 3.2.5
2: package ‘ggplot2’ was built under R version 3.2.4
3: package ‘StanHeaders’ was built under R version 3.2.5

map function from rethinking library appears to be working properly based on reproducing outputs found in Statistical Rethinking
attempt to use 'map2stan' function and R session aborts without warnings or other error messages
R session abort occurring for both ch08 and ch10 examples, regardless of how the model is being passed to Stan in the map2Stan function (whether by model variable - e.g. m8.1 - or by full specification of the model in R, as in chapter 8 rugged example).

Sampling warning errors

After sampling for all chains has finished, it prints out something like the following:
Chain 1, Iteration: 1 / 1 [100%] (Sampling)#
# Elapsed Time: 2e-06 seconds (Warm-up)
# 0.000223 seconds (Sampling)
# 0.000225 seconds (Total)
#
I then get a warning about a single divergent transition
There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.X may help.
where 0.X is some adapt_delta setting (changing it hasn't helped).

This seems to occur no matter what model I'm fitting, including very simple models like the following:
fakeData <- data.frame(y=rnorm(100,10,5))
test <- map2stan(alist(y ~ dnorm(mu,sigma), mu~dnorm(0,10), sigma~dunif(1,20)), data=fakeData)

Using rethinking v. 1.58. This has occurred with multiple versions of R, running on multiple versions of Ubuntu, and rstan 2.8 and 2.9

tan or atan link function

The Stan reference manual recommends reparameterizing the Cauchy distribution in terms of the uniform distribution and a tangent function (manual v2.11.0 p225). In the example in the manual, this

beta ~ cauchy(mu, tau);

becomes

beta_unif ~ uniform(-pi()/2, pi()/2);
beta = mu + tau * tan(beta_unif);

Writing this directly as part of the model specification for map2stan doesn't work, as it complains that tan isn't a known distribution. In my case mu=0 and tau=1, so we can rewrite the second line as:

atan(beta) = beta_unif

However, now the problem is that atan isn't a defined link function in this package. Would that be difficult to add? I'm happy to attempt the code modification myself if you think it would be straightforward.

Cannot simply use caret ^ in linear models

In version 1.58, using a caret in a linear model like this:

library(rethinking)
data(cars)
m <- map2stan(
    alist(
        y ~ dnorm(mu,sigma),
        mu <- a*x^b,
        a ~ dnorm(0,10),
        b ~ dnorm(0,1),
        sigma ~ dexp(1)
    ),
    data=list(y=cars$dist,x=cars$speed) )

produces an odd error, telling the user that the variable x cannot be found.

This bug is fixed in the dev branch and so will be gone in 1.59. In the meantime, a work-around is to put parentheses around the variable name:

m <- map2stan(
    alist(
        y ~ dnorm(mu,sigma),
        mu <- a*(x)^b,
        a ~ dnorm(0,10),
        b ~ dnorm(0,1),
        sigma ~ dexp(1)
    ),
    data=list(y=cars$dist,x=cars$speed) )

Different results with map2stan

Hi Richard!

Congratulations for your new book and your YouTube videos. I find them amazing, I'm learning a lot with them. Today I was replicating some of your examples and I realised that the results obtained with map() are quite different from those estimated with map2stan(). This gives the right answer:

m4.3 <- map(
    alist(
        height ~ dnorm( mu , sigma ) ,
        mu <- a + b*weight,
        a ~ dnorm( 156 , 100 ) ,
        b ~ dnorm( 0 , 10 ) ,
        sigma ~ dunif( 0 , 50 )
) ,
    data=d2 )

But this one does not:

m4.3 <- map2stan(
    alist(
        height ~ dnorm( mu , sigma ) ,
        mu <- a + b*weight,
        a ~ dnorm( 156 , 100 ) ,
        b ~ dnorm( 0 , 10 ) ,
        sigma ~ dunif( 0 , 50 )
) ,
    data=d2 )

Any leads? Thanks a lot! :)

unexpected behavior in rlkjcorr

Hi Richard, hope you're having a great time in Germany.

My understanding of the LKJ correlation matrix was that the shape parameter should have a consistent interpretation (e.g. eta==1 implies uniform) regardless of the number of dimensions. In version 1.58 of the rethinking package, I'm not finding that to be the case.

If I run the example with K==2 and eta==1, I get a uniform distribution, as expected. But if I change K to 20, I get a bell-shaped curve with SD≈0.2.

library(rethinking)
R <- rlkjcorr(1e4,K=20,eta=1)
dens( R[,1,2] )

Question re extension to varying slopes model

I am trying to build on the stancode generated from the model in code block 13.23 by adding a predictor to the random intercept (more generally, I want to code cross-level interactions) but I can't figure out the syntax in a model with MVN priors. I'm probably missing something obvious, and this may not be the place for what is essentially a Stan (not map2stan) question, but would appreciate suggestions or example code.

To clarify further: In a simpler model--say, one with a variable intercept and fixed effects coefficients--the relevant Stan code would be:

`....
transformed parameters {
vector[J] eb0;
for (j in 1:J)
eb0[j] = z[j] * g0 ;
}

model {
g0 ~ normal(0, 5);
sigma_b0 ~ cauchy(0, 2);
for (j in 1:J)
b0[j] ~ normal(eb0[j], sigma_b0);
for (n in 1:N)
y[n] ~ normal(b0[cid[n]] + x[n] * beta, sigma_y[cid[n]]);
}
`
where z is a matrix of predictors with 1 in the first column and g is a vector of coefficients. There is no obvious way to plug this into my version of code block 13.23:

`data {
int<lower=0> N; // num country-years
int<lower=1> J; // num countries
real y[N]; // outcome
real dincr[N];
real gdpgr[N];
int<lower=1, upper=17> cid[N]; // country
}

parameters {
matrix[3, J] z_mat;
cholesky_factor_corr[3] L_Rho;
real g1;
real g2;
real g3;
vector<lower=0.0001>[3] sigma_b;
real<lower=0.0001> sigma_y;
}

transformed parameters{
matrix[J, 3] b_mat;
vector[J] b1;
vector[J] b2;
vector[J] b3;
vector[J] eb1;
matrix[3, 3] Rho;
b_mat = (diag_pre_multiply(sigma_b, L_Rho) * z_mat)';
b1 = col(b_mat, 1);
b2 = col(b_mat, 2);
b3 = col(b_mat, 3);
Rho = L_Rho * L_Rho' ;
}

model {
vector[N] mu;
vector[N] B1;
vector[N] B2;
vector[N] B3;
L_Rho ~ lkj_corr_cholesky(2);
sigma_y ~ cauchy(0, 2);
sigma_b ~ cauchy(0, 2);
g1 ~ normal(4.8, 10);
g2 ~ normal(0, 5);
g3 ~ normal(0, 5);
to_vector(z_mat) ~ normal(0, 1);
for (i in 1:N){
B1[i] = g1 + b1[cid[i]];
}
for (i in 1:N){
B2[i] = g2 + b2[cid[i]];
}
for (i in 1:N){
B3[i] = g3 + b3[cid[i]];
}
for (i in 1:N){
mu[i] = B1[i] + B2[i] * dincr[i] + B3[i] * gdpgr[i];
}
y ~ normal(mu, sigma_y);
}`

Thanks for any attention to this.

Convert to coda?

Is it possible to convert map2stan model fit objects to coda objects, as it is for standard stan objects (e.g. http://jeromyanglim.tumblr.com/post/91434443911/how-to-convert-a-stan-fit-object-to-work-with-coda)

glimmer error when factor labels contain spaces

For example:

> library(rethinking)
> data(milk)
> d <- milk
> glimmer(kcal.per.g ~ clade, data = d)
Error in parse(text = flist_txt) : <text>:4:20: unexpected symbol
3:     mu <- Intercept +
4:         b_cladeNew World
                  ^

Possible fix (untested) - in xparse_glimmer_formula, change

  fixef <- colnames(model.matrix(f_nobars, data))

  fixef <- make.names(colnames(model.matrix(f_nobars, data)))

Issue with Traceplots inside of Map2stan

plot(m8.1stan)
Error in as.double(y) :
cannot coerce type 'S4' to vector of type 'double'

Similar to the bug with plot precise, I am getting the same error with trace plots. I've reinstalled everything, but R itself. Any assistance would be greatly appreciated. This is Chapter 8 R code 8.12, I've followed the book coding exactly, yet I'm still getting this error. If I plot with:

plot(post$a,type="l")
it works, but it still isn't what I was hoping for since it lacks information.
Thanks.

Adding to generated quantities block

I'd like to compute estimates of LOO and K-fold cross validation, as described here:

http://www.stat.columbia.edu/~gelman/research/unpublished/waic_stan.pdf

Is there any way to add to the generated quantities block of the stan code maptostan emits to enable one to capture the log likelihood as Gelman et al describe? e.g.

generated quantities {
  vector[N] log_lik;
  for (n in 1:N){
         log_lik[n] <- normal_log(y[n], X[n]*b, sigma);
       }
}

I realise it's possible to use maptostan first and then extract and edit the code, but this is somewhat of a fiddle. More generally, I wonder if it's possible to include the LL quantity always to enable map2stan fits to be compatible with this LOO package: https://github.com/stan-dev/loo

Unrepeatable issue with plot(precis(object)) - cannot coerce type 'S4' to vector of type 'double'

A few people have reported this error:

> plot(precis(m5.3))
Error in as.double(y) : 
cannot coerce type 'S4' to vector of type 'double'

It appears to be related to precis methods not exporting right. But I have been unable to replicate the error. It's on my radar. In the meantime, accessing the plot method directly with precis.plot() (or precis_plot()) does work.

Accessing the Course Material

Hello,
Is there a way to access the course material (http://xcelab.net/rm/statistical-rethinking/)?

Thank You.

Error in compare function with an argument "func"

Thank you for this great textbook. This book is truly what I have wished to read for a long time.

I tried a compare function in an overthinking box, on page 310, but an error message was displayed like this.

compare(m10.8,m10.9,func=DIC)
Error in newvec[j] <- pars_orig[[name_orig]] :
replacement has length zero

Without the func argument, this code worked.

`map2stan` error when data.frame is a `tibble`

Hi Richard.

When using Hadley Wickam's tidyverse tools, data.frames become tibbles, which return a character vector when put through class:

c("rowwise_df","tbl_df","tbl","data.frame")

This causes map and map2stan to fail the check whether the data are list or data.frame (lines 39 in map.R and 74 in map2stan.r).

I wondering if it would be possible to wrap the test in a function call to any, so that it would look like this:

!any((class(data) %in% c("list", "data.frame")))

This would lead to returning a single TRUE when checking with a tibble, rather than a bool vector, which then R defaults to the first result on the vector when testing the condition in the if clause.

Thank you.

Anders.

WaffleDivorce.csv is missing Nevada?

Not sure if this is intentional.

Running 2nd map2stan() model with rstan 2.14.1 leads to segfault

After the rstan 2.14.1 (it might have been 2.14, but at least by 2.14.1), running a second map2stan() model in the same R session causes a segfault in my setup. Example:

library(rethinking)

data(chimpanzees)

# don't want any variables with NAs
d <- list( 
  pulled_left = chimpanzees$pulled_left ,
  prosoc_left = chimpanzees$prosoc_left ,
  condition = chimpanzees$condition ,
  actor = as.integer( chimpanzees$actor ) ,
  blockid = as.integer( chimpanzees$block )
)

# RStan fit
m2 <- map2stan(
  alist(
    pulled_left ~ dbinom(1,theta),
    logit(theta) <- a + bp*prosoc_left + bpc*condition*prosoc_left ,
    a ~ dnorm(0,10),
    bp ~ dnorm(0,10),
    bpc ~ dnorm(0,10)
  ) ,
  data=d, chains=1, cores=1 )

m2_2 <- map2stan(
  alist(
    pulled_left ~ dbinom(1,theta),
    logit(theta) <- a + bp*prosoc_left + bpc*condition*prosoc_left ,
    a ~ dnorm(0,10),
    bp ~ dnorm(0,10),
    bpc ~ dnorm(0,10)
  ) ,
  data=d, chains=1, cores=1 )

Gives:

 *** caught segfault ***
address 0x1c, cause 'memory not mapped'

Traceback:
 1: FUN(X[[i]], ...)
 2: lapply(name, function(id) {    v <- .Internal(getSymbolInfo(as.character(id), PACKAGE, as.logical(withRegistrationInfo)))  ...
 3: getNativeSymbolInfo(name, PACKAGE)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(getNativeSymbolInfo(name, PACKAGE), error = function(e) e)
 8: Module(module, mustStart = TRUE)
 9: .getModulePointer(x)
10: <S4 object of class "Module">$stan_fit4model11743559f5a88_pulled_left___dbinom_1__theta_
11: eval(expr, envir, enclos)
12: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = "")))
13: object@mk_cppmodule(object)
14: .local(object, ...)
15: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init,     check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file,   ...
16: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init,     check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file,   ...
17: stan(model_code = model_code, model_name = modname, data = d,     init = initlist, iter = iter, warmup = warmup, chains = chains,   ...
18: doTryCatch(return(expr), name, parentenv, handler)
19: tryCatchOne(expr, names, parentenv, handlers[[1L]])
20: tryCatchList(expr, classes, parentenv, handlers)
21: tryCatch(expr, error = function(e) {    call <- conditionCall(e)  ...
22: try(stan(model_code = model_code, model_name = modname, data = d,     init = initlist, iter = iter, warmup = warmup, chains = chains,   ...
23: map2stan(alist(pulled_left ~ dbinom(1, theta), logit(theta) <- a +     bp * prosoc_left + bpc * condition * prosoc_left, a ~ dnorm(0,   ...

I am running R 3.3.2 Patched and reinstalled rethinking and all dependencies from source. I'm not sure if this is reproducible by anyone else.

My sessionInfo():

R version 3.3.2 Patched (2017-01-03 r71888)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rethinking_1.59    rstan_2.14.1       StanHeaders_2.14.0 ggplot2_2.2.1      fortunes_1.5-4     devtools_1.12.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8        MASS_7.3-45        munsell_0.4.3      colorspace_1.3-2   lattice_0.20-34    plyr_1.8.4        
 [7] tools_3.3.2        grid_3.3.2         gtable_0.2.0       loo_1.0.0          coda_0.19-1        withr_1.0.2       
[13] matrixStats_0.51.0 lazyeval_0.2.0     assertthat_0.1     digest_0.6.11      tibble_1.2         gridExtra_2.2.1   
[19] codetools_0.2-15   memoise_1.0.0      inline_0.3.14      scales_0.4.1       stats4_3.3.2       mvtnorm_1.0-5

re-using compiled model

Is there a way to make map2stan re-use a compiled model with new data?

I know it's possible to extract the stanfit object and use that, but I'm running a simulation study and would like to be able to re-use the map2stan object to fit new data and then use post() simulate data from these fitted values.

Problem with dlkjcorr in varying slopes simulation

When I run model 13.12 on the simulated data, I get the following error:

Error in map2stan(alist(wait ~ dnorm(mu, Sigma), mu <- a_cafe[cafe] + :
Rho : no dimension found for this matrix. Use explicit start value to define its dimension.
For example: Rho = diag(2)

All previous code is exactly per the book. In particular,

Rho <- matrix(c(1,rho,rho,1), nrow = 2) # correlation matrix

Differing errors by groups throws compilation error

I've been working with a collaborator who has a model build in STAN, but we'd like to translate it to rethinking as it's so much more accessible to the rest of our collaborators on the project. An odd feature of the model is that each Study (this is a meta-analysis - although I can think of other models this should be true for) has it's own residual error. So, in our model, our likelihood function is

  y ~ lognormal(y_loc, sd_e[Study])

for which our prior would be stated as

  sd_e[Study] ~ dcauchy(0,1) #prior for study level errors

However, before map2stan will run the model, I get the error

Error in constr_list[[sigma_name]] : no such index at level 1

This wasn't super helpful in debugging, but I eventually tracked it down to the sd_e parameter, as when I took [Study] out of the residual, the model fit just fine. I'm thinking that either this is the wrong way to use dcauchy or there is something wonkier here in terms of adding having multiple residual errors. But given that normal distributions work just fine with a group structure, I cannot think of why this should fail. Is there something in how the model is translated to STAN code?

documentation for arrowspline

arrowspline in plotting.r looks generally useful, but is undocumented. Same with segmentsby and a few others.

Posterior Predictions from Multiple Equations

So, I'm working with rethinking in a Structural Equation Modeling framework. It works beautifully, but, I'm having some issues in generating posterior predictions. Namely, effects of linked relationships don't seem to translate. Consider the following model

x1 -> y1 -> y2

where each coefficient is a slope of 2.

I can create a nice fake data frame

set.seed(31415)
ndf <- data.frame(x1=runif(100,0,100))
ndf$y1 <- rnorm(100, ndf$x1*2, 40)
ndf$y2 <- rnorm(100, ndf$y1*2, 80)

And then a model that fits nicely using map2stan

nf <- alist(
  y1 ~ dnorm(yhat1, sigma1),
  y2 ~ dnorm(yhat2, sigma2),

  yhat1 <-  b1*x1,
  yhat2 <- b2*y1,

  b1 ~ dnorm( 0 , 1000 ),
  b2 ~ dnorm( 0 , 1000 ),

  sigma1 ~ dcauchy( 0 , 1 ),
  sigma2 ~ dcauchy( 0 , 1 )

)

n.fit <- map2stan( 
  nf , 
  data=ndf , 
  start=list(a1=1, a2=1, b1=1, b2=1, sigma1=10, sigma2=10)
)

here

> coef(n.fit)
        b1         b2     sigma1     sigma2 
  1.930119   1.965780  37.934121  86.538586

Great! But, when I try and do posterior prediction with x starting as 3, which should produce roughly 12 for y2...

> link(n.fit, data=list(x1=3)) -> a
[ 1000 / 1000 ] 200 / 1000 ]

> sapply(a, mean)
   yhat1    yhat2 
5.790357 1.965780

Huh....why is yhat incorrect here?

rng_seed appears to no effect in map2stan()

Setting rng_seed does not appear to ensure repeatable analyses with map2stan(). For example, a simple model from Rethinking:

library(rethinking)
data("Howell1")

d <- Howell1
d2 <- d[ d$age >= 18 , ]

m1 <- map2stan(
  alist(
    height ~ dnorm(mu, sigma),
    mu ~ dnorm(178, 20),
    sigma ~ dunif(0, 50)),
  data = d2,
  WAIC = FALSE,
  seed = 8
)

m2 <- map2stan(
  alist(
    height ~ dnorm(mu, sigma),
    mu ~ dnorm(178, 20),
    sigma ~ dunif(0, 50)),
  data = d2,
  WAIC = FALSE,
  seed = 8
)

m1
m2

Produce slightly different results.

> m1
map2stan model fit
1000 samples from 1 chain

Formula:
height ~ dnorm(mu, sigma)
mu ~ dnorm(178, 20)
sigma ~ dunif(0, 50)

Log-likelihood at expected values: -1219.41 
Deviance: 2438.83 
DIC: 2442.6 
Effective number of parameters (pD): 1.89 

> m2
map2stan model fit
1000 samples from 1 chain

Formula:
height ~ dnorm(mu, sigma)
mu ~ dnorm(178, 20)
sigma ~ dunif(0, 50)

Log-likelihood at expected values: -1219.41 
Deviance: 2438.82 
DIC: 2442.72 
Effective number of parameters (pD): 1.95

all.equal(m1, m2) show lots of differences, which seem to be fundamental rather than related to sampling (e.g., timing).:

 [1] "Attributes: < Component “coef”: Mean relative difference: 0.0001766267 >"                                                                                                                                                                                                                         
 [2] "Attributes: < Component “deviance”: Mean relative difference: 3.669022e-06 >"                                                                                                                                                                                                                     
 [3] "Attributes: < Component “DIC”: Mean relative difference: 6.030028e-05 >"                                                                                                                                                                                                                          
 [4] "Attributes: < Component “pD”: Mean relative difference: 0.03561425 >"       
...
[54] "Attributes: < Component “start”: Component 1: Component 1: Mean relative difference: 0.2771743 >"                                                                                                                                                                                                 
[55] "Attributes: < Component “start”: Component 1: Component 2: Mean relative difference: 0.1926305 >"                                                                                                                                                                                                 
[56] "Attributes: < Component “vcov”: Mean relative difference: 0.03782353 >"

These differences are admittedly small, but my thinking is that the models should be identical except for timing differences (the case when using seed = is a pure stan() call).

is it possible to code multivariate outcomes in map2stan?

Hello Richard.

Great package, great book, and great lectures! Thank you.

I am wondering if there is a way of using map2stan for multivariate outcomes. You mention it briefly in one your lectures, but I haven't seen any code for it on the web or book.

I want to quantify the effect of disease on a bird species. I think it might affect weight, and skeleton size. So, we have paired measures of weight and skeleton size (which are related as you might imagine), and we have a predictor variable that is healthy/sick. At some point, I want scale up to multiple species analysis, and it would be awesome to get some pooling going at different levels.

I was hoping something like this might work:

  alist(
    c( weight, skeleton ) ~ dmvnorm2( c(mu_w, mu_s), sigma, Rho ),
    c( mu_w, mu_s ) <- c(alpha_w, alpha_s)[ status_id ],
    c( alpha_w, alpha_s )[ status_id ] ~ dmvnorm2( c(mu_wa, mu_sa), sigma_a, Rho_a),
    mu_wa ~ dnorm( 0, 50 ),
    mu_sa ~ dnorm( 0, 5 ),
    sigma ~ dcauchy( 0, 2 ),
    sigma_a ~ dcauchy( 0, 2 ), 
    Rho ~ dlkjcorr( 2 ),
    Rho_a ~ dlkjcorr( 2 )
 )

This does not seem work.

At first I got an error about not knowing the dimensions of Rho, and suggesting giving it a starting value. So, I gave map2stan this: start = list( Rho = diag(2)). But, that was to no avail.

Do I have the syntax completely wrong, or the model, or both, or am I asking too much of the map2stan AI? All of the above might be an acceptable answer.

Thank you again.

Anders.

Don't save samples of irrelevant / redundant parameters in map2stan

First of all, many thanks for this great book and package!

When playing with the examples in the documentation of map2stan, I noticed that using dmvnormNC leads to saving parameter samples of z_N_<group> and L_Rho_<group>. The former samples are irrelevant because they are just normal(0,1) and the latter are redundant (in my opinion) as Rho_<group> is also saved. Accordingly, I feel that it would make sense to exclude them via the pars argument of stan. What do you think?

Imputation w/varying effects bug

Found an odd (and new) bug with using imputation in a varying effects model. I have an easy fix in local repository, but want to change algorithm a little to provide a better fix.

So fix pending in next release, which should be 1.54.

Tweaking map2stan traceplots

Hi Richard,

I'm hoping you could help me look under the hood on how your package plots the traceplots of map2stan objects. Specifically, is there a way to pull out the plots for just some of the parameters (my current models have dozens and I don't always want to look at all of them). Also, can I adjust the device parameters (e.g., ask or mfrow)?

Thanks for the help

maximum a prior vs maximum a posterior

In the textbook, p. 115, the summary says

To fit these models to data, the chapter introduced maximum a prior (MAP) estimation.

I assume what is meant and was introduced is maximum a posteriori estimation.

Wrong (undefined) array size constant in map2stan model with missing values

First of all, thank you very much for your extremely well written and insightful book and the accompanying software package, which has already changed my 'statistical life'.

I was trying to use map2stan for solving a multilevel CFA, and in the course of that endeavour, came across a problem which I will try to demonstrate with the following minimal working example (the original CFA model is no longer recognizable from this, of course):

library(rethinking)
## package rethinking_1.59, installed from github

d <- data.frame( ## Never mind, just a small data set …
    x = c(1, 2, 3, 4),
    y = c(3, 4, 1, 2),
    z = c(1, 3, 2, 4),
    grp = c(1L, 1L, 2L, 2L) ## … with a group indicator
)

model <- alist(
    y ~ dnorm(mu_y, sd_y),
    z ~ dnorm(mu_z, sd_z),
    mu_y <- b_y * x + a[grp],
    mu_z <- b_z * x + a[grp], ## Probably the rest is irrelevant, but for completeness …
    a[grp] ~ dnorm(mu_a, sd_a),
    c(b_y, b_z) ~ dnorm(0, 100),
    c(mu_y, mu_z, mu_a) ~ dnorm(0, 100),
    c(sd_y, sd_z, sd_a) ~ dcauchy(0, 1)
)

model0 <- map2stan(model, data = d, sample = FALSE)
stancode(model0)
## (output omitted)

It seems like in the data declaration of the group index, int grp[N_z];, a wrong symbol, N_z, is given for the array size, which probably should be N_grp, declared in the stancode. This is not really a problem unless z contains missing values, because in that case, N_z will no longer be declared in the stancode and that code will not run any more:

d2 <- d
d2$z[1] <- NA

model2 <- map2stan(model, data = d2, sample = FALSE)
stancode(model2)
## data{
##     int<lower=1> N;
##     int<lower=1> N_z_missing;
##     int<lower=1> N_grp;
##     int<lower=1> z_missingness;
##     real y[N];
##     real z[N];
##     real x[N];
##     int grp[N_z];
## }
## (further output omitted)

I tried to find my way around the source code of map2stan, but don't really feel up to fixing that. Thanks for looking into this issue. If I can be of any further help, please let me know.

warning during installation

This probably not a real problem, but I thought I'd put it on your radar.

The following warning occurs in the midst of a bunch of messages about Creating a generic function for ‘x’ from package ‘y’ in package ‘rethinking’:

Warning: undefined slot classes in definition of "map2stan": stanfit(class "stanfit")

WAIC not working with Student t distribution?

after successfully fitting the following model:


m3 <- map2stan(

  alist( 

    s ~ student_t( v+1, mu, sigma ),

    mu <-  k1*v1 + k2*v2 

    k1  ~ dnorm(0,100),        
    k2  ~ dnorm(0,100),        

    sigma ~ dcauchy(0,0.5),

    v ~ dexp(30)

  ), 
  data=db, 
  constraints=list(sigma="lower=0", v="lower=0"),
  chains=12, 
  cores=12,
  WAIC=FALSE )

A following call to WAIC will produce the following error:

WAIC(m3)
Constructing posterior predictions
[ 12000 / 12000 ] 2400 / 12000 ]
Error in do.call(dname, args = args_list, envir = e1) : 
  could not find function "dstudent"

passing a function to stan through map2stan ?

Hi,

is there any way to pass a function defining a complex non-linear data model
through map2stan ???

My hand built .stan file would look like this:
functions {
vector viscosity(vector x, real eps, real sig) {
... function body
}
}
model {
y ~ normal(viscosity(x,eps,sig), sigma);
}

Specifying prior parameters from code?

Is there a way of specifying priors parameters from code/file? For example, in the code below, rather than writing dnorm(0,100) we may want to specify the prior mean and variance as a result of a calculation. Is there a way of feeding such computed prior parameters to the model?

`m3 <- map2stan(

  alist( 

    s ~ student_t( v+1, mu, sigma ),

    mu <-  k1*v1 + k2*v2 

    k1  ~ dnorm(0,100),        
    k2  ~ dnorm(0,100),        

    sigma ~ dcauchy(0,0.5),

    v ~ dexp(30)

  ), 
  data=db, 
  constraints=list(sigma="lower=0", v="lower=0"),
  chains=12, 
  cores=12,
  WAIC=FALSE )`

Multilevel ordered logistic regression model

For example Is there a way to extend the following example given at page 337 in the book to the multilevel case?


m11.1stan <- map2stan(
alist(
response ~ dordlogit( phi , cutpoints ),
phi <- 0,
cutpoints ~ dnorm(0,10)
) ,
data=list(response=d$response),
start=list(cutpoints=c(-2,-1,0,1,2,2.5)) ,
chains=2 , cores=2 )

Let's suppose we would like to estimate the 'cutpoints' for different education groups in a multilevel fashion. It would seem that we need to impose a hierarchical structure to 'cutpoints' .How could this be achieved within the Rethinking package? I have tried different options but none compiled for different reasons.

Thanks,
Antonio

precis objects don't inherit data.frame class

Hi Richard, I could be wrong, but it seems that precis() output is an S4 object that doesn't inherit the data.frame class.

> myPrecis = precis(myModel)
> class(myPrecis)
[1] "precis"
attr(,"package")
[1] "rethinking"
> as.data.frame(myPrecis)
Error in as.data.frame.default(myPrecis) : 
  cannot coerce class "structure("precis", package = "rethinking")" to a data.frame

Problem Installing Packages

Would you please advise how I can address the following installation error?

install.packages(c("coda","mvtnorm","devtools"))

There is a binary version available but the source version is later:
Error in install.packages : 8 arguments passed to .Internal(format) which requires 9

I am running current version of R (3.2.3) on 64-bit Windows 10.

I have installed Rtools33.exe and verified that all is OK in R with the commands:
Sys.getenv("PATH") #Check Rtools installation and path
system('g++ -v')
system('where make')

I have also run the following commands in R:
install.packages("loo") ## per #13
install.packages("rstan", dependencies = TRUE)
library(rstan) # observe startup messages

Any advice will be appreciated.

Red Owl (yes, my real name)

Matrix notation?

Hi,

thanks for this package, which seems really great. I also enjoyed the book a lot.

I wondered if there is a way in map2stan to use matrix notation, as is possible in stan. Suppose for example I have a matrix of covariates X and a vector of parameters B, such that y ~ XB, is there anyway to indicate it in map2stan?

Thanks!

rgampois seems to be incorrectly defined ( with respect to neg_binomial_2 from stan)

Hi,

Thank you so much writing an amazing textbook. I really enjoyed it a lot and also really liked the rethinking package. It made bayesian analysis enjoyable thing to do.

While using rethinking::rgampois, I noticed some inconsistency.
Here's my small experiment that illustrates the point:

# this is the function definition of rethinking::rgampois 
rgampois_rethinking <- function (n, mu, scale)  {
  shape <- mu/scale
  prob <- 1/(1 + scale)
  rnbinom(n, size = shape, prob = prob)
}

# this is what I propose
rgampois_proposed <- function(n, mu, scale) {
  prob <- scale/(scale + mu) 
 #  this definition is directly copied off of the documentation of rnbionom : 
 # "...An alternative parametrization (often used in ecology) is 
 # by the mean mu (see above), and size, the dispersion parameter, 
 # where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization...."

  rnbinom(n = n, size = scale,  prob = prob)
}

library(tidyverse)
library(rethinking)

b0_truth  <- 3
phi_truth <- 10
data_proposed <- rgampois_proposed(100,exp(b0_truth), phi_truth)
data_rethinking <- rgampois_rethinking(100, exp(b0_truth), phi_truth)

Plot

We can see from the bottom plot that the samples are very different from each other

list(data_proposed=data_proposed, data_rethinking=data_rethinking) %>%
  bind_rows() %>%
  gather(type, data) %>%
  ggplot(aes(x=data, fill=type)) +
  geom_density(alpha=0.3)

Fitting the data with map2stan

m2s_proposed <- map2stan(
  alist(
    y ~ dgampois(mu,phi),
    log(mu) ~ b0,
    b0 ~ dnorm(0,1),
    phi ~ dcauchy(0,1)
  ),
  data=data.frame(y=data_proposed)
)

m2s_rethinking <- map2stan(
  alist(
    y ~ dgampois(mu,phi),
    log(mu) ~ b0,
    b0 ~ dnorm(0,1),
    phi ~ dcauchy(0,1)
  ),
  data=data.frame(y=data_rethinking)
)

precis(m2s_proposed)

##     Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
## b0  3.03   0.05       2.97       3.11   734    1
## phi 8.27   1.91       5.44      11.07   753    1

precis(m2s_rethinking)

##     Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
## b0  3.05   0.09       2.92       3.18   719    1
## phi 1.78   0.28       1.36       2.21   616    1

As you can see, the estimate of phi is more or less closer to the correct phi_truth=10 in m2s_proposed but way off in m2s_rethinking.

dealing with interactions in sim after glimmer/map

I really like being able to do this:

df1 <- data_frame(x=rnorm(100), z=rnorm(100)) %>% mutate(y=x*z+rnorm(n()))

lm1 <- glimmer(y~x+z, data=df1)
lm1.fit <- map(lm1$f, lm1$d)
sim(lm1.fit, data=expand.grid(x=-1:1, z=-1:1))

But I can't do this

lm2 <- glimmer(y~x*z, data=df1)
lm2.fit <- map(lm2$f, lm2$d)
sim(lm2.fit, data=expand.grid(x=-1:1, z=-1:1))

Error is "Error in eval(expr, envir, enclos) : object 'b_x_X_z' not found"

Is there a way of doing this without manually reconstructing all the interaction terms required via model matrix and regexes for replacing : with X?

glimmer with family = "binomial"

I can't seem to get glimmer to work when family is "binomial":

glimmer(Species == "virginica" ~ Petal.Width, iris, "binomial")
Error in family$link : $ operator is invalid for atomic vectors

Using the standard glm, the formula works fine:

glm(Species == "virginica" ~ Petal.Width, "binomial", iris)

uniform distribution non compiling with negative values

The following test model compiles correctly with uniform(0,1) but fails to compile with uniform(-1,1).
Stan doesn't have a problem with uniform(-1,1) so it's something with map2stan I believe.

 m2 <- map2stan(

    alist(

      response ~ dnorm(m, 10),
      m <- k * v1,

      k ~ uniform(-1,1)
    ) , 
    data = db,
  )

Covariance Matrix not Fitting Correctly in Varying Slopes Model

Hi!

I've been running the code in chapter 13 of your book and currently the code does not extract out the correct Rho value after fitting the model.

The code in question is the following:


## R code 13.1
a <- 3.5            # average morning wait time
b <- (-1)           # average difference afternoon wait time
sigma_a <- 1        # std dev in intercepts
sigma_b <- 0.5      # std dev in slopes
rho <- (-0.7)       # correlation between intercepts and slopes

## R code 13.2
Mu <- c( a , b )

## R code 13.3
cov_ab <- sigma_a*sigma_b*rho
Sigma <- matrix( c(sigma_a^2,cov_ab,cov_ab,sigma_b^2) , ncol=2 )

## R code 13.4
matrix( c(1,2,3,4) , nrow=2 , ncol=2 )

## R code 13.5
sigmas <- c(sigma_a,sigma_b) # standard deviations
Rho <- matrix( c(1,rho,rho,1) , nrow=2 ) # correlation matrix

# now matrix multiply to get covariance matrix
Sigma <- diag(sigmas) %*% Rho %*% diag(sigmas)

## R code 13.6
N_cafes <- 20

## R code 13.7
library(MASS)
set.seed(5) # used to replicate example
vary_effects <- mvrnorm( N_cafes , Mu , Sigma )

## R code 13.8
a_cafe <- vary_effects[,1]
b_cafe <- vary_effects[,2]

## R code 13.9
plot( a_cafe , b_cafe , col=rangi2 ,
    xlab="intercepts (a_cafe)" , ylab="slopes (b_cafe)" )

# overlay population distribution
library(ellipse)
for ( l in c(0.1,0.3,0.5,0.8,0.99) )
    lines(ellipse(Sigma,centre=Mu,level=l),col=col.alpha("black",0.2))

## R code 13.10
N_visits <- 10
afternoon <- rep(0:1,N_visits*N_cafes/2)
cafe_id <- rep( 1:N_cafes , each=N_visits )
mu <- a_cafe[cafe_id] + b_cafe[cafe_id]*afternoon
sigma <- 0.5  # std dev within cafes
wait <- rnorm( N_visits*N_cafes , mu , sigma )
d <- data.frame( cafe=cafe_id , afternoon=afternoon , wait=wait )

## R code 13.11
R <- rlkjcorr( 1e4 , K=2 , eta=2 )
dens( R[,1,2] , xlab="correlation" )

## R code 13.12
m13.1 <- map2stan(
    alist(
        wait ~ dnorm( mu , sigma ),
        mu <- a_cafe[cafe] + b_cafe[cafe]*afternoon,
        c(a_cafe,b_cafe)[cafe] ~ dmvnorm2(c(a,b),sigma_cafe,Rho),
        a ~ dnorm(0,10),
        b ~ dnorm(0,10),
        sigma_cafe ~ dcauchy(0,2),
        sigma ~ dcauchy(0,2),
        Rho ~ dlkjcorr(2)
    ) ,
    data=d ,
    iter=5000 , warmup=2000 , chains=2 )

## R code 13.13
post <- extract.samples(m13.1)
dens( post$Rho[,1,2] )

You do not get a plot similar to that in Figure 13.4 when running your code.

You instead get the following:

If you run 1000000 reps with 4000 warmup you get the following, which doesn't include the big spike on zero:

Either way, it is centered at -0.19, not at -0.7 at it should be by design.

There is also a few errors that pop up on each chain, besides the one addressed by the overthinking box in the chapter:

[1] "The following numerical problems occured the indicated number of times after warmup on chain 2"
                                                                                                     count
validate transformed params: SRS_sigma_cafeRho is not symmetric. SRS_sigma_cafeRho[1,2] = inf, but S     1
[1] "When a numerical problem occurs, the Metropolis proposal gets rejected."
[1] "However, by design Metropolis proposals sometimes get rejected  even when there are no numerical problems."
[1] "Thus, if the number in the 'count' column is small,  do not ask about this message on stan-users."

Anyway, I am not sure what is going on, but from the looks of things, the sampling seems to be getting stuck on 0 in covariance estimation, any ideas on how to fix this?

Thanks!

Nonobvious behavior of `link`-function combined with map2stan models

When you train a model with map2stan with a non-specified n and then use link or extract.samples with a large n, the results are nonobvious: extract.samples (used by link) returns a set of size n but only filled up to the default map2stan-size.

Example code:

Train model:

m <- map2stan (
 formula, 
  data=ddd)
summary(m)

Relevant output:

Inference for Stan model: totalScore ~ dbinom(40, p).
1 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=1000.

Subsequent call to link with the same dataset as example but now n=1010:

posterior <- link(m, data=ddd, n=1010)

Result of 5th row:

posterior[,5]
   [1] 0.4477779 0.4661901 0.5257805 0.4876305 0.4294614 0.4631741 0.5056220 0.5049068 0.4612833 0.4821353 0.4681812 0.4635064 0.5055391 0.5054395
...
 [995] 0.5142565 0.4737347 0.4642776 0.4575579 0.4403849 0.4595757        NA        NA        NA        NA        NA        NA        NA        NA
[1009]        NA        NA

The last 10 samples are set to NA, due to link depending on extract.samples which silently ignores its n-argument when used in combination with map2stan.

Perhaps this should be relayed back to the user (ie. a warning message about n to be curbed to the map2stan n-value).

cryptic `map` error message if no density function is provided

I ran into this error message today when playing with map on the foxes data, and it took me a moment to figure out what was going on:

Error in solve.default(fit$hessian) : 
  Lapack routine dgesv: system is exactly singular: U[1,1] = 0`

I initially looked for multicollinearity based on the error message, but that turned out not to be the problem.

My code is below. The problem is that I forgot to include dnorm and just gave it a formula for the distribution's mean.

Since this will probably be a common form of user error, it would be nice if map had a more informative error message that triggered some time before solve got called. I unfortunately don't know enough about working with expressions to know exactly how one would go about defining the conditions for the error to trigger, but I thought I'd put it on your radar.

library(rethinking)
data("foxes")
head(foxes)

map(
  list(weight ~ a + b_f * avgfood + b_s * groupsize), 
  data = foxes, 
  start = list(a = mean(foxes$weight), b_f = 0, b_s = 0)
)

Error in docs for HPDI, PI

The documentation gives the default value of prob=0.95 but the code uses prob=0.89.

Bug in plot.map2stan

When selecting a subset of parameters for plotting the traces from a map2stan object, the method plot does not correctly select their n_eff.
Consider the model from the example in the help file to map2stan and compare
precis(m2)
precis(m2, pars = c('bp', 'bpc'))
plot(m2, pars = c('bp', 'bpc'))
The plot shows n_eff for the first two parameters instead of the selected ones.

[Package rethinking version 1.59]

Installation issue

Dear Richard and others,

I have installed rstan, but when I try to install rethinking, I get the following error:

ERROR: dependency 'loo' is not available for package 'rethinking'
removing 'C:/Users/drlabuser/Documents/R/win-library/3.1/rethinking'
Error: Command failed (1)

Any suggestions welcome!

Thank you,
Colleen

Model descriptions in book chapter 4 don't always match with code

This is mostly an issue for the textbook chapter 4.

In some cases the model definitions given in the book don't match with the corresponding R code. Furthermore the book references code which doesn't appear.

Examples:

page 87, R code for mu
page 95, R code for alpha
page 96, The text mentions a start list for the model but the R code doesn't contain such a list.

Variable order requirement for link, sim, ensemble

The prediction functions seem to require the variables in data to be provided in the order they appear in the model formula. For example:

data(nettle)
n = nettle
n$loglpc = log10(n$num.lang / n$k.pop)
n$logArea = log10(n$area)

m = lm(loglpc ~ mean.growing.season * sd.growing.season + log10(area), data = n)
dpred = data.frame(mean.growing.season = 11,
                   sd.growing.season = seq(0, 6, len = 100),
                   logArea = mean(n$logArea))
l = link(m, dpred)
summary(apply(l, 2, mean))

dpred = dpred[, 3:1]
l = link(m, dpred)
summary(apply(l, 2, mean))