pbs-assess / sdmtmb Goto Github PK
View Code? Open in Web Editor NEW:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
Home Page: https://pbs-assess.github.io/sdmTMB/
:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
Home Page: https://pbs-assess.github.io/sdmTMB/
I am running into an issue where I am trying to next sdmTMB_cv within another function, with designated parameters to turn on or off specific functionality.
This works:
spde <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
m_cv <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, spde = spde,
family = tweedie(link = "log"), k_folds = 2
)
m_cv$fold_loglik
m_cv$sum_loglik
head(m_cv$data)
m_cv$models[[1]]
m_cv$max_gradients
This does not work:
cv_foo <- function(spatial_field=F){
out <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, spde = spde,
family = tweedie(link = "log"), k_folds = 2,
map_rf=ifelse(spatial_field,F,T)
)
out
}
m_cv2 <- cv_foo()
# Error in ifelse(spatial_field, F, T) : object 'spatial_field' not found
For what it's worth, I do not have this same issue with sdmTMB()
, so I think it has to do with how the dot args are recycled within sdmTMB_cv()
Setting weights to zero in the source data for years you want to predict on was a hack suggested by Eric. At the very least, add it in a vignette or example or a note in the help.
Should be doable to let the prediction data frame not match the fitted one, but requires some careful checking of the input data into TMB to make sure it makes sense. Currently it (R) checks to make sure they match on purpose.
Currently fixed at 3.
E.g.
student(link = "identity", df = 7)
Affects fitting and predicting
As suggested by @Cole-Monnahan-NOAA, this would be relatively easy and compliment the R-based simulation function.
See the appendix of https://arxiv.org/abs/2103.09929
Nice example here:
https://github.com/aaron-oz/tmb-inla-paper/blob/0c2fc80cbf5edbd161a8bdd0a0ea697b71dacef8/continuous_sims/4_setup_run_predict_tmb.R#L146-L176
Also, look into whether doing this and recreating the predictions in R (as in that code) would be a faster way of getting uncertainty on predictions than using adreport on the predictions, which is currently very slow.
Ideally this is a vector that varies by observation, but at minimum needs to be a single integer that could be applied to all observations. Probably passed in as column name in data frame containing trials? Or even easier, just assume it's N in the data frame
Should be done at the end of the TMB source code.
Probably treat it like the predictions: use a flag so that it's not calculated until necessary since it's expensive to calculate the standard errors.
Biomass weighted UTM Y, UTM X, and depth? Should we be working in a rotated projection so that Y is up and down the coast or is it better if it represents latitude?
Probably needs a prediction helper function to pass the posterior samples through the obj$report()
function as is done in the predict function for the joint precision matrix simulation.
Should also be accompanied by map
and start
arguments in sdmTMB()
and enhancing the ability to specify priors on parameters. From my testing, the PC prior helps a lot.
May want the ability to scale omega_s and epsilon_st deviations to help NUTS adaptation.
Would need a vignette with guidance.
In the AR1 branch, where current development is taking place, the predict function will only work on new data if the same spatial coordinates are predicted on each year. The simplest way to fix this, because the random effect projections are done on a matrix in the C++ code, is probably to expand the spatial and spatiotemporal projection mesh to include all spatial coordinates for all years and then keep track of a vector of elements to remove from proj_re_sp
and proj_re_st_vector
before calculating proj_eta
.
Is there any hope for this beyond coarsening the resolution of the prediction grid? I tried estimating the standard errors for a single west coast example and it was running for hours before I aborted the process.
Either summarize anomalies in residuals from predictions in a sdmTMB model (this can be done post-hoc outside of estimation), or internally within TMB with hard restrictions on model parameters by year. Example: AR(1) process, linear or exponential fits, etc
Useful for showing marginal effects of fixed effects without waiting for or bothering with random effects.
@Kotkot made the suggestion and provided some example code.
Imagine we have rand1, rand2, and rand3 random effects acting on the intercept in the model. Each have 10, 20, 30 categories. Then I have created a vector RE with length equal to 10+20+30=60 which stores all the RE estimate along with a vector storing the number of categories for each random effect (following the order of appearance in the model) i.e. c(10,20,30). And an observation matrix with the random effects in the column and observation as row.
Example code:
int temp = 0;
for (int i=0; i<Nfish_LA; i++) {
for (int k=0; k<n_RE_LA; k++){
if (k==0) mu(i) += RE_LA(ObsinRE_LA(i,k));
if (k>0) {
temp += nobs_RE_LA(k-1);
mu(i) += RE_LA(ObsinRE_LA(i,k)+temp);
}
}
temp = 0;
}
where RE_LA
is a vector of random intercept values that get worked along.
Formula can be parsed using utilities from lme4. E.g.
LA_formula <- y ~ x + (1|g)
RE_variable_LA <- barnames(findbars(LA_formula))
When a spatial only model is fit (with env covariates), and a new data frame is passed in with enviro covariates varying by year, the COG time series on the future values are not generated
With the recently added spatial-varying slopes random fields (thanks @eric-ward!) the prediction data frame is getting complicated. Might be a good time to put some more thought into what they should be called.
The current setup:
est
= Complete estimate in link space (everything is in link space)
est_fe
= Estimate from fixed effects
est_re_s
= Spatial random effect estimates
est_re_s_trend
= Spatial trend estimates (slope * time)
est_re_st
= Spatiotemporal random effect estimates
re_s_trend
= The spatial trend slopes themselves
It occurred to me that maybe it makes more sense just to name these columns based on the symbols that have become a bit of a convention and are used in the TMB code.
Therefore,
est_re_s
would instead be omega_s
est_re_st
would be epsilon_st
These are the realizations of these random effects at the prediction locations.
This makes me think we should come up with a symbol for the slope random field. Anything could do. Maybe gamma or psi or the always-cool-sounding zeta?
It probably doesn't make sense to show just the spatially-varying slope * time realizations and since that doesn't make much sense without the intercept included. I was wondering if instead we should make a column that is the sum of all the random effects. In the case of spatially varying slopes this would be the spatially varying intercept plus the spatially varying slope * time. In other cases this might be omega_s + epsilon_st. And in other cases this would just be the same as omega_s or epsilon_st. What to call this? est_re
?
Then we could have est_fe
representing the predictions just from all the fixed effects. This currently includes the predictions from the time varying effects, which technically aren't fixed effects but since they are constant across space it makes sense to me to include them here. Basically these are the predictions that don't include any random fields. An alternative would therefore be est_rf
and est_non_rf
for random field estimates and non random field estimates. Maybe that's nicer and avoids the whole fixed-effect-random-effect jargon problem.
Finally, we have est
, which represents the some of all the effects in link space.
OK, after that stream of consciousness, one proposal would be:
est
= Complete estimate in link space (everything is in link space)
est_non_rf
= Estimate from everything that isn't a random field
est_rf
= Estimate from all random fields combined together
omega_s
= Spatial (intercept) random field that is constant through time
epsilon_st
= Spatiotemporal (intercept) random fields (could be independent draws each year or AR1)
zeta_s
= Spatial slope random field
See code example in p 74-77 of https://arxiv.org/abs/2103.09929
Allows for matching INLA; may help convergence in some cases.
See the threshold vignette for guidance.
Currently the sdmTMB version of predict does not support an argument for adding standard errors to the predictions.
It would be great if you could enable this option in the package.
Thanks a lot and best wishes!
Jakob
Eventually target somewhere like MEE or maybe even PeerJ or JSS.
I have some right skewed data that I'd like to model with an inverse gamma distribution, but the model doesn't seem to want to converge. Here's a simple example:
library(tidyverse)
library(sdmTMB)
pcod_spde <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
plot(pcod_spde)
test <- pcod %>%
mutate(test_gamma = rgamma(nrow(.), shape = 1, scale = 1))
# works
sdmTMB(test_gamma ~ 1,
family = Gamma(link = "log"),
data = test,
spde = pcod_spde,
spatial_only = TRUE)
# not working
sdmTMB(test_gamma ~ 1,
family = Gamma(link = "inverse"),
data = test,
spde = pcod_spde,
spatial_only = TRUE)
Any ideas? Anyways, thanks for putting together such a cool and useful package!
Similar to glmmfields: simulate, recover close enough values. Also, cache the output from a simple model and make sure it stays the same in the future.
(Partly done as of 2021-05-05.)
There's a type mismatch error that gets thrown with predict() when one of two things occurs:
The reason this is occurring is that scale() returns a column matrix, and that type is maintained in the dataframe. Several ways to fix this, but perhaps just put a as.data.frame() call in predict() to catch these?
The re_form argument should control if these get included in prediction as is done in lmer etc. I wonder if we need some way to distinguish including spatial/spatiotemporal random effects vs. random intercepts. Probably. This is how lmer does it:
(formula, NULL, or NA) specify which random effects to condition on when predicting. If NULL, include all random effects; if NA or ~0, include no random effects.
Right now, NULL, NA, and ~0 work but only affect the spatiotemporal effects.
I guess for regular prediction, you'd probably want to include all the random effects by default. For index standardization, you'd probably want to exclude the IID random intercepts.
Would allow for regularization. Let the user specify the SDs on a normal mean zero prior as a vector perhaps (to deal with the intercept or different scales of predictors). Split this out from the poorly thought out default priors right now.
Would allow for testing of turning off spatial estimation, for example, to compare to a standard GLMM.
This is not so much a bug as documenting some time-consuming debugging.
Two seemingly trivial changes in the .cpp template caused small changes in the parameter estimates and index values that were detected by the unit tests when I was merging into the master branch. I worked my way through the Git history to figure out what caused the issues.
Moving phi = exp(ln_phi)
from just before the observation likelihood section to the top changed the results. I have left it at the top since this is helpful for applying priors. That was part of this commit. I have confirmed that using map to share the range parameter makes no difference.
Recording the observation likelihood separately in nll_obs
and then adding it to jnll
changed the results very slightly. That was done here. I have reverted to only jnll
.
Going forward the unit tests should help flag these types of things as they happen. It would be good to have even more of these types of unit tests.
The one I was using:
library(sdmTMB)
pcod_spde <- make_mesh(pcod, c("X", "Y"), cutoff = 20)
m <- sdmTMB(
data = pcod,
formula = density ~ 0 + as.factor(year),
time = "year", spde = pcod_spde, family = tweedie(link = "log")
)
predictions <- predict(m, newdata = qcs_grid, return_tmb_object = TRUE)
ind <- get_index(predictions, bias_correct = FALSE)
expect_equal(ind$est,
c(209277.00119385, 316826.468229804, 355481.950149426, 91540.9936646985,
150187.860588992, 274230.962985684, 274541.848909578, 326960.546698788,
151502.369101489))
Various results:
Lines 325 to 330 in 8ac7d53
I was trying to base this off the parameterization in glmmTMB but got confused. We could instead just use the default parameterization and move on for now.
http://www.r-inla.org/barrier-model
Might be as simple as passing the appropriate SPDE object from INLA.
If we can't do this easily in TMB then that might be a reason to prefer a straight R-INLA approach for a coast-wide model because of the configuration of our surveys and islands.
I think I'll add them to the control list and move map and start there as well. I'll get the basics working...
There's currently a minimal simulation function in R/sim.R
.
[x] It needs separate spatial and spatiotemporal random fields (to match the setup of the model). Right now it just as a spatial field with one time slice.
[ ] It needs the ability to work with covariates
[x] It needs the ability to work with multiple time slices
[ ] It needs the ability to work with other observation models, although I think this one is lower on the priority list since this part is pretty easy to get right.
It's a bit more complicated than we need here, but there's some inspiration in our glmmfields package sim function. https://github.com/seananderson/glmmfields/blob/master/R/sim.R
The covariates could come in as a formula or just as a design matrix to make things simple.
https://github.com/seananderson/glmmfields/blob/2ebed79b295860eb06e89f09a43b0f666d10af4f/R/sim.R#L152
The random walk part is as simple as this:
https://github.com/seananderson/glmmfields/blob/2ebed79b295860eb06e89f09a43b0f666d10af4f/tests/testthat/test-fit-ar-processes.R#L28-L31
@ecophilina , this may be a good one for you to try tackling. I usually find that writing a simulation function is the best way to understand the model and we'll need the simulation function to test any extra functionality we add to the package. The RandomFields::RFsimulate()
function helps hide a lot of the complexity.
Say I've got three areas, A, B, and C which are essentially non-overlapping subsets of the large spatial domain. Each observation can be categorized as in an area.
Is it possible to get index values and SEs for each area, apart from the total?
I'm thinking of a couple possible ways to do this. First is to generating a polygon in R and generating a ton of points and then integrate it by pushing it through as newdata. Or to create separate meshes for each and somehow get those to work internally to sdmTMB.
Any quick thoughts on whether this is easy to do or not? I could also make a reprex is that's helpful.
Thanks for merging the threshold branch Sean. Turns out I was wrong in thinking that was the source of a recent error arising from application of the cross-validation function. I can run the sdmTMB_cv example just fine ( and previously ran without issues for the spatial trends analyses). However, with the code here: https://github.com/fate-spatialindicators/consonants/blob/master/code/run_wc_models_MI.R, when use_cv = TRUE (run models with cross-validation) I am getting this error:
Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :
argument "x" is missing, with no default
Which, given the output from traceback() below seems to indicate that there is some kind of indexing issue occurring at line 115 of cross-val.R, or the data to be fit without one of the folds is not getting passed properly?
8: (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x,
i, exact = exact))(x, ..., exact = exact)
7: [[.data.frame
(d_fit, x)
6: d_fit[[x]]
5: cbind(x, y)
4: spde_function(d_fit[[x]], d_fit[[y]], n_knots = n_knots)
3: FUN(X[[i]], ...)
2: lapply(seq_len(k_folds), function(k) {
if (k_folds > 1) {
d_fit <- data[data[[fold_ids]] != k, , drop = FALSE]
d_withheld <- data[data[[fold_ids]] == k, , drop = FALSE]
}
else {
d_fit <- data
d_withheld <- data
}
d_fit_spde <- spde_function(d_fit[[x]], d_fit[[y]], n_knots = n_knots)
object <- sdmTMB(data = d_fit, formula = formula, time = time,
spde = d_fit_spde, ...)
predicted <- predict(object, newdata = d_withheld, xy_cols = c(x,
y))
cv_data <- d_withheld
cv_data$cv_predicted <- object$family$linkinv(predicted$est)
response <- get_response(object$formula)
withheld_y <- predicted[[response]]
withheld_mu <- cv_data$cv_predicted
cv_data$cv_loglik <- ll_sdmTMB(object, withheld_y, withheld_mu)
list(data = cv_data, model = object, pdHess = object$sd_report$pdHess,
max_gradient = max(abs(object$gradients)), bad_eig = object$bad_eig)
})
1: sdmTMB_cv(formula = as.formula(formula), time = time, k_folds = 10,
n_knots = 250, seed = 45, family = tweedie(link = "log"),
data = sub, anisotropy = TRUE, spatial_only = df$spatial_only[i],
)
E.g., should be >= 0 if Tweedie/NB2/Poisson, link = log or R will crash.
If you run get_cog() on a prediction object created with return_tmb_object = FALSE, it produces the wrong error message (unless something changed very recently that I missed?):
Error: It looks like the model was built with an older version of sdmTMB. Please update the model with your_model <- sdmTMB:::update_model(your_model)
first before running this function.
Check what other TMB packages are doing to keep size down (e.g. MSEtool vs. glmmTMB). Technically I think it's good to go as is, but we might want to shorten some example and vignette runs and check on Rhub.
The INLA problem could be insurmountable. See what other packages are doing.
Lines 19 to 27 in 8ac7d53
statmod::qres.gamma()
is a good reference
Not sure where best to do this, but it'd be nice for the randomized quantile residuals from a model fit with cross validation to be easily accessible. Right now, if we have
fit_cv <- sdmTMB_cv(...)
The residuals()
function can be applied to each element of the fit_cv$models
list -- but they're not in a single data frame. I think options could be to put these calculations in sdmTMB_cv and attach them to fit_cv$data
, or change residuals()
extract elements of a list
For example, it can be done currently like this:
fit_cv$data$qq = NA
for(i in 1:length(fit_cv$models)) {
fit_cv$data$qq[which(fit_cv$data$cv_fold==i)] = residuals(fit_cv$models[[i]])[which(fit_cv$data$cv_fold==i),1]
}
follow same general approach as glmmTMB
I noticed something weird with the tidy confidence intervals. First, the lower/upper CIs were flipped, and I added a commit to catch it -- but didn't solve the problem. I think it's something around this block of code
Line 70 in 0a04985
I'll dig into it later, just flagging it
What should the interface look like for this?
The idea is that some of the variables included as 'fixed effects' should be allowed to vary via a random walk with each time step.
So if a formula looks like this:
formula = y ~ x1 + x2, time = "year"
We want to be able to communicate that either x1
or x2
or both should follow a random walk with time.
We could borrow the usual random effect syntax:
formula = y ~ x1 + x2 | (x1 | year), time = "year"
but that has a slightly different meaning than in lme4, for example.
Alternatively we could add an argument:
formula = y ~ x1 + x2, time = "year", time_varying = "x1"
formula = y ~ x1 + x2, time = "year", time_varying = c("x1", "x2")
or use another formula:
formula = y ~ x1 + x2, time = "year", time_varying = ~ x1
formula = y ~ x1 + x2, time = "year", time_varying = ~ x1 + x2
Any thoughts on this? @ecophilina
I'm thinking the last option might be cleanest. Then there's the question of whether we want to duplicate x1 + x2
across the two arguments or whether it should just be in a time_varying component:
formula = y ~ x2, time_varying = ~ x1, time = "year"
formula = y ~ 1, time_varying = ~ x1 + x2, time = "year"
They are in the sd_report
. If m
is a fitted model:
b <- as.list(m$sd_report, "Estimate")
b$RE
and you can reconstruct them from the m$tmb_data$RE_indexes
. Where:
RE(RE_indexes(i, k))
i is row of data, k is the random effect grouping ID, e.g. (1 | g ) + (1 | h) would have a k of 1 and 2.
This line in index.R for get_index and get_cog will only work when the name of the time column of the input data frame is "year"
d$year <- sort(unique(obj$data$year))
Update update_model function so predict will still work for models built with previous versions.
Should be doable know but a bit clunky.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.