Comments (10)
(edited the original post to include sdmTMB and R versions)
from sdmtmb.
I don't think there's much wrong here? -- the truncated negative binomial code used in sdmTMB is taken from glmmTMB (see here: https://github.com/pbs-assess/sdmTMB/blob/c70fa5c3528cfabc6c8136d21b22b430f149a368/src/utils.h#L337C8-L337C25). The simulation below seems to recover the parameters ok -- I think the major issue is that one of your initial means was very small (0.1) and that was causing things to blow up / not converge for other some other seeds I tried.
library(dplyr)
library(tidyr)
library(sdmTMB)
set.seed(1)
n_draws <- 150
mod_pars <- data.frame(treatment = letters[1:2], mu = c(2, 5), phi = 0.1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
prd_data <- data.frame(treatment = letters[1:2])
## predictions in link space
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#> treatment est
#> 1 a 0.6644778
#> 2 b 1.5506004
## predictions in response space appear incorrect
## e.g. mean response can't be < 1 for a truncated negative binomial
pred_lp %>% mutate(., response = exp(est))
#> treatment est response
#> 1 a 0.6644778 1.943475
#> 2 b 1.5506004 4.714300
## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)
## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]
## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 2.336000 4.812081
## these appear to correspond to training data means
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 2 × 2
#> treatment mean
#> <chr> <dbl>
#> 1 a 2.34
#> 2 b 4.81
Created on 2024-06-14 with reprex v2.1.0
from sdmtmb.
Thanks @ericward-noaa for looking into this and your quick response. And apologies if my pretty contrived example was not particularly clear (and for not having checked whether the model fitting had successfully converged...).
And thanks also for the link to the rtruncated_nbinom function. I tried to expose it to R so I could use it to simulate the training dataset directly rather than truncating random draws from a negative binomial distribution - unfortunately my C++ skills weren't up to the task.
The parameter estimates in your example do appear to be more consistent with the means of the negative binomial distribution(s) used to simulate the training data, as well as the actual means of the training data after removal of the 0s. But, the differences between the negative binomial and truncated negative binomial distributions become more pronounced with lower means (all else equal), as the difference in means are larger due to more 0s.
Hopefully the example below more clearly demonstrates why I think the predictions from predict
calls are incorrect for truncated_nbinom2
models (and positive components of delta_truncated_nbinom2
models):
library(dplyr)
library(tidyr)
library(sdmTMB)
set.seed(1)
n_draws <- 250
mod_pars <- data.frame(treatment = letters[1:4], mu = c(0.5, 0.75, 1, 2.5), phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
## predictions in link space
prd_data <- data.frame(treatment = letters[1:4])
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#> treatment est
#> 1 a -0.561669862
#> 2 b -0.197211601
#> 3 c -0.001578785
#> 4 d 0.960275459
## predictions in response space:
pred_lp %>% mutate(., response = exp(est))
#> treatment est response
#> 1 a -0.561669862 0.5702560
#> 2 b -0.197211601 0.8210169
#> 3 c -0.001578785 0.9984225
#> 4 d 0.960275459 2.6124160
## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
#> # A tibble: 4 × 2
#> treatment mean_response
#> <chr> <dbl>
#> 1 a 1.48
#> 2 b 1.70
#> 3 c 1.83
#> 4 d 3.17
## treatment specific mean responses in the training dataset:
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 4 × 2
#> treatment mean
#> <chr> <dbl>
#> 1 a 1.56
#> 2 b 1.81
#> 3 c 1.98
#> 4 d 3.59
My understanding is that predict
should return the predicted mean repsonse. In which case the predictions are biased, and also inconsistent with both the assumptions of the model (a zero truncated negative binomial can't have a mean response < 1) and simulated data from the fitted model. The predictions are also suspiciously similar to the means of the (untruncated) negative binomial distributions that were used to generate the underlying data (i.e. mod_pars$mu
).
Using the formula for the mean of a zero truncated negative binomial from https://grodri.github.io/glms/notes/countmoments, I am able to recover predicted means that are consistent with both the training dataset, and simulated responses from the fitted model:
## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)
## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]
## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 1.561798 1.809524 1.984962 3.585366
These in combination suggest to me that the predict calls are not returning the predicted mean response from the truncated negative binomial distribution - perhaps instead a parameter that in part defines the mean? Again, I tried unsuccessfully to dig in to the source code, but didn't make much progress.
Thanks again for your assistance, and apologies if this is still unclear. I'm aware this isn't the gold standard in a minimal reproducible example!
Created on 2024-06-16 with reprex v2.1.0
from sdmtmb.
Thanks @tom-peatman for the clear example. I fleshed this out a bit more, adding more levels (I was interested if there was a gradient in any bias, etc). I don't really see one here -- the estimated means seem consistent with the original means (before data are removed) -- and simulated data are consistent with empirical means, after zeros are removed:
library(dplyr)
library(tidyr)
library(sdmTMB)
library(ggplot2)
set.seed(1)
n_draws <- 250
true_mu <- seq(0.2, 3, length.out = 20)
mod_pars <- data.frame(treatment = letters[1:length(true_mu)], mu = true_mu, phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
## predictions in link / response space
prd_data <- data.frame(treatment = letters[1:length(true_mu)])
pred_lp <- predict(fit, newdata = prd_data, type = "link", se_fit = TRUE) %>%
mutate(response = exp(est), lo = exp(est - 2*est_se), hi = exp(est + 2*est_se))
#> Prediction can be slow when `se_fit = TRUE` and random fields are included
#> (i.e., `re_form = NA`). Consider using the `nsim` argument to take draws from
#> the joint precision matrix and summarizing the standard devation of those
#> draws.
pred_lp$true_mu <- true_mu
# There doesn't appear to be a consistent bias here
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw()
## treatment specific mean responses in the training dataset:
sim_mean <- sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
pred_lp$sim_mean <- sim_mean$mean
# When we look at the means in the data with zeros removed (blue dots), these values don't
# include 0s, so are higher
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
geom_point(aes(treatment, sim_mean), col="blue") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw()
## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_resp <- sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
pred_lp$sim_resp <- sim_resp$mean_response
# We can also compare the simualted values (orange dots) to the filtered original data, and these are
# quite similar (neither include 0)
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
geom_point(aes(treatment, sim_mean), col="blue") +
geom_point(aes(treatment, sim_resp), col="orange") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw()
Created on 2024-06-17 with reprex v2.1.0
from sdmtmb.
Thanks @ericward-noaa for this, very helpful.
It looks like I've fundamentally misunderstood what is returned by predict
calls for truncated_nbinom2
models, and the positive component for delta_truncated_nbinom2
models. My expectation was that predict
should return the predicted mean response from the zero-truncated distribution (at either the link or response scale). However it seems that predict
actually returns the linear predictor, which in the case of the truncated_nbinom2
distribution is the mean of the untruncated distribution.
I suppose that truncated_nbinom2
models can be used in different situations, and depending on the context users might have different expectations on what predictions should be returned by predict
.
In a situation where 0s occur but the 0s are not observable, and modelled with a truncated_nbinom2
model, then predictions from the untruncated distribution may be appropriate, i.e. the predictions reflect the full range of possible responses (including 0s).
However, in the case of delta_truncated_nbinom2
models, I'd argue that predictions for the positives component should come from the zero truncated distribution. The positives component models numbers when present, and so 0s are not a possible outcome.
For what it's worth, I think it would be worth considering adding something to the predict
documentation that explains what is returned for delta_truncated_nbinom2
models, and/or a warning message when predicting for delta_truncated_nbinom2
models, to avoid any misunderstandings by users?
from sdmtmb.
Thinking more about this overnight, there's also potential cases where a delta model is being fit, but with the delta and positives components fitted separately so that different formulas can be specified for each component. Here, with a truncated_nbinom2
model for the positives, then I think predictions from the truncated distribution are appropriate?
I'm not sure how helpful any of this is - the dialogue has been very informative for me at least so thanks again @ericward-noaa for your input!
from sdmtmb.
Agree -- this is a great discussion, and I appreciate the thoughts. I'll work on folding some of your comments into the documentation to make it more clear what is being returned with the predict() function. I think in implementing this, the goal was try to be as consistent as possible with implementations in other packages like glmmTMB, but I think your points on the delta models are good and potentially confusing to users.
On the glmmTMB side, I should have added this in my last code snippet, but we can add on the glmmTMB equivalent with their truncated_nbinom2()
family and show that the predictions are perfectly correlated:
fit_glmmTMB <- glmmTMB(
response ~ treatment,
data = sim_data,
family = truncated_nbinom2()
)
pred_glmmTMB <- predict(fit_glmmTMB, newdata = prd_data, type = "link")
cor(pred_lp$est, pred_glmmTMB)
from sdmtmb.
I also didn't realize until this thread that the linear predictors on the truncated count distributions were predicting the expectation for the untruncated distributions. The help file in glmmTMB doesn't note this, but does link to the aster package vignette where the algorithms are described.
This part in that vignette is relevant:
Indeed, although the link-level predictions are coming back correctly, I believe we have the combined expectation from the delta_truncated_nbinom2()
wrong.
I had missed this part:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L1046-L1047
where log_nzprob
is calculated here:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L207-L213
That converts the mean of the untruncated distribution to the mean of the the truncated distribution before combining that linear predictor with the probability of a non zero.
I'll get this fixed.
Here were my experiments:
library(sdmTMB)
set.seed(1)
d0 <- data.frame(y = rnbinom(n = 4e4, mu = 3, size = 5))
mean(d0$y)
#> [1] 2.997775
fit <- sdmTMB(
y ~ 1,
data = d0,
spatial = FALSE,
family = delta_truncated_nbinom2()
)
fit$sd_report
#> sdreport(.) result
#> Estimate Std. Error
#> b_j 2.260764 0.017098746
#> b_j2 1.094016 0.004466339
#> ln_phi 1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p <- predict(fit)
est <- plogis(p$est1) * exp(p$est2)
mean(d0$y)
#> [1] 2.997775
mean(est)
#> [1] 2.704267
# which is what is happening here:
p <- predict(fit, type = "response")
mean(p$est)
#> [1] 2.704267
library(glmmTMB)
#>
#> Attaching package: 'glmmTMB'
#> The following objects are masked from 'package:sdmTMB':
#>
#> lognormal, nbinom1, nbinom2, truncated_nbinom1, truncated_nbinom2,
#> tweedie
fit2 <- glmmTMB::glmmTMB(
y ~ 1,
ziformula = ~ 1,
data = d0,
family = glmmTMB::truncated_nbinom2()
)
# matches:
fit2$sdr
#> sdreport(.) result
#> Estimate Std. Error
#> beta 1.094016 0.004466339
#> betazi -2.260764 0.017098746
#> betad 1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p2 <- predict(fit2)
# matches:
mean(exp(p2))
#> [1] 2.986243
p2 <- predict(fit2, type = "zlink")
mean(p2)
#> [1] -2.260764
plogis(mean(p2))
#> [1] 0.094425
# the mean of the *truncated* distribution:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
s <- as.list(fit2$sdr, "Estimate")
plogis(s$betazi)
#> [1] 0.094425
s1 <- s$beta
etad <- s$betad
# convert to truncated mean:
logspace_add <- DPQ::logspace.add
logspace_sub <- DPQ::logspace.sub
(s2 <- logspace_add(0, s1 - etad))
#> [1] 0.4823433
(ans <- logspace_sub(0, -exp(etad) * s2))
#> [1] -0.1030396
xx <- exp(s1) / exp(ans)
xx
#> [1] 3.310355
# which matches:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
# combined with glmmTMB:
p2 <- predict(fit2, type = "response")
mean(p2)
#> [1] 2.997775
# matches (with sdmTMB):
prob_1 <- plogis(tidy(fit, model = 1)$estimate)
xx * prob_1
#> [1] 2.997775
Created on 2024-06-18 with reprex v2.1.0
from sdmtmb.
OK, this is fixed in the main branch. See these new unit tests: https://github.com/pbs-assess/sdmTMB/blob/main/tests/testthat/test-truncated-dists.R
The link-scale predictions remain unchanged. However, the response-scale predictions and any internal calculations that rely on response-scale predictions (including, importantly, the delta families) now use the truncated mean.
Thanks for pointing this out, @tom-peatman!
from sdmtmb.
Thanks @seananderson for fixing this so quickly, much appreciated! And thanks again to you and @ericward-noaa for being so responsive to my questions on this, I'm very grateful.
from sdmtmb.
Related Issues (20)
- Rename time column internally to protect against extra_time + as.factor conflict HOT 2
- Add RMSE and MAE sdmTMB_cv examples to cross validation vignette HOT 3
- Make get_index() smarter about extra time slices and area argument HOT 2
- Release sdmTMB 0.5.0 HOT 1
- `mvn-laplace` residual estimation fails due to `predict.sdmTMB()` offset issue
- Simplify extra time implementation
- get_index potential computation issues for very large prediction grids HOT 7
- Fix prediction with newdata = NULL for non-delta models with extra_time
- Add names of random effects in tidy() HOT 1
- get_index() over alternate time variable? HOT 3
- Expanding abundance when fitting Bernoulli-cloglog model HOT 2
- Predictions with new data from delta models with random intercepts HOT 3
- Release sdmTMB 0.6.0
- Drop the 'proj' section of the .cpp template
- Add cli progress bars where appropriate
- Allow the 'area' argument to take a column name
- Add effective area occupied calculations in .cpp HOT 1
- Fix coef() to work with delta models
- tidy with exponentiate = TRUE returns CIs based on already exponentiated estimate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdmtmb.