Giter Site home page Giter Site logo

Comments (10)

tom-peatman avatar tom-peatman commented on July 17, 2024

(edited the original post to include sdmTMB and R versions)

from sdmtmb.

ericward-noaa avatar ericward-noaa commented on July 17, 2024

I don't think there's much wrong here? -- the truncated negative binomial code used in sdmTMB is taken from glmmTMB (see here: https://github.com/pbs-assess/sdmTMB/blob/c70fa5c3528cfabc6c8136d21b22b430f149a368/src/utils.h#L337C8-L337C25). The simulation below seems to recover the parameters ok -- I think the major issue is that one of your initial means was very small (0.1) and that was causing things to blow up / not converge for other some other seeds I tried.

library(dplyr)
library(tidyr)
library(sdmTMB)
  
set.seed(1)
n_draws <- 150
mod_pars <- data.frame(treatment = letters[1:2], mu = c(2, 5), phi = 0.1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

prd_data <- data.frame(treatment = letters[1:2])

## predictions in link space
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#>   treatment       est
#> 1         a 0.6644778
#> 2         b 1.5506004

## predictions in response space appear incorrect
## e.g. mean response can't be < 1 for a truncated negative binomial
pred_lp %>% mutate(., response = exp(est))
#>   treatment       est response
#> 1         a 0.6644778 1.943475
#> 2         b 1.5506004 4.714300

## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)

## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]

## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 2.336000 4.812081

## these appear to correspond to training data means
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 2 × 2
#>   treatment  mean
#>   <chr>     <dbl>
#> 1 a          2.34
#> 2 b          4.81

Created on 2024-06-14 with reprex v2.1.0

from sdmtmb.

tom-peatman avatar tom-peatman commented on July 17, 2024

Thanks @ericward-noaa for looking into this and your quick response. And apologies if my pretty contrived example was not particularly clear (and for not having checked whether the model fitting had successfully converged...).

And thanks also for the link to the rtruncated_nbinom function. I tried to expose it to R so I could use it to simulate the training dataset directly rather than truncating random draws from a negative binomial distribution - unfortunately my C++ skills weren't up to the task.

The parameter estimates in your example do appear to be more consistent with the means of the negative binomial distribution(s) used to simulate the training data, as well as the actual means of the training data after removal of the 0s. But, the differences between the negative binomial and truncated negative binomial distributions become more pronounced with lower means (all else equal), as the difference in means are larger due to more 0s.

Hopefully the example below more clearly demonstrates why I think the predictions from predict calls are incorrect for truncated_nbinom2 models (and positive components of delta_truncated_nbinom2 models):

library(dplyr)
library(tidyr)
library(sdmTMB)

set.seed(1)

n_draws <- 250
mod_pars <- data.frame(treatment = letters[1:4], mu = c(0.5, 0.75, 1, 2.5), phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
## predictions in link space
prd_data <- data.frame(treatment = letters[1:4])
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#>   treatment          est
#> 1         a -0.561669862
#> 2         b -0.197211601
#> 3         c -0.001578785
#> 4         d  0.960275459
## predictions in response space:
pred_lp %>% mutate(., response = exp(est))
#>   treatment          est  response
#> 1         a -0.561669862 0.5702560
#> 2         b -0.197211601 0.8210169
#> 3         c -0.001578785 0.9984225
#> 4         d  0.960275459 2.6124160
## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
#> # A tibble: 4 × 2
#>   treatment mean_response
#>   <chr>             <dbl>
#> 1 a                  1.48
#> 2 b                  1.70
#> 3 c                  1.83
#> 4 d                  3.17
## treatment specific mean responses in the training dataset:
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 4 × 2
#>   treatment  mean
#>   <chr>     <dbl>
#> 1 a          1.56
#> 2 b          1.81
#> 3 c          1.98
#> 4 d          3.59

My understanding is that predict should return the predicted mean repsonse. In which case the predictions are biased, and also inconsistent with both the assumptions of the model (a zero truncated negative binomial can't have a mean response < 1) and simulated data from the fitted model. The predictions are also suspiciously similar to the means of the (untruncated) negative binomial distributions that were used to generate the underlying data (i.e. mod_pars$mu).

Using the formula for the mean of a zero truncated negative binomial from https://grodri.github.io/glms/notes/countmoments, I am able to recover predicted means that are consistent with both the training dataset, and simulated responses from the fitted model:

## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)

## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]

## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 1.561798 1.809524 1.984962 3.585366

These in combination suggest to me that the predict calls are not returning the predicted mean response from the truncated negative binomial distribution - perhaps instead a parameter that in part defines the mean? Again, I tried unsuccessfully to dig in to the source code, but didn't make much progress.

Thanks again for your assistance, and apologies if this is still unclear. I'm aware this isn't the gold standard in a minimal reproducible example!

Created on 2024-06-16 with reprex v2.1.0

from sdmtmb.

ericward-noaa avatar ericward-noaa commented on July 17, 2024

Thanks @tom-peatman for the clear example. I fleshed this out a bit more, adding more levels (I was interested if there was a gradient in any bias, etc). I don't really see one here -- the estimated means seem consistent with the original means (before data are removed) -- and simulated data are consistent with empirical means, after zeros are removed:

library(dplyr)
library(tidyr)
library(sdmTMB)
library(ggplot2)

set.seed(1)

n_draws <- 250 
true_mu <- seq(0.2, 3, length.out = 20)
mod_pars <- data.frame(treatment = letters[1:length(true_mu)], mu = true_mu, phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf

## predictions in link / response space
prd_data <- data.frame(treatment = letters[1:length(true_mu)])
pred_lp <- predict(fit, newdata = prd_data, type = "link", se_fit = TRUE) %>% 
  mutate(response = exp(est), lo = exp(est - 2*est_se), hi = exp(est + 2*est_se))
#> Prediction can be slow when `se_fit = TRUE` and random fields are included
#> (i.e., `re_form = NA`). Consider using the `nsim` argument to take draws from
#> the joint precision matrix and summarizing the standard devation of those
#> draws.
pred_lp$true_mu <- true_mu

# There doesn't appear to be a consistent bias here
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

## treatment specific mean responses in the training dataset:
sim_mean <- sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
pred_lp$sim_mean <- sim_mean$mean

# When we look at the means in the data with zeros removed (blue dots), these values don't 
# include 0s, so are higher
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  geom_point(aes(treatment, sim_mean), col="blue") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_resp <- sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
pred_lp$sim_resp <- sim_resp$mean_response

# We can also compare the simualted values (orange dots) to the filtered original data, and these are 
# quite similar (neither include 0)
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  geom_point(aes(treatment, sim_mean), col="blue") + 
  geom_point(aes(treatment, sim_resp), col="orange") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

Created on 2024-06-17 with reprex v2.1.0

from sdmtmb.

tom-peatman avatar tom-peatman commented on July 17, 2024

Thanks @ericward-noaa for this, very helpful.

It looks like I've fundamentally misunderstood what is returned by predict calls for truncated_nbinom2 models, and the positive component for delta_truncated_nbinom2 models. My expectation was that predict should return the predicted mean response from the zero-truncated distribution (at either the link or response scale). However it seems that predict actually returns the linear predictor, which in the case of the truncated_nbinom2 distribution is the mean of the untruncated distribution.

I suppose that truncated_nbinom2 models can be used in different situations, and depending on the context users might have different expectations on what predictions should be returned by predict.

In a situation where 0s occur but the 0s are not observable, and modelled with a truncated_nbinom2 model, then predictions from the untruncated distribution may be appropriate, i.e. the predictions reflect the full range of possible responses (including 0s).

However, in the case of delta_truncated_nbinom2 models, I'd argue that predictions for the positives component should come from the zero truncated distribution. The positives component models numbers when present, and so 0s are not a possible outcome.

For what it's worth, I think it would be worth considering adding something to the predict documentation that explains what is returned for delta_truncated_nbinom2 models, and/or a warning message when predicting for delta_truncated_nbinom2 models, to avoid any misunderstandings by users?

from sdmtmb.

tom-peatman avatar tom-peatman commented on July 17, 2024

Thinking more about this overnight, there's also potential cases where a delta model is being fit, but with the delta and positives components fitted separately so that different formulas can be specified for each component. Here, with a truncated_nbinom2 model for the positives, then I think predictions from the truncated distribution are appropriate?

I'm not sure how helpful any of this is - the dialogue has been very informative for me at least so thanks again @ericward-noaa for your input!

from sdmtmb.

ericward-noaa avatar ericward-noaa commented on July 17, 2024

Agree -- this is a great discussion, and I appreciate the thoughts. I'll work on folding some of your comments into the documentation to make it more clear what is being returned with the predict() function. I think in implementing this, the goal was try to be as consistent as possible with implementations in other packages like glmmTMB, but I think your points on the delta models are good and potentially confusing to users.

On the glmmTMB side, I should have added this in my last code snippet, but we can add on the glmmTMB equivalent with their truncated_nbinom2() family and show that the predictions are perfectly correlated:

fit_glmmTMB <- glmmTMB(
  response ~ treatment,
  data = sim_data, 
  family = truncated_nbinom2()
)

pred_glmmTMB <- predict(fit_glmmTMB, newdata = prd_data, type = "link")

cor(pred_lp$est, pred_glmmTMB)

from sdmtmb.

seananderson avatar seananderson commented on July 17, 2024

I also didn't realize until this thread that the linear predictors on the truncated count distributions were predicting the expectation for the untruncated distributions. The help file in glmmTMB doesn't note this, but does link to the aster package vignette where the algorithms are described.

This part in that vignette is relevant:
image

Indeed, although the link-level predictions are coming back correctly, I believe we have the combined expectation from the delta_truncated_nbinom2() wrong.

I had missed this part:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L1046-L1047
where log_nzprob is calculated here:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L207-L213

That converts the mean of the untruncated distribution to the mean of the the truncated distribution before combining that linear predictor with the probability of a non zero.

I'll get this fixed.

Here were my experiments:

library(sdmTMB)

set.seed(1)
d0 <- data.frame(y = rnbinom(n = 4e4, mu = 3, size = 5))

mean(d0$y)
#> [1] 2.997775
fit <- sdmTMB(
  y ~ 1,
  data = d0, 
  spatial = FALSE,
  family = delta_truncated_nbinom2()
)
fit$sd_report
#> sdreport(.) result
#>        Estimate  Std. Error
#> b_j    2.260764 0.017098746
#> b_j2   1.094016 0.004466339
#> ln_phi 1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p <- predict(fit)
est <- plogis(p$est1) * exp(p$est2)
mean(d0$y)
#> [1] 2.997775
mean(est)
#> [1] 2.704267
# which is what is happening here:
p <- predict(fit, type = "response")
mean(p$est)
#> [1] 2.704267
library(glmmTMB)
#> 
#> Attaching package: 'glmmTMB'
#> The following objects are masked from 'package:sdmTMB':
#> 
#>     lognormal, nbinom1, nbinom2, truncated_nbinom1, truncated_nbinom2,
#>     tweedie
fit2 <- glmmTMB::glmmTMB(
  y ~ 1,
  ziformula = ~ 1,
  data = d0, 
  family = glmmTMB::truncated_nbinom2()
)
# matches:
fit2$sdr
#> sdreport(.) result
#>         Estimate  Std. Error
#> beta    1.094016 0.004466339
#> betazi -2.260764 0.017098746
#> betad   1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p2 <- predict(fit2)

# matches:
mean(exp(p2))
#> [1] 2.986243
p2 <- predict(fit2, type = "zlink")
mean(p2)
#> [1] -2.260764
plogis(mean(p2))
#> [1] 0.094425
# the mean of the *truncated* distribution:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
s <- as.list(fit2$sdr, "Estimate")
plogis(s$betazi)
#> [1] 0.094425
s1 <- s$beta
etad <- s$betad

# convert to truncated mean:
logspace_add <- DPQ::logspace.add
logspace_sub <- DPQ::logspace.sub
(s2 <- logspace_add(0, s1 - etad))
#> [1] 0.4823433
(ans <- logspace_sub(0, -exp(etad) * s2))
#> [1] -0.1030396
xx <- exp(s1) / exp(ans)
xx
#> [1] 3.310355
# which matches:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
# combined with glmmTMB:
p2 <- predict(fit2, type = "response")
mean(p2)
#> [1] 2.997775
# matches (with sdmTMB):
prob_1 <- plogis(tidy(fit, model = 1)$estimate)
xx * prob_1
#> [1] 2.997775

Created on 2024-06-18 with reprex v2.1.0

from sdmtmb.

seananderson avatar seananderson commented on July 17, 2024

OK, this is fixed in the main branch. See these new unit tests: https://github.com/pbs-assess/sdmTMB/blob/main/tests/testthat/test-truncated-dists.R

The link-scale predictions remain unchanged. However, the response-scale predictions and any internal calculations that rely on response-scale predictions (including, importantly, the delta families) now use the truncated mean.

Thanks for pointing this out, @tom-peatman!

from sdmtmb.

tom-peatman avatar tom-peatman commented on July 17, 2024

Thanks @seananderson for fixing this so quickly, much appreciated! And thanks again to you and @ericward-noaa for being so responsive to my questions on this, I'm very grateful.

from sdmtmb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.