psolymos / opticut Goto Github PK

Likelihood Based Optimal Partitioning

Home Page: https://github.com/psolymos/opticut

R 100.00%

cran species indicator-species-analysis likelihood optimal-partitioning r ecology

opticut's Introduction

opticut: Likelihood Based Optimal Partitioning and Indicator Species Analysis

Likelihood based optimal partitioning for indicator species analysis and more. Finding the best binary partition for each species based on model selection, possibly controlling for modifying/confounding variables as described in Kemencei et al. (2014).

Versions

Install stable version from CRAN:

install.packages("opticut")

Install development version from GitHub:

devtools::install_github("psolymos/opticut")

User visible changes in the package are listed in the NEWS file.

Report a problem

Use the issue tracker to report a problem.

Typical workflow

library(opticut)

## --- community data ---
y <- cbind(
    Sp1 = c(4,6,3,5, 5,6,3,4, 4,1,3,2),
    Sp2 = c(0,0,0,0, 1,0,0,1, 4,2,3,4),
    Sp3 = c(0,0,3,0, 2,3,0,5, 5,6,3,4))

## --- stratification ---
g <-      c(1,1,1,1, 2,2,2,2, 3,3,3,3)

## --- find optimal partitions for each species ---
oc <- opticut(y, strata = g, dist = "poisson")
summary(oc)
#  Multivariate opticut results, comb = rank, dist = poisson
#
#  Call:
#  opticut.default(Y = y, strata = g, dist = "poisson")
#
#  Best supported models with logLR >= 2:
#      split assoc      I  mu0  mu1 logLR      w
#  Sp3   2 3    ++ 0.6471 0.75 3.50 4.793 0.6962
#  Sp2     3   +++ 0.8571 0.25 3.25 9.203 0.9577
#  2 binary splits
#  1 species not shown

## --- visualize the results ---
plot(oc, cut = -Inf)

## --- quantify uncertainty ---
uc <- uncertainty(oc, type = "asymp", B = 999)
summary(uc)
#  Multivariate opticut uncertainty results
#  type = asymp, B = 999, level = 0.95
#
#      split R      I   Lower  Upper
#  Sp1   1 2 1 0.2860 0.02341 0.5668
#  Sp3   2 3 1 0.6218 0.21456 0.8813
#  Sp2     3 1 0.8274 0.51229 0.9680

Dynamic documents with opticut

Here is a minimal Rmarkdown example: Rmd source, knitted PDF.

References

Kemencei, Z., Farkas, R., Pall-Gergely, B., Vilisics, F., Nagy, A., Hornung, E. & Solymos, P. (2014): Microhabitat associations of land snails in forested dolinas: implications for coarse filter conservation. Community Ecology 15:180--186. [link, PDF]

opticut's People

Contributors

Stargazers

Watchers

Forkers

ahmedscchaider

opticut's Issues

Overarching questions about Gaussian models

Non-centered covariates in Gaussian models can lead to negative estimates. This is undesirable from various reasons:

Lorenz-curve based thresholding will not work, and in the absence of that, we cannot assign split labels to multicut1 objects.
mu1/mu0 type prior indicator potential measures will be incorrect (even x + abs(mu0) will be NaN due to division by 0). mu0=0 (or -10^-14) is quite likely when data has lots of 0.
Lognormal, Gamma etc models have 0 mass at 0, thus require ZI component for almost all practical purposes

Having negative values, however, might be justified when one used non-abundance data, e.g. instrument measurements that can take negative values (and do not want to transform due to variances, 0s, etc)

Possible options:

Do not care (no error/warning), let the users do what they want.
Warning: give warning and a workaround (e.g. centering, fix_fitted option) for methods that otherwise fail (like bestpart, summary, plot). But the estimates can be processed further.
Error: constrain the analysis workflow to abundance type data.

Right now the implementation uses warnings. This is maybe as good as it gets.

Implement class for uncertainty with summary and plot methods

General:

Add classes: uncertainty and uncertainty1,
class will inherit from summary.opticut and add additional element(s) to it.
summary method can give summary statistics, there can be a confint method as well.

Plots for uncertainty1 (wrap these up in a single one, like in plot.lm?):

plot: draw hist for mu0, mu1, I, (note: I for multi might need facets or rows due to multiple models)
~~pairs: pairs plot for mu0, mu1,~~
selection frequencies in barplot (habitats vs proportions),
~~something like ladderplot in plotrix.~~

Plots for uncertainty:

for uncertainty: similar to plot.opticut but showing I and CIs corresponding to level=0.95 arg.
Cut and rank applies (cut used to open vs. solid circles for I). Use colors based on I?

Indicator value scaling is not sensitive

Add a tuning constant: I = abs(tanh(beta*scale))

Add unit testing for multicut and methods

multicut objects and methods require unit testing. Add /tests/tests_multicut.R.

getMLE.opticut should return coefs for custom dist when vcov=FALSE

This should work:

mle4 <- try(getMLE(m4, 1, vcov=FALSE), silent=TRUE)
stopifnot(inherits(mle4, "try-error"))

The error:

> getMLE(m4, 1, vcov=FALSE)
Error in .opticut1(Y = object$Y[j, i], X = object$X[j, , drop = FALSE],  : 
  Unable to return full model for custom distribution function.
Called from: .opticut1(Y = object$Y[j, i], X = object$X[j, , drop = FALSE], 
    Z1 = ZZ, dist = object$dist, ...)

Note: it works for multicut.

Add ABMI multi-taxa disturbance data set

ABMI multi-taxa disturbance data set used for calibration to be added.

Need to have the following features:

classification by disturbance levels (multiple scales)
classification by natural regions
classification by land cover types (multiple scales)
modifiers: lat/long, climate.

Following taxa to consider: terrestrial vascular plants, bryophytes, lichens, mites, birds.

complete data processing and test all 3 classifications.
add new rda file
document new data set

predict fails with (Y, X) models and works with formula

It works with formula interface, but not with (Y, X) interface.

dist="ordinal" is unreliable for small data, needs fuller testing

Predictions about expected values seem not to make sense in some cases (e.g. dune data). Null should be between mu0 and mu1 but this is not the case.

test based on simulations and see sample size and # category effects
fix or disallow based on results.

Add parallel infrasctructure to uncertainty method

Wrap up the 3 types in a function,
include cl=NULL argument and copy the section from opticut.

Define opticut as method

Define opticut as generic with default (opticut(y, strata=g)) and formula opticut(y ~ 1, strata=g) method. Follow vegan::cca implementation.

Column sorting in plot & wplot

Sorting of columns (strata in plot.opticut, splits in wplot.opticut) should be more flexible than rows (species).

separate row and column sorting (sort option applies only to rows)
wplot: not sure what rules to apply, but can be the default (alphabetical?), or just logical/integer allowed (sort as default, or user supplied sorting indices)
plot: provide alphabetical, similarity, richness, none (falls back to factor levels), or user supplied indices.

Add bestpart results and v=I*t to uncertainty method results

Need to parse split labels to get t matrix.

Add wrsi

wrsi: weighted relative selection index + documentation

Additional CRAN issues

As reported by BDR:

§A 3.1 of the 'R Installation and Administration Manual' describes how the reference BLAS and LAPACK implementations can be replaced by external versions with in principle faster execution but probably re-ordered calculations. Some binary distributions of R use these by default.

Problems using several implementations are now reported on the CRAN results pages under 'Additional issues'.

For packages

FixedPoint MFPCA adaptDA aster expectreg madness opticut pulsar robustbase sca

the issue occurred for more than one of these implementations.

You are unlikely to be able to reproduce these, but they do indicate issues with your package or its tests. > See the discussion on testing in §1.6 of 'Writing R Extensions' ... and

in some cases tolerances are too small.

where singular matrices are reported, take a look at the conditioning of that matrix computed on a platform you have access to.

if theory says a matrix is symmetric, symmetrize it ( 0.5*(A+t(A)) )

do not use identical() nor == on numerical results without exceptional cause.

singular vectors and eigenvectors have arbitrary signs.

Please correct as much as possible and submit a correction before Feb 12 to safely retain the package on CRAN.

The issue can be traced back to optilevels -> .optilevels which threw an error on ATLAS, MKL, OpenBLAS:

> ol <- optilevels(spp[,"Species3"], gr)
Error in if (Delta < 0) { : argument is of length zero
Calls: optilevels -> .optilevels
Execution halted

Delta is delta-AIC calculated from nobs, df, and model$logLik as returned by .opticut1, which calls glm and the summary method.

The only reason why Delta can be of length 0 is to get a numeric(0) or NULL value for model$logLik, which is probably happening at the glm level.

As a workaround, .opticut1 now calls lm instead of glm when dist == "gaussian" && link == "identity". Will see if this solves the problem.

More thorough testing of distributions

Need to do more thorough testing of distributions, especially for getMLE, fitted, and predict methods.

direction not shown besides split for uncertainty output

Split in summary.uncartainty can be misleading without assoc (direction). Need to:

include assoc,
or invert the labels.

Convenience function to manipulate strata levels

The error Collapse value found in levels. should be remedied quickly by a fix_levels command, i.e. take collapse value from ocoptions and replace by something else (e.g. _ by default):

fix_levels <- function(x, sep="_") {
    if (identical(getOption("ocoptions")$collapse, sep))
        stop("Nice try, but collapse option and sep argument are identical.")
    levels(x) <- gsub(getOption("ocoptions")$collapse, sep, x, fixed=TRUE)
    x
}

Integrate multicut with lorenz thresholding

Implement bestmodel, bestpart, and uncertainty methods for multicut objects.

bestmodel: simply return the single model.
bestpart: use fitted values, apply Lc threshold to turn it into 0/1 matrix (store cut points as attribute).
Update summary and plot to sort by bp results.
uncertainty: "asymp" and "boot" types to implement, bp to be used for R.

Add these to unit tests.

Add functionality to do inverse prediction

Create new branch (inverse-prediction) and implement the following functions:

ipredict generic and methods (default, opticut): currently called calibrate, but ipredict should better reflect the intention. This will only run a single instance with any ynew/xnew arguments.
multiclass measures based on Sokolova et al.
calibrate method to implement LOO. Implement comparison based on multiclass measures similarly to AIC but meybe only for 2 objects: LOTO (leave one taxon out) vs. reference (all taxa LOO)

collapse value messes up rankComb based output (summary/plot)

The default collapse value (through ocoptions) is " ". When that is changed, it can affect summary/plot results. rankComb is not affected, and opticut1 seems to work fine. I suspect that the summary does something. The code is frozen (the patch is hard coded to force the default, that works):

#    oc <- oComb(x, collapse = getOption("ocoptions")$collapse)
    oc <- oComb(x, collapse = " ")

Needs investigation.

Sorting needs to recognize strata and species separately

Currently the option and argument sort takes logical value. It should accept numeric in 1:2:

1 = sort rows/species only
2 = sort columns/strata only
any other value or FALSE = no sorting
1:2 or TRUE = sort both.

Internally the logical needs to turned into 0 (FALSE) or 1:2 (TRUE).
evaluate like: if (1 %in% sort) <sort_rows> etc.

Default value was TRUE should remain.

Bootstrap can lead to strata missing

Bootstrap can lead to complementary Z matrix. This only happened with small sample size so far.
Possible options:

options need to be relaxed and reset on exit: this will then refer to a not represented stratum -- not a good idea.
block bootstrap by stratum: use blocking as option (need to check iid assumptions), or provide B matrix from outside.
jackknife (LOO, leave-one-out) can be a solution if n>1 in stratum (i.e. small sample size in stratum can be bad in general).

y <- cbind(
    Spp1=c(4,6,3,5,5,6,3,4,4,1,3,2),
    Spp2=c(0,0,0,0,1,0,0,1,4,2,3,4),
    Spp3=c(0,0,3,0,2,3,0,5,5,6,3,4))
g <- rep(1:3, each=4)
x <- c(0.1,-0.2,1,0,-0.5,-1,0,0.5,0,0.8,-0.3,0.1)
m <- opticut(formula = y ~ 1, strata = g, dist = "poisson")

set.seed(1)
u <- uncertainty(m, type = "multi")
## Error in opticut1(Y = object$Y[j, i, drop = TRUE], X = object$X[j, ],  : 
##   Guess what! Complementary design variables found:
## use 'checkComb'
## Called from: opticut1(Y = object$Y[j, i, drop = TRUE], X = object$X[j, ], 
##    Z = zz, dist = object$dist, ...)

set.seed(2)
u <- uncertainty(m, type = "multi")
##   [==================================================] 100% elapsed = 03s

Add internal function to retrieve linkinv

Accessing the inverse link function part of the dist definition is often required. Write a .get_linkinv function for this purpose. Use this internal function at places where linkinv is needed.

wplot method for uncertainty should be renamed

wplot indicates w as in weight (model weight). selection frequencies are not really weights...

Check behavior when NA is in the data

Need to assess how NA in data impacts examples. By default, it should pass the data with NA to underlying functions. Sometimes missing na.rm arguments and or default model.matrix behavior prevents the expected behavior to unfold.

Add new data sets

ARU data for temporal classification
ABMI data for multiple taxa for prediction

pbapply related parallel pb changes

Finalize pbapply parallel pb and follow that in opticut. This should simplify handling of cl argument.

Allow intercept and modifiers in optilevel

Allow intercept when response is a factor (i.e. discrete units, not proportions). Have the current interface as the engine with intercept argument.
When intercept is present (factor input only!), allow definition of modifiers through another argument.
Also, does it make sense to allow this for proportions as input?
Think about a final formula interface following opticut.

Add `bestmodel` method for `optilevel` objects

Needs to be able to recover all the input arguments from the fitted model object.

Consider dropping dist=ordinal altogether

Ordinal models are problematic for various reasons when considering multiple species. Thus it does not really make sense to include them.

Prediction involves values corresponding to all levels of the response, thus making it multidimensional. We can use only zeta[1] which really makes it very similar to a binomial model anyways.

(Same does not apply to RSF/RSPF models, there results at least are comparable across species, only fitting is difficult due to the ragged nature of the data.)

RSF/RSPF resampling should not include available points

Resampling based uncertainty for RSF/RSPF based models should not touch available points, only used points are to be resampled.

Pass ... from opticut call to ... in uncertainty

RSF/RSPF need to pass m=0 to uncertainty functions. This should be passed from opticut call in general, other args might be missing as well if not done (e.g. offset in dist=poisson etc).

Also investigate: type=asymp failung for dist=ordered

Add cross validation method to opticut

Needed for correlation and weighted distribution based calibration to assess classification accuracy.

getMLE not returning variance components

Variance components (e.g. for Gaussian, Beta) are needed for possible extension of ipredict functionality. Gaussian would certainly be nice to have.

Add simple ROC and AUC functions

pROC is great, but it is awfully slow. Need to have simple ROC/AUC:

simple_roc <- function(labels, scores){
  labels <- labels[order(scores, decreasing=TRUE)]
  data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
}
simple_auc <- function(ROC) {
    ROC$inv_spec <- 1-ROC$FPR
    dx <- diff(ROC$inv_spec)
    sum(dx * ROC$TPR[-1]) / sum(dx)
}

Make these nice (S3 classes, some methods etc) and add to package.

library(opticut)
library(ResourceSelection)
slp <- cut(goats$SLOPE,c(-1, 20, 30, 40, 50, 90))
table(slp, goats$STATUS)

## this works
o <- opticut(STATUS ~ ELEVATION, data=goats, strata=slp, dist="rsf")

## this doesn't
o <- opticut(STATUS ~ 1, data=goats, strata=slp, dist="rsf")

14: terms.formula(formula, data = data)
13: terms(formula, data = data)
12: model.frame.default(formula = Y ~ ., data = XX[, -1, drop = FALSE], 
        drop.unused.levels = TRUE)
11: model.frame(formula = Y ~ ., data = XX[, -1, drop = FALSE], drop.unused.levels = TRUE)
10: eval(expr, envir, enclos)
9: eval(mf, parent.frame())
8: ResourceSelection::rsf(Y ~ ., data = XX[, -1, drop = FALSE], 
       m = 0, B = 0, ...)
7: .opticut1(Y, X, Z1 = NULL, dist = dist, ...)
6: opticut1(Y = Y[, i], X = X, Z = Z, dist = dist, sset = sset, 
       ...)
5: FUN(X[[i]], ...)
4: lapply(X, FUN, ...)
3: pbapply::pblapply(seq_len(ncol(Y)), function(i, ...) opticut1(Y = Y[, 
       i], X = X, Z = Z, dist = dist, sset = sset, ...), cl = cl, 
       ...)
2: opticut.default(goats$STATUS, strata = slp, dist = "rsf")
1: opticut(goats$STATUS, strata = slp, dist = "rsf")

summary/bestpart fails for opticut when using custom strata matrix

Summary fails (due to bestpart) when object$comb is NA:

set.seed(2345)
n <- 50
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
x3 <- ifelse(x0 %in% 2:4, 1, 0)
lam1 <- exp(0.5 + 1*x1 + -0.2*x2)
Y1 <- rpois(n, lam1)
lam2 <- exp(1 + 0.5*x3)
Y2 <- rpois(n, lam2)
Y3 <- rpois(n, exp(0))
Y <- cbind(Spp1=Y1, Spp2=Y2, Spp3=Y3)
comb <- allComb(x0)
colnames(comb) <- letters[1:7]
ocfun <- opticut(Y ~ x2, strata=comb, dist=fun)
summary(ocfun) # this fails!
> traceback()
4: bestpart.opticut(object)
3: bestpart(object)
2: summary.opticut(ocfun)
1: summary(ocfun)