Giter Site home page Giter Site logo

opticut's Introduction

opticut: Likelihood Based Optimal Partitioning and Indicator Species Analysis

CRAN version CRAN RStudio mirror downloads Linux build status Windows build status Code coverage status License: GPL v2

Likelihood based optimal partitioning for indicator species analysis and more. Finding the best binary partition for each species based on model selection, possibly controlling for modifying/confounding variables as described in Kemencei et al. (2014).

Versions

Install stable version from CRAN:

install.packages("opticut")

Install development version from GitHub:

devtools::install_github("psolymos/opticut")

User visible changes in the package are listed in the NEWS file.

Report a problem

Use the issue tracker to report a problem.

Typical workflow

library(opticut)

## --- community data ---
y <- cbind(
    Sp1 = c(4,6,3,5, 5,6,3,4, 4,1,3,2),
    Sp2 = c(0,0,0,0, 1,0,0,1, 4,2,3,4),
    Sp3 = c(0,0,3,0, 2,3,0,5, 5,6,3,4))

## --- stratification ---
g <-      c(1,1,1,1, 2,2,2,2, 3,3,3,3)

## --- find optimal partitions for each species ---
oc <- opticut(y, strata = g, dist = "poisson")
summary(oc)
#  Multivariate opticut results, comb = rank, dist = poisson
#
#  Call:
#  opticut.default(Y = y, strata = g, dist = "poisson")
#
#  Best supported models with logLR >= 2:
#      split assoc      I  mu0  mu1 logLR      w
#  Sp3   2 3    ++ 0.6471 0.75 3.50 4.793 0.6962
#  Sp2     3   +++ 0.8571 0.25 3.25 9.203 0.9577
#  2 binary splits
#  1 species not shown

## --- visualize the results ---
plot(oc, cut = -Inf)

## --- quantify uncertainty ---
uc <- uncertainty(oc, type = "asymp", B = 999)
summary(uc)
#  Multivariate opticut uncertainty results
#  type = asymp, B = 999, level = 0.95
#
#      split R      I   Lower  Upper
#  Sp1   1 2 1 0.2860 0.02341 0.5668
#  Sp3   2 3 1 0.6218 0.21456 0.8813
#  Sp2     3 1 0.8274 0.51229 0.9680

Dynamic documents with opticut

Here is a minimal Rmarkdown example: Rmd source, knitted PDF.

References

Kemencei, Z., Farkas, R., Pall-Gergely, B., Vilisics, F., Nagy, A., Hornung, E. & Solymos, P. (2014): Microhabitat associations of land snails in forested dolinas: implications for coarse filter conservation. Community Ecology 15:180--186. [link, PDF]

opticut's People

Contributors

psolymos avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

ahmedscchaider

opticut's Issues

Overarching questions about Gaussian models

Non-centered covariates in Gaussian models can lead to negative estimates. This is undesirable from various reasons:

  • Lorenz-curve based thresholding will not work, and in the absence of that, we cannot assign split labels to multicut1 objects.
  • mu1/mu0 type prior indicator potential measures will be incorrect (even x + abs(mu0) will be NaN due to division by 0). mu0=0 (or -10^-14) is quite likely when data has lots of 0.
  • Lognormal, Gamma etc models have 0 mass at 0, thus require ZI component for almost all practical purposes

Having negative values, however, might be justified when one used non-abundance data, e.g. instrument measurements that can take negative values (and do not want to transform due to variances, 0s, etc)

Possible options:

  • Do not care (no error/warning), let the users do what they want.
  • Warning: give warning and a workaround (e.g. centering, fix_fitted option) for methods that otherwise fail (like bestpart, summary, plot). But the estimates can be processed further.
  • Error: constrain the analysis workflow to abundance type data.

Right now the implementation uses warnings. This is maybe as good as it gets.

Implement class for uncertainty with summary and plot methods

General:

  • Add classes: uncertainty and uncertainty1,
  • class will inherit from summary.opticut and add additional element(s) to it.
  • summary method can give summary statistics, there can be a confint method as well.

Plots for uncertainty1 (wrap these up in a single one, like in plot.lm?):

  • plot: draw hist for mu0, mu1, I, (note: I for multi might need facets or rows due to multiple models)
  • pairs: pairs plot for mu0, mu1,
  • selection frequencies in barplot (habitats vs proportions),
  • something like ladderplot in plotrix.

Plots for uncertainty:

  • for uncertainty: similar to plot.opticut but showing I and CIs corresponding to level=0.95 arg.
    Cut and rank applies (cut used to open vs. solid circles for I). Use colors based on I?

getMLE.opticut should return coefs for custom dist when vcov=FALSE

This should work:

mle4 <- try(getMLE(m4, 1, vcov=FALSE), silent=TRUE)
stopifnot(inherits(mle4, "try-error"))

The error:

> getMLE(m4, 1, vcov=FALSE)
Error in .opticut1(Y = object$Y[j, i], X = object$X[j, , drop = FALSE],  : 
  Unable to return full model for custom distribution function.
Called from: .opticut1(Y = object$Y[j, i], X = object$X[j, , drop = FALSE], 
    Z1 = ZZ, dist = object$dist, ...)

Note: it works for multicut.

Add ABMI multi-taxa disturbance data set

ABMI multi-taxa disturbance data set used for calibration to be added.

Need to have the following features:

  • classification by disturbance levels (multiple scales)
  • classification by natural regions
  • classification by land cover types (multiple scales)
  • modifiers: lat/long, climate.

Following taxa to consider: terrestrial vascular plants, bryophytes, lichens, mites, birds.

  • complete data processing and test all 3 classifications.
  • add new rda file
  • document new data set

Define opticut as method

Define opticut as generic with default (opticut(y, strata=g)) and formula opticut(y ~ 1, strata=g) method. Follow vegan::cca implementation.

Column sorting in plot & wplot

Sorting of columns (strata in plot.opticut, splits in wplot.opticut) should be more flexible than rows (species).

  • separate row and column sorting (sort option applies only to rows)
  • wplot: not sure what rules to apply, but can be the default (alphabetical?), or just logical/integer allowed (sort as default, or user supplied sorting indices)
  • plot: provide alphabetical, similarity, richness, none (falls back to factor levels), or user supplied indices.

Add wrsi

wrsi: weighted relative selection index + documentation

Additional CRAN issues

As reported by BDR:

§A 3.1 of the 'R Installation and Administration Manual' describes how the reference BLAS and LAPACK implementations can be replaced by external versions with in principle faster execution but probably re-ordered calculations. Some binary distributions of R use these by default.

Problems using several implementations are now reported on the CRAN results pages under 'Additional issues'.

For packages

FixedPoint MFPCA adaptDA aster expectreg madness opticut pulsar robustbase sca

the issue occurred for more than one of these implementations.

You are unlikely to be able to reproduce these, but they do indicate issues with your package or its tests. > See the discussion on testing in §1.6 of 'Writing R Extensions' ... and

  • in some cases tolerances are too small.

  • where singular matrices are reported, take a look at the conditioning of that matrix computed on a platform you have access to.

  • if theory says a matrix is symmetric, symmetrize it ( 0.5*(A+t(A)) )

  • do not use identical() nor == on numerical results without exceptional cause.

  • singular vectors and eigenvectors have arbitrary signs.

Please correct as much as possible and submit a correction before Feb 12 to safely retain the package on CRAN.

The issue can be traced back to optilevels -> .optilevels which threw an error on ATLAS, MKL, OpenBLAS:

> ol <- optilevels(spp[,"Species3"], gr)
Error in if (Delta < 0) { : argument is of length zero
Calls: optilevels -> .optilevels
Execution halted

Delta is delta-AIC calculated from nobs, df, and model$logLik as returned by .opticut1, which calls glm and the summary method.

The only reason why Delta can be of length 0 is to get a numeric(0) or NULL value for model$logLik, which is probably happening at the glm level.

As a workaround, .opticut1 now calls lm instead of glm when dist == "gaussian" && link == "identity". Will see if this solves the problem.

Convenience function to manipulate strata levels

The error Collapse value found in levels. should be remedied quickly by a fix_levels command, i.e. take collapse value from ocoptions and replace by something else (e.g. _ by default):

fix_levels <- function(x, sep="_") {
    if (identical(getOption("ocoptions")$collapse, sep))
        stop("Nice try, but collapse option and sep argument are identical.")
    levels(x) <- gsub(getOption("ocoptions")$collapse, sep, x, fixed=TRUE)
    x
}

Integrate multicut with lorenz thresholding

Implement bestmodel, bestpart, and uncertainty methods for multicut objects.

  • bestmodel: simply return the single model.
  • bestpart: use fitted values, apply Lc threshold to turn it into 0/1 matrix (store cut points as attribute).
  • Update summary and plot to sort by bp results.
  • uncertainty: "asymp" and "boot" types to implement, bp to be used for R.

Add these to unit tests.

Add functionality to do inverse prediction

Create new branch (inverse-prediction) and implement the following functions:

  • ipredict generic and methods (default, opticut): currently called calibrate, but ipredict should better reflect the intention. This will only run a single instance with any ynew/xnew arguments.
  • multiclass measures based on Sokolova et al.
  • calibrate method to implement LOO. Implement comparison based on multiclass measures similarly to AIC but meybe only for 2 objects: LOTO (leave one taxon out) vs. reference (all taxa LOO)

collapse value messes up rankComb based output (summary/plot)

The default collapse value (through ocoptions) is " ". When that is changed, it can affect summary/plot results. rankComb is not affected, and opticut1 seems to work fine. I suspect that the summary does something. The code is frozen (the patch is hard coded to force the default, that works):

#    oc <- oComb(x, collapse = getOption("ocoptions")$collapse)
    oc <- oComb(x, collapse = " ")

Needs investigation.

Sorting needs to recognize strata and species separately

Currently the option and argument sort takes logical value. It should accept numeric in 1:2:

  • 1 = sort rows/species only
  • 2 = sort columns/strata only
  • any other value or FALSE = no sorting
  • 1:2 or TRUE = sort both.
  1. Internally the logical needs to turned into 0 (FALSE) or 1:2 (TRUE).
  2. evaluate like: if (1 %in% sort) <sort_rows> etc.

Default value was TRUE should remain.

Bootstrap can lead to strata missing

Bootstrap can lead to complementary Z matrix. This only happened with small sample size so far.
Possible options:

  • options need to be relaxed and reset on exit: this will then refer to a not represented stratum -- not a good idea.
  • block bootstrap by stratum: use blocking as option (need to check iid assumptions), or provide B matrix from outside.
  • jackknife (LOO, leave-one-out) can be a solution if n>1 in stratum (i.e. small sample size in stratum can be bad in general).
y <- cbind(
    Spp1=c(4,6,3,5,5,6,3,4,4,1,3,2),
    Spp2=c(0,0,0,0,1,0,0,1,4,2,3,4),
    Spp3=c(0,0,3,0,2,3,0,5,5,6,3,4))
g <- rep(1:3, each=4)
x <- c(0.1,-0.2,1,0,-0.5,-1,0,0.5,0,0.8,-0.3,0.1)
m <- opticut(formula = y ~ 1, strata = g, dist = "poisson")

set.seed(1)
u <- uncertainty(m, type = "multi")
## Error in opticut1(Y = object$Y[j, i, drop = TRUE], X = object$X[j, ],  : 
##   Guess what! Complementary design variables found:
## use 'checkComb'
## Called from: opticut1(Y = object$Y[j, i, drop = TRUE], X = object$X[j, ], 
##    Z = zz, dist = object$dist, ...)

set.seed(2)
u <- uncertainty(m, type = "multi")
##   [==================================================] 100% elapsed = 03s

Add internal function to retrieve linkinv

Accessing the inverse link function part of the dist definition is often required. Write a .get_linkinv function for this purpose. Use this internal function at places where linkinv is needed.

Check behavior when NA is in the data

Need to assess how NA in data impacts examples. By default, it should pass the data with NA to underlying functions. Sometimes missing na.rm arguments and or default model.matrix behavior prevents the expected behavior to unfold.

Add new data sets

  • ARU data for temporal classification
  • ABMI data for multiple taxa for prediction

Allow intercept and modifiers in optilevel

  • Allow intercept when response is a factor (i.e. discrete units, not proportions). Have the current interface as the engine with intercept argument.
  • When intercept is present (factor input only!), allow definition of modifiers through another argument.
  • Also, does it make sense to allow this for proportions as input?
  • Think about a final formula interface following opticut.

Consider dropping dist=ordinal altogether

Ordinal models are problematic for various reasons when considering multiple species. Thus it does not really make sense to include them.

Prediction involves values corresponding to all levels of the response, thus making it multidimensional. We can use only zeta[1] which really makes it very similar to a binomial model anyways.

(Same does not apply to RSF/RSPF models, there results at least are comparable across species, only fitting is difficult due to the ragged nature of the data.)

Pass ... from opticut call to ... in uncertainty

RSF/RSPF need to pass m=0 to uncertainty functions. This should be passed from opticut call in general, other args might be missing as well if not done (e.g. offset in dist=poisson etc).

Also investigate: type=asymp failung for dist=ordered

Add simple ROC and AUC functions

pROC is great, but it is awfully slow. Need to have simple ROC/AUC:

simple_roc <- function(labels, scores){
  labels <- labels[order(scores, decreasing=TRUE)]
  data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
}
simple_auc <- function(ROC) {
    ROC$inv_spec <- 1-ROC$FPR
    dx <- diff(ROC$inv_spec)
    sum(dx * ROC$TPR[-1]) / sum(dx)
}

Make these nice (S3 classes, some methods etc) and add to package.

Consider I definition for multicut

Current I for multicut is f^-1(mu_max)-f^-1(mu_min) for K partitions.

Alternative is: f^-1(mu1)-f^-1(mu0) where mu1 and mu0 are weighted means of x$mu and x$n belonging to 0/1 strata based on Lc threshold.

Intercept-only RSF model fails

library(opticut)
library(ResourceSelection)
slp <- cut(goats$SLOPE,c(-1, 20, 30, 40, 50, 90))
table(slp, goats$STATUS)

## this works
o <- opticut(STATUS ~ ELEVATION, data=goats, strata=slp, dist="rsf")

## this doesn't
o <- opticut(STATUS ~ 1, data=goats, strata=slp, dist="rsf")

14: terms.formula(formula, data = data)
13: terms(formula, data = data)
12: model.frame.default(formula = Y ~ ., data = XX[, -1, drop = FALSE], 
        drop.unused.levels = TRUE)
11: model.frame(formula = Y ~ ., data = XX[, -1, drop = FALSE], drop.unused.levels = TRUE)
10: eval(expr, envir, enclos)
9: eval(mf, parent.frame())
8: ResourceSelection::rsf(Y ~ ., data = XX[, -1, drop = FALSE], 
       m = 0, B = 0, ...)
7: .opticut1(Y, X, Z1 = NULL, dist = dist, ...)
6: opticut1(Y = Y[, i], X = X, Z = Z, dist = dist, sset = sset, 
       ...)
5: FUN(X[[i]], ...)
4: lapply(X, FUN, ...)
3: pbapply::pblapply(seq_len(ncol(Y)), function(i, ...) opticut1(Y = Y[, 
       i], X = X, Z = Z, dist = dist, sset = sset, ...), cl = cl, 
       ...)
2: opticut.default(goats$STATUS, strata = slp, dist = "rsf")
1: opticut(goats$STATUS, strata = slp, dist = "rsf")

summary/bestpart fails for opticut when using custom strata matrix

Summary fails (due to bestpart) when object$comb is NA:

set.seed(2345)
n <- 50
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
x3 <- ifelse(x0 %in% 2:4, 1, 0)
lam1 <- exp(0.5 + 1*x1 + -0.2*x2)
Y1 <- rpois(n, lam1)
lam2 <- exp(1 + 0.5*x3)
Y2 <- rpois(n, lam2)
Y3 <- rpois(n, exp(0))
Y <- cbind(Spp1=Y1, Spp2=Y2, Spp3=Y3)
comb <- allComb(x0)
colnames(comb) <- letters[1:7]
ocfun <- opticut(Y ~ x2, strata=comb, dist=fun)
summary(ocfun) # this fails!
> traceback()
4: bestpart.opticut(object)
3: bestpart(object)
2: summary.opticut(ocfun)
1: summary(ocfun)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.