tidymodels / applicable Goto Github PK

View Code? Open in Web Editor NEW

46.0 10.0 7.0 4.56 MB

Quantify extrapolation of new samples given a training set

Home Page: https://applicable.tidymodels.org/

License: Other

R 100.00%

applicable's Introduction

applicable

Introduction

There are times when a model’s prediction should be taken with some skepticism. For example, if a new data point is substantially different from the training set, its predicted value may be suspect. In chemistry, it is not uncommon to create an “applicability domain” model that measures the amount of potential extrapolation new samples have from the training set. applicable contains different methods to measure how much a new data point is an extrapolation from the original data (if at all).

Installation

You can install the released version of applicable from CRAN with:

install.packages("applicable")

Install the development version of applicable from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/applicable")

Vignettes

To learn about how to use applicable, check out the vignettes:

vignette("binary-data", "applicable"): Learn different methods to analyze binary data.
vignette("continuous-data", "applicable"): Learn different methods to analyze continuous data.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
If you think you have encountered a bug, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

applicable's People

Contributors

Stargazers

Watchers

Forkers

marlycormar thecodemasterk mpadge shaoyoucheng rnaimehaom jameshwade

applicable's Issues

applicable.tidymodels.org not working

The site https://applicable.tidymodels.org is not working. Maybe we are missing a configuration step?

Breaking changes in dependency `isotree`

I am the maintainer of package isotree which is a dependency of {applicable} in CRAN under 'Suggests':
https://cran.r-project.org/web/packages/applicable/index.html

I would like to push an update to {isotree} which would break one of the unit tests of {applicable}.

In particular, I would like to change the default argument ndim to function isolation.forest:
https://github.com/david-cortes/isotree/blob/ad49b9717b41ce9bab86f2aeebe742679f0fca58/R/isoforest.R#L996
From the current (CRAN) default of min(3, NCOL(data)) to 1.

This would generate a problem in this unit test for {applicable}:
https://github.com/cran/applicable/blob/b66153447194c71778f7c04bc258722cc5cc5257/tests/testthat/test-isolation-fit.R#L26

To get the old behavior, one would now need to pass ndim=2 in this test:

res_rec <- apd_isolation(rec, cells_tr, ntrees = 10, nthreads = 1, ndim = 2),

Would be ideal if an updated version with this change could be submitted to CRAN.

Leaving it as an issue instead of PR as it looks like the code here is out of synch with the CRAN release and doesn't have the problematic file commited here.

Upkeep for applicable

Pre-history

2020

usethis::use_package_doc()
Consider letting usethis manage your @importFrom directives here.
usethis::use_import_from() is handy for this.
usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
Align the names of R/ files and test/ files for workflow happiness.
usethis::rename_files() can be helpful.

2021

usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
Remove check environments section from cran-comments.md
Bump required R version in DESCRIPTION to 3.4
Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
Add RStudio to DESCRIPTION as funder, if appropriate

Unit test is failing

The unit test linked below is failing because the expected and actual error message differ. However, I dont see any difference in the messages. I tested the strings in R and they are equal.

https://github.com/marlycormar/applicable/blob/2ee8debba287ada4d67c92595e69bf8273dbc53e/tests/testthat/test-pca-fit.R#L48

some initial notes

Based on our previous conversation

I think that, for unsupervised models, we can use y = NA in these calls to mold().
For this line, you won't need to pass in outcome. The line outcome <- processed$outcomes[[1]] won't be needed either.
fit-implementation.R would include the call to prcomp() and that object would be returned here (instead of the coefs thing that gets automatically populated)
Since we will have multiple ad_* functions, you may want to combine the fit-* files into a single file for PCA (same for the predict-* files too).

Output an error message when column names don't match

Output a descriptive error message when the selector columns do not exist in the dataset. For example,
in the code below apd_pca(predictors) doesn't contain columns matching "PC00[1-3]".

library(applicable)

predictors <- mtcars[, -1]
mod <- apd_pca(predictors)
autoplot(mod, matches("PC00[1-3]"))

As a result, it throws the following error.

Error: At least one layer must contain all faceting variables: component.

Plot is missing component

Layer 1 is missing component

Upkeep for applicable (2023)

Pre-history

usethis::use_readme_rmd()
usethis::use_roxygen_md()
usethis::use_github_links()
usethis::use_pkgdown_github_pages()
usethis::use_tidy_github_labels()
usethis::use_tidy_style()
usethis::use_tidy_description()
urlchecker::url_check()

2020

usethis::use_package_doc()
Consider letting usethis manage your @importFrom directives here.
usethis::use_import_from() is handy for this.
usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
Align the names of R/ files and test/ files for workflow happiness.
The docs for usethis::use_r() include a helpful script.
usethis::rename_files() may be be useful.

2021

usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
Remove check environments section from cran-comments.md
Bump required R version in DESCRIPTION to 3.6
Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions

2022

usethis::use_tidy_coc()
Handle and close any still-open master --> main issues
Update README badges, instructions in r-lib/usethis#1594
Update errors to rlang 1.0.0. Helpful guides:
https://rlang.r-lib.org/reference/topic-error-call.html
https://rlang.r-lib.org/reference/topic-error-chaining.html
https://rlang.r-lib.org/reference/topic-condition-formatting.html
Update pkgdown site using instructions at https://tidytemplate.tidyverse.org
Ensure pkgdown development is mode: auto in pkgdown config
Re-publish released site; see How to update a released site
Update lifecycle badges with more accessible SVGs: usethis::use_lifecycle()

2023

Necessary:

Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
usethis::use_tidy_coc()
usethis::use_tidy_github_actions()

Optional:

Review 2022 checklist to see if you completed the pkgdown updates
Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

^{Created on 2023-10-30 with usethis::use_tidy_upkeep_issue(), using usethis v2.2.2}

Interpretation of PCA scores

First of all, thank you for this package. I find it very useful.

This question is not directly related to the package but the output interpretation.

My goal is to provide to the user the predicted class along with the applicability of the observation. Since I have continuous variables, I decided to use the PCA score and take the distance_pctl column. The percentile interpretation is simple: 95 (only 5% of observations were more different than your query). Still, it raises the next question: what should it be the threshold to define those queries that are acceptable with respect to those that are very different and, therefore, the prediction should be rejected?

I know this is a tricky question, but I think a categorization of the applicability value should be useful to improve the results' interpretation. I thought to set a 95 (for instance) threshold, but I'm wondering whether there are more elegant approaches.

Thank you!!

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

Help us firm up the list of targetted repositories
Make sure all maintainers are aware of what's coming
Give us an issue to close when the job is done
Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

leverage computations

After our conversation with @bwlewis, let's use the QR decomposition.

Here's some example code:

options(width = 100)
# Use the QR decomposition to get (X'X)^{-1}. Fail if it doesn't work. 
get_inv <- function(X) {
  if (!is.matrix(X)) {
    X <- as.matrix(X)
  }
  XpX <- t(X) %*% X
  XpX_inv<- try(qr.solve(XpX), silent = TRUE)
  if (inherits(XpX_inv, "try-error")) {
    stop(as.character(XpX_inv), call. = FALSE)
  }
  dimnames(XpX_inv) <- NULL
  XpX_inv
}



X1 <- mtcars[, -1]
round(get_inv(X1), 3)
#>         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
#>  [1,]  0.085 -0.001 -0.001 -0.002  0.049 -0.027  0.120  0.032  0.018 -0.018
#>  [2,] -0.001  0.000  0.000 -0.001 -0.004  0.001  0.001  0.000  0.000  0.001
#>  [3,] -0.001  0.000  0.000  0.000  0.001  0.000 -0.002  0.000 -0.001 -0.001
#>  [4,] -0.002 -0.001  0.000  0.313  0.091 -0.049  0.005 -0.122 -0.086 -0.030
#>  [5,]  0.049 -0.004  0.001  0.091  0.507 -0.086  0.043  0.064  0.088 -0.158
#>  [6,] -0.027  0.001  0.000 -0.049 -0.086  0.031 -0.064  0.021 -0.036  0.031
#>  [7,]  0.120  0.001 -0.002  0.005  0.043 -0.064  0.625  0.142 -0.003  0.020
#>  [8,]  0.032  0.000  0.000 -0.122  0.064  0.021  0.142  0.570 -0.178  0.022
#>  [9,]  0.018  0.000 -0.001 -0.086  0.088 -0.036 -0.003 -0.178  0.265 -0.066
#> [10,] -0.018  0.001 -0.001 -0.030 -0.158  0.031  0.020  0.022 -0.066  0.096

bad <- cbind(int = rep(1, 150), model.matrix(~ .  + 0, data = iris))
get_inv(bad)
#> Error: Error in qr.solve(XpX) : singular matrix 'a' in solve

# A new sample: 
unk <- as.matrix(mtcars[3, -1, drop = FALSE ])

# leverage value
unk %*% get_inv(X1) %*% t(unk)
#>            Datsun 710
#> Datsun 710  0.2191155

# compare to base R:
lm_fit <- lm(mpg ~ . -  1, data = mtcars)
hatvalues(lm_fit)[3]
#> Datsun 710 
#>  0.2191155

^{Created on 2019-07-10 by the reprex package (v0.2.1)}

We might want to have an option for including the intercept or not. I'm on the fence about it.

isolation forests

We could add am ad_iso_forest() method that would use an isolation forest to find anomalies.

The isotree package has a lot of features but requires an additional serialization step to save the model. The isolation package might be the best approach.

Pinging: @kevin-m-kent

Hotelling T^2 for Outlier Detection

Feature - Hotelling T² for Outlier Detection

In chemometric models that use PCA/PLS or similar methods, we often use T² for outlier detection. This could be a nice complement the score.pca method that is already implemented.

Here is an example taken from Chapter 6 of Process Improvement using Data by Kevin Dunn.

Is this too niche? Worthwhile to implement? I'm also interested in the isolation methods, such as implementing isolation forests from #25. I'm not as familiar with those methods, so I'd need to learn more about them first.

Testing fitting & scoring functions

library(hardhat)
library(dplyr)

# ---------------------------------------------------------
# Testing model constructor
# ---------------------------------------------------------

# Run constructor.R
manual_model <- new_ad_pca("my_coef", default_xy_blueprint())
manual_model
names(manual_model)

manual_model$blueprint

# ---------------------------------------------------------
# Testing model fitting implementation
# ---------------------------------------------------------

# Run pca-fit.R
ad_pca_impl(iris %>% select(Sepal.Width))

# ---------------------------------------------------------
# Simulating user input and pass it to the fit bridge
# ---------------------------------------------------------

# Simulating formula interface
processed_1 <- mold(~., iris)
ad_pca_bridge(processed_1)

# Simulating x interface
iris_sub <- iris %>% select(-Species)
processed_2 <- mold(iris_sub, NA_real_)
ad_pca_bridge(processed_2)

# Simulating multiple outcomes. Error expected.
multi_outcome <- mold(Sepal.Width + Petal.Width ~ Sepal.Length + Species, iris)
ad_pca_bridge(multi_outcome)

# ---------------------------------------------------------
# Testing user facing fitting function
# ---------------------------------------------------------

# Using recipes
library(recipes)

predictors <- iris[c("Sepal.Width", "Petal.Width")]

# Data frame predictor
predictor <- iris['Sepal.Length']
ad_pca(predictor)

# Vector predictor.
# We should get the following error:
# "Error: `ad_pca()` is not defined for a 'numeric'."
predictor <- iris$Sepal.Length
ad_pca(predictor)

# Formula interface
ad_pca(~., iris)

# Using recipes. Fails "Error: No variables or terms were selected.".
library(recipes)
rec <- recipe(~., iris) %>%
  step_log(Sepal.Width) %>%
  step_dummy(Species, one_hot = TRUE)
ad_pca(rec, iris)


# ---------------------------------------------------------
# Testing model scoring implementation
# ---------------------------------------------------------

# Run pca-score.R
model <- ad_pca(Sepal.Width ~ Sepal.Length + Species, iris)
predictors <- forge(iris, model$blueprint)$predictors
predictors <- as.matrix(predictors)
score_ad_pca_numeric(model, predictors)


# ---------------------------------------------------------
# Testing score bridge function
# ---------------------------------------------------------

model <- ad_pca(~., iris)
predictors <- forge(iris, model$blueprint)$predictors
score_ad_pca_bridge("numeric", model, predictors)


# ---------------------------------------------------------
# Testing score interface function
# ---------------------------------------------------------

# Run 0.R
model <- ad_pca(~., iris)
score(model, iris)

# We should get an error:
# "Error: The class of `new_data`, 'factor', is not recognized."
# since `iris$Species` is not a data.frame
score(model, iris$Species)

# We should get an error:
# "Error: The following required columns are missing: 'Sepal.Length'."
# since `Sepal.Length` column is missing.
score(model, subset(iris, select = -Sepal.Length))

# The column `Species` is silently converted to a factor.
iris_character_col <- transform(iris, Species = as.character(Species))
score(model, iris_character_cols)

# We should get an error:
# "Error: Can't cast `x$Species` <double> to `to$Species` <factor<12d60>>."
# since `Species` can't be forced to be a factor
iris_double_col <- transform(iris, Species = 1)
score(model, iris_double_col)

"pctl" output should either be reported [0, 1] or [0, 100], but not both.

Hello,

First, I love the package and the idea.

In playing around with apd_pca() and score(), I've stumbled across a potentially confusing styling for the format of the output of score().

When calculating the percentile of the PCA distance, once that percentile reaches ~100, the output is converted to "1". Is this the intended behavior?