Giter Site home page Giter Site logo

applicable's Introduction

applicable

R-CMD-check Codecov test coverage Lifecycle:experimental CRAN status

Introduction

There are times when a model’s prediction should be taken with some skepticism. For example, if a new data point is substantially different from the training set, its predicted value may be suspect. In chemistry, it is not uncommon to create an “applicability domain” model that measures the amount of potential extrapolation new samples have from the training set. applicable contains different methods to measure how much a new data point is an extrapolation from the original data (if at all).

Installation

You can install the released version of applicable from CRAN with:

install.packages("applicable")

Install the development version of applicable from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/applicable")

Vignettes

To learn about how to use applicable, check out the vignettes:

  • vignette("binary-data", "applicable"): Learn different methods to analyze binary data.

  • vignette("continuous-data", "applicable"): Learn different methods to analyze continuous data.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

applicable's People

Contributors

emilhvitfeldt avatar hfrick avatar juliasilge avatar marlycormar avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

applicable's Issues

Breaking changes in dependency `isotree`

I am the maintainer of package isotree which is a dependency of {applicable} in CRAN under 'Suggests':
https://cran.r-project.org/web/packages/applicable/index.html

I would like to push an update to {isotree} which would break one of the unit tests of {applicable}.

In particular, I would like to change the default argument ndim to function isolation.forest:
https://github.com/david-cortes/isotree/blob/ad49b9717b41ce9bab86f2aeebe742679f0fca58/R/isoforest.R#L996
From the current (CRAN) default of min(3, NCOL(data)) to 1.

This would generate a problem in this unit test for {applicable}:
https://github.com/cran/applicable/blob/b66153447194c71778f7c04bc258722cc5cc5257/tests/testthat/test-isolation-fit.R#L26

To get the old behavior, one would now need to pass ndim=2 in this test:

res_rec <- apd_isolation(rec, cells_tr, ntrees = 10, nthreads = 1, ndim = 2),

Would be ideal if an updated version with this change could be submitted to CRAN.

Leaving it as an issue instead of PR as it looks like the code here is out of synch with the CRAN release and doesn't have the problematic file commited here.

Upkeep for applicable

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    usethis::rename_files() can be helpful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.4
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
  • Add RStudio to DESCRIPTION as funder, if appropriate

some initial notes

Based on our previous conversation

  • I think that, for unsupervised models, we can use y = NA in these calls to mold().

  • For this line, you won't need to pass in outcome. The line outcome <- processed$outcomes[[1]] won't be needed either.

  • fit-implementation.R would include the call to prcomp() and that object would be returned here (instead of the coefs thing that gets automatically populated)

  • Since we will have multiple ad_* functions, you may want to combine the fit-* files into a single file for PCA (same for the predict-* files too).

Output an error message when column names don't match

Output a descriptive error message when the selector columns do not exist in the dataset. For example,
in the code below apd_pca(predictors) doesn't contain columns matching "PC00[1-3]".

library(applicable)

predictors <- mtcars[, -1]
mod <- apd_pca(predictors)
autoplot(mod, matches("PC00[1-3]"))

As a result, it throws the following error.

Error: At least one layer must contain all faceting variables: component.

  • Plot is missing component
  • Layer 1 is missing component

Upkeep for applicable (2023)

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_github_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    The docs for usethis::use_r() include a helpful script.
    usethis::rename_files() may be be useful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.6
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions

2022

2023

Necessary:

  • Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
  • Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
  • usethis::use_tidy_coc()
  • usethis::use_tidy_github_actions()

Optional:

  • Review 2022 checklist to see if you completed the pkgdown updates
  • Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
  • Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
  • use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
  • Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

Created on 2023-10-30 with usethis::use_tidy_upkeep_issue(), using usethis v2.2.2

Interpretation of PCA scores

First of all, thank you for this package. I find it very useful.

This question is not directly related to the package but the output interpretation.

My goal is to provide to the user the predicted class along with the applicability of the observation. Since I have continuous variables, I decided to use the PCA score and take the distance_pctl column. The percentile interpretation is simple: 95 (only 5% of observations were more different than your query). Still, it raises the next question: what should it be the threshold to define those queries that are acceptable with respect to those that are very different and, therefore, the prediction should be rejected?

I know this is a tricky question, but I think a categorization of the applicability value should be useful to improve the results' interpretation. I thought to set a 95 (for instance) threshold, but I'm wondering whether there are more elegant approaches.

Thank you!!

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

leverage computations

After our conversation with @bwlewis, let's use the QR decomposition.

Here's some example code:

options(width = 100)
# Use the QR decomposition to get (X'X)^{-1}. Fail if it doesn't work. 
get_inv <- function(X) {
  if (!is.matrix(X)) {
    X <- as.matrix(X)
  }
  XpX <- t(X) %*% X
  XpX_inv<- try(qr.solve(XpX), silent = TRUE)
  if (inherits(XpX_inv, "try-error")) {
    stop(as.character(XpX_inv), call. = FALSE)
  }
  dimnames(XpX_inv) <- NULL
  XpX_inv
}



X1 <- mtcars[, -1]
round(get_inv(X1), 3)
#>         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
#>  [1,]  0.085 -0.001 -0.001 -0.002  0.049 -0.027  0.120  0.032  0.018 -0.018
#>  [2,] -0.001  0.000  0.000 -0.001 -0.004  0.001  0.001  0.000  0.000  0.001
#>  [3,] -0.001  0.000  0.000  0.000  0.001  0.000 -0.002  0.000 -0.001 -0.001
#>  [4,] -0.002 -0.001  0.000  0.313  0.091 -0.049  0.005 -0.122 -0.086 -0.030
#>  [5,]  0.049 -0.004  0.001  0.091  0.507 -0.086  0.043  0.064  0.088 -0.158
#>  [6,] -0.027  0.001  0.000 -0.049 -0.086  0.031 -0.064  0.021 -0.036  0.031
#>  [7,]  0.120  0.001 -0.002  0.005  0.043 -0.064  0.625  0.142 -0.003  0.020
#>  [8,]  0.032  0.000  0.000 -0.122  0.064  0.021  0.142  0.570 -0.178  0.022
#>  [9,]  0.018  0.000 -0.001 -0.086  0.088 -0.036 -0.003 -0.178  0.265 -0.066
#> [10,] -0.018  0.001 -0.001 -0.030 -0.158  0.031  0.020  0.022 -0.066  0.096

bad <- cbind(int = rep(1, 150), model.matrix(~ .  + 0, data = iris))
get_inv(bad)
#> Error: Error in qr.solve(XpX) : singular matrix 'a' in solve

# A new sample: 
unk <- as.matrix(mtcars[3, -1, drop = FALSE ])

# leverage value
unk %*% get_inv(X1) %*% t(unk)
#>            Datsun 710
#> Datsun 710  0.2191155

# compare to base R:
lm_fit <- lm(mpg ~ . -  1, data = mtcars)
hatvalues(lm_fit)[3]
#> Datsun 710 
#>  0.2191155

Created on 2019-07-10 by the reprex package (v0.2.1)

We might want to have an option for including the intercept or not. I'm on the fence about it.

isolation forests

We could add am ad_iso_forest() method that would use an isolation forest to find anomalies.

The isotree package has a lot of features but requires an additional serialization step to save the model. The isolation package might be the best approach.

Pinging: @kevin-m-kent

Hotelling T^2 for Outlier Detection

Feature - Hotelling T2 for Outlier Detection

In chemometric models that use PCA/PLS or similar methods, we often use T2 for outlier detection. This could be a nice complement the score.pca method that is already implemented.

Here is an example taken from Chapter 6 of Process Improvement using Data by Kevin Dunn.

image

Is this too niche? Worthwhile to implement? I'm also interested in the isolation methods, such as implementing isolation forests from #25. I'm not as familiar with those methods, so I'd need to learn more about them first.

Testing fitting & scoring functions

library(hardhat)
library(dplyr)

# ---------------------------------------------------------
# Testing model constructor
# ---------------------------------------------------------

# Run constructor.R
manual_model <- new_ad_pca("my_coef", default_xy_blueprint())
manual_model
names(manual_model)

manual_model$blueprint

# ---------------------------------------------------------
# Testing model fitting implementation
# ---------------------------------------------------------

# Run pca-fit.R
ad_pca_impl(iris %>% select(Sepal.Width))

# ---------------------------------------------------------
# Simulating user input and pass it to the fit bridge
# ---------------------------------------------------------

# Simulating formula interface
processed_1 <- mold(~., iris)
ad_pca_bridge(processed_1)

# Simulating x interface
iris_sub <- iris %>% select(-Species)
processed_2 <- mold(iris_sub, NA_real_)
ad_pca_bridge(processed_2)

# Simulating multiple outcomes. Error expected.
multi_outcome <- mold(Sepal.Width + Petal.Width ~ Sepal.Length + Species, iris)
ad_pca_bridge(multi_outcome)

# ---------------------------------------------------------
# Testing user facing fitting function
# ---------------------------------------------------------

# Using recipes
library(recipes)

predictors <- iris[c("Sepal.Width", "Petal.Width")]

# Data frame predictor
predictor <- iris['Sepal.Length']
ad_pca(predictor)

# Vector predictor.
# We should get the following error:
# "Error: `ad_pca()` is not defined for a 'numeric'."
predictor <- iris$Sepal.Length
ad_pca(predictor)

# Formula interface
ad_pca(~., iris)

# Using recipes. Fails "Error: No variables or terms were selected.".
library(recipes)
rec <- recipe(~., iris) %>%
  step_log(Sepal.Width) %>%
  step_dummy(Species, one_hot = TRUE)
ad_pca(rec, iris)


# ---------------------------------------------------------
# Testing model scoring implementation
# ---------------------------------------------------------

# Run pca-score.R
model <- ad_pca(Sepal.Width ~ Sepal.Length + Species, iris)
predictors <- forge(iris, model$blueprint)$predictors
predictors <- as.matrix(predictors)
score_ad_pca_numeric(model, predictors)


# ---------------------------------------------------------
# Testing score bridge function
# ---------------------------------------------------------

model <- ad_pca(~., iris)
predictors <- forge(iris, model$blueprint)$predictors
score_ad_pca_bridge("numeric", model, predictors)


# ---------------------------------------------------------
# Testing score interface function
# ---------------------------------------------------------

# Run 0.R
model <- ad_pca(~., iris)
score(model, iris)

# We should get an error:
# "Error: The class of `new_data`, 'factor', is not recognized."
# since `iris$Species` is not a data.frame
score(model, iris$Species)

# We should get an error:
# "Error: The following required columns are missing: 'Sepal.Length'."
# since `Sepal.Length` column is missing.
score(model, subset(iris, select = -Sepal.Length))

# The column `Species` is silently converted to a factor.
iris_character_col <- transform(iris, Species = as.character(Species))
score(model, iris_character_cols)

# We should get an error:
# "Error: Can't cast `x$Species` <double> to `to$Species` <factor<12d60>>."
# since `Species` can't be forced to be a factor
iris_double_col <- transform(iris, Species = 1)
score(model, iris_double_col)

"pctl" output should either be reported [0, 1] or [0, 100], but not both.

Hello,

First, I love the package and the idea.

In playing around with apd_pca() and score(), I've stumbled across a potentially confusing styling for the format of the output of score().

When calculating the percentile of the PCA distance, once that percentile reaches ~100, the output is converted to "1". Is this the intended behavior?

Screenshot 2023-10-07 at 3 34 17 PM

Upkeep for applicable

2023

Necessary:

  • Update copyright holder in DESCRIPTION: person(given = "Posit Software, PBC", role = c("cph", "fnd"))
  • Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
  • Update email addresses *@rstudio.com -> *@posit.co
  • Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
  • usethis::use_tidy_coc()
  • usethis::use_tidy_github_actions()

Optional:

  • Review 2022 checklist to see if you completed the pkgdown updates
  • Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
  • Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
  • use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
  • Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

Additions to the `score_ad_pca_numeric`

# add the distance column
# notes:
# te <- score(mod, test)
# diffs <- sweep(as.matrix(te), 2, means)^2
# sq_diff <- diffs^2
# dists <- apply(sq_diff, 1, function(x) sqrt(sum(x)))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.