tidymodels / model-implementation-principles Goto Github PK

View Code? Open in Web Editor NEW

41.0 14.0 4.0 1.58 MB

recommendations for creating R modeling packages

Home Page: https://tidymodels.github.io/model-implementation-principles/

HTML 68.21% CSS 20.26% R 0.06% JavaScript 11.47%

model-implementation-principles's Introduction

model-implementation-principles

recommendations for creating R modeling packages

model-implementation-principles's People

Contributors

Stargazers

Watchers

Forkers

hongzhonglu batpigandme marlycormar wligtenberg

model-implementation-principles's Issues

Add section on how to test models

Things that should be included:

How to test a formula / recipe interface
How to test code that compares results to computation from other packages, or software written in other languages (i.e. suppose you want to compare your R implementation to someone's reference implementation in MATLAB)
How to test models that are expensive to compute / will time out on Travis (i.e. saving models)

Some of this should overlap with the R package book (see related issues hadley/r-pkgs#477, hadley/r-pkgs#481).

Recommend tuning knob for resources allocated to model training

A broad class of hyperparameter tuning methods rely on being able to specify some amount of resources (iterations, time) to spend training a model. An even broader class of hyperparameter tuning algorithms is possible if models training can be paused and resumed.

We should develop some guidelines on how to do this.

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

Help us firm up the list of targetted repositories
Make sure all maintainers are aware of what's coming
Give us an issue to close when the job is done
Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

tranform()-like generic for unsupervised methods

Moving the discussion here from Slack. We should define a generic for unsupervised transformations. Jenny pointed out that transform() would be a bad name since it would have a name conflict with base R.

Check for integer values for arguments

I don't know if this fits within this repo so please bear with me.

I have been thinking about what happens when we check the type of the arguments we pass into our models, and for the most part that task is fairly easy. But I feel we have overlooked the case when we are concerned with argument such as times , epochs and mtry that require integers. Often the approach is to round down the double silently which I think is a little dangerous.

Examples

library(ranger)

ranger(
  mpg ~ ., 
  data = mtcars, 
  mtry = 3.8, 
  num.trees = 20.2, 
  importance = 'impurity'
)
#> Ranger result
#> 
#> Call:
#>  ranger(mpg ~ ., data = mtcars, mtry = 3.8, num.trees = 20.2,      importance = "impurity") 
#> 
#> Type:                             Regression 
#> Number of trees:                  20 
#> Sample size:                      32 
#> Number of independent variables:  10 
#> Mtry:                             3 
#> Target node size:                 5 
#> Variable importance mode:         impurity 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       5.142068 
#> R squared (OOB):                  0.8584392

library(parsnip)

rand_forest(mode = "regression", mtry = 2.3) %>%
  fit(mpg ~ ., 
      data = mtcars,
      engine = "ranger")
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = 2.3, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      32 
#> Number of independent variables:  10 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       6.186759 
#> R squared (OOB):                  0.829679

I feel like this problem is already being looked at in the vctrs package with vec_cast but might be something you should consider when implementing models in R.

Prediction on empty tibbles?

What should happen? I think you should either get an empty tibble back, or an error. Could see it going either way.

Regression prediction type should be `"numeric"`

https://tidymodels.github.io/model-implementation-principles/model-predictions.html

not "response" as is currently there

Don't keep estimates in multiple objects

See tidymodels/broom#663 for example, or a number of the bootstrap-style things that I can't remember. All the estimates should live in a single objects, otherwise you end with an interface like

do_something_with_model(first_half_of_estimates, second_half_of_estimates, ...)

which is not intuitive and a pain. Another example of this: https://github.com/tidymodels/broom/blob/master/R/joinerml-tidiers.R#L9.

It's perhaps worth providing advice on what do when wrapping / extending existing functions / objects instead of implementing from scratch. Advice in this vein might belong more to https://github.com/alexpghayes/kaboom

Settle on an interface for ensembles

In particular thinking about preprocessing, as the many fits in an ensemble might require different pre-processing.

Notes from safe_predict development

I'm following the notes from the prediction section fairly closely as I write tests for modeltests.

Recommendations I think we should add:

The newdata argument should default to NULL
predict should do something graceful when categorical predictors in newdata have novel levels not present in the training data. This should be carefully documented.
I would also add some comments about missing data. Currently the notes specify the number of rows needs to be the same -- I would add that a reasonable default would be to set predictions to NA in these cases (although imputation, etc, are also options). Again I think we just want to remind people to document.
Add a recommendation that type default to a character vector of allowed types, so that the function signature looks like

predict_method(object, newdata = NULL, type = c("class", "prob"), ...) {
  type <- match.arg(type)  # key!
}

Predictions should not depend on which observations are present in newdata. Mostly this is a stats::poly() and splines::ns() trap that people can get caught in:

fit <- lm(mpg ~ disp + hp + splines::ns(drat, 2), mtcars)

hp <- head(predict(fit, mtcars))
ph <- predict(fit, head(mtcars))

all.equal(hp, ph)
#> [1] "Mean relative difference: 0.1161239"

# loading the splines library fixes this

library(splines)
fit <- lm(mpg ~ disp + hp + ns(drat, 2), mtcars)

hp <- head(predict(fit, mtcars))
ph <- predict(fit, head(mtcars))

all.equal(hp, ph)
#> [1] TRUE

Created on 2018-09-04 by the reprex package (v0.2.0).

There's something similar for stats::poly() although the details escape me at the moment.

Questions

Say, for logistic regression, I have class A, with predicted prob 0.4 and class B with predicted prob 0.6 -- do people ever report uncertainty in these class probabilities?

This SO response seems to suggest that GPs for classification do this? I imagine this sort of reporting might be useful for something like a neural net with softmax probabilities where the softmax probabilities aren't really meaningful, so you might want to look at bootstrapped class probabilities to sanity check.

What should the factor levels of the .pred_class column be? The same factor levels as the factor in input dataset specified as the outcome? Should factor levels that never appear in the training data get dropped? Etc, etc

Don't use `match.arg()` to validate `type`

I disagree with the use of match.arg() for validation, as it accepts partial matches.

match.arg("res", c("response"))
#> [1] "response"

It is mentioned here:
https://tidymodels.github.io/model-implementation-principles/model-predictions.html

I think type should default to whatever prediction type you think is going to be the most common, and then in the documentation you mention the other accepted values for the type.

Then match with rlang::arg_match() which does not accept partial matches and gives nice error messages.

type <- "resp"
choices <- c("response", "conf_int")
rlang::arg_match(type, choices)
#> Error: `type` should be one of: "response" or "conf_int"
#> Did you mean "response"?

Consider specifying the data structure of model objects

It seems that if model objects had a standard data structure, a lot of the need for broom would disappear.

I suggest adding some rules to define a broom-like API, and advice on how to make that easy.

Model objects should have a glance() method that extracts a data frame of model-level information.
Model objects should have a tidy() method that extracts a data frame of coefficient-level information.
Model objects should have an augment() method that extracts a data frame of observation-level information.
The three previous rules can be most easily accomplished if the model structure is a list of three data frames.

The names of these functions could be changed, to something more meaningful, for example coefficient_details() is clearer than tidy(), but that would require changing broom too.

Do not give model comparison related objects an `anova` class

The anova class is so overloaded that you never know what you're going to get, and so it's near impossible to extend methods that save information of some sort in anova objects. broom::tidy.anova() tries to handle this to some degree, see https://github.com/tidymodels/broom/blob/master/R/stats-anova-tidiers.R#L31, but it's impossible to make any guarantees to users that they're getting what they want.

Add section on _documenting_ modeling functions and predict methods

I think we can extract a few good practices regarding documentation of the top level modeling function. A few that come to mind are:

Clearly document what x can be since we are going to allow it to be a data frame, matrix, or recipe.
Since the methods have very different signatures, I think it is good practice to use @rdname model_generic on methods such as model_generic.data.frame to ensure that the S3 method for data frame shows up on the documentation page. That way users can know that data frames are allowed, otherwise they are like 🤷‍♂️