tidymodels / butcher Goto Github PK

View Code? Open in Web Editor NEW

130.0 130.0 12.0 23.11 MB

Reduce the size of model objects saved to disk

Home Page: https://butcher.tidymodels.org/

License: Other

R 99.80% Rez 0.20%

butcher's Introduction

tidymodels

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
dials has tools to create and manage values of tuning parameters.
dplyr contains a grammar for data manipulation.
ggplot2 implements a grammar of graphics.
infer is a modern approach to statistical inference.
parsnip is a tidy, unified interface to creating models.
purrr is a functional programming toolkit.
recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample has infrastructure for resampling data so that models can be assessed and empirically validated.
tibble has a modern re-imagining of the data frame.
tune contains the functions to optimize model hyper-parameters.
workflows has methods to combine pre-processing steps and models into a single object.
yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
Most issues will likely belong on the GitHub repo of an individual package. If you think you have encountered a bug with the tidymodels metapackage itself, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

butcher's People

Contributors

Stargazers

Watchers

Forkers

kevinykuo abichat mgaldame dpprdan ashesitr rdinnager janithwanni cregouby galen-ft rnaimehaom sz-tim jameshwade saarialho

butcher's Issues

type stability

The current class assignment and re-assignment might aid with the use of predict, but requires new generics for any other function a user wants to use on a butchered object. We could consider introducing a function that explicitly assigns an axed model object to a butchered class, but the issue remains on how best to "exit gracefully" when the user uses an object from this class on a function that may no longer be compatible.

For the time being, attach an additional class (i.e. "butchered_lm") to the existing class as a means to mark it has been axed.

If an object has been butchered, should that also be a means to recover the original object? Restore functionality?

pipe operations

Make sure it is possible to use the axe function as part of a pipe (i.e., for any print methods, need to ensure the object is returned invisibly)

Do not strip off all attributes from functions as a result of srcref

Just replace the env with empty_env

null objects

figure out how to treat NULL objects

what should it return?

customizing travis build

experiment with ways to speed up travis build

checksum function?

verify the integrity of a model object? check what works, or provide a means to compare two objects (butchered, non-butchered, partially butchered) to elucidate differences in functionality or memory footprint

include spark objects

spark is an important engine leveraged through the parsnip package. learn how to use it, how the model objects are created, and learn how it can be parsed down

how do different modelling objects from spark differ?
what are the differences in object size?

s4 objects

Figure out how to include new butcher class assignment to s4 axed model object (see: fitted output from kernlab as an example)

rename functions

Based on r-design discussion "but more importantly, I think I'd recommend object-verb rather than verb-object because it leads to better autocomplete"

Change all the sub axe functions to begin with object-axe

testing libraries

Clean up tests such that libraries can be moved under Suggests

name conflicts

does butcher-package export conflict with export of butcher function?

Use `()` when talking about functions

i.e. in the readme: axe_call -> axe_call(). It's just a convention we try and use.

Get rid of global axe option?

Goes back to the question of "how much is enough?" We don't want to parse an object down to the point that it is so far from what it originally was... yet, we don't want to go forth and save a lot of extraneous components that will not be recalled at all.

More user feedback will be needed in figuring out whether a global axe option would be helpful

Specific value in lm test

I think it might be better to use expect_equal() here if possible
https://github.com/jyuu/butcher/blob/0cd4b51ffba12fedbcf762c87c7a22a267298075/tests/testthat/test-lm.R#L36

creating a template

include an easy template file with all possible axe functions, for the addition of new model objects

xgboost

include model objects from xgboost

test_that block structure

Generally, I would encourage using small test_that blocks for each model + axe_* function combination.

So it would be more like test_that("lm + axe_env() works", ...). This way it is self documenting (you don't have to do # LM), and when you run the tests and anything fails it tells you exactly which model failed.

You'll invert the test files so it will be test-lm.R, but the same principle will apply!

spark objects

Take up way too much memory (for testing only). Re-configure such that objects are only created on the fly for testing

Generic function help files

To go along with #4, we need a way to make the help page for lm specific methods more discoverable. Right now, if you have the package installed, you'd have to do ?axe_env.lm. But that actually doesn't work, because if you have the package installed the only thing that you really can see is butcher::axe_env, not butcher::axe_env.lm. To see the lm specific page you'd have to do ?butcher:::axe_env.lm with the triple colon. This will be difficult for users.

To make it easier to find, I think we can take a page out of the generics + broom playbook.

In a fresh R session, try calling ?broom::tidy without loading broom. You should be taken to a page where you can click through to tidy from generics. At the bottom of that next page you should see a Methods section that shouldn't have any methods in it.

Now library(broom). And call ?broom::tidy again and click through to that page. It should be populated with links to all kinds of broom model specific help pages.

This is dynamic documentation, and we might be able to do a similar thing here. Doing ?axe_env would take you to a page that would dynamically link to all of the loaded model specific axe_env help pages.

To do this, we will probably have to copy the generics function for this, in this file. I don't think we should put all of the axe_*() functions in the generics package, which would be the normal approach to add a new common generic function:
https://github.com/r-lib/generics/blob/master/R/docs.R

To make the dynamic docs work, you do this in the roxygen help files, but it will be specific to the axe_env generic, not explain, and it won't be generics::: but probably butcher:::
https://github.com/r-lib/generics/blob/c15ac433450078b581cdfa5d9a3c10797f3a3fd5/R/explain.R#L7

auto build pkgdown documentation

refer to vctrs as a good example

recursive find

edit this function so that it utilizes lobstr::sxp? issue w/flattening out the result

library(butcher)
y <- butcher:::butcher_example("lm.rda")
load(y)
lm_str <- lobstr::sxp(lm_fit$fit)
butcher:::butcher_map(lm_str, rlang::is_named)

might be helpful for console messages, demo how much memory saved

model object tracker

include a file that tracks all the model objects that have been included and tested

important that the overall axe wrapper references this file and include a

stop("Please consider contributing to the butcher package! No axe method for objects exist yet for this class ", class(x)[1], call. = FALSE)

if the particular model object does not yet exist

kknn

verify whether kknn predict requires call object

prototype butchered object console messages

Emojis?

Should `verbose` be defaulted to `FALSE`?

I think I would probably default verbose to FALSE, as it is usually better to be less noisy rather than more noisy, but provide the option to be noisier.

Maybe add `verbose` to the axe generics

When you are just using axe_*() interactively, it might be hard to know that there is a verbose option because rstudio doesn't autocomplete it for methods (unless you are using the pipe, i think). I don't see any reason verbose couldn't be part of the generic.

increase code coverage

consider removing bytecode from function object

as.function(c(formals(f), body(f)), envir = environment(f))

import predict

Should predict functions associated with particular model objects like rpart, glmnet and ranger be imported from their respective packages?

butcher class assignment

Class assignment (addition) only happens when the user calls the butcher function. What if only the sub axe functions are used? How to elegantly add class assignment without if statements? Is there a way to do so globally to any kind of axing done on an object?

user-defined models

ideate on user-defined models

are there axe functions that can be generalized enough to tackle these? at least remove the environment?

old predict output in tests

after ensuring the predict function works with reassignment compare against predict output expected from original model fit object

make roxygen templates for all the sub axeing functions

Invert file structure

As we discussed, rather than axe_env.R holding axe_env.lm, we will probably move towards the broom approach of lm.R and that holding axe_env.lm along with the other lm specific methods

Use `()` in `Disabled` messages for function names

library(earth)
#> Loading required package: Formula
#> Loading required package: plotmo
#> Loading required package: plotrix
#> Loading required package: TeachingDemos
library(butcher)

earth.mod <- earth(Volume ~ ., data = trees)

axe_call(earth.mod)
#> ✔ Memory released: '144 B'
#> ✖ Disabled: `summary`, `update`
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

^{Created on 2019-07-17 by the reprex package (v0.2.1)}

Increase the granularity of axe generics

Not necessary to have an axe_misc that aggregates a number of model components together. Increase the granularity so the user has greater control over what to remove.

Examples include:

indices on the train object
env in each step type (thus can remove it all with the help of purrr)
formula for each step interaction
index associated with objects generated from ipred bagging function
some "bigMatrix" somewhere

The general, let's not lose the specificity!

are there model objects of class glmnet, not elnet

Environments attached to srcfile attribute

Can the environments attached to the srcfile attribute be removed?

y <- butcher_example("train.rda")
load(y)
x <- train_fit

# > weigh(x, 0)
# # A tibble: 128 x 2
# object                     size
# <chr>                     <dbl>
#   1 modelInfo.fit           0.0565 
# 2 modelInfo.predict       0.0524 
# 3 modelInfo.grid          0.0518 
# 4 modelInfo.sort          0.0415 
# 5 modelInfo.predictors    0.0414 
# 6 modelInfo.levels        0.0413 
# 7 modelInfo.prob          0.0412 
# 8 finalModel.learn.X      0.0152 
# 9 control.summaryFunction 0.00682
# 10 call                    0.00179

We can see that what's weighing down these modelInfo components are the environments attached (i.e., str(attr(attr(x$modelInfo$fit, "srcref"), "srcfile")))

predict.rpart

Issue with predict.rpart not exported from rpart package, hence running
predict(rpart_fit$fit) does not return the output expected. However, when instantiating a rpart model on a clean session, the predict.rpart function works. It may thus be worth revisiting whether our assignment to butcher class (then reassignment back to rpart) introduces this issue...

Don't export methods specific to a model if they just call the default method

I know I said otherwise in the meeting last week, but I now think that if a model doesn't have an applicable axe_env method, then it should just use axe_env.default directly rather than having a axe_env.model method that calls the default.

The reason for this is related to #4. If you only export methods that are actually relevant to the model, then they are the only things that will show up in the model specific help files.

Meaning you don't want axe_misc.lm to show up in the lm specific help doc if it doesn't actually do anything.

user defines butchering severity?

Prototype for a model object the option for the user to choose what to axe (i.e., list different tiers of how much memory the object will take on disk with the severity of butchering). User can then choose (or at least be aware of) how much functionality is lost -- make the choice between memory vs other post-fit analyses can be done.

Check for existence before inserting prototype element

I just hit this interesting scenario where butcher() gives a larger object than we started with, so we get this "No memory released" warning.

library(earth)
library(butcher)

earth.mod <- earth(Volume ~ ., data = trees)

butcher(earth.mod)
#> ✖ No memory released. Do not butcher.
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

axe_call(earth.mod)
#> ✔ Memory released: '144 B'
#> ✖ Disabled: `summary`, `update`
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

^{Created on 2019-07-17 by the reprex package (v0.2.1)}

Upon looking into this, axe_data() for .earth does:

function(x, verbose = TRUE, ...) {
  old <- x
  x$x <- data.frame(NA)
  x$y <- numeric(0)

  add_butcher_attributes(x, old,
                         disabled = c("update"),
                         verbose = verbose)
}

but $x and $y don't exist in earth.mod to start with, so this ends up adding memory rather than removing it! This is because keepxy = FALSE by default in earth().

I think you might want to have a function called exchange(x, what, with) that would replace the object at location x[[what]] only if something actually exists there already. Then you would use it like:

function(x, verbose = TRUE, ...) {
  old <- x
  x <- exchange(x, "x", data.frame(NA))
  x <- exchange(x, "y", numeric(0))
  
  add_butcher_attributes(
    x, 
    old,
    disabled = c("update"),
    verbose = verbose
  )
}

Model specific help files

I think it might be useful to have the help files be "model specific" as well.

So, what I am suggesting is that lm would have one help file that would have axe_env.lm, axe_call.lm and so on all documented in one place.

That would allow you to document exactly what axe_misc.lm does, right in the lm specific help doc.

Doing this is a bit of a trick, because you have to really know how roxygen comments work. It looks something like this:

#' Axing an lm
#'
#' This is where all of the lm specific documentation lies
#'
#'
#' @name axe-lm
NULL

#' @rdname axe-lm
#' @export
axe_call.lm <- function(x, ...) {
  x$call <- call("dummy_call")
  x
}

#' @rdname axe-lm
#' @export
axe_env.lm <- function(x, ...) {
  # Environment in terms
  x$terms <- axe_env(x$terms, ...)
  # Environment in model
  attributes(x$model)$terms <- axe_env(attributes(x$model)$terms, ...)
  x
}

This is also where I would actually use lm() to create a lm object, and then you could call all of the relevant axing functions on it to show what it does.

This is slightly more complicated when a package is required to call the model (lm() comes with base R). Like for flexsurv, to be able to run library(flexsurv) in the examples section, you have to have flexsurv as a Suggests. I am still not sure if this is what we will do in the long run, but we can give it a try for now.

vignette

Not only should describe how the package works, but should also

Include guidelines on how to extend the package
Include examples on how much memory is freed for the user

fix documentation

All sub-axe generics appear to optimized for "not breaking predict"

This package might be extended to support other functions so remove this language

test step_nnmf after rngtools is updated

`sysdata.rda` thoughts

I'm not entirely sure what the right approach should be here, but I think that rather than having 1 sysdata.rda file, I would rather take the readxl approach where the data files are all stored in inst/extdata/ and then a helper is provided to load the examples.

https://github.com/tidyverse/readxl/blob/master/R/example.R
https://github.com/tidyverse/readxl/tree/master/inst/extdata

The inst/extdata/ directory is blessed by CRAN
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Data-in-packages

As it currently stands, the issues with the sysdata.rda approach are:

It is hard to tell at a glance exactly what example data sets are there
These are actually all internal only. So you can't use them in examples. For example, this won't actually work axe_misc(glmnet_multi) (I'm fairly certain). However, I'm not sure you'd actually want to use them in examples. Rather, you might want to actually show examples of creating the objects with the modeling functions, and then pruning them. This gives a clearer picture of how the package will be used.

I do really like the fact that there is a script that generates the test data files.

So here are my thoughts:

Create an inst/extdata/ folder. This will be included in the R package.
Create a inst/extdata-scripts/ folder. This won't be bundled with the R package, so use usethis::use_build_ignore("inst/extdata-scripts") to add it do the .Rbuildignore. (I think that should work)
I would have 1 file per data set in extdata-scripts/ that generates 1 dataset in extdata/. They should have matching file names.
Create a helper like readxl that can read in the datasets when requested.
Use this helper in tests, but maybe not in examples. In examples, as mentioned above i think it would be most clear if we could actually library() the relevant package and show off the methods after using the package to create a model object.
I am unsure if the helper should be exported at the current moment.

I'm going to open another issue about what I think the examples sections could look like.

Remove `dplyr` dependency

This might have been a temporary thing on your part, but dplyr is a large dependency, and we probably won't actually need it for anything. Generally I'd include dplyr as a package dep if i was extending it or wanted group_by().

If you really need bind_rows() (I see you use it), I'd recommend switching it out for vctrs and vec_rbind(). You'll need the development version, so you'll need to set vctrs to be a Remote in the DESCRIPTION file. Just ask me if you want to do this and I can show you what I mean by that

test helper for recipes

create a helper function in the test-recipes.R way too repetitive here

capture the various step_ in an expr

only a few unique cases exist

remove tf output in tests

reference parsnip