Giter Site home page Giter Site logo

butcher's Introduction

butcher butcher website

R-CMD-check CRAN status Codecov test coverage Lifecycle: stable

Overview

Modeling or machine learning in R can result in fitted model objects that take up too much memory. There are two main culprits:

  1. Heavy usage of formulas and closures that capture the enclosing environment in model training
  2. Lack of selectivity in the construction of the model object itself

As a result, fitted model objects contain components that are often redundant and not required for post-fit estimation activities. The butcher package provides tooling to “axe” parts of the fitted output that are no longer needed, without sacrificing prediction functionality from the original model object.

Installation

Install the released version from CRAN:

install.packages("butcher")

Or install the development version from GitHub:

# install.packages("pak")
pak::pak("tidymodels/butcher")

Butchering

As an example, let’s wrap an lm model so it contains a lot of unnecessary stuff:

library(butcher)
our_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars) 
}

This object is unnecessarily large:

library(lobstr)
obj_size(our_model())
#> 8.02 MB

When, in fact, it should only be:

small_lm <- lm(mpg ~ ., data = mtcars) 
obj_size(small_lm)
#> 22.22 kB

To understand which part of our original model object is taking up the most memory, we leverage the weigh() function:

big_lm <- our_model()
weigh(big_lm)
#> # A tibble: 25 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         8.05    
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # ℹ 15 more rows

The problem here is in the terms component of our big_lm. Because of how lm() is implemented in the stats package, the environment in which our model was made is carried along in the fitted output. To remove the (mostly) extraneous component, we can use butcher():

cleaned_lm <- butcher(big_lm, verbose = TRUE)
#> ✔ Memory released: 8.03 MB
#> ✖ Disabled: `print()`, `summary()`, and `fitted()`

Comparing it against our small_lm, we find:

weigh(cleaned_lm)
#> # A tibble: 25 × 2
#>    object           size
#>    <chr>           <dbl>
#>  1 terms        0.00771 
#>  2 qr.qr        0.00666 
#>  3 residuals    0.00286 
#>  4 effects      0.0014  
#>  5 coefficients 0.00109 
#>  6 model.mpg    0.000304
#>  7 model.cyl    0.000304
#>  8 model.disp   0.000304
#>  9 model.hp     0.000304
#> 10 model.drat   0.000304
#> # ℹ 15 more rows

And now it will take up about the same memory on disk as small_lm:

weigh(small_lm)
#> # A tibble: 25 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         8.06    
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # ℹ 15 more rows

To make the most of your memory available, this package provides five S3 generics for you to remove parts of a model object:

  • axe_call(): To remove the call object.
  • axe_ctrl(): To remove controls associated with training.
  • axe_data(): To remove the original training data.
  • axe_env(): To remove environments.
  • axe_fitted(): To remove fitted values.

When you run butcher(), you execute all of these axing functions at once. Any kind of axing on the object will append a butchered class to the current model object class(es) as well as a new attribute named butcher_disabled that lists any post-fit estimation functions that are disabled as a result.

Model Object Coverage

Check out the vignette("available-axe-methods") to see butcher’s current coverage. If you are working with a new model object that could benefit from any kind of axing, we would love for you to make a pull request! You can visit the vignette("adding-models-to-butcher") for more guidelines, but in short, to contribute a set of axe methods:

  1. Run new_model_butcher(model_class = "your_object", package_name = "your_package")
  2. Use butcher helper functions weigh() and locate() to decide what to axe
  3. Finalize edits to R/your_object.R and tests/testthat/test-your_object.R
  4. Make a pull request!

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

butcher's People

Contributors

abichat avatar ashbythorpe avatar ashesitr avatar davisvaughan avatar dpprdan avatar era127 avatar galen-ft avatar hfrick avatar juliasilge avatar jyuu avatar simonpcouch avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

butcher's Issues

Check for existence before inserting prototype element

I just hit this interesting scenario where butcher() gives a larger object than we started with, so we get this "No memory released" warning.

library(earth)
library(butcher)

earth.mod <- earth(Volume ~ ., data = trees)

butcher(earth.mod)
#> ✖ No memory released. Do not butcher.
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

axe_call(earth.mod)
#> ✔ Memory released: '144 B'
#> ✖ Disabled: `summary`, `update`
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

Created on 2019-07-17 by the reprex package (v0.2.1)

Upon looking into this, axe_data() for .earth does:

function(x, verbose = TRUE, ...) {
  old <- x
  x$x <- data.frame(NA)
  x$y <- numeric(0)

  add_butcher_attributes(x, old,
                         disabled = c("update"),
                         verbose = verbose)
}

but $x and $y don't exist in earth.mod to start with, so this ends up adding memory rather than removing it! This is because keepxy = FALSE by default in earth().

I think you might want to have a function called exchange(x, what, with) that would replace the object at location x[[what]] only if something actually exists there already. Then you would use it like:

function(x, verbose = TRUE, ...) {
  old <- x
  x <- exchange(x, "x", data.frame(NA))
  x <- exchange(x, "y", numeric(0))
  
  add_butcher_attributes(
    x, 
    old,
    disabled = c("update"),
    verbose = verbose
  )
}

rename functions

Based on r-design discussion "but more importantly, I think I'd recommend object-verb rather than verb-object because it leads to better autocomplete"

Change all the sub axe functions to begin with object-axe

predict.rpart

Issue with predict.rpart not exported from rpart package, hence running
predict(rpart_fit$fit) does not return the output expected. However, when instantiating a rpart model on a clean session, the predict.rpart function works. It may thus be worth revisiting whether our assignment to butcher class (then reassignment back to rpart) introduces this issue...

Model specific help files

I think it might be useful to have the help files be "model specific" as well.

So, what I am suggesting is that lm would have one help file that would have axe_env.lm, axe_call.lm and so on all documented in one place.

That would allow you to document exactly what axe_misc.lm does, right in the lm specific help doc.

Doing this is a bit of a trick, because you have to really know how roxygen comments work. It looks something like this:

#' Axing an lm
#'
#' This is where all of the lm specific documentation lies
#'
#'
#' @name axe-lm
NULL

#' @rdname axe-lm
#' @export
axe_call.lm <- function(x, ...) {
  x$call <- call("dummy_call")
  x
}

#' @rdname axe-lm
#' @export
axe_env.lm <- function(x, ...) {
  # Environment in terms
  x$terms <- axe_env(x$terms, ...)
  # Environment in model
  attributes(x$model)$terms <- axe_env(attributes(x$model)$terms, ...)
  x
}

Screen Shot 2019-06-17 at 9 12 24 AM

This is also where I would actually use lm() to create a lm object, and then you could call all of the relevant axing functions on it to show what it does.

This is slightly more complicated when a package is required to call the model (lm() comes with base R). Like for flexsurv, to be able to run library(flexsurv) in the examples section, you have to have flexsurv as a Suggests. I am still not sure if this is what we will do in the long run, but we can give it a try for now.

name conflicts

does butcher-package export conflict with export of butcher function?

Use `()` in `Disabled` messages for function names

library(earth)
#> Loading required package: Formula
#> Loading required package: plotmo
#> Loading required package: plotrix
#> Loading required package: TeachingDemos
library(butcher)

earth.mod <- earth(Volume ~ ., data = trees)

axe_call(earth.mod)
#> ✔ Memory released: '144 B'
#> ✖ Disabled: `summary`, `update`
#> Selected 4 of 5 terms, and 2 of 2 predictors
#> Termination condition: RSq changed by less than 0.001 at 5 terms
#> Importance: Girth, Height
#> Number of terms at each degree of interaction: 1 3 (additive model)
#> GCV 11.25439    RSS 209.1139    GRSq 0.959692    RSq 0.9742029

Created on 2019-07-17 by the reprex package (v0.2.1)

vignette

Not only should describe how the package works, but should also

  1. Include guidelines on how to extend the package
  2. Include examples on how much memory is freed for the user

Maybe add `verbose` to the axe generics

When you are just using axe_*() interactively, it might be hard to know that there is a verbose option because rstudio doesn't autocomplete it for methods (unless you are using the pipe, i think). I don't see any reason verbose couldn't be part of the generic.

Screen Shot 2019-07-17 at 1 42 48 PM

recursive find

edit this function so that it utilizes lobstr::sxp? issue w/flattening out the result

library(butcher)
y <- butcher:::butcher_example("lm.rda")
load(y)
lm_str <- lobstr::sxp(lm_fit$fit)
butcher:::butcher_map(lm_str, rlang::is_named)

might be helpful for console messages, demo how much memory saved

butcher class assignment

Class assignment (addition) only happens when the user calls the butcher function. What if only the sub axe functions are used? How to elegantly add class assignment without if statements? Is there a way to do so globally to any kind of axing done on an object?

fix documentation

All sub-axe generics appear to optimized for "not breaking predict"

This package might be extended to support other functions so remove this language

Remove `dplyr` dependency

This might have been a temporary thing on your part, but dplyr is a large dependency, and we probably won't actually need it for anything. Generally I'd include dplyr as a package dep if i was extending it or wanted group_by().

If you really need bind_rows() (I see you use it), I'd recommend switching it out for vctrs and vec_rbind(). You'll need the development version, so you'll need to set vctrs to be a Remote in the DESCRIPTION file. Just ask me if you want to do this and I can show you what I mean by that

test_that block structure

Generally, I would encourage using small test_that blocks for each model + axe_* function combination.

So it would be more like test_that("lm + axe_env() works", ...). This way it is self documenting (you don't have to do # LM), and when you run the tests and anything fails it tells you exactly which model failed.

You'll invert the test files so it will be test-lm.R, but the same principle will apply!

creating a template

include an easy template file with all possible axe functions, for the addition of new model objects

null objects

figure out how to treat NULL objects

what should it return?

Invert file structure

As we discussed, rather than axe_env.R holding axe_env.lm, we will probably move towards the broom approach of lm.R and that holding axe_env.lm along with the other lm specific methods

test helper for recipes

create a helper function in the test-recipes.R way too repetitive here

capture the various step_ in an expr

only a few unique cases exist

Generic function help files

To go along with #4, we need a way to make the help page for lm specific methods more discoverable. Right now, if you have the package installed, you'd have to do ?axe_env.lm. But that actually doesn't work, because if you have the package installed the only thing that you really can see is butcher::axe_env, not butcher::axe_env.lm. To see the lm specific page you'd have to do ?butcher:::axe_env.lm with the triple colon. This will be difficult for users.

To make it easier to find, I think we can take a page out of the generics + broom playbook.

In a fresh R session, try calling ?broom::tidy without loading broom. You should be taken to a page where you can click through to tidy from generics. At the bottom of that next page you should see a Methods section that shouldn't have any methods in it.

Screen Shot 2019-06-17 at 9 25 13 AM

Now library(broom). And call ?broom::tidy again and click through to that page. It should be populated with links to all kinds of broom model specific help pages.

Screen Shot 2019-06-17 at 9 26 20 AM

This is dynamic documentation, and we might be able to do a similar thing here. Doing ?axe_env would take you to a page that would dynamically link to all of the loaded model specific axe_env help pages.

To do this, we will probably have to copy the generics function for this, in this file. I don't think we should put all of the axe_*() functions in the generics package, which would be the normal approach to add a new common generic function:
https://github.com/r-lib/generics/blob/master/R/docs.R

To make the dynamic docs work, you do this in the roxygen help files, but it will be specific to the axe_env generic, not explain, and it won't be generics::: but probably butcher:::
https://github.com/r-lib/generics/blob/c15ac433450078b581cdfa5d9a3c10797f3a3fd5/R/explain.R#L7

Increase the granularity of axe generics

Not necessary to have an axe_misc that aggregates a number of model components together. Increase the granularity so the user has greater control over what to remove.

Examples include:

  • indices on the train object
  • env in each step type (thus can remove it all with the help of purrr)
  • formula for each step interaction
  • index associated with objects generated from ipred bagging function
  • some "bigMatrix" somewhere

The general, let's not lose the specificity!

s4 objects

Figure out how to include new butcher class assignment to s4 axed model object (see: fitted output from kernlab as an example)

pipe operations

Make sure it is possible to use the axe function as part of a pipe (i.e., for any print methods, need to ensure the object is returned invisibly)

kknn

verify whether kknn predict requires call object

spark objects

Take up way too much memory (for testing only). Re-configure such that objects are only created on the fly for testing

old predict output in tests

after ensuring the predict function works with reassignment compare against predict output expected from original model fit object

type stability

The current class assignment and re-assignment might aid with the use of predict, but requires new generics for any other function a user wants to use on a butchered object. We could consider introducing a function that explicitly assigns an axed model object to a butchered class, but the issue remains on how best to "exit gracefully" when the user uses an object from this class on a function that may no longer be compatible.

For the time being, attach an additional class (i.e. "butchered_lm") to the existing class as a means to mark it has been axed.

If an object has been butchered, should that also be a means to recover the original object? Restore functionality?

user defines butchering severity?

Prototype for a model object the option for the user to choose what to axe (i.e., list different tiers of how much memory the object will take on disk with the severity of butchering). User can then choose (or at least be aware of) how much functionality is lost -- make the choice between memory vs other post-fit analyses can be done.

model object tracker

include a file that tracks all the model objects that have been included and tested

important that the overall axe wrapper references this file and include a

stop("Please consider contributing to the butcher package! No axe method for objects exist yet for this class ", class(x)[1], call. = FALSE)

if the particular model object does not yet exist

checksum function?

verify the integrity of a model object? check what works, or provide a means to compare two objects (butchered, non-butchered, partially butchered) to elucidate differences in functionality or memory footprint

xgboost

include model objects from xgboost

include spark objects

spark is an important engine leveraged through the parsnip package. learn how to use it, how the model objects are created, and learn how it can be parsed down

  • how do different modelling objects from spark differ?
  • what are the differences in object size?

import predict

Should predict functions associated with particular model objects like rpart, glmnet and ranger be imported from their respective packages?

user-defined models

ideate on user-defined models

are there axe functions that can be generalized enough to tackle these? at least remove the environment?

Get rid of global axe option?

Goes back to the question of "how much is enough?" We don't want to parse an object down to the point that it is so far from what it originally was... yet, we don't want to go forth and save a lot of extraneous components that will not be recalled at all.

More user feedback will be needed in figuring out whether a global axe option would be helpful

Don't export methods specific to a model if they just call the default method

I know I said otherwise in the meeting last week, but I now think that if a model doesn't have an applicable axe_env method, then it should just use axe_env.default directly rather than having a axe_env.model method that calls the default.

The reason for this is related to #4. If you only export methods that are actually relevant to the model, then they are the only things that will show up in the model specific help files.

Meaning you don't want axe_misc.lm to show up in the lm specific help doc if it doesn't actually do anything.

Environments attached to srcfile attribute

Can the environments attached to the srcfile attribute be removed?

y <- butcher_example("train.rda")
load(y)
x <- train_fit

# > weigh(x, 0)
# # A tibble: 128 x 2
# object                     size
# <chr>                     <dbl>
#   1 modelInfo.fit           0.0565 
# 2 modelInfo.predict       0.0524 
# 3 modelInfo.grid          0.0518 
# 4 modelInfo.sort          0.0415 
# 5 modelInfo.predictors    0.0414 
# 6 modelInfo.levels        0.0413 
# 7 modelInfo.prob          0.0412 
# 8 finalModel.learn.X      0.0152 
# 9 control.summaryFunction 0.00682
# 10 call                    0.00179

We can see that what's weighing down these modelInfo components are the environments attached (i.e., str(attr(attr(x$modelInfo$fit, "srcref"), "srcfile")))

`sysdata.rda` thoughts

I'm not entirely sure what the right approach should be here, but I think that rather than having 1 sysdata.rda file, I would rather take the readxl approach where the data files are all stored in inst/extdata/ and then a helper is provided to load the examples.

https://github.com/tidyverse/readxl/blob/master/R/example.R
https://github.com/tidyverse/readxl/tree/master/inst/extdata

The inst/extdata/ directory is blessed by CRAN
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Data-in-packages

As it currently stands, the issues with the sysdata.rda approach are:

  • It is hard to tell at a glance exactly what example data sets are there
  • These are actually all internal only. So you can't use them in examples. For example, this won't actually work axe_misc(glmnet_multi) (I'm fairly certain). However, I'm not sure you'd actually want to use them in examples. Rather, you might want to actually show examples of creating the objects with the modeling functions, and then pruning them. This gives a clearer picture of how the package will be used.

I do really like the fact that there is a script that generates the test data files.

So here are my thoughts:

  • Create an inst/extdata/ folder. This will be included in the R package.
  • Create a inst/extdata-scripts/ folder. This won't be bundled with the R package, so use usethis::use_build_ignore("inst/extdata-scripts") to add it do the .Rbuildignore. (I think that should work)
  • I would have 1 file per data set in extdata-scripts/ that generates 1 dataset in extdata/. They should have matching file names.
  • Create a helper like readxl that can read in the datasets when requested.
  • Use this helper in tests, but maybe not in examples. In examples, as mentioned above i think it would be most clear if we could actually library() the relevant package and show off the methods after using the package to create a model object.
  • I am unsure if the helper should be exported at the current moment.

I'm going to open another issue about what I think the examples sections could look like.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.