Giter Site home page Giter Site logo

planning's Introduction

tidymodels

R-CMD-check Codecov test coverage CRAN_Status_Badge Downloads lifecycle

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

  • broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.

  • dials has tools to create and manage values of tuning parameters.

  • dplyr contains a grammar for data manipulation.

  • ggplot2 implements a grammar of graphics.

  • infer is a modern approach to statistical inference.

  • parsnip is a tidy, unified interface to creating models.

  • purrr is a functional programming toolkit.

  • recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.

  • rsample has infrastructure for resampling data so that models can be assessed and empirically validated.

  • tibble has a modern re-imagining of the data frame.

  • tune contains the functions to optimize model hyper-parameters.

  • workflows has methods to combine pre-processing steps and models into a single object.

  • yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

planning's People

Contributors

andland avatar davisvaughan avatar hfrick avatar juliasilge avatar topepo avatar turgeonmaxime avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

planning's Issues

supporting list input for keras

Is there a way to create a recipe for list of matrices? Like inputing a DTM and Image Pixels for a custom keras model at the same time would be great. Also is there a way to incorporate a keras tokenizer into a recipe?

Thanks for all your work! And I would be delighted to help on that end if possible.

Feature engineering for image data

Image Classification with Tidymodels

Suggest Recipe: step_image

Is It possible to create a recipe that makes image processing?

Is It possible to process images like array reshape in Keras?

The tidymodels are an excellent package but there are no options to image process

Add functionality to improve explainability to tidymodels

It would be great when methods for making models and predictions better explainable would be made available through tidymodels and could be added to a workflow, for example like the ones in the DALEX and iml package. Are there any plans to do this?

option for OOB in bagging (e.g., RF)

Feature

In situations when running random forests (or other bagged models), OOB model information (predictions, error rates, etc.) should be available.

  1. First of all, I'm not convinced that OOB is a bad option. In this recent paper they say:

In line with results reported in the literature [5], the use of stratified subsampling with sampling fractions that are proportional to response class sizes of the training data yielded almost unbiased error rates in most settings with metric predictors. It therefore presents an easy way of reducing the bias in the OOB error. It does not increase the cost of constructing the RF, since unstratified sampling (bootstrap of subsampling) is simply replaced by stratified subsampling.

Indicating that OOB errors are doing a good job of estimating error rates (with the added benefit that they require no additional model fitting) as long as stratified sampling is done instead of subsampling.

  1. Even if nested resampling is superior (and I'll buy that there is an argument to be made), I find that cross validation and OOB are stepping stones to understanding nested resampling. Do you argue that nested resampling is better than CV? If so, why have CV in the package? Again, OOB happens for free, and sometimes nested resampling isn't even that much better. I think that more people will use nested resampling if they understand OOB, and the path to understanding OOB happens when it is included in the tidymodels package.

Thanks for all that you do!! The tidymodels package is amazing, and I really appreciate all the hard work that has gone into creating it.

Sparse tibble support

This will serve as the main hub of issues across the tidymodels ecosystem, regarding the implementation of sparse data in tibbles.

right now we are still in the exploratory phase, with work happening in https://github.com/EmilHvitfeldt/sparsevctrs to implement sparse vector classes that can be used within a tibble.

another thing we can do with this framework is allow sparse data as inputs to functions such as vfold_cv(), fit(), predict() etc etc, turning sparse data into sparse tibbles

  • Sparse vector classes
  • use sparse vector classes internally in recipes
  • use sparse vector to matrix conversion in appropriate places
    • recipes
    • workflows
    • parsnip (for xgboost)
    • rsample
  • make sure workflow/recipes interface is good to enable sparse data
    • possibly having sparse data by default if conversion isn't a big downside
  • make sure that end to end workflows uses and passes sparse data correctly
  • consider improved printing to allow the user to better use sparsity
    • a user might use 90% sparse compliant methods. letting them know what they would need to change to fully use sparsity (step_normalize() would be one example of a step that destroys sparsity)
  • Have vignettes in appropiate places describing what sparsity is and which steps/models benefit from them
  • document steps and models that work with sparsity and doesn't work with them

Exporting model runs

Hi again, I was using MLflow for a while and switched now to a custom solution inspired by MLflow to save all artefacts into a folder with a unique run_id and and params and metrics as json. This way all params are quickly available and some shiny app can pick up those meta info. Do you think it would be beneficial to add a standardised export function to parsnip or workflows?

Best

Using tidymodels to predict species distributions

Species Distribution Models or Ecological Niche Models are a broad range of tools for estimating the distribution of a focal species based on (typically) environmental covariates.

There are a number of specific R packages available to fit these, typically focused on specific statistical methodologies (e.g. dismo).

I can see broad interest from the SDM community for using tidymodels as a route to exploring a broad range of statistical models/ engines within a familiar tidy workflow. For example, I am currently trying to build an ensemble of 4 models (lm, gam, rf and brt) using a tidymodel approach to estimate the distribution of ~30 marine species. One reason for choosing a tidy approach is being able to train models using spatial cross-validation via the spatialsample package.

However, I'm struggling to understand how best to generate predictions from tidymodel workflows that would fit back into the standard SDM workflow.

Typically we check the interpretability of a model through (1) partial dependence plots of specific covariates and (2) by generating spatial predictions from models using covariates from the geographic domain of interest e.g. raster::predict or terra::predict.

I can see a route to do (1) via DALEX following the Mario Kart guide from @juliasilge here but am struggling to find documentation on how to do (2). Can I pass the workflow object directly to predict or should I create an explainer object first?

Here's some pseudo-code for my current workflow for 2:

# fit the final model to the data (after cross-validation and tuning)
final_fit <- final_model |> fit(my_data)

# generate DALEX explainer
rf_explainer <- explain_tidymodels(final_fit, data = my_data, y = my_data$response)

# load a covariate raster stack and convert to xyz tibble
covs <- readRDS("data/covariate_stack.rds") |> 
     rasterToPoints(covs) |> 
     as_tibble()

# generate predictions for each xyz from the covariate stack
spatial_predictions <- predict(rf_explainer, test)

Transfer Learning via tidymodels

FIrst of all, thank you everyone for your hard work. I'm addicted to tidymodels and a heavy user since I discovered it.

I'm not sure if that's the right place, but I wonder if you ever thought about implementing transfer model via tidymodels workflow?

Econometric Tools project ideas

I was browsing through the projects of the tidymodels organization and I see that there's a project on "Econometric Tools" and why people flock to Stata on this topic. I'm not sure where to put this comment, feel free to move it elsewhere if it's relevant elsewhere.

I think incorporating econometric models in some way in the tidymodels framework would be interesting but it's definitely reinventing the wheel since there's quite some work on econometrics in R. In particular, zelig is a bit of the "tidymodels" of econometric modelling, wrapping nearly all standard econometric models into a unified framework. I won't list all of the econometric packages out there because I think zelig can be a good starting point to think about this.

Support for Competing Risk models

Hello,

I am very excited to see the tidymodels implementation of survival analysis. I am currently utilizing competing risk models in my research, and wanted to advocate for their implementation in the tidymodels framework in the future (But given their complexity, I imagine it would be low priority at this time).

The models of interest would be the Fine-Gray model (similar to the Cox Proportional hazards model) as well as the Cumulative Incidence function (similar to the Kaplan-Meier)

Competing risk dataset

As an example, we can use the MGUS2 dataset from the survival package, where patients with the disease MGUS have the potential to transform into a Plasma Cell Malignancy (PCM), but are also at risk for death. We will do imputation to get rid of all NA values.

library(tidyverse)
library(survival)
library(mice)

data(cancer, package="survival")

imputed_mgus2 <- mgus2 %>% 
  mutate(etime = ifelse(pstat==0, futime, ptime),
         event = ifelse(pstat==0, 2*death, 1)) %>% #0 = censor, 1 = Plasma cell malignancy, 2 = death as a competing risk
mice(maxit = 2, m = 2, seed = 1, method = "cart") %>% 
  complete()

mgus2_split <- imputed_mgus2 %>% initial_split(prop = 0.99) 
train_dat <- mgus2_split %>% training()
test_pred <- mgus2_split %>% testing()

Fine-Gray Models

cmprsk:::crr

The model fitting function:

library(cmprsk)
fit_mgus_crr <- cmprsk::crr(ftime = train_dat$etime, fstatus = train_dat$event, 
            cov1 = data.matrix(train_dat[,c("sex", "hgb")]), 
            cencode = 0, failcode = 1, variance = TRUE)

Per the documentation, The predict function "returns a matrix with the unique type 1 failure times in the first column, and the other columnsgiving the estimated subdistribution function corresponding to the covariate combinations in the
rows of cov1 and cov2, at each failure time (the value that the estimate jumps to at that failure
time)."

fit_mgus_crr %>% predict(cov1 = data.matrix(test_pred[,c("sex", "hgb")]))

In addition, there is a tidier in the broom package for crr objects.

References: Subdistribution Analysis of Competing Risks (pdf)

crrp:::crrp

This package is useful for penalized Fine-Gray models, using LASSO, SCAD, MCP, and their group versions.

The model fitting function:

library(crrp)
fit_mgus_crrp <- crrp(time = train_dat$etime,
     fstatus = train_dat$event,
     X = as.matrix(cbind(train_dat$hgb, train_dat$creat)), 
     failcode = 1, cencode = 0, penalty = "LASSO", 
     lambda = 0.01, eps = 1E-6)

Unfortunately, there is no predict function, and the output from the crrp() function does not include convenient parameters such as a P value. There are standard errors, so a P value can be manually calculated. Creating a tidy wrapper may be difficult. In addition, this package has not been maintained since its first commit in 2015.

References: Penalized Variable Selection in Competing Risks Regression (pdf)

fastcmprsk:::fastCrr

The fastcmprsk uses a wrapper in C to do the same unregularized and regularized Fine-Gray models, similar to above. However the implementation uses a novel algorithm that is much faster than the above.

The model fitting function:

library(fastcmprsk)
fit_mgus_fastcrrp <- fastCrrp(Crisk(ftime = etime,
                                    fstatus = event,
                                    cencode = 0, failcode = 1) ~ hgb + creat, 
                              lambda = 0.01, penalty = "LASSO",
                              data = train_dat)

Per documentation, the predict function calculates the cumulative incidence function.

fit_mgus_fastcrr %>% predict(train_dat %>% select(sex, hgb))

Though there is no tidy wrapper for the fcrr objects, the output is almost equivalent to the above crr objects, and making a wrapper would be simple.

References: Fine-Gray Regression via Forward-Backward Scan (pdf)

fastcmprsk:::fastCrrp

This is the faster implementation of penalized Fine-Gray models, similar to the crrp package.

The model fitting function:

fit_mgus_crrp <- crrp(time = train_dat$etime,
     fstatus = train_dat$event,
     X = as.matrix(cbind(train_dat$hgb, train_dat$creat)), 
     failcode = 1, cencode = 0, penalty = "LASSO", 
     lambda = 0.01, eps = 1E-6)

Unfortunately, there also is no predict function assocated with the fcrrp objects. In addition, there is no summary() function, so making a tidy wrapper would also be difficult.

References: Fine-Gray Regression via Forward-Backward Scan (pdf)

Idea - Statistical testing library

So for the past few weeks I've been mulling over the idea of a tidy style interface for statistical testing. The motivation for this came out of a few recent projects I've run at work, and this lead me to discovering a few areas to improve. The idea for this package is different than Infers approach by attempting to add as minimal work to learning the package for those coming from a non statistics background. I apologize if this seems all over the place, but currently in my heads it was created by a need, and that current packages don't meet the problems needs, and a mix of a few newer tidyverse type verbs.
I'm adding this issue request for two reasons, number one to comply with the how to contribute for tidy(verse/models) packages, and to get any input from others as to ways forward. One thing I want to point out while this this may seem repetitive to the library Infer, I believe this would be a library for non statisticians (Where the syntax of Infer might seem foreign to non statisticians).

The original problem - The work project that launched this was from a survey the organization I work for ran for a government agency. We broke down the surveys based on 4 important demographics factors we also wanted to test. We ran numerous tests, including chisq, shapiro wilks, and kruskal wallis as the main three. We ran tests on each applicable question, and so all 5 (Including overall) where outputted from a function into a list.

Current approach. Using R's base Stats library we currently have multiple ways in which data is required to be input to a statistical test (Depending on the test), and can require transformations before being run. With more and more data being stored in a tidy manner, being able to compute statistical tests directly form tidy datasets is beneficial. I've heard from a number of coworkers (All non statisticians), the inconsistencies in the stats package is one of the reasons R has a steep learning curve.

My approach: Create a library that follows two principles, number one is be as simple for non statisticians to pick and use as quickly as possible, and number two is to do this while adding as minimal, if any, new verbs to the tidyverse/tidymodel universe.

Current working idea - A package tentatively called Tidy Tests.

  • Firstly naming convention. The stringr naming convention of str_ prefixes was invaluable for me when learning the library, I think it's completely reasonable to add that idea as a prefix with a similar naming convention such as tt_chisq or tt_kruskall
  • group_by() - Some statistical tests worked would inherently work well with group_by() for simplicity. One example being df %>% group_by(group_var) %>% tt_chisq(test_var)
  • across() and where(), the release of across is actually what brought about all of this. Currently, you can reduce duplication with tests by using map, but that's isn't very user friendly. Adding the across syntax for all tests could decrease the skill level required for speeding up repetitive statistical testing. The way in which this could be run could be something like df %>% group_by(group_var) %>% tt_chisq(across(...))

If you read the problem statement the list aspect of return the values might not have been the most optimal way to return the data. By the end of the project, and on later ones I started using broom heavily, and creating functions to output "clean" looking data at the end. I believe the best option would be to add some sort of cleaned output that looked similar to brooms cleaning up of other things.

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Mixed effects cox models

I couldn't find any tidy models implementation of coxme::coxme() in either broom or broom.mixed. Valid to add?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.