stevenpawley / recipeselectors Goto Github PK

View Code? Open in Web Editor NEW

55.0 55.0 7.0 261 KB

Additional recipes for supervised feature selection to be used with the tidymodels recipes package

Home Page: https://stevenpawley.github.io/recipeselectors/

License: Other

R 100.00%

recipeselectors's People

Contributors

Stargazers

Watchers

Forkers

topepo octopushy danielhe3 emilhvitfeldt rowancallahan tinghui21 blakcjack

recipeselectors's Issues

Error at refit using modeltime_refit and modeltime_calibrate

Using recipe selectors with modeltime yields an error at modeltime_refit with new data and modeltime_calibrate with testing data even though the new data and testing data contain all predictor columns as the training data fed into the recipe.

Error: The following required columns are missing from new_data: "volume_lag88_roll_364", "event_holiday_thanksgiving".
ℹ️ These columns have one of the following roles, which are required at bake() time: "predictor".

Any way to pull which predictors are chosen for the final model?

Super excited to be able to try out recipeselectors in some of my work. I'd like to use the recipeselectors::step_select_forests() in a rand_forest() machine learning model. I'm under the impression that recipeselectors::step_select_forests() will alter the final model so that only predictors that exceed a particular scoring threshold (or have the n highest scores) will be used in the final model. Is this correct? If so, is there any way for us to see which predictors are actually selected?

I know that recipeselectors::pull_importances() can give us an idea of which features are most important (including factors within a predictor), but this seems to be different than what I'm actually after.

Still being developed?

No real issue here, just checking to see if this package is still being developed and if there are plans to "fully" bring it into the tidymodels family. Finding the various step_select_*() functions to be very helpful, but apprehensive to use them for production code since it looks like the last commit was a couple years ago and the package is not on CRAN yet. Thanks for all the effort that's gone into it so far.

Fatal error with recipes >= 1.0.6

step_select_boruta uses deprecated recipes::terms_select, superceded by recipes::recipes_eval_select

Error in `step_select_boruta()`:
Caused by error:
! `terms_select()` was deprecated in recipes 1.0.6 and is now defunct.
ℹ Please use `recipes_eval_select()` instead.
Backtrace:
     ▆
  1. ├─... %>% colnames()
  2. ├─base::colnames(.)
  3. │ └─base::is.data.frame(x)
  4. ├─dplyr::select(., -condition)
  5. ├─recipes::bake(., new_data = NULL)
  6. ├─recipes::prep(.)
  7. └─recipes:::prep.recipe(.)
  8.   ├─recipes:::recipes_error_context(...)
  9.   │ ├─base::withCallingHandlers(...)
 10.   │ └─base::force(expr)
 11.   ├─recipes::prep(x$steps[[i]], training = training, info = x$term_info)
 12.   └─recipeselectors:::prep.step_select_boruta(...)
 13.     └─recipes::terms_select(terms = x$terms, info = info)
 14.       └─lifecycle::deprecate_stop("1.0.6", "terms_select()", "recipes_eval_select()")
 15.         └─lifecycle:::deprecate_stop0(msg)
 16.           └─rlang::cnd_signal(...)

Recipe step for dealing with highly correlated values for ensemble stacked models

Dear Steven Pawley & Max Kuhn,
and other enthusiasts,

Regarding my request for help on RStudio Community https://community.rstudio.com/t/does-themis-package-feature-functions-for-dealing-with-continuous-data-imbalance/110432

I am in need of a solution to solve my issue of my poor performing ensemble stacked model, that I suspect to be related to some kind of feature pre-processing steps. Your package recipeselectors seems promessing.

I already perform these preprocessing steps

 step_impute_mode(Product) %>% 
  step_novel(Site_Type, Tree, -all_outcomes()) %>% 
  step_dummy(Site_Type, Tree, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_corr(all_numeric(), -all_outcomes()) %>% 
  step_lincomb(all_numeric(), -all_outcomes()) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_impute_mode(all_nominal(), -all_outcomes()) %>%
  step_impute_knn(logRR)

As you see on the model evaluation graphs something goes wrong in the modelling. For some reason, my model performs exceptionally poor, especially around the centre.

Here is a snapshot of my ensemble stacked model output.

step_select_boruta and step_select_mrmr need method for internally handling NAs

step_select_boruta and step_select_mrmr cannot handle data with missing/NA values. This requires the user to remove or impute NAs in a recipe step prior to the feature selection step in order to use these feature selection steps, which might not be desirable. It would be handy if step_select_boruta and step_select_mrmr could internally omit NAs which would allow the user to preserve them in the training data.

Possible new step: FCBF

Hi! I have been using the fast correlation-based filter algorithm in one of my projects, and made a package to implement this as a recipe step. It uses the bioconductor FCBF package as an engine, and brings it into the step_fcbf() function. I am new to package development so I'm sure there are things I can still tidy up, but it seemed like it would fit well with the recipeselectors package. Is this of interest?

terms_select deprecated please use recipe_eval_select

Hi recipeselectors developer,

I am trying to use step_select_vip, however, I encounter the following error:
Error in step_select_vip():
Caused by error:
! terms_select() was deprecated in recipes 1.0.6 and is now defunct.
ℹ Please use recipes_eval_select() instead.

As I don't use terms_select directly, so I guess the step_select_vip use this function internally. It there a way to solve this issue?

Thank you!
Sukis

How to view table with score and selected variables with information gain feature selection?

Dear, @stevenpawley

I can't see the scores function table that features the 'variable' and 'scores' columns containing the variable names and their scores. Could you help me find these values?

Information gain feature selection ----

recipe_spec <- recipe(value ~ ., 
                      data = training(emergency_tscv$splits[[1]])) %>%
  step_timeseries_signature(date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(.lbl$)|(hour)|(minute)|(second)|(am.pm)|(date_year$)")) %>%
  step_normalize (date_index.num,tempe_verage,tempemin,tempemax, -all_outcomes())%>%
  step_select_infgain(all_predictors(), top_p = 25, outcome = "value") %>%
  step_mutate(data = factor(value, ordered = TRUE))%>%
  step_dummy(all_nominal(), one_hot = TRUE)

# Model 1: Xgboost ----
wflw_fit_xgboost <- workflow() %>%
  add_model(
    boost_tree("regression") %>% set_engine("xgboost") 
  ) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(emergency_tscv$splits[[1]]))

After training the model with the above preprocessing step the scores should be calculated but I can't find them. Please, can you help me. an example of how to view scores with the `scores` function would suffice.

step_select_vip & dummy variables

step_dummy followed by step_select_vip for all_predictors results in the top_p predictors plus the dummy variables. Is it possible to include the dummy variables in the top_p?

`terms_select()` was deprecated in recipes 1.0.6 and is now defunct in step_select_Boruta

Hi!

I was trying to use step_select_vip or step_select_boruta but I get this error message:

Error in `step_select_boruta()`:
Caused by error:
! `terms_select()` was deprecated in recipes 1.0.6 and is now defunct.
ℹ Please use `recipes_eval_select()` instead.
Backtrace:
  1. recipes::prep(rec)
  2. recipes:::prep.recipe(rec)
  7. recipeselectors:::prep.step_select_boruta(...)
  8. recipes::terms_select(terms = x$terms, info = info)
  9. lifecycle::deprecate_stop("1.0.6", "terms_select()", "recipes_eval_select()")
 10. lifecycle:::deprecate_stop0(msg)

this is my code trying to replicate the examples :

 {r cleaning data using features selection}
install.packages("tidymodels")
install.packages("conflict")
install.packages("devtools")
devtools::install_github("stevenpawley/recipeselectors")
library(tidymodels)
library(devtools)
library(recipeselectors)
tidymodels_prefer(quiet = FALSE)

# define a base model to use for feature importances
base_model <- rand_forest(mode = "classification") %>%
    set_engine("ranger", importance = "permutation")

# create a preprocessing recipe
rec <- morfocol_sic_cal_clean %>%
 recipe(taxon ~ .) %>%
 recipeselectors::step_select_vip(all_numeric_predictors(), 
                                  model = base_model, top_p = 2,
                 outcome = "taxon") %>% 
  step_nzv(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.1

Could these lines be related to the error?

recipeselectors/R/step_select_boruta.R

Lines 96 to 104 in 94c88e8

    
           #' @export 
        
           prep.step_select_boruta <- function(x, training, info = NULL, ...) { 
        
             # translate the terms arguments 
        
             x_names <- recipes::terms_select(terms = x$terms, info = info) 
        
             y_name <- recipes::terms_select(x$outcome, info = info) 
        
             y_name <- y_name[1] 
        
             if (length(x_names) > 0) {

Thanks!

Issues when tuning parameters from 3 different sources

As I will show in a reprex below, I got some issues, tuning model arguments, and recipe arguments (from recipes and recipeselectors) both, by merging the grids.
I tried numerous was, but always get the error message:preprocessor 3/3: Error: You cannot prep() a tuneable recipe. Argument(s) with tune(): 'top_p'. Do you want to use a tuning function such as tune_grid()?

If I tune all the model and recipe arguments except top_p, it all works fine. How can I understand this issue?

#### LIBS

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))


#### DATA

df <- fread("Churn_Modelling.csv") # source: https://www.kaggle.com/shrutimechlearn/churn-modelling

set.seed(31)

split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)

k_folds_data <- vfold_cv(training(split), v = 10)

#### FEATURES 

# Define the recipe for Up-Sampling
rec <- recipe(Exited ~ ., data = train) %>%
	step_rm(one_of("RowNumber", "Surname")) %>%
	update_role(CustomerId, new_role = "Helper") %>%
	step_num2factor(all_outcomes(),
					levels = c("No", "Yes"),
					transform = function(x) {x + 1}) %>%
	step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
	step_dummy(all_nominal(), -all_outcomes()) %>%
	step_nzv(all_predictors()) %>%
	themis::step_upsample(Exited) %>%
	step_other(all_nominal(), threshold = tune("cat_thresh")) %>% 
	step_corr(all_predictors(), threshold = tune("thresh_cor")) %>% 
	#step_pca(all_numeric(), -all_outcomes(), num_comp = tune())
	step_select_roc(all_predictors(), outcome = "Exited", top_p = tune())
    

#### MODEL

model_metrics <- metric_set(roc_auc)            

# xgboost model
xgb_spec <- boost_tree(
	trees = tune(), 
	tree_depth = tune(), min_n = tune(), 
	loss_reduction = tune(),                    
	sample_size = tune(), mtry = tune(),         
	learn_rate = tune(),                        
	stop_iter = tune()
) %>% 
	set_engine("xgboost") %>% 
	set_mode("classification")

# grid
xgb_grid <- grid_latin_hypercube(
	trees(),
	tree_depth(),
	min_n(),
	loss_reduction(),
	sample_size = sample_prop(),
	finalize(mtry(), train),
	learn_rate(),
	stop_iter(range = c(5L,50L)),
	size = 10
)

rec_grid <- grid_latin_hypercube(
	parameters(rec) %>% 
		update(top_p = top_p(c(1,11))) ,
	size = 10
)

comp_grid <- merge(xgb_grid, rec_grid)

# tune
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
model_res <- tune_grid(xgb_spec, preprocessor = rec,
					   resamples = k_folds_data,
					   grid = comp_grid,
					   metrics = model_metrics)
stopCluster(cl)

Question.

Hello!

Is there a step wise feature selection within tidyverse?

general thoughts on feature selection in tidymodels

I'm debating on where to include supervised feature selection inside of tidymodels. Should they be inside of recipes? I'll brainstorm out loud here; pardon my unsolicited ramblings.

pro-recipes pov:

Simple and already uses a specification that people know about (a recipe).
This lets the user define the pre-processing and filtering order. It is not obvious which should come first and may need some experimentation for each data set.

con-recipes pov:

Can't easily combine filters. For example, like with a volcano plot I might want to filter on statistical significance (e.g. p-value/FDR) and the size of a difference simultaneously. In this specific case, there would be a step that has these two criteria as arguments but, in general, more complex filter combinations would be difficult within a recipe.
We might have to repeat some computations (but potentially a lot). Take a "select the best X predictors by ROC score" scenario. We'd like to loop over models for each value of X so that we don't repeat the recipe execution when it is not needed. We have something like this in parsnip (via multi_predict()) but it can't be done inside of a recipe. Imagine a complex text recipe that does stemming, tokenization, and a bunch of heavy computations before the filter step. For each value of X that we want to search over, those computations get repeated. (I mention below that this may be better solved in a specific function for RFE).

Originally I had thought up a filter specification (sort of like a recipe) that would define statistics (e.g. p-values, summary stats like ROC, model importance) and then rules to combine them ("ROC > .8 or being in the top 3 RF importance scores). This would get included in a workflow and executed accordingly.

This method would be very expressive but yet another specification for users to fill out. That's why I haven't worked on it further.

Switching gears, here are some specific design thoughts for this package:

The package name is fairly general. Can you come up with something that is more about supervised feature selection (as opposed to just selection)?
Maybe we should have a specific naming convention for these steps (step_filter_*, step_select_* or something like that).
The filter steps might be parameterized so that the top n features are selected and/or via specific values (e.g. keep features with ROC values > 0.8 and the top 3 ROC features). This would avoid selecting out all of the features. Looking through some of the steps, you may already have that but, in some cases, one overrides the other. Maybe defaults as NA for both and users would have to fill at least one out. Filling two out would be an effective "or" unless otherwise noted.
Alternatively, if a filter excludes everything, you might want an option to always take the top score (even if it sucks).
For importance scores, we've been using the vip package so maybe importing that would be a good idea. They have the wrappers worked out already. This would also offset the number of dependencies for your package.
I think that the steps should mostly be filter methods (instead of wrappers). Some wrappers/algorithms (like RFE) could be done via the functions in tune. For more complex algorithms, I think that we would want functions that take a model workflow as input. Maybe functions like search_rfe(), search_sa(), etc. For the sake of package size, it might make sense for those to live in a separate package.
Some existing recipes use an argument name of outcome for specifying the outcome column.
Also, in terms of argument names, top_n or something similar is more generic than num_comp.
If threshold is an option to filter the importance scores, there should also be the option to standardize the score range (0-1 maybe?)

Which feature selectors for regression to use?

Hi Steven,

Thank you so much for making this great package!

Among the list of selectors that you showed here, which one is usable for regression (i.e the target outcome is numerical values)?

I've tried step_select_mrmr() with this data.

> toxic_feat_outcome_dat

# A tibble: 30 × 13
   toxic_outcome foo_energy charge boman hmoment
           <dbl>              <dbl>      <dbl>     <dbl>       <dbl>
 1         0.570              -750.      0.943      1.61       0.641
 2         0.626              -750.      6.09       5.30       0.278
 3         1.49              -1120.      6.99       2.49       0.461
 4         2.15               -938.      9.09       3.29       0.623
 5         1.04               -927.      3.12       2.66       0.469
 6         1.57              -1272.      9.00       5.73       0.604
 7         1.99              -1094.      4.57       4.33       0.329
 8         1.24               -933.      2.94       2.65       0.339
 9         1.40              -1076.      6.12       2.87       0.469
10         1.20              -1002.      4.94       3.48       0.427
# … with 20 more rows, and 8 more variables: hydrophobicity <dbl>,
#   insta <dbl>, length <dbl>, masshift <dbl>, mw <dbl>,
#   mz <dbl>, pi <dbl>, PEP <dbl>

It works for me. But I'm not sure if it's appropriate.

mrmr_rec <- recipe(toxic_outcome ~ ., data = toxic_feat_outcome_dat ) %>%
  step_select_mrmr(all_predictors(), outcome = "toxic_outcome", threads = 2,  
                   top_p = dim(toxic_feat_outcome_dat)[1], threshold = 0.9)

The reason I asked, it's because in your example, the outcome class is categorical.

library(recipes)
data(cells, package = "modeldata")
rec <-recipe(class ~ ., data = cells[, -1]) %>%
 step_select_mrmr(all_predictors(), outcome = "class", top_p = 10, threshold = 0.9)

Thanks and hope to hear from you again.

Sincerely,
G.V.

	#' @export
	prep.step_select_boruta <- function(x, training, info = NULL, ...) {

	# translate the terms arguments
	x_names <- recipes::terms_select(terms = x$terms, info = info)
	y_name <- recipes::terms_select(x$outcome, info = info)
	y_name <- y_name[1]

	if (length(x_names) > 0) {