tidymodels / embed Goto Github PK
View Code? Open in Web Editor NEWExtra recipes for predictor embeddings
Home Page: https://embed.tidymodels.org
License: Other
Extra recipes for predictor embeddings
Home Page: https://embed.tidymodels.org
License: Other
Prepare for release:
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
The readme has most of the essential information but is missing titles and sections as is used in most other tidymodels packages
See https://github.com/tidymodels/broom for example
In tidymodels/recipes#635 we decided to unify on keep_original_cols
for the idea of retaining/preserving variables but here we've got at least one other argument:
Lines 35 to 36 in d46dd82
In that PR I have an example of deprecating a different option (preserve
) for step_pls()
and switching over to the new argument name.
Hello, step_embed
does not seem to work with parallel processing in caret. I think it may be related to topepo/caret#860
I am getting this error, which is very similar to the error in the previous issue:
Error in {: task 1 failed - "$ operator is invalid for atomic vectors"
Here is a reproducible example:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(embed)
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
library(doParallel)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] doParallel_1.0.15 iterators_1.0.12 foreach_1.4.7 embed_0.0.3
#> [5] recipes_0.1.6 dplyr_0.8.3 caret_6.0-84 ggplot2_3.2.1
#> [9] lattice_0.20-38
#>
#> loaded via a namespace (and not attached):
#> [1] minqa_1.2.4 colorspace_1.4-1 class_7.3-15
#> [4] ggridges_0.5.1 rsconnect_0.8.15 markdown_1.0
#> [7] base64enc_0.1-3 rstan_2.19.2 DT_0.7
#> [10] prodlim_2018.04.18 lubridate_1.7.4 codetools_0.2-16
#> [13] splines_3.6.0 knitr_1.23 shinythemes_1.1.2
#> [16] zeallot_0.1.0 bayesplot_1.7.0 jsonlite_1.6
#> [19] nloptr_1.2.1 tfruns_1.4 uwot_0.1.3
#> [22] shiny_1.3.2 compiler_3.6.0 backports_1.1.4
#> [25] assertthat_0.2.1 Matrix_1.2-17 lazyeval_0.2.2
#> [28] cli_1.1.0 later_0.8.0 htmltools_0.3.6
#> [31] prettyunits_1.0.2 tools_3.6.0 igraph_1.2.4.1
#> [34] gtable_0.3.0 glue_1.3.1 reshape2_1.4.3
#> [37] Rcpp_1.0.2 vctrs_0.2.0 nlme_3.1-139
#> [40] crosstalk_1.0.0 timeDate_3043.102 gower_0.2.1
#> [43] xfun_0.8 stringr_1.4.0 ps_1.3.0
#> [46] lme4_1.1-21 lifecycle_0.1.0 mime_0.7
#> [49] miniUI_0.1.1.1 gtools_3.8.1 MASS_7.3-51.4
#> [52] zoo_1.8-6 scales_1.0.0 ipred_0.9-9
#> [55] rstanarm_2.18.2 colourpicker_1.0 promises_1.0.1
#> [58] inline_0.3.15 shinystan_2.5.0 yaml_2.2.0
#> [61] reticulate_1.13 gridExtra_2.3 loo_2.1.0
#> [64] StanHeaders_2.18.1-10 keras_2.2.4.1 rpart_4.1-15
#> [67] stringi_1.4.3 highr_0.8 tensorflow_1.13.1
#> [70] dygraphs_1.1.1.6 boot_1.3-22 pkgbuild_1.0.3
#> [73] lava_1.6.6 rlang_0.4.0 pkgconfig_2.0.2
#> [76] matrixStats_0.54.0 evaluate_0.14 purrr_0.3.2
#> [79] rstantools_1.5.1 htmlwidgets_1.3 tidyselect_0.2.5
#> [82] processx_3.4.1 plyr_1.8.4 magrittr_1.5.0.9000
#> [85] R6_2.4.0 generics_0.0.2 pillar_1.4.2
#> [88] whisker_0.3-2 withr_2.1.2 xts_0.11-2
#> [91] survival_2.44-1.1 nnet_7.3-12 tibble_2.1.3
#> [94] crayon_1.3.4 rmarkdown_1.13.6 grid_3.6.0
#> [97] data.table_1.12.2 callr_3.3.1 ModelMetrics_1.2.2
#> [100] threejs_0.3.1 digest_0.6.20 xtable_1.8-4
#> [103] tidyr_0.8.99.9000 httpuv_1.5.1 RcppParallel_4.4.3
#> [106] stats4_3.6.0 munsell_0.5.0 shinyjs_1.0
mtcars2 <- as_tibble(mtcars)
mtcars2 <- mtcars2 %>%
mutate(cyl = as.factor(paste0("num_", cyl))) %>%
mutate(am = as.factor(ifelse(am == 1, "am", "not_am")))
rec <- recipe(am ~ cyl + hp, mtcars2) %>%
step_embed(
cyl,
outcome = vars(am),
options = embed_control(epochs = 75, validation_split = 0.2)
)
ctrl <- trainControl(
method = 'cv',
number = 5,
savePredictions = 'final',
classProbs = TRUE,
summaryFunction = twoClassSummary,
sampling = NULL,
returnData = FALSE
)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
train(
rec,
mtcars2,
method = "glm",
metric = "ROC",
trControl = ctrl
)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6913 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Error in {: task 1 failed - "$ operator is invalid for atomic vectors"
stopCluster(cl)
Created on 2019-08-26 by the reprex package (v0.3.0)
And here is the same example, working without parallel processing:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(embed)
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] embed_0.0.3 recipes_0.1.6 dplyr_0.8.3 caret_6.0-84
#> [5] ggplot2_3.2.1 lattice_0.20-38
#>
#> loaded via a namespace (and not attached):
#> [1] minqa_1.2.4 colorspace_1.4-1 class_7.3-15
#> [4] ggridges_0.5.1 rsconnect_0.8.15 markdown_1.0
#> [7] base64enc_0.1-3 rstan_2.19.2 DT_0.7
#> [10] prodlim_2018.04.18 lubridate_1.7.4 codetools_0.2-16
#> [13] splines_3.6.0 knitr_1.23 shinythemes_1.1.2
#> [16] zeallot_0.1.0 bayesplot_1.7.0 jsonlite_1.6
#> [19] nloptr_1.2.1 tfruns_1.4 uwot_0.1.3
#> [22] shiny_1.3.2 compiler_3.6.0 backports_1.1.4
#> [25] assertthat_0.2.1 Matrix_1.2-17 lazyeval_0.2.2
#> [28] cli_1.1.0 later_0.8.0 htmltools_0.3.6
#> [31] prettyunits_1.0.2 tools_3.6.0 igraph_1.2.4.1
#> [34] gtable_0.3.0 glue_1.3.1 reshape2_1.4.3
#> [37] Rcpp_1.0.2 vctrs_0.2.0 nlme_3.1-139
#> [40] iterators_1.0.12 crosstalk_1.0.0 timeDate_3043.102
#> [43] gower_0.2.1 xfun_0.8 stringr_1.4.0
#> [46] ps_1.3.0 lme4_1.1-21 lifecycle_0.1.0
#> [49] mime_0.7 miniUI_0.1.1.1 gtools_3.8.1
#> [52] MASS_7.3-51.4 zoo_1.8-6 scales_1.0.0
#> [55] ipred_0.9-9 rstanarm_2.18.2 colourpicker_1.0
#> [58] promises_1.0.1 parallel_3.6.0 inline_0.3.15
#> [61] shinystan_2.5.0 yaml_2.2.0 reticulate_1.13
#> [64] gridExtra_2.3 loo_2.1.0 StanHeaders_2.18.1-10
#> [67] keras_2.2.4.1 rpart_4.1-15 stringi_1.4.3
#> [70] highr_0.8 tensorflow_1.13.1 dygraphs_1.1.1.6
#> [73] foreach_1.4.7 boot_1.3-22 pkgbuild_1.0.3
#> [76] lava_1.6.6 rlang_0.4.0 pkgconfig_2.0.2
#> [79] matrixStats_0.54.0 evaluate_0.14 purrr_0.3.2
#> [82] rstantools_1.5.1 htmlwidgets_1.3 tidyselect_0.2.5
#> [85] processx_3.4.1 plyr_1.8.4 magrittr_1.5.0.9000
#> [88] R6_2.4.0 generics_0.0.2 pillar_1.4.2
#> [91] whisker_0.3-2 withr_2.1.2 xts_0.11-2
#> [94] survival_2.44-1.1 nnet_7.3-12 tibble_2.1.3
#> [97] crayon_1.3.4 rmarkdown_1.13.6 grid_3.6.0
#> [100] data.table_1.12.2 callr_3.3.1 ModelMetrics_1.2.2
#> [103] threejs_0.3.1 digest_0.6.20 xtable_1.8-4
#> [106] tidyr_0.8.99.9000 httpuv_1.5.1 RcppParallel_4.4.3
#> [109] stats4_3.6.0 munsell_0.5.0 shinyjs_1.0
mtcars2 <- as_tibble(mtcars)
mtcars2 <- mtcars2 %>%
mutate(cyl = as.factor(paste0("num_", cyl))) %>%
mutate(am = as.factor(ifelse(am == 1, "am", "not_am")))
rec <- recipe(am ~ cyl + hp, mtcars2) %>%
step_embed(
cyl,
outcome = vars(am),
options = embed_control(epochs = 75, validation_split = 0.2)
)
ctrl <- trainControl(
method = 'cv',
number = 5,
savePredictions = 'final',
classProbs = TRUE,
summaryFunction = twoClassSummary,
sampling = NULL,
returnData = FALSE
)
train(
rec,
mtcars2,
method = "glm",
metric = "ROC",
trControl = ctrl
)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 8761 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 7367 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 1120 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 3630 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6134 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6598 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 8840 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Generalized Linear Model
#>
#> Recipe steps: embed
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 26, 25, 25, 25, 27
#> Resampling results:
#>
#> ROC Sens Spec
#> 0.775 0.7 0.7333333
Created on 2019-08-26 by the reprex package (v0.3.0)
Via the VBsparsePCA
package.
I am just wondering about features created by embeddings using the label information. Is there any data leakage problem if I try to build a model with those added features?
Guo, C and Berkhahn F (2106)
I think I reported that before :)
I'm having trouble with printing a recipe that includes step_woe(). Other steps (printed via the recipe) produce a single line of explanation. Step_woe() looks like a raw print. Perhaps print.step_woe is not properly exported.
library(recipes)
library(embed)
library(modeldata)
data("credit_data")
rec <- recipe(Status ~ ., data = credit_data) %>%
step_center(all_numeric()) %>%
step_woe(Job, Home, outcome = vars(Status)) %>%
step_scale(all_numeric())
print("The printed recipe follows - Note that step_woe is a 'raw' print compared to step_center & step_scale")
print(rec)
This package has some very heavy (system) dependencies that are required in any case (via Imports:
). Is it an option for you to move some of them into Suggests:
to make a base installation lighter? As far as I understand this package, most recipes steps from this package require just one or two of these dependencies (like tensorflow), so if I only use say one step from this package, I technically don't need the majority of the dependencies in Imports:
. This in particular matters in model deployment.
Hi!
I was just pondering if a function to conduct t-sne
analysis could be implemented.
Looking for dimensionality reduction options in the recipes
package, I found that the embed
package contains step_umap
(and, of course, the recipes
package contains step_pca
).
I'd like to thank you guys for the work you are doing in this and in many other R
packages. :)
step_umap()
creates column names in an inconsistent manner compared to how the dimensionality reduction steps in {recipes} are doing it. Adding a prefix
argument in step_umap()
could properly resolve this issue
library(recipes)
library(embed)
recipe(~., data = mtcars) %>%
step_pca(all_predictors()) %>%
prep() %>%
bake(new_data = mtcars) %>%
names()
#> [1] "PC1" "PC2" "PC3" "PC4" "PC5"
recipe(~., data = mtcars) %>%
step_ica(all_predictors()) %>%
prep() %>%
bake(new_data = mtcars) %>%
names()
#> [1] "IC1" "IC2" "IC3" "IC4" "IC5"
recipe(~., data = mtcars) %>%
step_nnmf(all_predictors()) %>%
prep() %>%
bake(new_data = mtcars) %>%
names()
#> [1] "NNMF1" "NNMF2"
recipe(~., data = mtcars) %>%
step_umap(all_predictors()) %>%
prep() %>%
bake(new_data = mtcars) %>%
names()
#> [1] "umap_1" "umap_2"
Created on 2021-03-12 by the reprex package (v0.3.0)
I successfully applied step_lencode_glm
and step_embed
to small datasets.
However, applying step_lencode_bayes
takes forever meaning it hasn't finished for very small datasets like ameshousing after 1h - which is the time to abort for me.
Does this have method-intrinsic reasons?
Furthermore, I am confused by the output message Linear embedding for factors via GLM for all_nominal()
.
What bayesian model is exactly implemented here? A reference to paper would be great, as I need it for publication.
Here is a reproducible example.
dataset <- AmesHousing::make_ames()
target.label <- "Sale_Price"
target <- dataset[[target.label]]
features.labels <- dataset %>% select(-target.label) %>% names
train.index <- createDataPartition(target, p = 0.8, list = FALSE) %>% as.vector()
training.set <- dataset[train.index, ]
testing.set <- dataset[-train.index, ]
features.labels <-training.set %>%
select(-target.label) %>% names %T>% print
recipe.base <- features.labels %>%
paste(collapse = " + ") %>%
paste(target.label, "~", .) %>%
as.formula %>%
recipe(training.set)
recipe.encoding <- recipe.base %>%
step_lencode_bayes(all_nominal(), outcome = vars(target.label))
# this step takes forever
prep.encoding <- prep(recipe.encoding, training = training.set, retain = TRUE)
training.set.juiced <- juice(prep.encoding) %T>% print
testing.set.baked <- prep.encoding %>% bake(testing_original)
Running the prep
step on the first 50 rows of the training set throws this cryptic error on which I would appreciate an explanation:
Error: grouping factors must have > 1 sampled level
In addition: Warning messages:
1: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Any hints welcome!
step_woe
(via add_woe()
, which is run during both prep()
and bake()
) warns if the dictionary contains more than 50 levels for the factor. It gets quite noisy when using the recipe in any resampling scheme.
Condition the warning based on an argument to step_woe()
, say max_levels
. If any of the provided predictors have more unique values than max_levels
, emit the warning during prep()
. The default value would be 50 to match current behavior.
Happy to make a PR if you agree. The reprex below demonstrates the current behavior on prep()
and bake()
.
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(tibble)
library(embed)
packageVersion("embed")
#> [1] '0.1.4'
set.seed(100)
data <- tibble(y = factor(sample(c('yes', 'no'), size = 52*10, replace = TRUE)),
fct = factor(sample(c(letters, LETTERS), size = 52*10, replace = TRUE)))
rec <- recipe(y~fct, data = data) %>%
step_woe(fct, outcome = vars(y))
# warns on `prep()`, which I appreciate
prec <- prep(rec)
#> Warning: Variable fct has 52 unique values. Is this expected? In case of numeric
#> variable, see ?step_discretize().
# also warns on `bake()`, which I don't much.
bake(prec, new_data = data)
#> Warning: Variable fct has 52 unique values. Is this expected? In case of numeric
#> variable, see ?step_discretize().
#> # A tibble: 520 x 2
#> y woe_fct
#> <fct> <dbl>
#> 1 no -0.590
#> 2 yes -0.254
#> 3 no 0.192
#> 4 no 0.817
#> 5 yes -0.542
#> 6 yes -0.0308
#> 7 no 0.123
#> 8 no -0.282
#> 9 no 1.22
#> 10 yes -0.318
#> # ... with 510 more rows
Created on 2021-02-24 by the reprex package (v1.0.0)
I have a large set of recipes (you could call them a cookbook). To manage all the recipes, I like to programmatically extract the step function and all the columns the function applies to.
However, step_woe
has an inconsistency in the naming that makes programmatically extracting this data more difficult.
It refers to the terms
as variables. You can see it in the source code here:
Line 411 in cdabcf2
Even though the variable name is term_names
, for some reason the column name was called variables
. This is inconsistent with other embed::step_*
functions.
For example:
Line 294 in cdabcf2
I'm happy to help out with a PR if you are interested. I'm also working through an audit of recipes::step_*
functions, but I haven't had a chance to post yet.
Here is a reproducible example showing all the tidy.step_*
functions. Only step_woe
has variables
column name:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(embed)
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
library(dplyr)
library(mlbench)
set.seed(124)
data(PimaIndiansDiabetes)
d <- PimaIndiansDiabetes
d <- d %>%
as_tibble() %>%
select(diabetes, everything())
# make factor variables
d <- d %>%
mutate(mass_fct = factor(ifelse(mass > 30, "large", "small"))) %>%
mutate(pregnant_fct = as.factor(pregnant)) %>%
mutate(pressure_fct = factor(case_when(
pressure < 30 ~ "low",
between(pressure, 30, 50) ~ "medium",
pressure > 50 ~ "high"
))) %>%
mutate(triceps_fct = factor(ifelse(triceps > 0, "has", "none"))) %>%
mutate(insulin_fct = factor(insulin)) %>%
mutate(age_fct = factor(age))
# steps in `embed`
embed_rec <- recipe(diabetes ~ ., d) %>%
embed::step_woe(mass_fct, outcome = diabetes) %>%
embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
embed::step_embed(
insulin_fct,
outcome = vars(diabetes),
options = embed_control(epochs = 1)
) %>%
embed::step_umap(pedigree, outcome = vars(diabetes))
embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: `.key` is deprecated
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> boundary (singular) fit: see ?isSingular
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 2636 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
for (i in seq_along(embed_rec_prepped$steps)) {
embed_rec_prepped %>%
tidy(i) %>%
print()
}
#> # A tibble: 2 x 9
#> variable predictor n_tot n_neg n_pos p_neg p_pos woe id
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 mass_fct large 465 250 215 0.5 0.802 0.473 woe_Jltqf
#> 2 mass_fct small 303 250 53 0.5 0.198 -0.928 woe_Jltqf
#> # A tibble: 18 x 4
#> level value terms id
#> <chr> <dbl> <chr> <chr>
#> 1 0 0.630 pregnant_fct lencode_bayes_Dh6Ag
#> 2 1 1.20 pregnant_fct lencode_bayes_Dh6Ag
#> 3 2 1.33 pregnant_fct lencode_bayes_Dh6Ag
#> 4 3 0.543 pregnant_fct lencode_bayes_Dh6Ag
#> 5 4 0.630 pregnant_fct lencode_bayes_Dh6Ag
#> 6 5 0.515 pregnant_fct lencode_bayes_Dh6Ag
#> 7 6 0.679 pregnant_fct lencode_bayes_Dh6Ag
#> 8 7 -0.0949 pregnant_fct lencode_bayes_Dh6Ag
#> 9 8 -0.139 pregnant_fct lencode_bayes_Dh6Ag
#> 10 9 -0.280 pregnant_fct lencode_bayes_Dh6Ag
#> 11 10 0.353 pregnant_fct lencode_bayes_Dh6Ag
#> 12 11 -0.0729 pregnant_fct lencode_bayes_Dh6Ag
#> 13 12 0.304 pregnant_fct lencode_bayes_Dh6Ag
#> 14 13 0.217 pregnant_fct lencode_bayes_Dh6Ag
#> 15 14 0.0188 pregnant_fct lencode_bayes_Dh6Ag
#> 16 15 0.166 pregnant_fct lencode_bayes_Dh6Ag
#> 17 17 0.173 pregnant_fct lencode_bayes_Dh6Ag
#> 18 ..new 0.341 pregnant_fct lencode_bayes_Dh6Ag
#> # A tibble: 4 x 4
#> level value terms id
#> <chr> <dbl> <chr> <chr>
#> 1 high 0.634 pressure_fct lencode_bayes_4bgX7
#> 2 low 0.223 pressure_fct lencode_bayes_4bgX7
#> 3 medium 0.916 pressure_fct lencode_bayes_4bgX7
#> 4 ..new 0.591 pressure_fct lencode_bayes_4bgX7
#> # A tibble: 3 x 4
#> level value terms id
#> <chr> <dbl> <chr> <chr>
#> 1 has 0.624 triceps_fct lencode_bayes_hv7LH
#> 2 none 0.624 triceps_fct lencode_bayes_hv7LH
#> 3 ..new 0.624 triceps_fct lencode_bayes_hv7LH
#> # A tibble: 187 x 5
#> insulin_fct_embed_1 insulin_fct_embed_2 level terms id
#> <dbl> <dbl> <chr> <chr> <chr>
#> 1 -0.0301 -0.0259 ..new insulin_β¦ lencode_bayes_Gβ¦
#> 2 0.0146 0.0397 0 insulin_β¦ lencode_bayes_Gβ¦
#> 3 -0.00895 -0.0290 14 insulin_β¦ lencode_bayes_Gβ¦
#> 4 -0.0280 -0.00999 15 insulin_β¦ lencode_bayes_Gβ¦
#> 5 -0.0267 -0.0226 16 insulin_β¦ lencode_bayes_Gβ¦
#> 6 -0.0463 0.00294 18 insulin_β¦ lencode_bayes_Gβ¦
#> 7 -0.0102 -0.0445 22 insulin_β¦ lencode_bayes_Gβ¦
#> 8 0.00978 0.0479 23 insulin_β¦ lencode_bayes_Gβ¦
#> 9 0.0182 -0.0329 25 insulin_β¦ lencode_bayes_Gβ¦
#> 10 -0.00414 -0.0186 29 insulin_β¦ lencode_bayes_Gβ¦
#> # β¦ with 177 more rows
#> # A tibble: 1 x 2
#> terms id
#> <chr> <chr>
#> 1 pedigree umap_exfSj
Created on 2019-08-29 by the reprex package (v0.3.0)
install.packages("embed")
Installing package into β/home/roxana/R/x86_64-pc-linux-gnu-library/3.6β
(as βlibβ is unspecified)
probando la URL 'https://cloud.r-project.org/src/contrib/embed_0.1.1.tar.gz'
Content type 'application/x-gzip' length 46880 bytes (45 KB)
==================================================
downloaded 45 KB
*** caught segfault ***
address 0x7fc0dd39d008, cause 'invalid permissions'
Traceback:
1: dyn.load(file, DLLpath = DLLpath, ...)
2: library.dynam(lib, package, package.lib)
3: loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
4: asNamespace(ns)
5: namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]), i[[2L]], from = package)
6: loadNamespace(package = package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, partial = TRUE)
7: withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
8: suppressPackageStartupMessages(loadNamespace(package = package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, partial = TRUE))
9: code2LazyLoadDB(package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, compress = compress, set.install.dir = set.install.dir)
10: tools:::makeLazyLoading("embed", "/home/roxana/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-embed/00new", keep.source = FALSE, keep.parse.data = FALSE, set.install.dir = "/home/roxana/R/x86_64-pc-linux-gnu-library/3.6/embed")
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
ERROR: lazy loading failed for package βembedβ
The downloaded source packages are in
β/tmp/RtmpgjvCcx/downloaded_packagesβ
My sessionInfo() details as follows:
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=es_AR.UTF-8 LC_NUMERIC=C LC_TIME=es_AR.UTF-8
[4] LC_COLLATE=es_AR.UTF-8 LC_MONETARY=es_AR.UTF-8 LC_MESSAGES=es_AR.UTF-8
[7] LC_PAPER=es_AR.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 Matrix_1.2-18 magrittr_1.5 R6_2.4.1 generics_0.0.2
[6] tools_3.6.3 whisker_0.4 base64enc_0.1-3 Rcpp_1.0.5 reticulate_1.16
[11] keras_2.3.0.0 tensorflow_2.2.0 grid_3.6.3 zeallot_0.1.0 jsonlite_1.7.0
[16] tfruns_1.4 lattice_0.20-40
Prepare for release:
devtools::check_win_devel()
rhub::check_for_cran()
email.yml
then revdepcheck::revdep_email_maintainers()
Perform release:
devtools::check_win_devel()
(again!)devtools::submit_cran()
Wait for CRAN...
pkgdown::build_site()
tidymodels
general update.Template from r-lib/usethis#338
It might be nice to have an option to estimate pooled likelihood encodings in lme4
, which is faster at the cost of less flexibility.
This is a silly example, but:
library(rstanarm)
library(lme4)
library(microbenchmark)
fit_stan <- function() stan_lmer(Sepal.Width ~ 1|Species, data = iris)
fit_lme4 <- function() lmer(Sepal.Width ~ 1|Species, data = iris)
microbenchmark(
fit_stan(),
fit_lme4(),
times = 5
)
Each stan_lmer
fit takes 13 seconds on my computer, but lmer
takes only a fifth of a second, which could add up quickly if users are calculating encodings across many resampled datasets.
I'd be happy to make a PR (won't be able to start working on it till next Monday), but wanted to see if you're interested first, and if you have anything thoughts on an inferface.
According to the rlang news, rlang::check_installed()
was added in version 0.4.10, the most recent CRAN version as of filing this issue. This is used in the utility functions lme_coefs()
and stan_coefs()
. However, embed's description does not impose any version restrictions on rlang.
I think the description should be updated accordingly or some alternative for rlang::check_installed()
should be used.
All embed::step_*
functions have an outcome parameter. However, step_woe
doesn't accept a vars
column selection, expecting the bar name. In fact, the bare name is captured by an enquo
:
https://github.com/tmastny/embed/blob/cdabcf28a8c5086237637d08ca86642b6fc2af50/R/woe.R#L127
This is inconsistent with the other step functions. I'm happy to help with a PR, but I don't know which is intended (bare name or vars
).
Here is a reproducible example, which the error
#> Can't find column `vars(diabetes)` in `.data`.
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(embed)
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
library(dplyr)
library(mlbench)
set.seed(124)
data(PimaIndiansDiabetes)
d <- PimaIndiansDiabetes
d <- d %>%
as_tibble() %>%
select(diabetes, everything())
# make factor variables
d <- d %>%
mutate(mass_fct = factor(ifelse(mass > 30, "large", "small"))) %>%
mutate(pregnant_fct = as.factor(pregnant)) %>%
mutate(pressure_fct = factor(case_when(
pressure < 30 ~ "low",
between(pressure, 30, 50) ~ "medium",
pressure > 50 ~ "high"
))) %>%
mutate(triceps_fct = factor(ifelse(triceps > 0, "has", "none"))) %>%
mutate(insulin_fct = factor(insulin)) %>%
mutate(age_fct = factor(age))
# steps in `embed`
embed_rec <- recipe(diabetes ~ ., d) %>%
embed::step_woe(mass_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
embed::step_embed(
insulin_fct,
outcome = vars(diabetes),
options = embed_control(epochs = 1)
) %>%
embed::step_umap(pedigree, outcome = vars(diabetes))
embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Can't find column `vars(diabetes)` in `.data`.
embed_rec <- recipe(diabetes ~ ., d) %>%
embed::step_woe(mass_fct, outcome = diabetes) %>%
embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
embed::step_embed(
insulin_fct,
outcome = vars(diabetes),
options = embed_control(epochs = 1)
) %>%
embed::step_umap(pedigree, outcome = vars(diabetes))
embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: `.key` is deprecated
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> boundary (singular) fit: see ?isSingular
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 5396 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
devtools::session_info()
#> β Session info ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 3.6.0 (2019-04-26)
#> os macOS Mojave 10.14.6
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Chicago
#> date 2019-08-29
#>
#> β Packages ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date lib
#> assertthat 0.2.1 2019-03-21 [1]
#> backports 1.1.4 2019-04-10 [1]
#> base64enc 0.1-3 2015-07-28 [1]
#> bayesplot 1.7.0 2019-05-23 [1]
#> boot 1.3-22 2019-04-02 [1]
#> callr 3.3.1 2019-07-18 [1]
#> class 7.3-15 2019-01-01 [1]
#> cli 1.1.0 2019-03-19 [1]
#> codetools 0.2-16 2018-12-24 [1]
#> colorspace 1.4-1 2019-03-18 [1]
#> colourpicker 1.0 2017-09-27 [1]
#> crayon 1.3.4 2017-09-16 [1]
#> crosstalk 1.0.0 2016-12-21 [1]
#> desc 1.2.0 2018-05-01 [1]
#> devtools 2.0.2 2019-04-08 [1]
#> digest 0.6.20 2019-07-04 [1]
#> dplyr * 0.8.3 2019-07-04 [1]
#> DT 0.8 2019-08-07 [1]
#> dygraphs 1.1.1.6 2018-07-11 [1]
#> ellipsis 0.2.0.1 2019-07-02 [1]
#> embed * 0.0.3.9000 2019-08-30 [1]
#> evaluate 0.14 2019-05-28 [1]
#> fs 1.3.1 2019-05-06 [1]
#> generics 0.0.2 2018-11-29 [1]
#> ggplot2 3.2.1 2019-08-10 [1]
#> ggridges 0.5.1 2018-09-27 [1]
#> glue 1.3.1 2019-03-12 [1]
#> gower 0.2.1 2019-05-14 [1]
#> gridExtra 2.3 2017-09-09 [1]
#> gtable 0.3.0 2019-03-25 [1]
#> gtools 3.8.1 2018-06-26 [1]
#> highr 0.8 2019-03-20 [1]
#> htmltools 0.3.6 2017-04-28 [1]
#> htmlwidgets 1.3 2018-09-30 [1]
#> httpuv 1.5.1 2019-04-05 [1]
#> igraph 1.2.4.1 2019-04-22 [1]
#> inline 0.3.15 2018-05-18 [1]
#> ipred 0.9-9 2019-04-28 [1]
#> jsonlite 1.6 2018-12-07 [1]
#> keras 2.2.4.1 2019-04-05 [1]
#> knitr 1.23 2019-05-18 [1]
#> later 0.8.0 2019-02-11 [1]
#> lattice 0.20-38 2018-11-04 [1]
#> lava 1.6.6 2019-08-01 [1]
#> lazyeval 0.2.2 2019-03-15 [1]
#> lifecycle 0.1.0 2019-08-01 [1]
#> lme4 1.1-21 2019-03-05 [1]
#> loo 2.1.0 2019-03-13 [1]
#> lubridate 1.7.4 2018-04-11 [1]
#> magrittr 1.5.0.9000 2019-07-03 [1]
#> markdown 1.1 2019-08-07 [1]
#> MASS 7.3-51.4 2019-03-31 [1]
#> Matrix 1.2-17 2019-03-22 [1]
#> matrixStats 0.54.0 2018-07-23 [1]
#> memoise 1.1.0 2017-04-21 [1]
#> mime 0.7 2019-06-11 [1]
#> miniUI 0.1.1.1 2018-05-18 [1]
#> minqa 1.2.4 2014-10-09 [1]
#> mlbench * 2.1-1 2012-07-10 [1]
#> munsell 0.5.0 2018-06-12 [1]
#> nlme 3.1-139 2019-04-09 [1]
#> nloptr 1.2.1 2018-10-03 [1]
#> nnet 7.3-12 2016-02-02 [1]
#> pillar 1.4.2 2019-06-29 [1]
#> pkgbuild 1.0.5 2019-08-26 [1]
#> pkgconfig 2.0.2 2018-08-16 [1]
#> pkgload 1.0.2 2018-10-29 [1]
#> plyr 1.8.4 2016-06-08 [1]
#> prettyunits 1.0.2 2015-07-13 [1]
#> processx 3.4.1 2019-07-18 [1]
#> prodlim 2018.04.18 2018-04-18 [1]
#> promises 1.0.1 2018-04-13 [1]
#> ps 1.3.0 2018-12-21 [1]
#> purrr 0.3.2 2019-03-15 [1]
#> R6 2.4.0 2019-02-14 [1]
#> Rcpp 1.0.2 2019-07-25 [1]
#> RcppAnnoy 0.0.12 2019-05-12 [1]
#> RcppParallel 4.4.3 2019-05-22 [1]
#> recipes * 0.1.6 2019-07-02 [1]
#> remotes 2.1.0 2019-06-24 [1]
#> reshape2 1.4.3 2017-12-11 [1]
#> reticulate 1.13 2019-07-24 [1]
#> rlang 0.4.0 2019-06-25 [1]
#> rmarkdown 1.13.6 2019-07-09 [1]
#> rpart 4.1-15 2019-04-12 [1]
#> rprojroot 1.3-2 2018-01-03 [1]
#> rsconnect 0.8.15 2019-07-22 [1]
#> RSpectra 0.15-0 2019-06-11 [1]
#> rstan 2.19.2 2019-07-09 [1]
#> rstanarm 2.18.2 2018-11-10 [1]
#> rstantools 1.5.1 2018-08-22 [1]
#> scales 1.0.0 2018-08-09 [1]
#> sessioninfo 1.1.1 2018-11-05 [1]
#> shiny 1.3.2 2019-04-22 [1]
#> shinyjs 1.0 2018-01-08 [1]
#> shinystan 2.5.0 2018-05-01 [1]
#> shinythemes 1.1.2 2018-11-06 [1]
#> StanHeaders 2.18.1-10 2019-06-14 [1]
#> stringi 1.4.3 2019-03-12 [1]
#> stringr 1.4.0 2019-02-10 [1]
#> survival 2.44-1.1 2019-04-01 [1]
#> tensorflow 1.14.0 2019-08-01 [1]
#> testthat 2.1.1 2019-04-23 [1]
#> tfruns 1.4 2018-08-25 [1]
#> threejs 0.3.1 2017-08-13 [1]
#> tibble 2.1.3 2019-06-06 [1]
#> tidyr 0.8.99.9000 2019-08-26 [1]
#> tidyselect 0.2.5 2018-10-11 [1]
#> timeDate 3043.102 2018-02-21 [1]
#> usethis 1.5.0 2019-04-07 [1]
#> uwot 0.1.3 2019-04-07 [1]
#> vctrs 0.2.0 2019-07-05 [1]
#> whisker 0.4 2019-08-28 [1]
#> withr 2.1.2 2018-03-15 [1]
#> xfun 0.9 2019-08-21 [1]
#> xtable 1.8-4 2019-04-21 [1]
#> xts 0.11-2 2018-11-05 [1]
#> yaml 2.2.0 2018-07-25 [1]
#> zeallot 0.1.0 2018-01-28 [1]
#> zoo 1.8-6 2019-05-28 [1]
#> source
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> Github (tidymodels/embed@cdabcf2)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> Github (tidyverse/magrittr@4104d6b)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> Github (rstudio/rmarkdown@7b18786)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> Github (tidyverse/tidyr@a3431e3)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Created on 2019-08-29 by the reprex package (v0.3.0)
The master
branch of this repository will soon be renamed to main
, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.
That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master
--> main
change.
The purpose of this issue is to:
message id: euphoric_snowdog
Maybe fine with the changes to recipes
though.
Hi,
Thanks a lot for this embedding step! I'm a big fan of this package and it makes easy to accomplish a lot with just few lines of code.
I recently encountered a problem whilst trying to reload a saved step_umap
object to bake on new data. The error message is shown below:
Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>, : NULL value passed as symbol address
I have done a bit of Google search for the said error and did find this git issue jlmelville/uwot#19 but I wanted to check if there exist any workaround in the embed
package implementation.
Thanks!
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_check(num_workers = 4)
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Would it be interesting to have an option to let the user choose how to blend the prior and posterior estimates through a hyperparameter (versus lme4 or stan) ?
Here's a description of the approach: https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf.
I've seen it used on kaggle a lot. Usually users also add a random noise parameter as another hyperparameter to be added to the coefficient estimates.
It should be easy to implement, would happy to do it. Could be on top of the glm approach (as a last optional step after the glm coefficients have been estimated) or a separate method.
Hi Max, I have a question about the approach in step_tfembed
. It looks like it only supports a single hidden layer, and while it may eventually handle multiple predictors at once (#5), only those categoricals will be fed to the TF/keras model as inputs. Is that correct?
This seems quite different from the approach in Guo/Berkhahn and others where the embeddings are learned jointly with the task, considering all predictors (including non-categorical). Is the idea that even much narrower and shallower models should learn similarly good categorical embeddings, or that this approach strikes a nice effort/benefit balance? I'm curious if you could say something about the intuition there or point me towards other examples of this approach. Is this something you've had success with in your own experiments?
using https://github.com/jlmelville/uwot
I'm trying to learn how to do entity embedding for categorical variables but I keep getting this error and I can't figure out why.
rec <- recipe(Case~AnimacyObj+ Participants+ AgencySubj, data=causee)%>%
step_embed(AnimacyObj,Participants, AgencySubj,
outcome=vars(Case),
num_terms=3,
hidden_units=10,
options= embed_control(epochs = 25, validation_split=0.2))%>%
prep()
Error in if (is.na(b)) return(1L) : argument is of length zero
Why is woe different if I sort the outcome column? I get different woe values if in the first row of data the value of outcome column is 1 or 0.
mtcars_outcome_1_first <- mtcars
mtcar_outcome_0_first <- mtcars %>% arrange(am)
embed::dictionary(.data = mtcars_outcome_1_first %>% select(cyl, am), outcome = "am")
embed::dictionary(.data = mtcar_outcome_0_first %>% select(cyl, am), outcome = "am")
The step_lencode_mixed uses a default id of rand_id("lencode_bayes")
Should this be changed to rand_id("lencode_mixed")
?
replace with rlang::abort()
and rlang::warn()
.
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
step_pca
is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp
is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp
training parameter, which requires running the SVD preparation step many times).
As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.
I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE
, but in any case it would be very helpful!
Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:
library(janeaustenr)
library(recipes)
library(textrecipes)
# Train a model to match a single line to one of Jane Austen's books
books <- austen_books() %>%
filter(text != "")
rec <- recipe(book ~ text, books) %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 300) %>%
step_tfidf(text)
# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
step_pca(starts_with("tfidf"), num_comp = 5) %>%
prep() %>%
juice()
# But this is fast (~3.5s)
rec %>%
prep() %>%
juice() %>%
select(-book) %>%
as.matrix() %>%
irlba(nv = 5)
Hi
I noticed that all the glm based steps have a method for handling novel values: "For novel levels, a slightly timmed average of the coefficients is returned."
Would it make sense to add the same handling method to step_woe?
How does the step_lencode_*
functions handle multi-variables ?
Does it construct a single big model afterwards or does it handle each variable separately ?
Hi! Thanks for developing such a good (and needed) recipe extension!
I was looking into the documentation for a while and could not find a parameter to set the Tensorflow seed to get reproducible results. Each time I rerun the transformation I get a new seed, I'd instead fix this behaviour at least when transforming the training data set.
Keep it up!
Cris
We are systematically re-licensing tidymodels packages to use the MIT license, to make our package licenses as clear and permissive as possible. To do so, we need the approval of all copyright holders, which I have found by reviewing contributions from all non-RStudio contributors.
@Athospd, @klahrich, @konradsemsch, would you permit us to re-license embed with the MIT license? If so, please comment "I agree" below.
Seems like there is a bug π for step_umap()
when trying to save a prepped recipe as .rds
and reading it back to apply it new data.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(tidyverse)
library(embed)
split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]
set.seed(11)
supervised <-
recipe(Species ~ ., data = tr) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>%
prep(training = tr)
write_rds(supervised, here::here(tempdir(), "umap.rds"))
saved_rec <- read_rds(here::here(tempdir(), "umap.rds"))
saved_rec %>% bake(new_data = te)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address
Created on 2021-08-02 by the reprex package (v2.0.0)
I'm sure this is not us (i.e. not the embed package) but I wonder if there is anything we can do about this.
The recipe is fine if you don't save as .rds
and then read it back.
Love embed
.Β It would be super awesome if there was a step_kmeans()
or step_cluster()
that added cluster assignments to a data frame.
Cluster assignments are super important for segmentation. K-Means and similar algorithms (e.g. K-modes) can help us to identify customer groups.
Embed is a good spot for this. step_umap()
is a similar algorithm that I often use in combination with K-Means.
Let me know what you think.
Thanks, Matt
For prediction problems with K classes, it seems like a reasonable generalization would be to create K - 1 new predictor columns of class probabilities.
In the unpooled case, nnet::multinom
would be an option at the cost another dependency. Haven't actually played around with keras
yet but might be able to get a dependency-free softmax that way. Some small amount of regularization may be necessary if I recall correctly?
In the partially pooled case, there's family = "categorical"
in brms
, or potentially K-1 binary fits from stan_glmer
or glmer
. In the latter case it'd probably be best to use K-1 binary fits for the unpooled case as well for consistency.
Haven't used this personally so would love to hear from someone in the know if this would actually be useful.
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Hi, could we enable this recipes 0.1.15 enhancement in embed? Thanks!
The full tidyselect DSL is now allowed inside recipes step_*() functions. This includes the operators &, |, - and ! and the new where() function. Additionally, the restriction preventing user defined selectors from being used has been lifted (#572).
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Could you add an example of how you would use embed when you want to create/get embeddings from multiple categorical variables? My assumption is you would have multiple embedding layers, but train the model at once. Is that possible with embed? Can you provide an example of that? I hope this makes sense. (This would be like Fig. 1 in the Guo & Berkhahn paper)
I have been running through the code on this embed:Tensorflow page: https://tidymodels.github.io/embed/articles/Applications/Tensorflow.html
I installed tensorflow/keras in an environment r-tensorflow using conda (could not get install_tensorflow to work due to HTTPS issues).
I have found that the code runs successfully if I load libraries and define environments in the following order (prior to running the code on the embed:Tensorflow page)
library(embed)
# Note that need to load tensorflow after embed
reticulate::use_python("C:/Users/duartem/.conda/envs/r-tensorflow/python.exe", required = TRUE)
library(tensorflow)
Sys.setenv(TENSORFLOW_PYTHON="C:/Users/duartem/.conda/envs/r-tensorflow/python")
reticulate::py_config()
If I instead load embed after tensorflow I get the following error:
Error in UseMethod("compile") : no applicable method for 'compile' applied to an object of class "c('tensorflow.python.keras.engine.training.Model', 'tensorflow.python.keras.engine.network.Network', 'tensorflow.python.keras.engine.base_layer.Layer', 'tensorflow.python.training.checkpointable.base.CheckpointableBase', 'python.builtin.object')"
This error occurs in the following step on the embed:Tensorflow page
tf_embed <-
recipe(Sale_Price ~ ., data = ames) %>%
step_log(Sale_Price, base = 10) %>%
# Add some other predictors that can be used by the network. We
# preprocess them first
step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_range(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_embed(
Neighborhood,
outcome = vars(Sale_Price),
predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area),
num_terms = 5,
hidden_units = 10,
options = embed_control(epochs = 75, validation_split = 0.2)
) %>%
prep(training = ames)
Session Info:
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] AmesHousing_0.0.3 yardstick_0.0.3 tibble_2.1.2 rsample_0.0.4
[5] tidyr_0.8.3 purrr_0.3.2 parsnip_0.0.2 infer_0.4.0.1
[9] ggplot2_3.1.1 dials_0.0.2 scales_1.0.0 broom_0.5.2
[13] tidymodels_0.0.2 embed_0.0.2 recipes_0.1.5.9000 dplyr_0.8.1
[17] tensorflow_1.13.1
loaded via a namespace (and not attached):
[1] minqa_1.2.4 colorspace_1.4-1 class_7.3-15
[4] ggridges_0.5.1 rsconnect_0.8.13 markdown_0.9
[7] base64enc_0.1-3 tidytext_0.2.0 rstudioapi_0.10
[10] rstan_2.18.2 SnowballC_0.6.0 DT_0.6
[13] prodlim_2018.04.18 lubridate_1.7.4 codetools_0.2-16
[16] splines_3.6.0 knitr_1.23 shinythemes_1.1.2
[19] zeallot_0.1.0 bayesplot_1.7.0 jsonlite_1.6
[22] nloptr_1.2.1 pROC_1.14.0 packrat_0.5.0
[25] tfruns_1.4 shiny_1.3.2 compiler_3.6.0
[28] backports_1.1.4 assertthat_0.2.1 Matrix_1.2-17
[31] lazyeval_0.2.2 cli_1.1.0 later_0.8.0
[34] htmltools_0.3.6 prettyunits_1.0.2 tools_3.6.0
[37] igraph_1.2.4.1 gtable_0.3.0 glue_1.3.1
[40] reshape2_1.4.3 Rcpp_1.0.1 nlme_3.1-139
[43] crosstalk_1.0.0 timeDate_3043.102 gower_0.2.1
[46] xfun_0.7 stringr_1.4.0 ps_1.3.0
[49] lme4_1.1-21 mime_0.6 miniUI_0.1.1.1
[52] gtools_3.8.1 tidypredict_0.3.0 MASS_7.3-51.4
[55] zoo_1.8-6 ipred_0.9-9 rstanarm_2.18.2
[58] colourpicker_1.0 promises_1.0.1 parallel_3.6.0
[61] inline_0.3.15 shinystan_2.5.0 tidyposterior_0.0.2
[64] reticulate_1.12 gridExtra_2.3 loo_2.1.0
[67] StanHeaders_2.18.1 keras_2.2.4.1 rpart_4.1-15
[70] stringi_1.4.3 tokenizers_0.2.1 dygraphs_1.1.1.6
[73] boot_1.3-22 pkgbuild_1.0.3 lava_1.6.5
[76] rlang_0.3.4 pkgconfig_2.0.2 matrixStats_0.54.0
[79] lattice_0.20-38 labeling_0.3 rstantools_1.5.1
[82] htmlwidgets_1.3 processx_3.3.1 tidyselect_0.2.5
[85] plyr_1.8.4 magrittr_1.5 R6_2.4.0
[88] generics_0.0.2 pillar_1.4.1 whisker_0.3-2
[91] withr_2.1.2 xts_0.11-2 survival_2.44-1.1
[94] nnet_7.3-12 janeaustenr_0.1.5 crayon_1.3.4
[97] grid_3.6.0 callr_3.2.0 threejs_0.3.1
[100] digest_0.6.19 xtable_1.8-4 httpuv_1.5.1
[103] stats4_3.6.0 munsell_0.5.0 shinyjs_1.0
Hello,
I found a broken link to http://jse.amstat.org/v23n2/kim.pdf (Kim and Escobedo-Land (2015)(pdf)) in the page https://tidymodels.github.io/embed/articles/Applications/GLM.html .
Could you check when you update the site?
Prepare for release:
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
A fix is needed because of library tensorflow
.
Using
embed_0.1.0
tensorflow_2.2.0
reticulate_1.16
tidyverse_1.3.0
prep.encoding <- prep(recipe.encoding, training = training.set, retain = TRUE)
throws Error:
Error in if (is.na(b)) return(1L) : argument is of length zero
from this line:
compareVersion("2.0", as.character(tensorflow::tf_version()))
because
tensorflow::tf_version()
returns NULL
Not in recipes anymore:
Running examples in βembed-Ex.Rβ failed
The error most likely occurred in:
> ### Name: step_embed
> ### Title: Encoding Factors into Multiple Columns
> ### Aliases: step_embed tidy.step_embed embed_control
> ### Keywords: datagen
>
> ### ** Examples
>
> data(okc)
Warning in data(okc) : data set βokcβ not found
>
> rec <- recipe(Class ~ age + location, data = okc) %>%
+ step_embed(location, outcome = vars(Class),
+ options = embed_control(epochs = 10))
Error in is_tibble(data) : object 'okc' not found
Calls: %>% ... eval -> recipe -> recipe.formula -> form2args -> is_tibble
Execution halted
```
* checking tests ...
```
ERROR
Running the tests in βtests/testthat.Rβ failed.
Last 13 lines of output:
>
> test_check(package = "embed")
ββ 1. Error: (unknown) (@test_woe.R#9) ββββββββββββββββββββββββββββββββββββββββ
object 'credit_data' not found
Backtrace:
1. base::sample(1:nrow(credit_data), 2000)
2. base::nrow(credit_data)
ββ testthat results βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[ OK: 112 | SKIPPED: 10 | WARNINGS: 1 | FAILED: 1 ]
1. Error: (unknown) (@test_woe.R#9)
Error: testthat unit tests failed
Execution halted
```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.