tidymodels / agua Goto Github PK
View Code? Open in Web Editor NEWCreate and evaluate models using 'tidymodels' and 'h2o'
Home Page: https://agua.tidymodels.org
License: Other
Create and evaluate models using 'tidymodels' and 'h2o'
Home Page: https://agua.tidymodels.org
License: Other
Prepare for release:
git pull
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
git push
usethis::use_github_release()
usethis::use_dev_version()
usethis::use_news_md()
git push
Since h2o uses different values for activation functions, we can
h2o_train_mlp()
to be what h2o expects (e.g. "Tanh").In the tune release following 1.2.1, tune's .catch_and_log(split)
argument will be renamed to .catch_and_log(split_labels)
, and will take the format labels(split)
rather than split
. agua just passes that argument once here:
Lines 105 to 111 in 6a742f6
...and can pass it's value conditional on tune's package version.
Related to tidymodels/tune#909.
The noted release is probably at least a couple months out, so this can be ignored for now.
For all models, not just those with engines in parsnip, we create engine-specific documentation in parsnip/man/rmd
. Details are here. We should add docs for the h20 engines.
In the very latest h2o version, they have this function documented exported. This should stop the progress bars and some other output.
edit: @ledell @tomasfryda Was the function supposed to be exported (since it is documented)?
This line restricts h2o engines from being tuned unless there is a recipe. There are two other types of preprocessors so we should generalize this. There's probably code in tune to do this already.
Once the repo is public, let's use usethis::use_pkgdown()
. I already made a CNAME entry so we should be able to use agua.tidymodels.org
.
First release:
usethis::use_cran_comments()
Title:
and Description:
@return
and @examples
Authors@R:
includes a copyright holder (role 'cph')Prepare for release:
git pull
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
git push
usethis::use_github_release()
usethis::use_dev_version()
usethis::use_news_md()
git push
I cannot seem to tune with H20, I keep getting an "unused argument" error. I was using an example that used keras (which works fine) and I just switched the engine from keras to h2o and thought it should also work. But it didn't.
To track it down, I decided to run the code from https://agua.tidymodels.org/articles/tune.html
which is given below:
library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
doParallel::registerDoParallel()
h2o_start()
data(ames)
set.seed(4595)
data_split <- ames %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)
ames_rec <-
recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_ns(Longitude, deg_free = tune("long df")) %>%
step_ns(Latitude, deg_free = tune("lat df"))
lm_mod <- linear_reg(penalty = tune()) %>%
set_engine("h2o")
lm_wflow <- workflow() %>%
add_model(lm_mod) %>%
add_recipe(ames_rec)
grid <- lm_wflow %>%
extract_parameter_set_dials() %>%
grid_regular(levels = 5)
ames_res <- tune_grid(
lm_wflow,
resamples = cv_splits,
grid = grid,
control = control_grid(save_pred = TRUE,
backend_options = agua_backend_options(parallelism = 5))
)
ames_res
There were issues with some computations:
Run show_notes(.Last.tune.result)
for more information.
If I run show_notes โฆ, this is the output:
unique notes:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimize", "maximize"), c("numeric", "numeric")), list(c("penalty", "deg_free", "deg_free"), c("penalty", "long df", "lat df"), c("model_spec", "recipe", "recipe"), c("linear_reg", "step_ns", "step_ns"), c("main", "ns_PCP7q", "ns_33KwK"), list(list("double", list(-10, 0), c(TRUE, TRUE), list("log-10", function (x)
log(x, base), function (x)
base^x, function (x, n = n_default)
{
raw_rng <- suppressWarnings(range(x, na.rm = TRUE))
if (any(!is.finite(raw_rng))) {
return(numeric())
}
rng <- log(raw_rng, base = base)
min <- floor(rng[1])
max <- ceiling(rng[2])
if (max == min) {
return(base^min)
}
by <- floor((max - min)/n) + 1
breaks <- base^seq(min, max, by = by)
relevant_breaks <- base^rng[1] <= breaks & breaks <= base^rng[2]
if (sum(relevant_breaks) >= (n - 2)) {
return(breaks)
}
while (by > 1) {
by <- by - 1
breaks <- base^seq(min, max, by = by)
When I run h2o_start()
without things installed/configured correctly, I do get
> h2o_start()
The operation couldnโt be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.
but then it just hangs there. It would be nice if that threw an error instead.
Hi,
Thanks for bringing h2o
capabilities to tidymodels
!.
h2o
already includes various functions to help in model's interpretation/explainability for binary classification and regression models:
These functions can also be applied to an h2o.automl() object.
All the available h2o functionality is documented here
Thanks!
Carlos.
Once the repo is public, let's setup usethis::use_tidy_github_actions()
.
We'd need to add a model definition to parsnip (with a default engine of h2o) and add the rest in agua.
Not sure what the main arguments should be (max number of models?).
Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
The h2o functions take weights in the argument weights_column
that is described as "Column with observation weights".
We have a method in workflows
At the top of the iteration function we should run h2o.no_progress()
and then use an on.exit()
to:
h2o.show_progress()
h2o parallelized internally by multithreading the training for an individual model.
We could also use R's external parallelization (via foreach or futures) to send more models to the h2o server at the same time.
We could also use both approaches.
Right now, when using multicore, it just works. For PSOCK clusters, it does not. It produces the error that it cannot find the h2o server.
Can we create a helper that will setup PSOCK clusters so that we can used them? We would need to experiment on what the worker processes are missing. It might be as simple as loading the h2o package in each.
A reminder of todos after h2o's cran release (3.36.1.2
)
change tune functions to use strategy = 'Sequential'
, also remove this line line from tune
update relevant parts in the vignette discussing parllel processing with h2o.grid
add tuning benchmark
use new progress functions
display threshold for classification models (if available)
discuss if explainability functions in #31 should be added
Mirror the api for xgboost where we have a validation
arg (default = 0) that splits off some data in the wrapper to supply as a validation frame.
I am unable to run the code from the model tuning vignette here. When doing so, I get the following error when running tune_grid
:
Error in get(x, envir = ns, inherits = FALSE) :
object 'tune_grid_loop_iter_h2o' not found
7.
get(x, envir = ns, inherits = FALSE)
6.
utils::getFromNamespace(x = "tune_grid_loop_iter_h2o", ns = "agua")
5.
fn_tune_grid_loop(resamples, grid, workflow, metrics, control,
rng)
4.
tune_grid_loop(resamples = resamples, grid = grid, workflow = workflow,
metrics = metrics, control = control, rng = rng)
3.
tune_grid_workflow(object, resamples = resamples, grid = grid,
metrics = metrics, pset = param_info, control = control)
2.
tune_grid.workflow(lm_wflow, resamples = cv_splits, grid = grid,
control = control_grid(save_pred = TRUE))
1.
tune_grid(lm_wflow, resamples = cv_splits, grid = grid, control = control_grid(save_pred = TRUE))
Any thoughts?
The training wrapper functions (e.g., h2o_train_glm
) did not receive possible interaction terms.
library(agua)
#> Loading required package: parsnip
h2o_start()
linear_mod <- linear_reg(penalty = 0.1) |>
set_engine("h2o") %>%
fit(mpg ~ wt * cyl, data = mtcars)
linear_mod$fit@parameters$x
#> [1] "wt" "cyl"
Created on 2022-06-22 by the reprex package (v2.0.1)
Internal functions used in tune_grid_loop_iter_h2o
that may need to be exported or carried to agua:
setup for parallel processing
tune:::load_namespace
finalize and fit workflows when loooping parameters
tune:::catch_and_log
tune:::forge_from_workflow
workflows:::.fit_pre
formatting functions for predictions
compute metrics
tune::outcome_names
tune:::estimate_metrics
Original bug found in tidymodels/censored#269.
It isn't that much work for us to specify it and it stops us from being bitten later down the line.
Getting following error when fitting drf model
Warning: stack imbalance in 'as.environment', 249 then 246
*** caught segfault ***
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
address 0x7fcfd18a4e7a, cause 'invalid permissions'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.