Release agua 0.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

dials activation values

Since h2o uses different values for activation functions, we can

take the values that are consistent with current tidymodels engines (e.g. "tanh") and translate them inside of h2o_train_mlp() to be what h2o expects (e.g. "Tanh").
Also, we could expand what dials has as possible values to include others in h2o that tidymodels does not currently have (or just fail if the value is not in our current list).

breaking change in upcoming tune release

In the tune release following 1.2.1, tune's .catch_and_log(split) argument will be renamed to .catch_and_log(split_labels), and will take the format labels(split) rather than split. agua just passes that argument once here:

agua/R/tune.R

Lines 105 to 111 in 6a742f6

    
           workflow <- tune::.catch_and_log( 
        
             .expr = workflows::.fit_pre(workflow, training_frame), 
        
             control, 
        
             split, 
        
             iter_msg_preprocessor, 
        
             notes = out_notes 
        
           )

...and can pass it's value conditional on tune's package version.

Related to tidymodels/tune#909.

The noted release is probably at least a couple months out, so this can be ignored for now.

setup parsnip engine docs

For all models, not just those with engines in parsnip, we create engine-specific documentation in parsnip/man/rmd. Details are here. We should add docs for the h20 engines.

use h2o::with_no_h2o_progress

In the very latest h2o version, they have this function documented ~~exported~~. This should stop the progress bars and some other output.

edit: @ledell @tomasfryda Was the function supposed to be exported (since it is documented)?

use `parallelism` in `h2o.grid()`

We need to find a way to to specify parallelism in h2o.grid() and allow parallel model building. One possible solution is using control_grid(parallel_over) and have a condition for that here.

@topepo

allow other preprocessors

This line restricts h2o engines from being tuned unless there is a recipe. There are two other types of preprocessors so we should generalize this. There's probably code in tune to do this already.

use pkgdown

Once the repo is public, let's use usethis::use_pkgdown(). I already made a CNAME entry so we should be able to use agua.tidymodels.org.

Release agua 0.1.0

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Unused arguement error while tuning

The problem

I cannot seem to tune with H20, I keep getting an "unused argument" error. I was using an example that used keras (which works fine) and I just switched the engine from keras to h2o and thought it should also work. But it didn't.

To track it down, I decided to run the code from https://agua.tidymodels.org/articles/tune.html

which is given below:

copied R code

library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
doParallel::registerDoParallel()
h2o_start()
data(ames)

set.seed(4595)
data_split <- ames %>%
  mutate(Sale_Price = log10(Sale_Price)) %>%
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)

ames_rec <-
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>%
  step_ns(Longitude, deg_free = tune("long df")) %>%
  step_ns(Latitude, deg_free = tune("lat df"))

lm_mod <- linear_reg(penalty = tune()) %>%
  set_engine("h2o")

lm_wflow <- workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(ames_rec)

grid <- lm_wflow %>%
  extract_parameter_set_dials() %>%
  grid_regular(levels = 5)

ames_res <- tune_grid(
  lm_wflow,
  resamples = cv_splits,
  grid = grid,
  control = control_grid(save_pred = TRUE,
    backend_options = agua_backend_options(parallelism = 5))
)

ames_res

The output is :

Tuning results

10-fold cross-validation using stratification

There were issues with some computations:

Error(s) x10: Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimiz...

Run show_notes(.Last.tune.result) for more information.

Follow up on the suggestion

If I run show_notes …, this is the output:

unique notes:
───────────────────────────────────────────────────────────────────────────────────────────────────────
Error in fn(...): unused arguments (metrics_info = list(c("rmse", "rsq"), c("minimize", "maximize"), c("numeric", "numeric")), list(c("penalty", "deg_free", "deg_free"), c("penalty", "long df", "lat df"), c("model_spec", "recipe", "recipe"), c("linear_reg", "step_ns", "step_ns"), c("main", "ns_PCP7q", "ns_33KwK"), list(list("double", list(-10, 0), c(TRUE, TRUE), list("log-10", function (x)
log(x, base), function (x)
base^x, function (x, n = n_default)
{
raw_rng <- suppressWarnings(range(x, na.rm = TRUE))
if (any(!is.finite(raw_rng))) {
return(numeric())
}
rng <- log(raw_rng, base = base)
min <- floor(rng[1])
max <- ceiling(rng[2])
if (max == min) {
return(base^min)
}
by <- floor((max - min)/n) + 1
breaks <- base^seq(min, max, by = by)
relevant_breaks <- base^rng[1] <= breaks & breaks <= base^rng[2]
if (sum(relevant_breaks) >= (n - 2)) {
return(breaks)
}
while (by > 1) {
by <- by - 1
breaks <- base^seq(min, max, by = by)

Error for `h2o_start()` without java installed

When I run h2o_start() without things installed/configured correctly, I do get

> h2o_start()
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

but then it just hangs there. It would be nice if that threw an error instead.

[New Functionalitiy]: Add explainability/interpretability functions from h2o.

Hi,

Thanks for bringing h2o capabilities to tidymodels!.

h2o already includes various functions to help in model's interpretation/explainability for binary classification and regression models:

h2o.shap_summary_plot()
h2o.shap_explain_row_plot()
h2o.pd_multi_plot()
h2o_pd_plot()
h2o_ice_plot()

These functions can also be applied to an h2o.automl() object.

All the available h2o functionality is documented here

Thanks!
Carlos.

enable GitHub actions

Once the repo is public, let's setup usethis::use_tidy_github_actions().

auto_ml() model type

We'd need to add a model definition to parsnip (with a default engine of h2o) and add the rest in agua.

Not sure what the main arguments should be (max number of models?).

Release agua 0.1.4

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)

case weights

The h2o functions take weights in the argument weights_column that is described as "Column with observation weights".

remove tidy.workflows

We have a method in workflows

on.exit() for h2o tuning module

At the top of the iteration function we should run h2o.no_progress() and then use an on.exit() to:

run h2o.show_progress()
run a function that removes the model id's that were created.

external parallel processing

h2o parallelized internally by multithreading the training for an individual model.

We could also use R's external parallelization (via foreach or futures) to send more models to the h2o server at the same time.

We could also use both approaches.

Right now, when using multicore, it just works. For PSOCK clusters, it does not. It produces the error that it cannot find the h2o server.

Can we create a helper that will setup PSOCK clusters so that we can used them? We would need to experiment on what the worker processes are missing. It might be as simple as loading the h2o package in each.

todos after h2o August release

A reminder of todos after h2o's cran release (3.36.1.2)

change tune functions to use strategy = 'Sequential', also remove this line line from tune
update relevant parts in the vignette discussing parllel processing with h2o.grid
add tuning benchmark
use new progress functions
display threshold for classification models (if available)
discuss if explainability functions in #31 should be added

validation set for xgboost and map

Mirror the api for xgboost where we have a validation arg (default = 0) that splits off some data in the wrapper to supply as a validation frame.

Cannot run tuning example

I am unable to run the code from the model tuning vignette here. When doing so, I get the following error when running tune_grid:

 Error in get(x, envir = ns, inherits = FALSE) : 
object 'tune_grid_loop_iter_h2o' not found
7.
get(x, envir = ns, inherits = FALSE)
6.
utils::getFromNamespace(x = "tune_grid_loop_iter_h2o", ns = "agua")
5.
fn_tune_grid_loop(resamples, grid, workflow, metrics, control, 
rng)
4.
tune_grid_loop(resamples = resamples, grid = grid, workflow = workflow, 
metrics = metrics, control = control, rng = rng)
3.
tune_grid_workflow(object, resamples = resamples, grid = grid, 
metrics = metrics, pset = param_info, control = control)
2.
tune_grid.workflow(lm_wflow, resamples = cv_splits, grid = grid, 
control = control_grid(save_pred = TRUE))
1.
tune_grid(lm_wflow, resamples = cv_splits, grid = grid, control = control_grid(save_pred = TRUE))

Any thoughts?

Interaction terms are ignored

The training wrapper functions (e.g., h2o_train_glm) did not receive possible interaction terms.

library(agua)
#> Loading required package: parsnip
h2o_start()

linear_mod <- linear_reg(penalty = 0.1) |> 
  set_engine("h2o") %>% 
  fit(mpg ~ wt * cyl, data = mtcars)

linear_mod$fit@parameters$x
#> [1] "wt"  "cyl"

^{Created on 2022-06-22 by the reprex package (v2.0.1)}

Internal functions used in tune_grid_loop_iter_h2o

Internal functions used in tune_grid_loop_iter_h2o that may need to be exported or carried to agua:

setup for parallel processing

~~tune:::load_namespace~~

finalize and fit workflows when loooping parameters

~~tune:::catch_and_log~~
tune:::forge_from_workflow
~~workflows:::.fit_pre~~

formatting functions for predictions

~~parsnip~~

compute metrics

~~tune::outcome_names~~
~~tune:::estimate_metrics~~

Use of set_dependency() should use `mode` argument

Original bug found in tidymodels/censored#269.

It isn't that much work for us to specify it and it stops us from being bitten later down the line.

Error segfault

Getting following error when fitting drf model
Warning: stack imbalance in 'as.environment', 249 then 246
*** caught segfault ***
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'
address 0x7fcfd18a4e7a, cause 'invalid permissions'
*** caught segfault ***
address 0x64209498, cause 'memory not mapped'

	workflow <- tune::.catch_and_log(
	.expr = workflows::.fit_pre(workflow, training_frame),
	control,
	split,
	iter_msg_preprocessor,
	notes = out_notes
	)

tidymodels / agua Goto Github PK

agua's People

Contributors

Stargazers

Watchers

Forkers

agua's Issues

The problem

copied R code

The output is :

Tuning results

10-fold cross-validation using stratification

Follow up on the suggestion

Recommend Projects

Recommend Topics

Recommend Org