a-hanf / mlr3automl Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 4.0 276 KB

Automated machine learning in mlr3

License: GNU Lesser General Public License v3.0

R 100.00%

automl mlr3 r

mlr3automl's People

Contributors

Stargazers

Watchers

Forkers

mb706 johnfrombluff ja-thomas phymucs

mlr3automl's Issues

Integration with DALEX

Dear Alex,

Thanks for your hard work on this amazing package! I watched your presentation on UserR session 3A and followed your examples on my own data. However I'm having trouble analysing the result using DALEX and other related packages (ArenaR, Triplot, ModelStudio, rSAFE). I wonder if you could provide example code in you vignette that shows how to:

create a DALEX explainer
display information about the final model and its parameters
display information about the model selection and tuning history

Thanks again for your amazing work,

John

makeActiveBinding error in mlr3automl

python runbenchmark.py mlr3automl openml/t/59 -f 0

...

Error in makeActiveBinding(name, active[[name]], public_bind_env) :
symbol already has a regular binding
Calls: run ... assert_r6 -> checkR6 -> -> makeActiveBinding

Inspect results of training

Thanks for great package. I planned to write a package with AutoML for finace (investing) using mlr3 but it seems on first the you have already made great package (better than I woud be able to do for sure :)).

I have tried the package on my dataset with only one learner and Inf runtime. Here is simple code:

bmr_results = AutoML(my_task, learner_list = c("classif.ranger"), runtime = Inf)
bmr_results$train()

I don't understand how can inspect results of the model after training?
I can see following methods and attributes:

names(bmr_results)
 [1] ".__enclos_env__"   "custom_trafo"      "additional_params" "portfolio"         "tuner"             "runtime"           "tuning_terminator"
 [8] "measure"           "resampling"        "preprocessing"     "learner"           "learner_timeout"   "learner_list"      "task"             
[15] "clone"             "initialize"        "tuned_params"      "resample"          "predict"           "train"

I can't see aggregate method.
I have also tried to use resmaple method instead of train, but I got the same result.
Additionally I would like to if it is possible to use feature select steps in preprocessing?

bad if-condition

mlr3automl/R/AutoMLBase.R

Line 191 in e157226

if (!is.null(self$learner_timeout) || !is.infinite(self$learner_timeout)) {

When learner_timeout is not NULL, then the condition evaluates to TRUE, when learner_timeout is NULL, then it evaluates to NA.

Assertion on 'ids of pipe operators' failed: Must have unique names

When I add additional learners, very often I get following error:

Error in gunion(x) : 
  Assertion on 'ids of pipe operators' failed: Must have unique names, but element 7 is duplicated.

Sample code:

# define learners
new_params = ParamSet$new(list(
  ParamInt$new("classif.kknn.k", lower = 1, upper = 5, default = 3),
  ParamDbl$new("classif.glmnet.alpha", lower = 0, upper = 1),
  ParamInt$new("classif.nnet.size", lower = 1, upper = 10),
  ParamDbl$new("classif.nnet.decay", lower = 0, upper = 0.5)
  # ParamInt$new("classif.bart.ntree", lower = 500, upper = 1000),
  # ParamDbl$new("classif.C50.CF", lower = 0, upper = 1),
  # ParamInt$new("classif.C50.trials", lower = 1, upper = 40)
))
my_trafo = function(x, param_set) {
  if ("classif.kknn.k" %in% names(x)) {
    x[["classif.kknn.k"]] = 2^x[["classif.kknn.k"]]
  }
  return(x)
}

task_ <- tsk("iris")
bmr_results = AutoML(task_,
                     learner_list = c("classif.ranger", "classif.xgboost", "classif.liblinear",
                                      "classif.kknn", "classif.glmnet", "classif.nnet"),
                     additional_params = new_params,
                     custom_trafo=my_trafo,
                     runtime = Inf)

If I remove last learner (classif,nnet), I don't get the errors. But if I use some other learner from mlr3extralearners package, I get the same error.

Should mlr3 be in Depends instead of Imports?

Currently, all the packages that mlr3automl depends on are in the Imports section of the DESCRIPTION file. I wonder if it would be helpful to move mlr3 to Depends, so you don't need to explicitly load it when running the examples?

> library(mlr3automl)
> iris_task <- tsk('iris')
Error in tsk("iris") : could not find function "tsk"

You have to do this:

library(mlr3)
library(mlr3automl)

iris_task <- tsk('iris')
model <- AutoML(iris_task)
model$train()

Tuning stops after one run of hyperband even if there is stull runtime left

This is very unintuitive behaviour.

self$tuner = tnr("hyperband", eta = 3L) needs to be set to self$tuner = tnr("hyperband", eta = 3L, repititions=Inf).

Preprocessing not working?

Hello,

Thanks for this fantastic package. One strange thing I've noticed is that I seem to not be able to activate the preprocessing stage. I can tell for a couple of reasons (I think)! So I have some numeric data, that contains NAs, and that I want to do regression on. If I run my data as is using AutoML(task) with no options set then I get an error:

Error in check_prediction_data.PredictionDataRegr(pdata) : 
  Assertion on 'pdata$response' failed: Contains missing values (element 1).

Ok, so if I remove the NAs first then, after running predict on the full data set (minus NAs) I get something like:

Very good predictions except the fact that the predictions lie at an angle to the x-y line, I think, is a result of a lack of scaling and centering because, if I do this manually first then I get:

I mean, obviously the plot looks different now as mlr3 doesn't know about the scaling, but now the predictions lie nicely around the x-y line.

So it seems the default is to do no preprocessing (it's not quite clear from the help pages). But, when I set the option AutoML(task, preprocessing = "full"), I get no difference in the outcome with the original data or manually scaled data. Plus if I leave in the NAs then I still get the error:

Error in check_prediction_data.PredictionDataRegr(pdata) : 
  Assertion on 'pdata$response' failed: Contains missing values (element 1).

The help pages suggest NAs can be handled as they mention imputation - but I still get the error. And, as I mentioned above the predictions on data after removing NAs look the same as with not setting the preprocessing option. Am I missing something?

EDIT;

but setting preprocessing = po("scale") does work:

so it seems like it's the "full", "stability", "none" options that aren't being respected. Or I'm being stupid!

mlr3automl with time series data?

Hi!
Is there at this moment an option to use mlr3automl for data where time is important?
Most importantly it should use a resampling that respects time. However, as far as I know, the mlr3temporal package does not yet provide these options.
Is there a way to do temporal train- test split with mlr3automl?

Thank You!

Installation fails without installing ml3extralearners

I think you might be missing a command to install ml3extralearners in your list of installation commands:

devtools::install_github('https://github.com/mlr-org/mlr3@master')
devtools::install_github('https://github.com/mlr-org/mlr3tuning@autotuner-notimeout')
devtools::install_github('https://github.com/a-hanf/mlr3automl@development')

Here's the error I got:

> devtools::install_github('https://github.com/a-hanf/mlr3automl@development')
Downloading GitHub repo a-hanf/mlr3automl@master
Skipping 2 packages not available: mlr3extralearners, glmnet
✓  checking for file ‘/private/var/folders/gj/cm0k4b_s42j30zs376cq_5hh0000gn/T/Rtmp6Nhtam/remotesa773542f4f5/a-hanf-mlr3automl-eed029b/DESCRIPTION’ ...
─  preparing ‘mlr3automl’:
✓  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘mlr3automl_0.0.0.9000.tar.gz’
   
Installing package into ‘/Users/me/Library/R/3.5/library’
(as ‘lib’ is unspecified)
ERROR: dependency ‘mlr3extralearners’ is not available for package ‘mlr3automl’
* removing ‘/Users/me/Library/R/3.5/library/mlr3automl’
Error: Failed to install 'mlr3automl' from GitHub:
  (converted from warning) installation of package ‘/var/folders/gj/cm0k4b_s42j30zs376cq_5hh0000gn/T//Rtmp6Nhtam/filea7745b7e50b/mlr3automl_0.0.0.9000.tar.gz’ had non-zero exit status

However, after installing ml3extralearners then the install of the ml3automl worked:

devtools::install_github('https://github.com/mlr-org/mlr3extralearners@master')

Use two or more tasks in AutoML

It seems to me it is not possible to use two or more tasks in AutoML?

If that's true, I would like to make feature request.

This is ussually possible in using mlr3 benchmark. It is possible to degine multiple tasks, learners etc.

If I use two tasks in AutoML:


library(mlr3automl)
library(mlr3verse)

task_1 <- tsk("iris")
task_2 <- tsk("iris")
bmr_results = AutoML(list(task_1, task_2))
bmr_results = AutoML(c(task_1, task_2))

it returns an error:

Error in if (task$task_type == "classif") { : argument is of length zero

BTW, is there any way I can contribute tu this package and help in developing? Maybe adding new learners, there are many of them in mlr3extensions?

Reproducibility Issue With Parallel Processing?

Here is code where I would have expected the aggregate results at the end for two identical benchmarks to be identical, but they are not. Since I am only an intermediate level coder in R, perhaps there is something wrong with my code. In any event, I pass this along for your consideration as a possible issue in mlr3automl. As you can imagine, this code takes a while to execute, ~10 minutes on my iMac Pro.

#############################################################
# Cross-validating the regression learners
#############################################################

library("doFuture")
library("doRNG")
library("future")
library("future.apply")
library("mlr3verse")
library("mlr3automl")
library("mlr3hyperband")

# set logger thresholds

lgr::get_logger("mlr3")$set_threshold("error")
lgr::get_logger("bbotk")$set_threshold("error")

# specify regression learners

learners = list(
  lrn(
    "regr.featureless",
    id = "fl"
  ),
  lrn(
    "regr.lm",
    id = "lm"
  ),
  lrn(
    "regr.cv_glmnet",
    id = "glm"
  ),
  lrn(
    "regr.ranger",
    id = "rf"
  ),
  lrn(
    "regr.xgboost",
    id = "xgb"
  ),
  lrn(
    "regr.svm",
    id = "svm"
  )
)

learner_ids = sapply(
  learners,
  function(x) x$id
)

# define regression task

task = tsk("boston_housing")

# select small subset of features

task$select(c("age", "crim", "lat", "lon"))

# specify resampling

resampling = rsmp("cv")

# specify measure

measure = msr("regr.mse")

# autotuners for models with hyperparameters

learners[[3]] = create_autotuner(
  learner = lrn("regr.cv_glmnet"),
  tuner = tnr("hyperband")
)

learners[[4]] = create_autotuner(
  learner = lrn("regr.ranger"),
  tuner = tnr("hyperband"),
  num_effective_vars = length(
    task$feature_names
  )
)

learners[[5]] = create_autotuner(
  learner = lrn("regr.xgboost"),
  tuner = tnr("hyperband")
)

learners[[6]] = create_autotuner(
  learner = lrn("regr.svm"),
  tuner = tnr("hyperband")
)

# create benchmark grid

design = benchmark_grid(
  tasks = task,
  learners = learners,
  resamplings = resampling
)

# start parallel processing

registerDoFuture()
plan(multisession, workers = availableCores() - 1)
registerDoRNG(123456)

# execute benchmark

bmr1 = mlr3::benchmark(design)

# terminate parallel processing

plan(sequential)

# start parallel processing

registerDoFuture()
plan(multisession, workers = availableCores() - 1)
registerDoRNG(123456)

# execute benchmark

bmr2 = mlr3::benchmark(design)

# terminate parallel processing

plan(sequential)

# test for reproducibility

bmr1$aggregate()$regr.mse == bmr2$aggregate()$regr.mse

Here are a couple of interesting clues. If I run this code several times, the end result is the same each time (i.e., the same mix of TRUE and FALSE results for the different stochastic learners). But if I run this code in R and then run the same code in RStudio, I get a different mix of TRUE and FALSE results depending on the platform. Finally, if I run this code substituting a different dataset, then I get a different mix of TRUE and FALSE results at the end.

> library(mlr3automl)
> iris_task <- tsk('iris')
> model <- AutoML(iris_task)
                 numeric_cols all_cols
no_encoding                 4        4
one_hot_encoding            4        4
impact_encoding             4        4
Warning messages:
1: Package 'emoa' required but not installed for Tuner '<TunerHyperband>'
2: Package 'emoa' required but not installed for Optimizer '<OptimizerChain>'

Suggestions for improvement

Allow to specify tuner when construction AutoML object
Allow to specify HP space
Stacking?
Log what tuner is used and give some progress on tuning (time/evals left, current best performance)
Consistent way of specifying timeout (can be done through terminator or explicitly timeout at the moment)

Installing error

There is an error when installing the new version of the package:

> devtools::install_github('a-hanf/mlr3automl')
Downloading GitHub repo a-hanf/mlr3automl@HEAD
√  checking for file 'C:\Users\Mislav\AppData\Local\Temp\Rtmp86dDna\remotes3efceab31af\a-hanf-mlr3automl-0a0d8c7/DESCRIPTION'
-  preparing 'mlr3automl':
√  checking DESCRIPTION meta-information ... 
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
-  building 'mlr3automl_0.0.0.9000.tar.gz'
   
Installing package into ‘C:/Users/Mislav/Documents/R/win-library/4.1’
(as ‘lib’ is unspecified)
* installing *source* package 'mlr3automl' ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'mlr3automl'
    finding HTML links ... done
    AutoML                                  html  
    finding level-2 HTML links ... done

    AutoMLBase                              html  
Error: C:/Users/Mislav/AppData/Local/Temp/RtmpOowfbs/R.INSTALL6c4c644b4c74/mlr3automl/man/AutoMLBase.Rd:271: Bad \link text
* removing 'C:/Users/Mislav/Documents/R/win-library/4.1/mlr3automl'
* restoring previous 'C:/Users/Mislav/Documents/R/win-library/4.1/mlr3automl'
Warning message:
In i.p(...) :
  installation of package ‘C:/Users/Mislav/AppData/Local/Temp/Rtmp86dDna/file3efc390535c3/mlr3automl_0.0.0.9000.tar.gz’ had non-zero exit status

"runtime" parameter breaks training / resampling

Something broke:

model = AutoML(tsk("iris"), runtime=60)
Warning message:
The fallback learner 'response' and the base learner 'prob' have different predict types 
model$train()
Error in makeActiveBinding(name, active[[name]], public_bind_env) : 
  symbol already has a regular binding
Called from: makeActiveBinding(name, active[[name]], public_bind_env)

It works without the runtime parameter