Comments (4)
Thanks for your detailed report and sorry for not getting back to you sooner, somehow I did not get a notification.
If I understand it correctly, you are raising two separate points in your issue here.
-
It looks like your data set is containing missing values in the response. The mlr3 regression learners do not handle this, and I would consider this expected behavior for supervised learning algorithms. What would you expect to happen here? We should at least give a more explicit warning.
-
Regarding the scaling of variables: you mentioned no differences in outcome based on scaling. Depending on how you were scaling the features, this is normal. mlr3automl has a few different algorithms for regression: ranger and xgboost are tree-based methods and indifferent to scaling. The regression SVM learner is from the e1071 package and will scale inputs itself. The other regression algorithms perform regularization, so scaling might have an effect there.
If you can provide some more details on the data set, which learners were selected and how scaling changed things, I can have a look to see if there's anything going wrong there. I'll update the docs to make the default preprocessing behavior more explicit.
from mlr3automl.
Hello and no worries.
- Yes I realised that in the end - and have raised an issue there that they are considering to introduce this functionality. It is in a commit now.
- My data set is purely numeric and I've generally been playing around with ranger::ranger and kernlab::ksvm (and e1071::svm). IIRC ranger gave very similar results to running
autoML
with default settings. I appreciate this might be me thinking in circles, but it seems to me thatautoML(task, preprocessing = "full")
andautoML(task, preprocessing = po("scale"))
ought to give identical results - but they don't. YetautoML(task, preprocessing = "full")
andautoML(task)
do and so doautoML(task, preprocessing = po("scale"))
andautoML(task)
if in the latter case I manually scale first. I could be completely wrong though in my expectations!
Hope that's more clear, if not let me know and I'll try to make a reprex.
from mlr3automl.
it seems to me that autoML(task, preprocessing = "full") and autoML(task, preprocessing = po("scale")) ought to give identical results - but they don't
The "full" preprocessing option extends the pipeline_robustify
function from mlr3pipelines
by adding tunable preprocessing options for imputation methods (if your data has missing values) and PCA. The pipeline po(scale)
only has the scaling operator, so these will be very different pipelines. Not sure why you expect them to return identical results, can you elaborate? Supplying a Graph object for the preprocessing does not extend the existing pipeline, but it replaces it. Maybe that was not clear from the docs?
Yet autoML(task, preprocessing = "full") and autoML(task) do
This depends on your dataset. The difference between "full" and the default "stability" preprocessing are:
- different methods for imputation of missing data (not sure if this is relevant for you)
- different methods for encoding categorical covariates (irrelevant for your case)
- PCA for dimensionality reduction
Some more background: empirically, we saw that the "full" preprocessing pipeline performed slightly worse on our benchmark datasets (mostly because PCA hurt performance on some of the included data sets). Since all the above options are subject to hyperparameter tuning, you might find the same pipeline in the end with both options.
and so do autoML(task, preprocessing = po("scale")) and autoML(task) if in the latter case I manually scale first.
If you include the scaling in the pipeline, you will apply the same scaling factors from the training sets to the test set (which would not happen in the manually scaled scenario).
I am a bit surprised that scaling makes a difference here, as ranger
should be indifferent to scaling and both them SVMs you mention perform scaling internally.
from mlr3automl.
Ok so I've gone back over my old rmds from the time and I think this is part me being stupid as I was experimenting with tidymodels, h2o, mlr3 (automl) all at the same time, and part the vignette for automl maybe being a bit too concise at the moment - meaning I got my wires a bit crossed as a result. Now I've read your explanation and the actual ?AutoML
doc rather than only the vignette it's much clearer.
Just to expand, with mlr3 there's a commit where you can keep NAs in the target for imputation (but only impute the non-target variables and then drop the target NAs after). These samples can help the imputation in non-target NAs - as in, if you have variables A:E (with E the target) and you have a sample with E missing, then you don't necessarily want to drop this sample before imputation because it can still be useful to impute missing A:D values in other samples. If that makes sense? So you want to keep this sample for imputation and then drop it afterwards.
Anyway, I accidentally used the data set where I had left in target NAs with automl and obviously did not reading the error message properly - leading to misinterpreting the preprocessing options totally. Shouldn't have been experimenting with so many packages at once probably as obviously I had too many things going through my head to think this through properly. Sorry about that, but talking this through has helped a lot!
I have now run the full option on a dataset with target NAs removed (but not non-target NAs removed) and it worked beautifully. Thank you.
from mlr3automl.
Related Issues (17)
- bad if-condition HOT 1
- Installation fails without installing ml3extralearners HOT 1
- Warning message about package emoa not being installed HOT 2
- Should mlr3 be in Depends instead of Imports? HOT 1
- Reproducibility Issue With Parallel Processing? HOT 2
- makeActiveBinding error in mlr3automl HOT 3
- Integration with DALEX HOT 3
- Inspect results of training HOT 2
- Assertion on 'ids of pipe operators' failed: Must have unique names HOT 6
- Use two or more tasks in AutoML HOT 2
- Installing error HOT 4
- Error in lapply(X = X, FUN = FUN, ...) : attempt to apply non-function HOT 1
- Tuning stops after one run of hyperband even if there is stull runtime left HOT 1
- "runtime" parameter breaks training / resampling HOT 2
- mlr3automl with time series data? HOT 1
- AutoMLTuner HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlr3automl.