Giter Site home page Giter Site logo

Preprocessing not working? about mlr3automl HOT 4 CLOSED

a-hanf avatar a-hanf commented on May 26, 2024
Preprocessing not working?

from mlr3automl.

Comments (4)

a-hanf avatar a-hanf commented on May 26, 2024

Thanks for your detailed report and sorry for not getting back to you sooner, somehow I did not get a notification.
If I understand it correctly, you are raising two separate points in your issue here.

  1. It looks like your data set is containing missing values in the response. The mlr3 regression learners do not handle this, and I would consider this expected behavior for supervised learning algorithms. What would you expect to happen here? We should at least give a more explicit warning.

  2. Regarding the scaling of variables: you mentioned no differences in outcome based on scaling. Depending on how you were scaling the features, this is normal. mlr3automl has a few different algorithms for regression: ranger and xgboost are tree-based methods and indifferent to scaling. The regression SVM learner is from the e1071 package and will scale inputs itself. The other regression algorithms perform regularization, so scaling might have an effect there.
    If you can provide some more details on the data set, which learners were selected and how scaling changed things, I can have a look to see if there's anything going wrong there. I'll update the docs to make the default preprocessing behavior more explicit.

from mlr3automl.

py9mrg avatar py9mrg commented on May 26, 2024

Hello and no worries.

  1. Yes I realised that in the end - and have raised an issue there that they are considering to introduce this functionality. It is in a commit now.
  2. My data set is purely numeric and I've generally been playing around with ranger::ranger and kernlab::ksvm (and e1071::svm). IIRC ranger gave very similar results to running autoML with default settings. I appreciate this might be me thinking in circles, but it seems to me that autoML(task, preprocessing = "full") and autoML(task, preprocessing = po("scale")) ought to give identical results - but they don't. Yet autoML(task, preprocessing = "full") and autoML(task) do and so do autoML(task, preprocessing = po("scale")) and autoML(task) if in the latter case I manually scale first. I could be completely wrong though in my expectations!
    Hope that's more clear, if not let me know and I'll try to make a reprex.

from mlr3automl.

a-hanf avatar a-hanf commented on May 26, 2024

it seems to me that autoML(task, preprocessing = "full") and autoML(task, preprocessing = po("scale")) ought to give identical results - but they don't

The "full" preprocessing option extends the pipeline_robustify function from mlr3pipelines by adding tunable preprocessing options for imputation methods (if your data has missing values) and PCA. The pipeline po(scale) only has the scaling operator, so these will be very different pipelines. Not sure why you expect them to return identical results, can you elaborate? Supplying a Graph object for the preprocessing does not extend the existing pipeline, but it replaces it. Maybe that was not clear from the docs?

Yet autoML(task, preprocessing = "full") and autoML(task) do

This depends on your dataset. The difference between "full" and the default "stability" preprocessing are:

  • different methods for imputation of missing data (not sure if this is relevant for you)
  • different methods for encoding categorical covariates (irrelevant for your case)
  • PCA for dimensionality reduction

Some more background: empirically, we saw that the "full" preprocessing pipeline performed slightly worse on our benchmark datasets (mostly because PCA hurt performance on some of the included data sets). Since all the above options are subject to hyperparameter tuning, you might find the same pipeline in the end with both options.

and so do autoML(task, preprocessing = po("scale")) and autoML(task) if in the latter case I manually scale first.

If you include the scaling in the pipeline, you will apply the same scaling factors from the training sets to the test set (which would not happen in the manually scaled scenario).
I am a bit surprised that scaling makes a difference here, as ranger should be indifferent to scaling and both them SVMs you mention perform scaling internally.

from mlr3automl.

py9mrg avatar py9mrg commented on May 26, 2024

Ok so I've gone back over my old rmds from the time and I think this is part me being stupid as I was experimenting with tidymodels, h2o, mlr3 (automl) all at the same time, and part the vignette for automl maybe being a bit too concise at the moment - meaning I got my wires a bit crossed as a result. Now I've read your explanation and the actual ?AutoML doc rather than only the vignette it's much clearer.

Just to expand, with mlr3 there's a commit where you can keep NAs in the target for imputation (but only impute the non-target variables and then drop the target NAs after). These samples can help the imputation in non-target NAs - as in, if you have variables A:E (with E the target) and you have a sample with E missing, then you don't necessarily want to drop this sample before imputation because it can still be useful to impute missing A:D values in other samples. If that makes sense? So you want to keep this sample for imputation and then drop it afterwards.

Anyway, I accidentally used the data set where I had left in target NAs with automl and obviously did not reading the error message properly - leading to misinterpreting the preprocessing options totally. Shouldn't have been experimenting with so many packages at once probably as obviously I had too many things going through my head to think this through properly. Sorry about that, but talking this through has helped a lot!

I have now run the full option on a dataset with target NAs removed (but not non-target NAs removed) and it worked beautifully. Thank you.

from mlr3automl.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.