Giter Site home page Giter Site logo

Comments (8)

GaelVaroquaux avatar GaelVaroquaux commented on September 22, 2024 3

And, besides, it's not very hard to write:

make_pipeline(TableVectorizer(), SimpleImputer(), RandomForestClassifier())

Not much more difficult than:

make_pipeline(TableVectorizer(), RandomForestClassifier())

from skrub.

jeromedockes avatar jeromedockes commented on September 22, 2024 1

still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support for missing values, and missing values are ubiquitous, so it is worth trying to find ways to improve the user experience and I would suggest keeping the issue open for discussion

from skrub.

GaelVaroquaux avatar GaelVaroquaux commented on September 22, 2024 1

I disagree with your desire to have an option to do it automatically: there is no good default and it tends to depend a lot on the downstream estimator.

If you really want good behavior by default, you should really use HistGradientBoosting, which is very robust to many thing.

from skrub.

GaelVaroquaux avatar GaelVaroquaux commented on September 22, 2024

This is not a bug in TableVectorizer: it's down to the learner to handle missing values (because the strategy to handle missing values must differ depending on the learner).

If the learning does not handle missing values, you should add an imputer (as you did)

In addition, RandomForests handle missing values in the upcoming release of scikit-learn: scikit-learn/scikit-learn#5870
So your specific problem will disappear real soon.

However, we recommend using HistGradientBoosting avec RandomForest it often works better.

from skrub.

jeromedockes avatar jeromedockes commented on September 22, 2024

But I agree that at least the default should probably be to output nans where there are missing values as is currently the case

from skrub.

tomMoral avatar tomMoral commented on September 22, 2024

I agree that this depends on the downstream classifier but I think having an option to "fill missing value" would be a nice feature as the goal of TableVectorizer is to take a table and "vectorize" it. (that is why this is a feature request and not a bug ;) )

from skrub.

tomMoral avatar tomMoral commented on September 22, 2024

Yes it is easy to fix (I used numerical_transformer=SimpleImputer) but I found this behavior unexpected as I thought (without reading the doc) that I would get a vector out of this Transformer.
I find the name confusing as this does not vectorize the table (to me, a vector should have a consistent type for all its entries).
This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer, to make it clear it does not touch the numerics.

my 2cts :)

from skrub.

GaelVaroquaux avatar GaelVaroquaux commented on September 22, 2024

from skrub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.