Giter Site home page Giter Site logo

Comments (4)

jeromedockes avatar jeromedockes commented on June 21, 2024 1

I mean 3 or 4 -- ie having a separate transformer for select. indeed it can do more than just subset columns

from skrub.

TheooJ avatar TheooJ commented on June 21, 2024

I agree that it is something that would be useful, and to reply to your points :

I like 2. more than 1. because I think it makes sense to have different dropping strategies for different transformers. That being said, if they share a common, widely used drop case it would make sense to have this argument in TableVectorizer too. This way you would set up the dropping strategy for all of them at once

  1. It could possibly account for other strategies than dropna, for instance practitioners often drop 1) very sparse columns, or features that are present only for a small number of ids, 2) correlations (among features, between feature and target), 3) outliers

  2. I like the idea of verbs over nouns, but I would choose nouns to avoid a contrast with scikit-learn

wdyt ?

from skrub.

Vincent-Maladiere avatar Vincent-Maladiere commented on June 21, 2024

@TheooJ I think Gaël comment is about removing columns based on user-defined lists. You'd need a transformer to perform feature selection.

Maybe we could combine 1. and 4. : having a drop parameter on the TableVectorizer and allowing renaming or simple column manipulation operations in kwargs. In addition, we could also introduce the ColSelector for usage out of TableVectorizer.

Let's assume a slightly different identity from scikit-learn with verbs rather than nouns ;)

from skrub.

jeromedockes avatar jeromedockes commented on June 21, 2024

I think I prefer option 3, "add a Drop transformer". The vectorizer has quite a few parameters already and I believe a slightly longer pipeline with simpler steps is easier to understand than a pipeline where some steps do a lot of things. Also, I'm not sure but there could be situations where a user wants control over where the drop happens, eg to drop a lot of columns as soon as possible to save memory, or to use a column for a join and drop it afterwards for prediction

from skrub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.