Dropping a column is a very common need that we might need to facilitate. Here are a f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

API Idea: add a "drop" argument about skrub HOT 4 OPEN

GaelVaroquaux commented on June 21, 2024 1

API Idea: add a "drop" argument

from skrub.

Comments (4)

jeromedockes commented on June 21, 2024 1

I mean 3 or 4 -- ie having a separate transformer for select. indeed it can do more than just subset columns

from skrub.

TheooJ commented on June 21, 2024

I agree that it is something that would be useful, and to reply to your points :

I like 2. more than 1. because I think it makes sense to have different dropping strategies for different transformers. That being said, if they share a common, widely used drop case it would make sense to have this argument in TableVectorizer too. This way you would set up the dropping strategy for all of them at once

It could possibly account for other strategies than dropna, for instance practitioners often drop 1) very sparse columns, or features that are present only for a small number of ids, 2) correlations (among features, between feature and target), 3) outliers
I like the idea of verbs over nouns, but I would choose nouns to avoid a contrast with scikit-learn

wdyt ?

from skrub.

Vincent-Maladiere commented on June 21, 2024

@TheooJ I think Gaël comment is about removing columns based on user-defined lists. You'd need a transformer to perform feature selection.

Maybe we could combine 1. and 4. : having a drop parameter on the TableVectorizer and allowing renaming or simple column manipulation operations in kwargs. In addition, we could also introduce the ColSelector for usage out of TableVectorizer.

Let's assume a slightly different identity from scikit-learn with verbs rather than nouns ;)

from skrub.

jeromedockes commented on June 21, 2024

I think I prefer option 3, "add a Drop transformer". The vectorizer has quite a few parameters already and I believe a slightly longer pipeline with simpler steps is easier to understand than a pipeline where some steps do a lot of things. Also, I'm not sure but there could be situations where a user wants control over where the drop happens, eg to drop a lot of columns as soon as possible to save memory, or to use a column for a join and drop it afterwards for prediction

from skrub.

Recommend Projects

API Idea: add a "drop" argument about skrub HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent