Problem Deion Although TableVector

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

FEAT enable pandas output in `TableVectorizer` as a parameter about skrub HOT 5 CLOSED

Vincent-Maladiere commented on June 17, 2024

FEAT enable pandas output in `TableVectorizer` as a parameter

from skrub.

Comments (5)

jovan-stojanovic commented on June 17, 2024

Interesting idea!

I do see that there is an increasing dichotomy in skrub between relying on dataframes or arrays, so this is an important discussion.

This is due to the fact that we are somewhere in between projects like pandas and scikit-learn. That is, pipeline-wise, I usually see skrub as having dataframes as input (e.g. from pandas) and returns arrays (for scikit-learn) for machine learning. This is what is done by the TableVectorizer.

I would use then, and this is how it's currently done in the examples, Joiners as a step before TableVectorizer, which returns numerical arrays for scikit-learn models. In this case, no need for dataframes as output.

from skrub.

GaelVaroquaux commented on June 17, 2024

This option is currently available with set_output(transform="pandas") via the SetOutputMixin inherited from TransformerMixin. However, most users won't know this option even exists. Instead, I suggest adding a return_dataframe parameter to TableVectorizer's __init__ method.

I'd rather not depart from the choices made in scikit-learn. We should rather better document this feature, which probably means using it more in our examples amongst other things.

from skrub.

Vincent-Maladiere commented on June 17, 2024

@jovan-stojanovic I see many use cases (including examples in the AggJoiner) where running TableVectorizer first is a must.

@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.

Let's close this issue, then.

from skrub.

GaelVaroquaux commented on June 17, 2024

@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.

It's there for a variety of reasons: - It can be enforced consistency in all the estimators that inherit from BaseEstimator. We feared that support across all the scikit-learn compatible libraries would be inconsistent - It can also be controlled via https://scikit-learn.org/stable/modules/generated/sklearn.set_config.html. Here the tension is between user-facing code (ie a datascientist using scikit-learn to analyse data) and library code (ie someone writing a library using scikit-learn) - It opens the door to future evolution as the dataframe ecosystem change and we can do better support of a variety of containers. API choices in scikit-learn are made with a lot of care, and I hesitate in overuling them.

from skrub.

Vincent-Maladiere commented on June 17, 2024

Thanks for the precision. I hadn't all these elements in mind.

I have a huge bias toward the user side and will always promote stuff that eases the life of the regular data scientists. Ultimately, we do this for them. "User first and maintainers adapt" in some ways.

from skrub.

FEAT enable pandas output in `TableVectorizer` as a parameter about skrub HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent