Comments (5)
Interesting idea!
I do see that there is an increasing dichotomy in skrub between relying on dataframes or arrays, so this is an important discussion.
This is due to the fact that we are somewhere in between projects like pandas and scikit-learn. That is, pipeline-wise, I usually see skrub as having dataframes as input (e.g. from pandas) and returns arrays (for scikit-learn) for machine learning. This is what is done by the TableVectorizer
.
I would use then, and this is how it's currently done in the examples, Joiners as a step before TableVectorizer
, which returns numerical arrays for scikit-learn models. In this case, no need for dataframes as output.
from skrub.
from skrub.
@jovan-stojanovic I see many use cases (including examples in the AggJoiner
) where running TableVectorizer
first is a must.
@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output
is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.
Let's close this issue, then.
from skrub.
from skrub.
Thanks for the precision. I hadn't all these elements in mind.
I have a huge bias toward the user side and will always promote stuff that eases the life of the regular data scientists. Ultimately, we do this for them. "User first and maintainers adapt" in some ways.
from skrub.
Related Issues (20)
- ENH Remove the `OneHotEncoder` inheritance `SimilarityEncoder` HOT 2
- support python 3.8 & 3.9 HOT 2
- Follow-up after #742 InterpolationJoin
- Test polars support HOT 1
- Drop numpy array input support for `TableVectorizer`
- `get_feature_names_out` returns lists instead of numpy arrays HOT 2
- datetimeencoder is very slow HOT 3
- TableVectorizer imputing logic is confusing HOT 5
- 2 ways GapEncoder get_feature_name_out is broken on low entropy data
- cannot run test suite in python3.12 due to warnings filter
- Add "to_datetime" to the narrative documentation
- development status in setup.cfg
- Add a "related projects" section in the documentation HOT 1
- Adding a TableVectorizer specialization for HistGradientBoosting HOT 1
- allowing to use a different distance for the nearest neighbors in fuzzy join HOT 1
- Consider casting to float32 by default in TableVectorizer HOT 3
- Handle numerical missing values in TableVectorizer HOT 8
- Basic regression problem raises exception on inference HOT 4
- TableVectoriser's "numerical_transformer" does not accept Pipelines HOT 3
- fetch_ken_types gives same results for many embedding_table_id's HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skrub.