Comments (8)
And, besides, it's not very hard to write:
make_pipeline(TableVectorizer(), SimpleImputer(), RandomForestClassifier())
Not much more difficult than:
make_pipeline(TableVectorizer(), RandomForestClassifier())
from skrub.
still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support for missing values, and missing values are ubiquitous, so it is worth trying to find ways to improve the user experience and I would suggest keeping the issue open for discussion
from skrub.
I disagree with your desire to have an option to do it automatically: there is no good default and it tends to depend a lot on the downstream estimator.
If you really want good behavior by default, you should really use HistGradientBoosting, which is very robust to many thing.
from skrub.
This is not a bug in TableVectorizer: it's down to the learner to handle missing values (because the strategy to handle missing values must differ depending on the learner).
If the learning does not handle missing values, you should add an imputer (as you did)
In addition, RandomForests handle missing values in the upcoming release of scikit-learn: scikit-learn/scikit-learn#5870
So your specific problem will disappear real soon.
However, we recommend using HistGradientBoosting avec RandomForest it often works better.
from skrub.
But I agree that at least the default should probably be to output nans where there are missing values as is currently the case
from skrub.
I agree that this depends on the downstream classifier but I think having an option to "fill missing value" would be a nice feature as the goal of TableVectorizer
is to take a table and "vectorize" it. (that is why this is a feature request and not a bug ;) )
from skrub.
Yes it is easy to fix (I used numerical_transformer=SimpleImputer
) but I found this behavior unexpected as I thought (without reading the doc) that I would get a vector out of this Transformer
.
I find the name confusing as this does not vectorize the table (to me, a vector should have a consistent type for all its entries).
This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer
, to make it clear it does not touch the numerics.
my 2cts :)
from skrub.
from skrub.
Related Issues (20)
- Add a "DropSimilar" transformer
- Example 08_join_aggregation broken HOT 4
- aggtarget with small dataframe
- AggTarget raises when y is a Series HOT 1
- systematically handling column names and indexes of transformed dataframes HOT 1
- Transforming auxiliary tables
- include the skrub wheel in the built documentation HOT 2
- Remove Version warning on stable HOT 3
- Font size in TableReport HOT 4
- Indexing failures on some of our pages HOT 3
- Links to API reference in index.html are broken HOT 1
- Min-hash at the category level
- Enable setting the Joiner threshold in kilometers when joining on (latitude, longitude) columns
- Add example with the e-commerce fraud detection dataset
- Add a FastText encoder HOT 4
- Add zero padding on embeddings column names for ordering purposes HOT 6
- Misc improvements of table report HOT 3
- TypeError in TableReport when dataframe contains unhashable values
- TableReport Stats doesn't display nunique Integers HOT 1
- More details on the contributing section HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skrub.