Comments (10)
Having a conditional transformer might be useful when something more general than selecting columns is needed though, such as "apply a PCA if there are more than 200 columns"
from skrub.
With the upcoming "Recipe" (or "PipeBuilder" or whatever its name will be), it
will be easy to apply a transformation to only some columns.
For example you would be able to do something like this:
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> from skrub._pipe_builder import PipeBuilder
>>> from skrub import selectors as s
>>> from skrub import TableVectorizer
>>> class DatetimeSplines(BaseEstimator):
... "dummy placeholder"
... def fit_transform(self, X, y=None):
... return self.transform(X)
...
... def transform(self, X):
... print(f"\ntransform: {X.columns.tolist()}\n")
... values = np.ones(X.shape[0])
... return pd.DataFrame({"spline_0": values, "spline_1": values})
>>> pipe = (
... PipeBuilder()
... .apply(DatetimeSplines(), cols=s.all() & "date")
... .apply(TableVectorizer())
... ).get_pipeline()
>>> df = pd.DataFrame({
... "date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... })
The column "date" gets transformed by the spline transformer:
>>> pipe.fit_transform(df)
transform: ['date']
temp spline_0 spline_1
0 10.1 1.0 1.0
1 17.5 1.0 1.0
When there is no column matching the selector, the spline transformer is not applied:
>>> df = pd.DataFrame({
... "not_date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... })
>>> pipe.fit_transform(df)
not_date_year not_date_month not_date_day not_date_total_seconds temp
0 2020.0 1.0 2.0 1577923200.0 10.1
1 2021.0 4.0 3.0 1617408000.0 17.5
Does that more or less address the problem you are facing?
from skrub.
However, if the important part is not really the name "date" but rather applying
the spline transformer to datetime columns only, you might already be able to
use the TableVectorizer's datetime_transformer
parameter? By passing your
transformer instead of the default DatetimeEncoder
.
(note the snippet below does not run on the main branch but it does on that of PR #902)
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator
from skrub import TableVectorizer
class DatetimeSplines(BaseEstimator):
"dummy placeholder"
def fit_transform(self, X, y=None):
return self.transform(X)
def transform(self, X):
print(f"\ntransform: {X.columns.tolist()}\n")
values = np.ones(X.shape[0])
return pd.DataFrame({"spline_0": values, "spline_1": values})
>>> vectorizer = TableVectorizer(datetime_transformer=DatetimeSplines())
>>> df = pd.DataFrame({
... "date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... })
>>> vectorizer.fit_transform(df)
transform: ['date']
spline_0 spline_1 temp
0 1.0 1.0 10.1
1 1.0 1.0 17.5
>>> df = pd.DataFrame({
... "not_date": ["blue", "red"],
... "temp": [10.1, 17.5]
... })
>>> vectorizer.fit_transform(df)
not_date_red temp
0 0.0 10.1
1 1.0 17.5
from skrub.
Does that more or less address the problem you are facing?
I think it does, just one thing. How would the DateTimeSplines
featurizer know which columns to select/ignore. Does the date column name need to be passed into the estimator? There may also be more than one date column in the dataframe.
from skrub.
However, if the important part is not really the name "date" but rather applying
the spline transformer to datetime columns only
Do we want to assume that the user ran their dataframe code or do we want our library to infer that on their behalf? I am partially asking because polars/pandas handle the date stuff slightly differently. But I am also wondering about categorical types. Do we only one-hot encode columns that are categorical?
from skrub.
for selecting all datetime columns you could use the skrub.selectors.any_date()
selector -- I just need to update the PipeBuilder branch with the current state of PR #902 and I'll show a snippet
from skrub.
Do we want to assume that the user ran their dataframe code or do we want our library to infer that on their behalf? I am partially asking because polars/pandas handle the date stuff slightly differently. But I am also wondering about categorical types. Do we only one-hot encode columns that are categorical?
I think we will have the TableVectorizer
which tries to guess on your behalf, and the PipeBuilder
which allows to build your own pipeline with more control over the different choices.
The TableVectorizer will one-hot encode anything that is strings or Categorical with a low cardinality. It will also try to parse strings as datetimes and apply the datetime_encoder
if it succeeds
from skrub.
If you wanted to manually control your pipeline you could do something like:
import pandas as pd
import numpy as np
from skrub import ToDatetime
from skrub import selectors as s
from skrub._pipe_builder import PipeBuilder
from skrub._on_each_column import SingleColumnTransformer
class DatetimeSplines(SingleColumnTransformer):
"dummy placeholder"
def fit_transform(self, col, y=None):
return self.transform(col)
def transform(self, col):
name = col.name
print(f" ==> transform: {name}")
values = np.ones(len(col))
return pd.DataFrame({f"{name}_spline_0": values, f"{name}_spline_1": values})
pipe = (
PipeBuilder()
.apply(ToDatetime(), allow_reject=True)
.apply(DatetimeSplines(), cols=s.any_date())
).get_pipeline()
>>> df = pd.DataFrame({
... "A": ["2020-01-02", "2021-04-03"],
... "B": [10.1, 17.5],
... "C": ["2020-01-02T00:01:02", "2021-04-03T10:11:12"],
... "D": ["red", "blue"],
... })
>>> df
A B C D
0 2020-01-02 10.1 2020-01-02T00:01:02 red
1 2021-04-03 17.5 2021-04-03T10:11:12 blue
>>> pipe.fit_transform(df)
==> transform: A
==> transform: C
A_spline_0 A_spline_1 B C_spline_0 C_spline_1 D
0 1.0 1.0 10.1 1.0 1.0 red
1 1.0 1.0 17.5 1.0 1.0 blue
from skrub.
allow_reject
means "let the ToDatetime
transformer decide if it should be applied to the column or not (and reject those that don't look like dates). (by default it is false)
from skrub.
But if you want something completely automatic, eg that you are running on many datasets that you don't inspect manually, then you're probably better off using the TableVectorizer and let it do the preprocessing and those choices for you.
it will apply all those processing steps:
- check input dataframe
- fit_transform:
- convert arrays to dataframes
- ensure column names are strings
- ensure column names are unique
- check dataframe is not a pandas sparse dataframe
- ensure dataframe is not lazy
- transform:
- same checks as fit_transform
- check dataframe library is the same as in fit
- check column names are the same as in fit
- fit_transform:
- clean null strings
- replace "N/A", "" etc with actual nulls
- to datetime
- try to parse strings as datetimes
- ensure consistent output dtype (resolution + timezone awareness + timezone)
- to float
- try to convert anything but dates and categorical to float32
- ensure consistent output dtype
- clean categories (pandas)
- ensure categories are strings stored with object dtype
- ensure categorical columns don't contain pd.NA
- ensure consistent output dtype
- convert all remaining columns to string
- convert pandas StringDtype to object & remove pd.NA
- apply the user-defined transformers
- low_cardinality_transformer (low-cardinality strings and categorical): by default one-hot encode, but you could use eg
ToCategorical
to take advantage of theHistGradientBoostingRegressor
'scategorical_features='from_dtype'
option - high_cardinality_transformer (high-cardinality strings and categorical): by default GapEncoder, MinHashEncoder can be a good choice
- datetime_encoder (Dates & Datetimes -- including those that have been parsed from strings during preprocessing): by default DatetimeEncoder, you could replace it by the custom encoder with splines
- numeric_encoder (numbers): by default, passthrough
- low_cardinality_transformer (low-cardinality strings and categorical): by default one-hot encode, but you could use eg
- try to convert all outputs to float32
from skrub.
Related Issues (20)
- Add features to the `DatetimeEncoder` HOT 16
- Scikit-learn v1.5 breaks skrub HOT 4
- `drop_nulls` and `is_null` have a different behavior for polars and pandas HOT 1
- Follow-ups after #902
- GapEncoder and MinHashEncoder modify their input inplace HOT 2
- adding `make_learner` to create a default pipeline for a given predictor
- Shorthand for getting only the preprocessing part of the TableVectorizer HOT 2
- AggJoiner raises exceptions when trying to join multiple tables at once HOT 5
- Faster alternative to GapEncoder HOT 1
- Display columns with the HTML representation of the fitted TableVectorizer
- `fetch_ken_embeddings` does not use `suffix` with default parameter HOT 1
- Ken embeddings RAM and disk usage HOT 2
- `SingleColumnTransformer`s don't work with `ColumnTransformer` HOT 4
- Port from sckikit-learn the positioning of the example buttons on downloading, jupyter-lite...
- Jupyter-lite picks up old versions of the examples? HOT 3
- Thebe + jupyterlite on the landing page
- Regroup developer docs in the main sphinx documentation
- Add a "DropSimilar" transformer
- Example 08_join_aggregation broken HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skrub.