Comments (1)
Hi @jovan-stojanovic, @GaelVaroquaux, @ogrisel, after further diving into this issue, we might not need to use the Frobenius norm or pairwise distance at all!
The concern of having equal representativity for each column's type (str, num) stems from the downstream kNN tasks with Euclidean distance. For example, let's say we run a fuzzy join on one string column and one numerical column. Below is the Euclidean distance between two mixed embeddings of index
$$d_{ij} = \sqrt{\sum_{k\in \mathcal{S}{ij}} (x{ik} - x_{jk})^2 + (x_{ik_{num}} - x_{jk_{num}})^2}$$
Where
Ideally, to equally take into account both numerical and string columns, we would like to ensure:
$$\frac{1}{N}\sum_{i=1}^N\sum_{k\in \mathcal{S}{ij}} (x{ik} - x_{jk})^2 \approx \frac{1}{N}\sum^N_{i=1}(x_{ik_{num}} - x_{jk_{num}})^2 $$
We now compute the Frobenius norm for the string and numerical column blocks.
Square Frobenius norm for the string column:
We use HashingVectorizer
followed by TfIdfTransformer
. By default, TfIdfTransformer
uses the l2 norm. Note that HashingVectorizer
creates a sparse matrix with
$$
||\tilde{X}{\mathrm{str}}||F^2=\sum{i=1}^N \sum{k=1}^K \tilde{x}{ik} ^ 2 = \sum{i=1}^N \sum_{k=1}^K \frac{x_{ik} ^ 2}{||x_i||^2_2} = \sum_{i=1}^N \frac{1}{||x_i||^2_2} \sum_{k=1}^K x_{ik} ^ 2 = N
$$
Square Frobenius norm for the numerical column:
We use StandardScaler
on our single numerical column.
$$||\tilde{X}_{\mathrm{num}}||F^2=\sum{i=1}^N \tilde{x}_i ^ 2 = N \mathbb{E}[\tilde{X}^2] = N (\mathbb{V}(\tilde{X}) + \mathbb{E}^2[\tilde{X}]) = N (1 + 0) = N
$$
Therefore, we have established:
The initial proposal was to divide each element by the Frobenius norm / N of its feature block, by doing so, we would divide each element —string or numerical alike— by:
i.e. multiply each element by
Experiment
from dirty_cat import datasets
from sklearn.model_selection import train_test_split
from dirty_cat._fuzzy_join import _numeric_encoding, _string_encoding
salaries = datasets.fetch_employee_salaries()
X, y = salaries.X, salaries.y
X_train, X_test, y_train, y_test = train_test_split(X, y)
# numerical encoding
num_cols = ["year_first_hired"]
main_num_encoding, aux_num_encoding = _numeric_encoding(
X_train,
num_cols,
X_test,
num_cols,
)
print((main_num_encoding ** 2).mean())
>> 0.9876
print((aux_num_encoding ** 2).mean())
>> 1.0372
# string encoding
str_cols = ["employee_position_title"]
main_str_encoding, aux_str_encoding = _string_encoding(
X_train,
str_cols,
X_test,
str_cols,
encoder=None,
analyzer="char_wb",
ngram_range=(2, 4),
)
print(
main_str_encoding[:5].toarray() ** 2).sum(axis=1)
)
>>> [1., 1., 1., 1., 1.]
print(
main_aux_encoding[:5].toarray() ** 2).sum(axis=1)
)
>>> [1., 1., 1., 1., 1.]
Conclusion
Due to the l2 scaling of TfIdfTransformer
for string columns and the StandardScaler
for numerical columns, dividing by the Frobenius norm is unnecessary :)
from skrub.
Related Issues (20)
- Adding a frequency encoder HOT 2
- Grid-search doesn't work with `TableVectorizer` HOT 1
- DatetimeEncored can add holiday/weekend binary features HOT 3
- One of the hyperlinks in the examples section of the Documentation not working HOT 1
- Add array-to-dummies preprocessor HOT 2
- FEAT enable pandas output in `TableVectorizer` as a parameter HOT 5
- Example 07_multiple_key_join takes a lot of time
- FEAT Use `pandas.merge` in `fuzzy_join` when `matching_score=1` HOT 1
- FEAT Develop the `AggJoiner` and `AggTarget` HOT 8
- Interactive examples for skrub HOT 4
- BUG example 03 breaks with `exact_until='microsecond'`
- FEAT Getting only the time since epoch from the DateTimeEncoder
- BUG DateTimeEncoder fails when `extract_until` is "year" or "month" HOT 2
- BUG DateTimeEncoder fails when a column mixes formats HOT 3
- DOC visual inconsistencies with the last version of the PyData Theme
- Adding a transformer to sessionize a table HOT 3
- Jupyterlite kernel fails to launch HOT 4
- DOC Remove file-level flake8 noqa HOT 1
- Use `Joiner` in `fuzzy_join` rather than the opposite
- `match_score` used in example 4 is too low to reject any matches
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skrub.