Problem Deion Trying to understand better why the GapEncoder

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Handle id columns differently about skrub HOT 5 OPEN

LeoGrin commented on September 26, 2024 1

Handle id columns differently

from skrub.

Comments (5)

jovan-stojanovic commented on September 26, 2024 1

Thanks @LeoGrin, great suggestion.

As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them.

Maybe adding a warning alongside would be good, for instance:
The 'id_name' column was identified as an ID column. Use column_specific_transformers if you still wish to include it.

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

from skrub.

GaelVaroquaux commented on September 26, 2024

Great discussion. To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it. It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

from skrub.

LilianBoulard commented on September 26, 2024

Thanks for the analysis!

It seems to me we should drop this type of columns optionally during fetching.
This is something that should be implemented as part of #581.

I don't see how it would be possible to identify ID columns reliably as Jovan suggested.

from skrub.

LeoGrin commented on September 26, 2024

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the geolocation column in traffic_violations, which is a tuple of floats), but I'm not sure we want these columns treated as high-cardinality columns either.

To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.

Agree

It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

Thanks ! I'm going to experiment with this.

from skrub.

GaelVaroquaux commented on September 26, 2024

It seems to me we should drop this type of columns optionally during fetching.

That's going to solve the problem for our examples, but our users are likely to still face this problem.

from skrub.

Handle id columns differently about skrub HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent