Giter Site home page Giter Site logo

Handle id columns differently about skrub HOT 5 OPEN

LeoGrin avatar LeoGrin commented on September 26, 2024 1
Handle id columns differently

from skrub.

Comments (5)

jovan-stojanovic avatar jovan-stojanovic commented on September 26, 2024 1

Thanks @LeoGrin, great suggestion.

As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them.

Maybe adding a warning alongside would be good, for instance:
The 'id_name' column was identified as an ID column. Use column_specific_transformers if you still wish to include it.

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

from skrub.

GaelVaroquaux avatar GaelVaroquaux commented on September 26, 2024

from skrub.

LilianBoulard avatar LilianBoulard commented on September 26, 2024

Thanks for the analysis!

It seems to me we should drop this type of columns optionally during fetching.
This is something that should be implemented as part of #581.

I don't see how it would be possible to identify ID columns reliably as Jovan suggested.

from skrub.

LeoGrin avatar LeoGrin commented on September 26, 2024

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the geolocation column in traffic_violations, which is a tuple of floats), but I'm not sure we want these columns treated as high-cardinality columns either.

To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.

Agree

It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

Thanks ! I'm going to experiment with this.

from skrub.

GaelVaroquaux avatar GaelVaroquaux commented on September 26, 2024

from skrub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.