Comments (5)
Thanks @LeoGrin, great suggestion.
As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them.
Maybe adding a warning alongside would be good, for instance:
The 'id_name' column was identified as an ID column. Use column_specific_transformers if you still wish to include it.
The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).
from skrub.
from skrub.
Thanks for the analysis!
It seems to me we should drop this type of columns optionally during fetching.
This is something that should be implemented as part of #581.
I don't see how it would be possible to identify ID columns reliably as Jovan suggested.
from skrub.
The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).
I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the geolocation
column in traffic_violations
, which is a tuple of floats), but I'm not sure we want these columns treated as high-cardinality columns either.
To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.
Agree
It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).
Thanks ! I'm going to experiment with this.
from skrub.
from skrub.
Related Issues (20)
- aggtarget with small dataframe
- AggTarget raises when y is a Series HOT 1
- systematically handling column names and indexes of transformed dataframes HOT 1
- Transforming auxiliary tables
- include the skrub wheel in the built documentation HOT 2
- Remove Version warning on stable HOT 3
- Font size in TableReport HOT 4
- Indexing failures on some of our pages HOT 3
- Links to API reference in index.html are broken HOT 1
- Min-hash at the category level
- Enable setting the Joiner threshold in kilometers when joining on (latitude, longitude) columns
- Add example with the e-commerce fraud detection dataset
- Add a FastText encoder HOT 4
- Add zero padding on embeddings column names for ordering purposes HOT 6
- Misc improvements of table report HOT 3
- TypeError in TableReport when dataframe contains unhashable values
- TableReport Stats doesn't display nunique Integers HOT 1
- More details on the contributing section HOT 2
- TableReport ENH: tighter layout when nothing selected HOT 1
- Data source objects
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skrub.