Comments (7)
What we require first is schema inference on hive partitions. Otherwise some parts may be strings and/or different date formats. There needs to be something in place for schema inference and communicating that schema result between the partitions first.
from polars.
+1
I had a similar issue open before: #12894
@ritchie46 is this something the core team has plans around by chance? If not, I'm willing to take a stab at it given some guidance on the desired design.
from polars.
@baycoder0 c-peters is part of the core team.
I think you'd want to get started here
polars/crates/polars-plan/src/logical_plan/hive.rs
Lines 71 to 83 in e1a4179
from polars.
The csv datetime parser seems to be over here
from polars.
Additionally, this is highly related to #13892
from polars.
@deanm0000 I added the initial PR here: #14950. I'd like to get the initial checks. Also, I did exhaustive testing myself and would like to add units tests. However, tests in tests/unit/io/test_hive.py
seem to be skipped due to PyArrow 15 right now. Should I add it there? And should I add columns to foods1.ipc and foods2.ipc files that are used to test hive partitioning? Or do you prefer me to create new files?
from polars.
Are you still considering a parameter to pass a hive partitions schema, similar to how you can pass dtypes
to override the inference? This is something @deanm0000 suggested in #13892.
There are some cases where it would be nice to override the regexp-based schema inference. For instance:
- It might be hard to detect all date formats correctly (Is this string
%d/%m/%Y
or%m/%d/%Y
?). In some cases (e.g. 8 digit number), it may look like a date, but really be an Integer or a String. - BOOLEAN_RE would match
value=TRUE
, even when it might represent a String, not a Boolean.
Having schema inference as a default is convenient, but an option to override would be nice.
from polars.
Related Issues (20)
- mismatching schemas when opening csv file HOT 1
- SchemaError for Non-Exiting dtype on Concat HOT 1
- LazyFrame select in 0.20.31 includes hive partition column even when not in specified columns HOT 2
- cross-join should not work on any key HOT 1
- `.sink_parquet()` sometimes panics when `statistics` has `"null_count": False` HOT 1
- separate `pl.list()` and `pl.concat_list` HOT 4
- Panick since first release candidate when expressions in `select()` return a different number of rows HOT 3
- Question, how do you generate and test the Python examples in the README? HOT 2
- min/max operations on i16 list with None elements HOT 1
- Panic when mismatching types between glob files HOT 5
- `write_database` fails for UInts and Time dtypes when ADBC used HOT 4
- rust-polars 0.41.3?
- Use BigQuery Dataframes as Read-Connector to BigQuery
- panic calling `collect_schema` on lazy group_by + map_batches HOT 1
- Python releases do not (always) correctly update the user guide
- Add "periods" parameter to pl.datetime_range() HOT 1
- Python new release 1.0.0 caused havoc HOT 1
- `Series.scatter` operates in-place HOT 1
- Argument suggestion for "pl.DataFrame.to_dict()" HOT 3
- Polars use nest_asyncio
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.