Comments (6)
Thanks, I edited the path when reading from data/mw/
to ./
from polars.
We (currently) allow comparisons in the form of pl.col("") > | < | == | != literal
. Would there be a problem with the first case? ,like casting the datetime
first.filter(pl.col("date") == dt.date(2024, 2, 1).strftime(...)).explain()
Any expression that might alter the value of the column (e.g. pl.col("date").str.to_date()
) would significantly increase the complexity and overhead of hive partitioning as we would need to run the expression for every file instead of comparing the statistic in the path directly to the literal.
It would be nice if we can provide feedback, a warning in case a hive partitioned dataset is scanning the whole dataset
from polars.
Ah, i guess i made the example slightly too small. What I am actually using, and what is most powerful (i think)
first.filter(pl.col("date").str.to_date().is_between(start, end)).explain()
Otherwise, to using the literal comparisons i need to do a loop over the required dates, and & them all together.
I currently do this in my environment, but it is pretty hacky.
from polars.
I agree this should be fixed. We first need to do proper schema inference on hive partitions. Once that is in place we can use a similar architecture we use for parquet statistic pruning for hive partitions.
from polars.
Awesome, is there an issue for tracking the schema inference? I can just follow along on that.
Thanks
from polars.
Not yet, I have created an issue for hive partition schema (#14838)
from polars.
Related Issues (20)
- `read_parquet` may fail when using file-like `source` from `fsspec` HOT 1
- saving parquet with utf8view doesn't use RLE_ENCODING HOT 1
- Add empty `attr` dictionary to lazy/dataframe HOT 2
- Inconsistent failure of map_elements() when returning a struct HOT 2
- Add a kwarg to write_database that governs whether to_pandas uses pyarrow extension array HOT 2
- Add fixed-width char/ASCII dtype
- Joining on list columns is not implemented HOT 1
- Unintuitive behavior when hashing `list[cat]` columns HOT 2
- `is_in` operation not supported for list types HOT 1
- add `keep_column` parameter to `DataFrame.to_dummies` HOT 2
- The result of drop_first is not unique HOT 2
- streaming join improvements
- Polars returns the incorrect number elements in a list after calling `.unique()` HOT 3
- Cannot pass `Response.raw` from `requests` (HTTP lib) to `read_json`
- Cargo build failing due to error in polars-arrow 0.38.1 (MacOS) HOT 1
- Hive Partition Schema HOT 7
- Read IPC format data from cloud
- Polars job faces timeouts while reading multiple parquet files on a AWS EKS pod HOT 2
- replace_time_zone()'s ambigouous argument doesn't work correctly HOT 6
- `drop_first` parameter missing from `Series.to_dummies`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.