Comments (4)
same as #15323
The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow
<pyarrow._parquet.Statistics object at 0x7f2abffd0bd0>
has_min_max: True
min: 9223539763183779054
max: 9222809258525037712
null_count: 0
distinct_count: None
num_values: 262615
physical_type: INT64
logical_type: Int(bitWidth=64, isSigned=false)
converted_type (legacy): UINT_64
note how the min is bigger than the max.
Essentially what's happening is that when you use eq
it skips all the row groups based on statistics but is_in
doesn't do partition pruning and that's why it returns results. You could also turn off optimizations in the collect under the eq
case.
from polars.
@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause.
from polars.
Will take a look
from polars.
Will re-open this issue - the fix by #16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself.
Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file.
from polars.
Related Issues (20)
- Data in csv files with less columns than schema shifts data. HOT 4
- Add the argument `ignore_nulls` in `.arr.all()`, `.arr.any()`, `.list.all()` and `.list.any()`
- read_database_uri panics for dates beyond 2262.04.11 HOT 2
- Move streaming engine original plan to separate field on the `IRPlan`
- Write upgrade guide for 1.0.0
- Polars is unable to parse dates beyond 2262.04.11 HOT 1
- Make a ParquetWriter context handler and/or more control over row group creation
- Casting to float32, int32, int16 and int8 in polars is slower than pandas in larger dfs HOT 4
- Interpolate based on other Float64 column HOT 3
- Comparing 0 with UInt64 values larger than Int64::MAX incorrectly return NULL
- `read_csv` ignores the `columns` parameter when reading an empty CSV file with header HOT 1
- Inconsistent XOR when using literals
- `pl.concat_str(...)`'s `ignore_nulls` arg field
- In `pl.read_csv(...)`, allow `separator=None` in order to read everything into only a single column
- pl.DataFrame loads in 2D lists in unexpected way HOT 5
- `join_asof` breaks with certain parquet files (I think due to memory layout or something?) HOT 2
- Expressions result in different serializations across Python Polars PATCH version upgrades HOT 1
- Add inline XOR (`^`) operator for selectors HOT 3
- Inconsistency in constructing dataframes with timezone-aware datatypes HOT 8
- "Failed to determine supertype of bool and datetime[ns]" Panic
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.