Giter Site home page Giter Site logo

Comments (4)

deanm0000 avatar deanm0000 commented on July 19, 2024

same as #15323

The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow

    <pyarrow._parquet.Statistics object at 0x7f2abffd0bd0>
      has_min_max: True
      min: 9223539763183779054
      max: 9222809258525037712
      null_count: 0
      distinct_count: None
      num_values: 262615
      physical_type: INT64
      logical_type: Int(bitWidth=64, isSigned=false)
      converted_type (legacy): UINT_64

note how the min is bigger than the max.

Essentially what's happening is that when you use eq it skips all the row groups based on statistics but is_in doesn't do partition pruning and that's why it returns results. You could also turn off optimizations in the collect under the eq case.

from polars.

isvoboda avatar isvoboda commented on July 19, 2024

@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause.

from polars.

nameexhaustion avatar nameexhaustion commented on July 19, 2024

Will take a look

from polars.

nameexhaustion avatar nameexhaustion commented on July 19, 2024

Will re-open this issue - the fix by #16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself.

Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.