Giter Site home page Giter Site logo

Hive Partition Schema about polars HOT 7 OPEN

c-peters avatar c-peters commented on July 2, 2024 4
Hive Partition Schema

from polars.

Comments (7)

ritchie46 avatar ritchie46 commented on July 2, 2024 1

What we require first is schema inference on hive partitions. Otherwise some parts may be strings and/or different date formats. There needs to be something in place for schema inference and communicating that schema result between the partitions first.

from polars.

baycoder0 avatar baycoder0 commented on July 2, 2024

+1

I had a similar issue open before: #12894

@ritchie46 is this something the core team has plans around by chance? If not, I'm willing to take a stab at it given some guidance on the desired design.

from polars.

deanm0000 avatar deanm0000 commented on July 2, 2024

@baycoder0 c-peters is part of the core team.

I think you'd want to get started here

let s = if INTEGER_RE.is_match(value) {
let value = value.parse::<i64>().ok()?;
Series::new(name, &[value])
} else if BOOLEAN_RE.is_match(value) {
let value = value.parse::<bool>().ok()?;
Series::new(name, &[value])
} else if FLOAT_RE.is_match(value) {
let value = value.parse::<f64>().ok()?;
Series::new(name, &[value])
} else if value == "__HIVE_DEFAULT_PARTITION__" {
Series::new_null(name, 1)
} else {
Series::new(name, &[percent_decode_str(value).decode_utf8().ok()?])

from polars.

deanm0000 avatar deanm0000 commented on July 2, 2024

The csv datetime parser seems to be over here

from polars.

deanm0000 avatar deanm0000 commented on July 2, 2024

Additionally, this is highly related to #13892

from polars.

baycoder0 avatar baycoder0 commented on July 2, 2024

@deanm0000 I added the initial PR here: #14950. I'd like to get the initial checks. Also, I did exhaustive testing myself and would like to add units tests. However, tests in tests/unit/io/test_hive.py seem to be skipped due to PyArrow 15 right now. Should I add it there? And should I add columns to foods1.ipc and foods2.ipc files that are used to test hive partitioning? Or do you prefer me to create new files?

from polars.

fcocquemas avatar fcocquemas commented on July 2, 2024

Are you still considering a parameter to pass a hive partitions schema, similar to how you can pass dtypes to override the inference? This is something @deanm0000 suggested in #13892.

There are some cases where it would be nice to override the regexp-based schema inference. For instance:

  • It might be hard to detect all date formats correctly (Is this string %d/%m/%Y or %m/%d/%Y?). In some cases (e.g. 8 digit number), it may look like a date, but really be an Integer or a String.
  • BOOLEAN_RE would match value=TRUE, even when it might represent a String, not a Boolean.

Having schema inference as a default is convenient, but an option to override would be nice.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.