Giter Site home page Giter Site logo

Comments (9)

pacman82 avatar pacman82 commented on May 30, 2024

Hello @leo-schick ,

I am missing something here. Sure the physical type for timstamps is INT64, but its logical type should be TIMESTAMP. There is even a test to prove it.

See:

parquet_schema_out(out_str).stdout(contains("OPTIONAL INT64 a (TIMESTAMP(MICROS,false));"));

The test uses datetime2 and not datetime. Did you experience the issue with datetime?

Best, Markus

from odbc2parquet.

leo-schick avatar leo-schick commented on May 30, 2024

I experienced the issue with both...

from odbc2parquet.

leo-schick avatar leo-schick commented on May 30, 2024

Used odbc2parquet version: 0.13.2

Probably the issue is not the logical type but that Azure Synapse (which I use to read the parquet files in this sample) does not support the logical type but use the converted type (legacy).

I inspected now the parquet files with parquet-tools:

MSSQL column with type date

############ Column(OrderDate) ############
name: OrderDate
path: OrderDate
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: ZSTD (space_saved: 18%)

✔️ Works fine with Azure Synapse, using data type date.

MSSQL column with type datetime

############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 30%)

🛑 Does not work with Azure Synapse. I must read column with data type bigint. datetime and datetime2 fails.

MSSQL column with type datetime2(7)

############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 15%)

🛑 Does not work with Azure Synapse. I must read column with data type bigint. datetime and datetime2 fails.

MSSQL column datetimeoffset

############ Column(OrderTimestamp) ############
name: OrderTimestamp
path: OrderTimestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: 20%)

✔️ Works fine with Azure Synapse when I use data type datetime2.

It looks to me that Azure Synapse interprets not the logical_type but the converted_type which is probably deprecated. Would it be possible to add the converted_type schema information to the datetime and datetime2 data types as well?

from odbc2parquet.

leo-schick avatar leo-schick commented on May 30, 2024

Note: I found the official documentation which as a note about the deprecated converted_type:

Despite there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated their local timestamps with legacy TIMESTAMP_MICROS and TIMESTAMP_MILLIS annotation, Parquet writer implementation must annotate local timestamps with legacy annotations too, as shown below.

Backward compatibility:

ConvertedType LogicalType
TIME_MILLIS TimeType (isAdjustedToUTC = true, unit = MILLIS)
TIME_MICROS TimeType (isAdjustedToUTC = true, unit = MICROS)

Forward compatibility:

LogicalType ConvertedType
TimeType isAdjustedToUTC = true unit = MILLIS TIME_MILLIS
unit = MICROS TIME_MICROS
unit = NANOS -
isAdjustedToUTC = false unit = MILLIS TIME_MILLIS
unit = MICROS TIME_MICROS
unit = NANOS -

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

Yeah, we should do that. I wonder however if this is not better fixed upstream in the parquet crate. Actually I assumed it would already do that. Maybe I am mistaken and there is a good reason for it not to do it, but asking can not hurt.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

Hi @leo-schick ,

thanks again for your thorough Bug report. My current understanding is that by setting the logical type, the converted type should be implied. So I opened an Upstream ticket. As a workaround I tried to set the converted type manually. but it seems to be ignored then I tested with parquet-tools. I currently can not think of another quick workaround, so for the moment I am waiting for feedback on the upstream ticket. Also to verify that my understanding is correct.

If the situtation does persist, I could try introducing a flag to instruct odbc2parquet to set the converted instead of the logical type. Currently my spare time is very rare, so the best course for now I'd say is to move forward with the upstream ticket.

Best Markus

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

It seems the missing converted type is due to odbc2parquet using INT64 instead of INT32 for Milliseconds precision timestamps. However switching to INT32 the code now fails if setting the logic type. So, still trying to figure this thing out, but working on it.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

Hi @leo-schick ,

as stated in the conclusion of apache/arrow-rs#3017 the parquet crate (and therefore also odbc2parquet) do emit a converted type. The bug seems to be in the readers which are based on the C++ implemantion (like pyarrow). Please open a bug ticket there.

Cheers, Markus

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

Also closing this issue, as the converted type is written into the output by odbc2parquet. Fault seems to lie with readers. Please open a bug there, if you need these to work together with odbc2parquet.

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.