Comments (9)
Hello @leo-schick ,
I am missing something here. Sure the physical type for timstamps is INT64
, but its logical type should be TIMESTAMP
. There is even a test to prove it.
See:
Line 686 in aee2ca5
The test uses datetime2
and not datetime
. Did you experience the issue with datetime
?
Best, Markus
from odbc2parquet.
I experienced the issue with both...
from odbc2parquet.
Used odbc2parquet version: 0.13.2
Probably the issue is not the logical type but that Azure Synapse (which I use to read the parquet files in this sample) does not support the logical type but use the converted type (legacy).
I inspected now the parquet files with parquet-tools:
MSSQL column with type date
############ Column(OrderDate) ############
name: OrderDate
path: OrderDate
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: ZSTD (space_saved: 18%)
✔️ Works fine with Azure Synapse, using data type date
.
MSSQL column with type datetime
############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 30%)
🛑 Does not work with Azure Synapse. I must read column with data type bigint
. datetime
and datetime2
fails.
MSSQL column with type datetime2(7)
############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 15%)
🛑 Does not work with Azure Synapse. I must read column with data type bigint
. datetime
and datetime2
fails.
MSSQL column datetimeoffset
############ Column(OrderTimestamp) ############
name: OrderTimestamp
path: OrderTimestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: 20%)
✔️ Works fine with Azure Synapse when I use data type datetime2
.
It looks to me that Azure Synapse interprets not the logical_type
but the converted_type
which is probably deprecated. Would it be possible to add the converted_type
schema information to the datetime
and datetime2
data types as well?
from odbc2parquet.
Note: I found the official documentation which as a note about the deprecated converted_type
:
Despite there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated their local timestamps with legacy
TIMESTAMP_MICROS
andTIMESTAMP_MILLIS
annotation, Parquet writer implementation must annotate local timestamps with legacy annotations too, as shown below.Backward compatibility:
ConvertedType LogicalType TIME_MILLIS TimeType (isAdjustedToUTC = true, unit = MILLIS) TIME_MICROS TimeType (isAdjustedToUTC = true, unit = MICROS) Forward compatibility:
LogicalType ConvertedType TimeType isAdjustedToUTC = true unit = MILLIS TIME_MILLIS unit = MICROS TIME_MICROS unit = NANOS - isAdjustedToUTC = false unit = MILLIS TIME_MILLIS unit = MICROS TIME_MICROS unit = NANOS -
from odbc2parquet.
Yeah, we should do that. I wonder however if this is not better fixed upstream in the parquet
crate. Actually I assumed it would already do that. Maybe I am mistaken and there is a good reason for it not to do it, but asking can not hurt.
from odbc2parquet.
Hi @leo-schick ,
thanks again for your thorough Bug report. My current understanding is that by setting the logical type, the converted type should be implied. So I opened an Upstream ticket. As a workaround I tried to set the converted type manually. but it seems to be ignored then I tested with parquet-tools. I currently can not think of another quick workaround, so for the moment I am waiting for feedback on the upstream ticket. Also to verify that my understanding is correct.
If the situtation does persist, I could try introducing a flag to instruct odbc2parquet
to set the converted instead of the logical type. Currently my spare time is very rare, so the best course for now I'd say is to move forward with the upstream ticket.
Best Markus
from odbc2parquet.
It seems the missing converted type is due to odbc2parquet
using INT64
instead of INT32
for Milliseconds precision timestamps. However switching to INT32
the code now fails if setting the logic type. So, still trying to figure this thing out, but working on it.
from odbc2parquet.
Hi @leo-schick ,
as stated in the conclusion of apache/arrow-rs#3017 the parquet
crate (and therefore also odbc2parquet
) do emit a converted type. The bug seems to be in the readers which are based on the C++ implemantion (like pyarrow
). Please open a bug ticket there.
Cheers, Markus
from odbc2parquet.
Also closing this issue, as the converted type is written into the output by odbc2parquet
. Fault seems to lie with readers. Please open a bug there, if you need these to work together with odbc2parquet
.
from odbc2parquet.
Related Issues (20)
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.