Export of a datetime2(7) column with this value: "8900-12-31 00:00:00.0000000" results

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

hm, okay I am missing something. This test passes for me: <div class="highlight hi

Timestamp with nanoprecision results in arrow error about arrow-odbc-py HOT 18 CLOSED

pacman82 commented on August 29, 2024

Timestamp with nanoprecision results in arrow error

from arrow-odbc-py.

Comments (18)

pacman82 commented on August 29, 2024 1

Hello @ErikJansenIRefact

arrow-odbc-py 1.2.2 is released. It raises an error with a descriptive message if it discovers out of range timestamps.

Best, Markus

from arrow-odbc-py.

pacman82 commented on August 29, 2024

Hi thanks for the Bug report. Are you writing or reading data? What database and driver are you using?

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Hi,

Reading data. Connection is to SQLServer 2022 with Microsoft odbc version 18 driver on a Mac.

unixodbc 2.3.11
msodbcsql18 18.2.1.1

from arrow-odbc-py.

pacman82 commented on August 29, 2024

Hello @ErikJansenIRefact ,

thanks, that seems easy enough for me to reproduce. ~~Would you mind telling me the relational type of the column which causes the error?~~ Just saw you described it in your initial statement.

Best, Markus

from arrow-odbc-py.

pacman82 commented on August 29, 2024

hm, okay I am missing something. This test passes for me:

def test_timestamp_ns():
    """
    Query a table with one row. Should return one batch
    """
    table = "OneRow"
    os.system(f'odbcsv fetch -c "{MSSQL}" -q "DROP TABLE IF EXISTS {table};"')
    os.system(f'odbcsv fetch -c "{MSSQL}" -q "CREATE TABLE {table} (a DATETIME2(7));"')
    rows = "a\n2014-04-14 21:25:42.0748412"
    run(["odbcsv", "insert", "-c", MSSQL, table], input=rows, encoding="ascii")

    query = f"SELECT * FROM {table}"
    reader = read_arrow_batches_from_odbc(
        query=query, batch_size=100, connection_string=MSSQL
    )
    it = iter(reader)
    actual = next(it)

    schema = pa.schema([("a", pa.timestamp("ns"))])
    expected = pa.RecordBatch.from_pydict({"a": [1397510742074841200]}, schema)
    print(expected[0])
    print(actual[0])
    assert expected == actual

    with raises(StopIteration):
        next(it)

Still running on a windows PC and with ODBC 17 driver version, but I think I am likely to miss something else.

from arrow-odbc-py.

pacman82 commented on August 29, 2024

Given that I now have test reading Datetime2(7) which seems to work, and also looking at the original error message you send I have doubts I can reprocude it. It seems the error occurs converting from pyarrow to pyarrow. arrow-odbc would use [ns] to represent a Datetime2(7) to not loose any precision, as it is also documented in the readme. It seems to me some downstream processing expects [us].

Best Markus

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Hi Markus,

I think the issue is caused due to the date time being a high value: '8900-12-31 00:00:00.0000000'. This in combination with a precision > 3 perhaps causes the issue.

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

I've looked into apache arrow to check if there are limitations in the high value of a datetime but so far did not found any relevant info.

from arrow-odbc-py.

pacman82 commented on August 29, 2024

I really would need a reproducing example. High values did not cause this error then I inserted in the test (the year did overflow though in the database, but reading it was fine). Also the error message points from an arrow to arrow conversion, which does not happen within this library. At least it is not supposed to. Do you have a minimum reproducing example?

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Hi Markus,

This is an example. In the query text you can try the scenarios by casting to a datetime2(7) => results in error, by casting to datetime2(3) => everything works ok and by casting to datatime2(4) no errors but when data is retrieved (I used duckDB) the value for valid_time = '1886-05-08 05:05:15.485381'

`from arrow_odbc import read_arrow_batches_from_odbc
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import fs

connection_string="Driver={ODBC Driver 18 for SQL Server};Server=localhost;TrustServerCertificate=yes;"
lfs = fs.LocalFileSystem()

def read_rows():
return read_arrow_batches_from_odbc(
query=f"select cast('8900-12-31' as datetime2(4)) as valid_time",
connection_string=connection_string,
batch_size=100,
user="sa",
password="iRefact2017",
)

def write_parquet(writer, batch):
writer.write(batch)

def export():
reader = read_rows()
writer = pq.ParquetWriter(where=f"./test.parquet",schema=reader.schema,filesystem=lfs,compression="snappy")
for batch in reader:
write_parquet(writer, batch)
writer.close()

export()
`

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Mmmh, the code is not nicely formatted. Hopefully you can use it.

from arrow-odbc-py.

pacman82 commented on August 29, 2024

Hello @ErikJansenIRefact ,

fromatting was not a problem. If I replace the final call to export in your example with read_rows it runs just fine. The exception is not raised when reading rows from the database, that works, but them writing them to parquet. My reading of the error is that the parquet writer does not support nanosecond precision and does not convert silently in order to avoid information loss you are not aware of. My suggestion would be to transform any timestamp with nanoseconds precision into one with milliseconds precision before writing it to parquet.

Best, Markus

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Did some more value tests. A value: select cast('2200-12-31' as datetime2(7)) as valid_time works fine. A value: select cast('2300-12-31' as datetime2(7)) as valid_time results in an error. So it seems there is a kind of high value for datetimes.

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Thanks Markus. So basically it's an issue in Arrow.

from arrow-odbc-py.

pacman82 commented on August 29, 2024

You are welcome

IMHO it is more a decision of arrow, rather than an issue. Not silently dropping information is likely a design choice. However, I think that parquet does support nano seconds, so they also might just decide to support it in the writer.

FYI: If you just want to export a database to parquet you could also skip the intermediate step via arrow and use the odbc2parquet command line tool.

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Tested without writing to parquet. It seems the value is not converted correctly already in arrow.
The datetime2(7) '8900-12-31' is converted to a pyarrow wrongly. So it's not an issue in writing to parquet. Checked the docs and parquet version 2.6 is required to support nanosecond precision datetimes.

This is the result of the recordBatch (converted to a pyarrow table). It is a nanosecond precision timestamp but somewhere it's value is converted wrongly.

pyarrow.Table
valid_time: timestamp[ns]

valid_time: [[1886-05-08 05:05:15.485380608]]

from arrow-odbc-py.

ErikJansenIRefact commented on August 29, 2024

Sorry to bother you again but could you take a look if somewhere in the conversion from database to arrow something goes wrong?

from arrow-odbc-py.

pacman82 commented on August 29, 2024

This is a different issue. I also saw that behavior. It is highly likely that this is an integer overflow. I would need to investigate in order to figure out if something can be done about it.

from arrow-odbc-py.

Timestamp with nanoprecision results in arrow error about arrow-odbc-py HOT 18 CLOSED

Comments (18)

pyarrow.Table
valid_time: timestamp[ns]

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (18)

pyarrow.Table valid_time: timestamp[ns]

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

pyarrow.Table
valid_time: timestamp[ns]