Giter Site home page Giter Site logo

Comments (18)

pacman82 avatar pacman82 commented on August 29, 2024 1

Hello @ErikJansenIRefact

arrow-odbc-py 1.2.2 is released. It raises an error with a descriptive message if it discovers out of range timestamps.

Best, Markus

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Hi thanks for the Bug report. Are you writing or reading data? What database and driver are you using?

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Hi,

Reading data. Connection is to SQLServer 2022 with Microsoft odbc version 18 driver on a Mac.

unixodbc 2.3.11
msodbcsql18 18.2.1.1

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Hello @ErikJansenIRefact ,

thanks, that seems easy enough for me to reproduce. Would you mind telling me the relational type of the column which causes the error? Just saw you described it in your initial statement.

Best, Markus

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

hm, okay I am missing something. This test passes for me:

def test_timestamp_ns():
    """
    Query a table with one row. Should return one batch
    """
    table = "OneRow"
    os.system(f'odbcsv fetch -c "{MSSQL}" -q "DROP TABLE IF EXISTS {table};"')
    os.system(f'odbcsv fetch -c "{MSSQL}" -q "CREATE TABLE {table} (a DATETIME2(7));"')
    rows = "a\n2014-04-14 21:25:42.0748412"
    run(["odbcsv", "insert", "-c", MSSQL, table], input=rows, encoding="ascii")

    query = f"SELECT * FROM {table}"
    reader = read_arrow_batches_from_odbc(
        query=query, batch_size=100, connection_string=MSSQL
    )
    it = iter(reader)
    actual = next(it)

    schema = pa.schema([("a", pa.timestamp("ns"))])
    expected = pa.RecordBatch.from_pydict({"a": [1397510742074841200]}, schema)
    print(expected[0])
    print(actual[0])
    assert expected == actual

    with raises(StopIteration):
        next(it)

Still running on a windows PC and with ODBC 17 driver version, but I think I am likely to miss something else.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Given that I now have test reading Datetime2(7) which seems to work, and also looking at the original error message you send I have doubts I can reprocude it. It seems the error occurs converting from pyarrow to pyarrow. arrow-odbc would use [ns] to represent a Datetime2(7) to not loose any precision, as it is also documented in the readme. It seems to me some downstream processing expects [us].

Best Markus

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Hi Markus,

I think the issue is caused due to the date time being a high value: '8900-12-31 00:00:00.0000000'. This in combination with a precision > 3 perhaps causes the issue.

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

I've looked into apache arrow to check if there are limitations in the high value of a datetime but so far did not found any relevant info.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

I really would need a reproducing example. High values did not cause this error then I inserted in the test (the year did overflow though in the database, but reading it was fine). Also the error message points from an arrow to arrow conversion, which does not happen within this library. At least it is not supposed to. Do you have a minimum reproducing example?

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Hi Markus,

This is an example. In the query text you can try the scenarios by casting to a datetime2(7) => results in error, by casting to datetime2(3) => everything works ok and by casting to datatime2(4) no errors but when data is retrieved (I used duckDB) the value for valid_time = '1886-05-08 05:05:15.485381'

`from arrow_odbc import read_arrow_batches_from_odbc
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import fs

connection_string="Driver={ODBC Driver 18 for SQL Server};Server=localhost;TrustServerCertificate=yes;"
lfs = fs.LocalFileSystem()

def read_rows():
return read_arrow_batches_from_odbc(
query=f"select cast('8900-12-31' as datetime2(4)) as valid_time",
connection_string=connection_string,
batch_size=100,
user="sa",
password="iRefact2017",
)

def write_parquet(writer, batch):
writer.write(batch)

def export():
reader = read_rows()
writer = pq.ParquetWriter(where=f"./test.parquet",schema=reader.schema,filesystem=lfs,compression="snappy")
for batch in reader:
write_parquet(writer, batch)
writer.close()

export()
`

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Mmmh, the code is not nicely formatted. Hopefully you can use it.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Hello @ErikJansenIRefact ,

fromatting was not a problem. If I replace the final call to export in your example with read_rows it runs just fine. The exception is not raised when reading rows from the database, that works, but them writing them to parquet. My reading of the error is that the parquet writer does not support nanosecond precision and does not convert silently in order to avoid information loss you are not aware of. My suggestion would be to transform any timestamp with nanoseconds precision into one with milliseconds precision before writing it to parquet.

Best, Markus

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Did some more value tests. A value: select cast('2200-12-31' as datetime2(7)) as valid_time works fine. A value: select cast('2300-12-31' as datetime2(7)) as valid_time results in an error. So it seems there is a kind of high value for datetimes.

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Thanks Markus. So basically it's an issue in Arrow.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

You are welcome

IMHO it is more a decision of arrow, rather than an issue. Not silently dropping information is likely a design choice. However, I think that parquet does support nano seconds, so they also might just decide to support it in the writer.

FYI: If you just want to export a database to parquet you could also skip the intermediate step via arrow and use the odbc2parquet command line tool.

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Tested without writing to parquet. It seems the value is not converted correctly already in arrow.
The datetime2(7) '8900-12-31' is converted to a pyarrow wrongly. So it's not an issue in writing to parquet. Checked the docs and parquet version 2.6 is required to support nanosecond precision datetimes.

This is the result of the recordBatch (converted to a pyarrow table). It is a nanosecond precision timestamp but somewhere it's value is converted wrongly.

pyarrow.Table
valid_time: timestamp[ns]

valid_time: [[1886-05-08 05:05:15.485380608]]

from arrow-odbc-py.

ErikJansenIRefact avatar ErikJansenIRefact commented on August 29, 2024

Sorry to bother you again but could you take a look if somewhere in the conversion from database to arrow something goes wrong?

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

This is a different issue. I also saw that behavior. It is highly likely that this is an integer overflow. I would need to investigate in order to figure out if something can be done about it.

from arrow-odbc-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.