Giter Site home page Giter Site logo

Comments (18)

aersam avatar aersam commented on August 29, 2024

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

This does not happen on Windows because on windows arrow-odbc-py is compiled to use UTF-16 encoding to talk with the Database. On Linux and OS-X however the system local is usually UTF-8 and therefore arrow-odbc-py assumes the narrow encoding returned from the database is UTF-8. However, if your System local is configured not to be UTF-8 this assumption is wrong.

So my first piece of advice would be to check whether your System local is UTF-8. Other narrow ASCII encodings are currently not supported by arrow-odbc-py.

from arrow-odbc-py.

aersam avatar aersam commented on August 29, 2024

Ok, I'll try. For completness, here's the link to my failing gh action: https://github.com/bmsuisse/odbc2deltalake/actions/runs/8538422686

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Looking at this it is more likely that somehow binary and character size is confused. I just do not know there. Yet, likely the system local is innocent. Would you kindly enable debug logging and execute again. In the beginning it should log the database column names and their type, even before conversion to arrow. I would be interested in seeing that.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

You can log to standard error using: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.log_to_stderr

from arrow-odbc-py.

aersam avatar aersam commented on August 29, 2024

Here we go:

DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x55c451f30050'
WARN State: 01S00, Native error: 0, Message: [Microsoft][ODBC Driver 18 for SQL Server]Invalid connection string attribute
WARN State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed database context to 'db_to_delta_test'.
WARN State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed language setting to us_english.
DEBUG Database managment system name as reported by ODBC: Microsoft SQL Server
DEBUG ODBC driver reported for column 0. Relational type: BigInt; Nullability: NoNulls; Name: 'User_-_iD';
DEBUG ODBC driver reported for column 1. Relational type: Varchar { length: Some(100) }; Nullability: Nullable; Name: 'FirstName';
DEBUG ODBC driver reported for column 2. Relational type: WVarchar { length: None }; Nullability: Nullable; Name: 'LastName';
DEBUG ODBC driver reported for column 3. Relational type: Decimal { precision: 15, scale: 3 }; Nullability: Nullable; Name: 'Age';
DEBUG ODBC driver reported for column 4. Relational type: Varchar { length: Some(10) }; Nullability: NoNulls; Name: 'companyid';
DEBUG ODBC driver reported for column 5. Relational type: BigInt; Nullability: Nullable; Name: 'time_stäm';
DEBUG ODBC driver reported for column 6. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: '__timestamp';
DEBUG ODBC driver reported for column 7. Relational type: Bit; Nullability: Nullable; Name: '__is_deleted';
DEBUG ODBC driver reported for column 8. Relational type: Bit; Nullability: Nullable; Name: '__is_full_load';
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '2' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '2' reported 100.
DEBUG Relational type of column 1: Varchar { length: Some(100) }
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '3' reported -9.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG Relational type of column 2: WVarchar { length: None }
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '5' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '5' reported 10.
DEBUG Relational type of column 4: Varchar { length: Some(10) }
INFO Column 'User_-_iD'
Bytes used per row: 8
INFO Column 'FirstName'
Bytes used per row: 409
INFO Column 'LastName'
Bytes used per row: 20009
INFO Column 'Age'
Bytes used per row: 26
INFO Column 'companyid'
Bytes used per row: 49
INFO Column 'time_stäm'
Bytes used per row: 16
INFO Column '__timestamp'
Bytes used per row: 24
INFO Column '__is_deleted'
Bytes used per row: 9
INFO Column '__is_full_load'
Bytes used per row: 9
INFO Total memory usage per row for single transit buffer: 20559
SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]
User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool
DEBUG SQLFreeHandle dropped 0x55c451ef1b20 of type Stmt.
DEBUG SQLFreeHandle dropped 0x55c451f30050 of type Dbc.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

I could reproduce the Bug.

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

Actually MSFT has one of the more solid ODBC drivers I would say. I also admire many of the things they achieved in terms of engineering. Yet this time, I think it is on them. It is not a configuration issue of the client local. The umlaut ä is rendered correctly. Yet the string is one letter short. My guess is this has to do with the fact that ä is two bytes in UTF-8, yet the text length is returned in characters not bytes. Anyhow this bug needs to be fixed by Microsoft.

Windows version of of arrow-odbc-py is not affected, because on Windows I choose to use the wide UTF-16 version of the ODBC interface. I am happy to help writing and detailing the issue (however I also suspect that a knowledgeable MS employee would be a ble to reproduce it by looking at this thread alone), but I can not figure out how to report a bug to MS.

Best, Markus

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

As a workaround I'll give it a try to supply larger buffers than requested by the drivers ...

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

To be fair to Microsoft, the function in question SQLDescribeCol is document to return the length in characters not bytes. https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqldescribecol-function?view=sql-server-ver16. Yet it is clearly intended to determine buffer lengths (usually expressed in bytes). Driver may be "fine" and the ODBC standard is broken here. Well, anyhow, seems in the narrow case I'll start multiplying its values by 3 and for wide code by 2, this should put it all on the safe side even in worst case scenarios. I need to fix this upstream in odbc-api though.

from arrow-odbc-py.

aersam avatar aersam commented on August 29, 2024

Hm, interesting. Would you think I can reproduce this behavior if I write it in C#/.Net Core using MSFT-only libs?

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Maybe not, depends if ADO relies on ODBC. Also relies on how wasteful authors of these libs are then using memory. I could avoid a lot of trouble if I would just allocate 4kb of buffer for each column name you wanna now. In that case you the name would need to be very long and contain special characters.

A minimal reproducing example in plain C, maybe based on an example for SQLDescribeCol would be feasable though. Just watch the string length output argument for a column name whose name differ in byte and character length.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

NameLengthPtr
[Output] Pointer to a buffer in which to return the total number of characters (excluding the null termination) available to return in *ColumnName. If the number of characters available to return is greater than or equal to BufferLength, the column name in *ColumnName is truncated to BufferLength minus the length of a null-termination character.

This formulation alone in the documents is problematic

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

It confuses binary buffer length and character length. Very likely related to the bug in question.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Definitly this. If you want to see it fail under windows too, you can use a character in the column name which is 4 Bytes in UTF-16 instead of two. E.g. 𐐏. Also MariaDB has the same bug; PostgreSQL is fine though.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Sadly allocating larger buffers is not a suitable workaround. The drivers only copy the bytes in the length of character data in the application provided buffers. The "good" news is that this makes the bug more relevant to fix.

Sorry, I cannot help here. I really tried.

from arrow-odbc-py.

aersam avatar aersam commented on August 29, 2024

Thank you for all your efforts! I'll try creating a repo in C# and getting this to the correct guys at Redmond - it's an interesting bug, after all 🙂

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Probably not even needed, just stating that SQLDescribeCol truncates column names those binary length exceeds the character length, is a precise description, which should enable them to reproduce.

from arrow-odbc-py.

pacman82 avatar pacman82 commented on August 29, 2024

Remark: Using SQLColAttribute instead of SQLDescribeCol might work.

from arrow-odbc-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.