Giter Site home page Giter Site logo

Comments (23)

lidavidm avatar lidavidm commented on August 28, 2024 2

I guess that means I'd keep the existing objects we have (that expose the raw address), and just have them implement the new dunder methods in addition

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024 1

So in summary, short term I think I would do:

  • Add the dunder methods to the handle classes of the low-level interface, which already enables using the low-level interface without pyarrow and with the capsule protocol

  • In the places that accept data (eg ingest), generalize to accept objects that implement the dunders in addition to hardcoded support for pyarrow

+1

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

Yes, IIRC the official R-Python interop library supports shipping PyCapsule to R (cc @paleolimbot ).

from arrow-adbc.

paleolimbot avatar paleolimbot commented on August 28, 2024

Here, in case it's helpful!

https://github.com/rstudio/reticulate/blob/74d139b1772d29dce24b22a828f2972ac97abacf/src/python.cpp#L1371-L1389

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

Ah, that doesn't sound helpful actually, because reticulate is only considering the case where the "context" of the capsule (an embedded opaque pointer) was created by R. Here, the case AFAIU would be to pass a Python-created PyCapsule to R. Casting a Python-managed opaque pointer to a R SEXP is probably a bad idea (though who knows :-)).

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

I mean, the code reference you gave was very helpful :-)

from arrow-adbc.

paleolimbot avatar paleolimbot commented on August 28, 2024

I see..that's the other direction (R to Python). I assumed that would roundtrip but apparently it doesn't ( https://github.com/rstudio/reticulate/blob/74d139b1772d29dce24b22a828f2972ac97abacf/src/python.cpp#L755-L756 ).

Casting a Python-managed opaque pointer to a R SEXP is probably a bad idea (though who knows :-)).

I think in this case that's exactly what we want! It's easy to work around, though as long as we can get an address (similar to what the Arrow R package does except in this case the R external pointer will keep a Python object reference).

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

Well, py_to_r doesn't seem to handle capsules at all? What am I missing?

from arrow-adbc.

paleolimbot avatar paleolimbot commented on August 28, 2024

Nothing! reticulate doesn't handle them. In this case the R package would implement py_to_r.some.qualified.python.type.schema() and return an external pointer classed as nanoarrow_schema (for example). My point was that the semantics should be exactly the same as if the transformation was automatic (at least in this case).

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

Going to punt this since I think if we start returning PyCapsule, you won't be able to work with that unless you also have PyArrow 14, and I don't want to bring up the minimum PyArrow version so much right now

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

I'm not sure I understand the relationship with PyArrow here. What are the PyCapsules in this issue supposed to convey?

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

We return custom wrappers around C Data Interface objects, we want to return compliant PyCapsules (and Joris already prototyped that), but first I also want some way to make the PyCapsule also work with versions of PyArrow that can't import them (since they're opaque to Python code, right?)

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

since they're opaque to Python code, right?

Indeed, they are, except using ctypes or cffi.

from arrow-adbc.

paleolimbot avatar paleolimbot commented on August 28, 2024

I think the idiom to apply would be to implement __arrow_c_stream__() on the statement wrapper and have that be the one and only thing that returns a capsule?

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

But we have other methods (like get_table_schema) that need to return capsules.

from arrow-adbc.

paleolimbot avatar paleolimbot commented on August 28, 2024

But we have other methods (like get_table_schema) that need to return capsules.

I think there it would return an object that implements __arrow_c_schema__()? I think the intention is that a user would do:

pyarrow.schema(conn.get_table_schema())

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

That still requires you to have PyArrow 14, which is the main thing I wanted to avoid

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

Well, given the PyArrow CVE I think the next release of ADBC will have to require 14.0.1 as a baseline.

from arrow-adbc.

pitrou avatar pitrou commented on August 28, 2024

There's the hotfix as well if you won't want to require 14.0.1.

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

Right - but I think it behooves another Arrow project to follow our own advice 🙂

from arrow-adbc.

jorisvandenbossche avatar jorisvandenbossche commented on August 28, 2024

Started looking at this again, exploring how we can adopt the capsule protocol in ADBC python.

The low-level interface (_lib.pyx) has no dependency on pyarrow, and currently has ArrowSchemaHandle / ArrowArrayHandle / ArrowArrayStreamHandle classes that hold the C structs, and those classes are returned in the various objects.

I think in theory we could replace those current Handle classes with PyCapsules directly (and in the higher-level code, we can still extract the pointer from the PyCapsule when having to deal with a pyarrow version that doesn't support capsules directly).
However, those handle objects are currently exposed in the public API, so would we be OK with just removing them? (I don't know if there are external libraries that use those directly?)
We could also keep them, add add the dunder methods to them so they are importable without having to access the integer address. Which is what David mentioned above (#70 (comment)), I assume:

I guess that means I'd keep the existing objects we have (that expose the raw address), and just have them implement the new dunder methods in addition

One advantage of keeping some generic wrapper/handle class that has the dunders vs returning pycapsules directly, is that the return values of the low-level interface can then more easily be passed to a library that expects an object with the dunder defined instead of the capsules directly (i.e. how we currently implemented support for this in pyarrow, e.g. pyarrow.array(..) checks for __arrow_c_array__ attribute on the passed object, but doesn't accept capsules directly, xref apache/arrow#38010)


The DBAPI is much more tied to pyarrow, so I don't know if, on the short term, we want to enable getting arrow data without a dependency on pyarrow and only relying on the capsule protocol. That would require quite some changes.

Just getting an overview for myself:

  • Connection.adbc_get_info and adbc_get_table_types -> returns info as a dict or list, but under the hood consumes the stream with pyarrow and convert to pylist -> this doesn't expose pyarrow data, so on the short term this could continue to require pyarrow as a runtime dependency
  • Connection.adbc_get_objects returns a pyarrow.RecordBatchReader
  • Connection.adbc_get_table_schema returns a pyarrow.Schema
  • Cursor._bind / _prepare_execute / execute -> currently accepts pyarrow RecordBatch/Table as parameters, but this can be expanded to any object that has the array or array stream protocol
  • Cursor.execute / adbc_read_partition sets a _result object to _RowIteratorof a AdbcRecordBatchReader. This reader subclasses pyarrow.RecordBatchReader (to ensure adbc error messages are properly propagated)
  • Cursor.adbc_ingest accepts pyarrow RecordBatch, Table or RecordBatchReader as data, this can also be generalized to any object that supports the protocol
  • Cursor.adbc_execute_schema / adbc_prepare returns a pyarrow.Schema
  • Cursor.fetchallarrow / fetch_arrow_table return a pyarrow.Table
  • Cursor.fetch_record_batch returns a pyarrow.RecordBatchReader
  • Cursor.fetchone / fetchmany / fetchall use the _RowIterator, which uses the pyarrow RecordBatchReader to iterate over the data -> this only uses pyarrow under the hood for converting the data to Python tuples, and so on the short term can continue to do that with a pyarrow runtime dependency.

Changing the methods that currently return a pyarrow Schema/Table/RecordBatchReader to return a generic object implementing the dunders seems too much of a breaking change (and would also be much less convenient for pyarrow users, e.g. requiring a users to do pyarrow.schema(conn.get_table_schema()) instead of conn.get_table_schema()).
But would we want to add additional method variants of each of those that return something generic like that?

Of course, assuming you have pyarrow 14+ installed, by returning pyarrow objects which implement the dunders, we of course also automatically make the capsule protocol available in ADBC. So it's more a question to what extent we want to make it possible to use the DBAPI layer without pyarrow.


So in summary, short term I think I would do:

  • Add the dunder methods to the handle classes of the low-level interface, which already enables using the low-level interface without pyarrow and with the capsule protocol
  • In the places that accept data (eg ingest), generalize to accept objects that implement the dunders in addition to hardcoded support for pyarrow

And then longer term we can think about whether we also want to enable using the DBAPI in some form without a runtime dependency on pyarrow.

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

Thanks! Yup, this is my plan (I started working on it over holiday but realized it was...holiday)

from arrow-adbc.

lidavidm avatar lidavidm commented on August 28, 2024

Completed in #1346

from arrow-adbc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.