Giter Site home page Giter Site logo

apache / arrow-adbc Goto Github PK

View Code? Open in Web Editor NEW
333.0 41.0 84.0 42.28 MB

Database connectivity API standard and libraries for Apache Arrow

Home Page: https://arrow.apache.org/adbc/

License: Apache License 2.0

C 5.86% CMake 1.66% C++ 18.91% Shell 2.91% Java 7.34% Python 4.08% Cython 1.08% Go 12.20% Ruby 1.53% Meson 0.53% PowerShell 0.32% Batchfile 0.13% Dockerfile 0.21% Makefile 0.07% R 2.24% Rust 4.42% C# 36.49% Vala 0.04%

arrow-adbc's Introduction

ADBC: Arrow Database Connectivity

License

ADBC is an API standard (version 1.0.0) for database access libraries ("drivers") in C, Go, and Java that uses Arrow for result sets and query parameters. Instead of writing code to convert to and from Arrow data for each individual database, applications can build against the ADBC APIs, and link against drivers that implement the standard. Additionally, a JDBC/ODBC-style driver manager is provided. This also implements the ADBC APIs, but dynamically loads drivers and dispatches calls to them.

Like JDBC/ODBC, the goal is to provide a generic API for multiple databases. ADBC, however, is focused on bulk columnar data retrieval and ingestion through an Arrow-based API rather than attempting to replace JDBC/ODBC in all use cases. Hence, ADBC is complementary to those existing standards.

Like Arrow Flight SQL, ADBC is an Arrow-based way to work with databases. However, Flight SQL is a protocol defining a wire format and network transport as opposed to an API specification. Flight SQL requires a database to specifically implement support for it, while ADBC is a client API specification for wrapping existing database protocols which could be Arrow-native or not. Together, ADBC and Flight SQL offer a fully Arrow-native solution for clients and database vendors.

For more about ADBC, see the introductory blog post.

Status

ADBC versions the API standard and the implementing libraries separately.

The API standard (version 1.0.0) is considered stable, but enhancements may be made.

Libraries are under development. For more details, see the documentation, or read the changelog.

Installation

Please see the documentation.

Documentation

The core API definitions can be read in adbc.h. User documentation can be found at https://arrow.apache.org/adbc

Development and Contributing

For detailed instructions on how to build the various ADBC libraries, see CONTRIBUTING.md.

arrow-adbc's People

Contributors

aiguofer avatar alexandreyc avatar birschick-bq avatar cocoa-xu avatar curthagenlocher avatar davidhcoe avatar dependabot[bot] avatar eitsupi avatar elenahenderson avatar esodan avatar jacobmarble avatar jduo avatar joellubi avatar julian-brandrick avatar kou avatar krlmlr avatar lidavidm avatar lupko avatar mbrobbel avatar paleolimbot avatar ruowan avatar ryan-syed avatar soumyadsanyal avatar tokoko avatar vipere avatar vleslief-ms avatar willayd avatar wjones127 avatar ywc88 avatar zeroshade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-adbc's Issues

[C] SQLite3 driver using nanoarrow

I did a little prototyping around this to try to hit nanoarrow with some more real-world data...my interest in SQLite3 is mostly around the GeoPackage (https://geopackage.org), which is a common spatial file format based on SQLite3. The experiment is currently living in its own repo ( https://github.com/paleolimbot/minigpkg/blob/main/src/minigpkg/nanoarrow_sqlite3.h#L112-L127 ), and while I don't think I currently have the bandwidth to implement the ADBC API, I'd be happy to circle back to this at some point since an SQLite3 driver that depends only on sqlite3 seems like it would be useful here.

As noted in the current driver (https://github.com/apache/arrow-adbc/blob/main/c/drivers/sqlite/sqlite.cc#L139-L141), SQLite really needs an adaptive builder or an adaptive schema guesser to be useful.

Scope for "1.0"

Implemented for C/Python and Java:

  • Query execution
  • Data ingestion
  • Basic database metadata (listing tables, etc) (#14)
  • Error codes, with space for vendor-specific codes and SQLSTATE codes akin to JDBC/ODBC (#20)
  • Transaction semantics (#23)
  • Basic CI
  • Basic validation suites
  • Flight SQL driver

IMO, we should have these features:

Not yet:

  • Async APIs
  • JNI-based driver

[C] libpq driver improvements

  • Handle null values, more types
  • Work out some of the error handling details that were left as TODOs (and probably do some refactoring)
  • Handle bulk ingestion
  • Use COPY FROM STDIN for bulk ingestion instead of a prepared statement (see https://www.postgresql.org/docs/current/populate.html)
  • Handle prepared statements
  • Add tests for queries that don't return result sets
  • #568
  • Make sure concurrent statements don't interfere with each other/internal locking
  • Set a notice handler
  • #538 integration test with Polars

[C] Clean up CMake config

Right now the CMake config is copied from the Arrow C++ project, but as pointed out in #90, we don't necessarily need all the complexity/features of that config, and it will be hard to keep them in sync - so we probably want to refactor/trim down/rebuild the CMake config.

Subtasks:

[Format] Improve partitioned data interface

We should improve the documentation/justification for this interface, describe better what happens when it's not supported, and make sure it lines up with what potential users of the interface expect.

In particular, it should line up with Spark's DataSourceV2. Looking at ReadSupport, the main thing is that we need to return the schema and partitions at the same time. So we might want to return something like this:

struct AdbcPartitions {
  struct ArrowSchema result_schema;
  size_t num_partitions;
  uint8_t** partitions;
  void* private_data;
};
AdbcStatusCode AdbcPartitionsRelease(struct AdbcPartitions*, struct AdbcError*);

Also, should deserializing a partition descriptor give you a statement, or just directly give you a result reader?

Also see #61 which proposes refactoring the Execute API.

[Format] Retrieve expected param binding information

If available, it would be great to be able to retrieve any information about parameter binding that is available. Some potential information that might be available:

  • Number of expected parameters
  • Names of expected parameters
  • Schema / types of expected parameters

Since it's not reliable to always be able to get this information, there would need to be a way to indicate that there was no error, but the information is not available.

Clarify transaction semantics

JDBC, ODBC, Flight SQL (implicitly): auto-commit
PEP 249: manual commit

We should define what the default is and add a function to set this on the connection.

Implement table/column reflection

See lidavidm/arrow#9

Similar to AdbcConnectionGetTables, there should be a way to query the columns available in a table with their names, types, NULL-ness etc.

I was basically modeling this off of Flight SQL: https://github.com/apache/arrow/blob/master/format/FlightSql.proto

In this case GetTables has a parameter to optionally fetch the Arrow schema of the tables as well. And then SQL-specific type info is encoded as column metadata.

I recall we talked about using something like MSSQL information_schema instead. This was also debated for Flight SQL. We decided against it since databases don't necessarily implement this, or implement slightly different things, and there's no easy way for a client to figure this out. Also, then you're dependent on knowing the database's SQL dialect in order to introspect it. While explicit calls like GetTables are tedious and less flexible, they don't require so much bootstrapping. (This is important for Flight SQL which does not want to have to implement client-side drivers; a Flight SQL client should be compatible with any database implementing the protocol. ADBC doesn't necessarily have the same constraint, though.)

GetTables needs to also (optionally) return the schema. Or we could have a separate method that models the schema explicitly? The difference would be that we could explicitly model the SQL-database metadata (which Flight SQL stuffs into Arrow field metadata)

Improve error handling

We should fill out the error codes. Things to consider:

  • Flight/gRPC status codes: the gRPC status codes are nice because they're well-defined.
  • PEP 249 exception hierarchy
  • SQLSTATE standard (JDBC/ODBC)

Do we want to extend AdbcError with space for database-specific error codes?

Set up linters, sanitizers

We need to set up:

  • cpplint
  • clang-tidy
  • ASan/UBSan
  • Valgrind, possibly
  • flake8 for Cython (no longer supported)

Reorganize and complete Python bindings

The existing bindings aren't fully complete.

We should have two packages: a low-level package with no/minimal dependencies that mirrors the C API closely; and eventually, a high-level package that implements a PEP 249 compliant API with extensions (similar to turbodbc).

[Format] Simplify Execute and Query interface

Rather than the separate Execute / GetStream functions, it might be better to follow something similar to FlightSQL's interface or Go's database/sql API.

Have two functions:

  • Execute a query without expecting a result set: Execute(struct AdbcConnection*, const char*, struct AdbcResult*, struct AdbcError*) where AdbcResult would contain an optional LastInsertedID and Number of Rows affected
  • Execute a query to retrieve a result set: Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) where the ArrowArrayStream is populated with the result set.

Corresponding methods would exist for a Statement just without the need for taking the const char* as it would already be prepared in the statement.

Some benefits of this idea:

  • With the interface enforcing the idea of the query -> retrieving the results being a single API call it makes concurrency easier to handle (any complexity would be handled by the driver implementation) by consumers of the interface.
  • Drivers can be aware of whether or not a user is expecting a result set and can operate accordingly. They don't have to interrogate the backend to know whether or not a result set is available or if the query was an update/insert vs a select.

[Ruby] RubyGems packaging is broken

We can't use ../ for files that are included into .gem:

$ cd ruby
$ rake install
red-adbc 1.0.0 built to pkg/red-adbc-1.0.0.gem.
rake aborted!
Running `gem install .../ruby/pkg/red-adbc-1.0.0.gem` failed with the following output:

ERROR:  While executing gem ... (Gem::Package::PathError)
    installing into parent path /tmp/local/lib/ruby/gems/3.2.0+2/gems/LICENSE.txt of /tmp/local/lib/ruby/gems/3.2.0+2/gems/red-adbc-1.0.0 is not allowed
...

[Python] Driver distribution

We don't want to/can't futz with shared libraries when distributing packages. Instead, since drivers have an entrypoint, just expose the entrypoint in Python?

e.g. adbc_driver_sqlite will bundle the driver and expose adbc_driver_sqlite._entrypoint(driver_address: int) -> int ~= AdbcStatusCode _entrypoint(struct AdbcDriver* driver). Then the driver manager can just import the package and call the entrypoint. (There'll need to be some support for this on the C++ side too)

That way we can more easily distribute pip-installable packages

[CI] CI/Packaging Setup

CI:

  • Integration test with Postgres

Packaging:

  • Nightly Java JARs (no native code, so only one build needed)
  • A single conda-forge feedstock (generating multiple packages)
  • #99
  • Python wheels (note: libpq has quite a few transitive dependencies, look at the licenses/make sure we are bundling everything appropriate)

AdbcStatementExecuteUpdate/AdbcStatementExecuteQuery & rows_affected

AdbcStatementExecuteUpdate can set rows_affected, which is a bit ancient given that most SQL systems support the RETURNING clause now, which makes the separation between ExecuteQuery and ExecuteUpdate redundant. I propose to have only one method to run a query, possibly along with allowing the out to be nullptr in AdbcStatementExecuteQuery. rows_affected should be removed.

Cleanly separate C/Java code

We should have a structure like

/
  adbc.h
  c/
    driver_manager
    driver
    validation
  java/
    ...

(also, the Java "testsuite" package should be renamed to "validation")

Potential DBI-inspired APIs

  • Function to SELECT * from a table without providing a query (makes it easier to provide non-query-engine based backends, e.g. a Parquet file backend)

[Format] Minor gaps with existing APIs

  • A way to get the row count of a result set, when known (== DBAPI Cursor.rowcount)
  • A way to get the name of the 'current' catalog (== Ibis Backend.current_database)
  • A way to get the parameter style (== DBAPI paramstyle)

[C] How about adding support for driver directory?

Related: #84

Generally, plugin system uses its own plugin directory. For example, PostgreSQL uses ${prefix}/${version}/lib/ and MySQL uses ${prefix}/lib/mysql/plugin/.

How about adding support for driver directory like them? If we use MODULE library not SHARED library (see #84), it's strange that adbc_driver_postgres.so (not lib prefix) exists in /usr/lib/.

[C] How about using MODULE library not SHARED library for driver?

Generally, plugin for C/C++ is implemented as MODULE library.

See also https://cmake.org/cmake/help/latest/command/add_library.html#normal-libraries for MODULE and SHARED libraries.

For example, PostgreSQL's extensions and MySQL's storage engines are implemented as MODULE library.

If we use MODULE library, driver module file name is something like adbc_driver_postgresql.so on Linux. It doesn't have lib prefix and shared object version such as 900.0.0.

[Format] Formalize thread safety and concurrency guarantees

Things to consider

  • What do underlying APIs provide (libpq, duckdb, sqlite, JDBC, ODBC, Flight SQL)
  • What do wrapper APIs expect (JDBC, ODBC, DBI, dbapi, Go's database library)

Example: libpq disallows concurrent queries through a single PGconn, so multiple AdbcStatements can't be used if they share a connection (and the semantics of that get murky anyways) - but what should the behavior be?

[C] Provide a "just query" method

For the common case of executing a single SQL string, let's have a method on the connection object for executing the query directly without the need for an intermediate Statement object

ADBC/Ibis pain points

  • Need way to query driver and server version
  • Need handling of multiple databases within a connection (or: this is exposed as a 'catalog' already)

[Python] Implement DBAPI

Will make implementing Ibis backends easier since we can reuse more of the code for SQLAlchemy-based backends

[C][Java] Expand validation suites

  • Port existing tests to validation suite
  • Clean up C validation suite test helpers, see if they can't be structured better
  • Make sure API is better covered
  • Find some way to better cover error cases as well (e.g. failure of driver entrypoint function) - some of these tests may be specific to the driver manager
  • Test things using the driver struct as well (tests should use both the driver manager and the library directly)

[Format] Document compatibility goals

  • Document ABI compatibility goals in header
  • Set up ABI checker
  • Rename ADBC_VERSION_0_0_1 to ADBC_VERSION_1_0_0
  • Ensure AdbcDriver struct is all up-to-date

[C] How about introducing naming rule for entrypoint name?

Now, driver implementers can use any name for entrypoint function. How about introducing a naming rule for entrypoint function such as AdbcDriverInit() or ${driver_name}_init()? If we have a naming rule, users don't need to entrypoint option.

We may not be able to use constant name such as AdbcDriverInit() because LoadLibraryEx() doesn't have a local bind flag like RTLD_LOCAL for dlopen().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.