apache / arrow-adbc Goto Github PK

Database connectivity API standard and libraries for Apache Arrow

Home Page: https://arrow.apache.org/adbc/

License: Apache License 2.0

C 5.86% CMake 1.66% C++ 18.91% Shell 2.91% Java 7.34% Python 4.08% Cython 1.08% Go 12.20% Ruby 1.53% Meson 0.53% PowerShell 0.32% Batchfile 0.13% Dockerfile 0.21% Makefile 0.07% R 2.24% Rust 4.42% C# 36.49% Vala 0.04%

arrow-adbc's Introduction

ADBC: Arrow Database Connectivity

ADBC is an API standard (version 1.0.0) for database access libraries ("drivers") in C, Go, and Java that uses Arrow for result sets and query parameters. Instead of writing code to convert to and from Arrow data for each individual database, applications can build against the ADBC APIs, and link against drivers that implement the standard. Additionally, a JDBC/ODBC-style driver manager is provided. This also implements the ADBC APIs, but dynamically loads drivers and dispatches calls to them.

Like JDBC/ODBC, the goal is to provide a generic API for multiple databases. ADBC, however, is focused on bulk columnar data retrieval and ingestion through an Arrow-based API rather than attempting to replace JDBC/ODBC in all use cases. Hence, ADBC is complementary to those existing standards.

Like Arrow Flight SQL, ADBC is an Arrow-based way to work with databases. However, Flight SQL is a protocol defining a wire format and network transport as opposed to an API specification. Flight SQL requires a database to specifically implement support for it, while ADBC is a client API specification for wrapping existing database protocols which could be Arrow-native or not. Together, ADBC and Flight SQL offer a fully Arrow-native solution for clients and database vendors.

For more about ADBC, see the introductory blog post.

Status

ADBC versions the API standard and the implementing libraries separately.

The API standard (version 1.0.0) is considered stable, but enhancements may be made.

Libraries are under development. For more details, see the documentation, or read the changelog.

Installation

Please see the documentation.

Documentation

The core API definitions can be read in adbc.h. User documentation can be found at https://arrow.apache.org/adbc

Development and Contributing

For detailed instructions on how to build the various ADBC libraries, see CONTRIBUTING.md.

arrow-adbc's People

Contributors

Stargazers

Watchers

Forkers

lidavidm paliwalashish zeroshade kou doytsujin ajunlonglive paleolimbot jorisvandenbossche quanghgx judahrand vonrosenchild 0x0l wjones127 raulcd willayd jmao-denver samkenxstream tokoko jacobmarble dhirschfeld rohankumardubey rajkishor77 curthagenlocher masonsun rtadepalli davidhcoe qpc-github quantum-platinum-cloud eitsupi cdaudt tsaiggo davisusanibar krlmlr mbrobbel aiguofer vipere super-rain ggodik ywc88 alexandreyc esi-far carlyeks prmoore77 scottlepp jaredzhou abandy alexandermalyga elenahenderson nkhoshini ruowan joellubi fredrikhgrelland nbenn esodan divyansh200102 superhawk610 llama90 vleslief-ms mox692 abhishekjog ryan-syed levakin anithapanduranganms williamleven neerajdixit oss-security-assessments birschick-bq harry-kwon amoeba bit-quill soumyadsanyal shylock-hg lupko julian-brandrick lkarthee cocoa-xu meowcraft-dev josephrp holicc dbettin jadewang-db sullis

arrow-adbc's Issues

[C] SQLite3 driver using nanoarrow

I did a little prototyping around this to try to hit nanoarrow with some more real-world data...my interest in SQLite3 is mostly around the GeoPackage (https://geopackage.org), which is a common spatial file format based on SQLite3. The experiment is currently living in its own repo ( https://github.com/paleolimbot/minigpkg/blob/main/src/minigpkg/nanoarrow_sqlite3.h#L112-L127 ), and while I don't think I currently have the bandwidth to implement the ADBC API, I'd be happy to circle back to this at some point since an SQLite3 driver that depends only on sqlite3 seems like it would be useful here.

As noted in the current driver (https://github.com/apache/arrow-adbc/blob/main/c/drivers/sqlite/sqlite.cc#L139-L141), SQLite really needs an adaptive builder or an adaptive schema guesser to be useful.

Scope for "1.0"

Implemented for C/Python and Java:

Query execution
Data ingestion
Basic database metadata (listing tables, etc) (#14)
Error codes, with space for vendor-specific codes and SQLSTATE codes akin to JDBC/ODBC (#20)
Transaction semantics (#23)
Basic CI
Basic validation suites
Flight SQL driver

IMO, we should have these features:

Acero driver demonstrating Substrait support
DuckDB driver (https://github.com/hannes/duckdb/tree/adbc)
Driver for Postgres/ODBC (nanodbc)/SQL Server (FreeTDS)
Experimental support in R (https://github.com/r-dbi/adbc)

Not yet:

Async APIs
JNI-based driver

[Ruby] Add minimum bindings

[C] libpq driver improvements

Set up issue/PR templates

[C] Research Turbodbc/Arrowdantic for developing ODBC-wrapping driver

Arrowdantic: https://github.com/jorgecarleitao/arrowdantic/
Turbodbc: https://github.com/blue-yonder/turbodbc/

[C] Clean up CMake config

Right now the CMake config is copied from the Arrow C++ project, but as pointed out in #90, we don't necessarily need all the complexity/features of that config, and it will be hard to keep them in sync - so we probably want to refactor/trim down/rebuild the CMake config.

Subtasks:

#608

[Format] Improve partitioned data interface

We should improve the documentation/justification for this interface, describe better what happens when it's not supported, and make sure it lines up with what potential users of the interface expect.

In particular, it should line up with Spark's DataSourceV2. Looking at ReadSupport, the main thing is that we need to return the schema and partitions at the same time. So we might want to return something like this:

struct AdbcPartitions {
  struct ArrowSchema result_schema;
  size_t num_partitions;
  uint8_t** partitions;
  void* private_data;
};
AdbcStatusCode AdbcPartitionsRelease(struct AdbcPartitions*, struct AdbcError*);

Also, should deserializing a partition descriptor give you a statement, or just directly give you a result reader?

Also see #61 which proposes refactoring the Execute API.

[Format] Retrieve expected param binding information

If available, it would be great to be able to retrieve any information about parameter binding that is available. Some potential information that might be available:

Number of expected parameters
Names of expected parameters
Schema / types of expected parameters

Since it's not reliable to always be able to get this information, there would need to be a way to indicate that there was no error, but the information is not available.

Clarify transaction semantics

JDBC, ODBC, Flight SQL (implicitly): auto-commit
PEP 249: manual commit

We should define what the default is and add a function to set this on the connection.

Implement table/column reflection

See lidavidm/arrow#9

Similar to AdbcConnectionGetTables, there should be a way to query the columns available in a table with their names, types, NULL-ness etc.

I was basically modeling this off of Flight SQL: https://github.com/apache/arrow/blob/master/format/FlightSql.proto

In this case GetTables has a parameter to optionally fetch the Arrow schema of the tables as well. And then SQL-specific type info is encoded as column metadata.

I recall we talked about using something like MSSQL information_schema instead. This was also debated for Flight SQL. We decided against it since databases don't necessarily implement this, or implement slightly different things, and there's no easy way for a client to figure this out. Also, then you're dependent on knowing the database's SQL dialect in order to introspect it. While explicit calls like GetTables are tedious and less flexible, they don't require so much bootstrapping. (This is important for Flight SQL which does not want to have to implement client-side drivers; a Flight SQL client should be compatible with any database implementing the protocol. ADBC doesn't necessarily have the same constraint, though.)

GetTables needs to also (optionally) return the schema. Or we could have a separate method that models the schema explicitly? The difference would be that we could explicitly model the SQL-database metadata (which Flight SQL stuffs into Arrow field metadata)

Improve error handling

We should fill out the error codes. Things to consider:

Flight/gRPC status codes: the gRPC status codes are nice because they're well-defined.
PEP 249 exception hierarchy
SQLSTATE standard (JDBC/ODBC)

Do we want to extend AdbcError with space for database-specific error codes?

Set up linters, sanitizers

We need to set up:

cpplint
clang-tidy
ASan/UBSan
Valgrind, possibly
~~flake8 for Cython~~ (no longer supported)

Reorganize and complete Python bindings

The existing bindings aren't fully complete.

We should have two packages: a low-level package with no/minimal dependencies that mirrors the C API closely; and eventually, a high-level package that implements a PEP 249 compliant API with extensions (similar to turbodbc).

[C] Driver manager: ensure we call FreeLibrary on Windows

FreeLibrary needs to be called if the entrypoint fails in the driver manager
private_manager and release need to be populated and tested on Windows

[Format] Simplify Execute and Query interface

Rather than the separate Execute / GetStream functions, it might be better to follow something similar to FlightSQL's interface or Go's database/sql API.

Have two functions:

Execute a query without expecting a result set: Execute(struct AdbcConnection*, const char*, struct AdbcResult*, struct AdbcError*) where AdbcResult would contain an optional LastInsertedID and Number of Rows affected
Execute a query to retrieve a result set: Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) where the ArrowArrayStream is populated with the result set.

Corresponding methods would exist for a Statement just without the need for taking the const char* as it would already be prepared in the statement.

Some benefits of this idea:

With the interface enforcing the idea of the query -> retrieving the results being a single API call it makes concurrency easier to handle (any complexity would be handled by the driver implementation) by consumers of the interface.
Drivers can be aware of whether or not a user is expecting a result set and can operate accordingly. They don't have to interrogate the backend to know whether or not a result set is available or if the query was an update/insert vs a select.

[deb] Add packages for Debian and Ubuntu

Targets:

c/driver_manager/
c/drivers/sqlite/ - not needed?
c/drivers/postgres/
glib/ - not exists yet

[Ruby] RubyGems packaging is broken

We can't use ../ for files that are included into .gem:

$ cd ruby
$ rake install
red-adbc 1.0.0 built to pkg/red-adbc-1.0.0.gem.
rake aborted!
Running `gem install .../ruby/pkg/red-adbc-1.0.0.gem` failed with the following output:

ERROR:  While executing gem ... (Gem::Package::PathError)
    installing into parent path /tmp/local/lib/ruby/gems/3.2.0+2/gems/LICENSE.txt of /tmp/local/lib/ruby/gems/3.2.0+2/gems/red-adbc-1.0.0 is not allowed
...

"1.0" Tasks

[GLib] Add minimum bindings

[Python] Driver distribution

We don't want to/can't futz with shared libraries when distributing packages. Instead, since drivers have an entrypoint, just expose the entrypoint in Python?

e.g. adbc_driver_sqlite will bundle the driver and expose adbc_driver_sqlite._entrypoint(driver_address: int) -> int ~= AdbcStatusCode _entrypoint(struct AdbcDriver* driver). Then the driver manager can just import the package and call the entrypoint. (There'll need to be some support for this on the C++ side too)

That way we can more easily distribute pip-installable packages

[CI] CI/Packaging Setup

CI:

Integration test with Postgres

Packaging:

Nightly Java JARs (no native code, so only one build needed)
A single conda-forge feedstock (generating multiple packages)
#99
Python wheels (note: libpq has quite a few transitive dependencies, look at the licenses/make sure we are bundling everything appropriate)

AdbcStatementExecuteUpdate/AdbcStatementExecuteQuery & rows_affected

AdbcStatementExecuteUpdate can set rows_affected, which is a bit ancient given that most SQL systems support the RETURNING clause now, which makes the separation between ExecuteQuery and ExecuteUpdate redundant. I propose to have only one method to run a query, possibly along with allowing the out to be nullptr in AdbcStatementExecuteQuery. rows_affected should be removed.

python/adbc_driver_manager: use PyCapsule for handles to C structs

We should try to use the 'native' type of the C API. Apparently, this will also ease interoperability with R.

[Java] Implement/use ServiceProvider for discovering drivers

https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html

Cleanly separate C/Java code

We should have a structure like

/
  adbc.h
  c/
    driver_manager
    driver
    validation
  java/
    ...

(also, the Java "testsuite" package should be renamed to "validation")

Potential DBI-inspired APIs

Function to SELECT * from a table without providing a query (makes it easier to provide non-query-engine based backends, e.g. a Parquet file backend)

CONTRIBUTING.md should describe pre-commit, needs to be updated for new packages/languages

[Format] Minor gaps with existing APIs

A way to get the row count of a result set, when known (== DBAPI Cursor.rowcount)
A way to get the name of the 'current' catalog (== Ibis Backend.current_database)
A way to get the parameter style (== DBAPI paramstyle)

[C] Add support for pkg-config

If we provide adbc-driver-manager.pc, users can find ADBC driver manager easily.

[C][Java] Set up automated ABI checker

[C] How about adding support for driver directory?

Related: #84

Generally, plugin system uses its own plugin directory. For example, PostgreSQL uses ${prefix}/${version}/lib/ and MySQL uses ${prefix}/lib/mysql/plugin/.

How about adding support for driver directory like them? If we use MODULE library not SHARED library (see #84), it's strange that adbc_driver_postgres.so (not lib prefix) exists in /usr/lib/.

[C] Add support for CMake package

If we provide AdbcDriverManagerConfig.cmake, CMake users can find ADBC driver manager easily: find_package(AdbcDriverManager)

[C] Implement ClickHouse driver

ClickHouse supports Arrow result sets, can easily insert Arrow data into tables, and has HTTP and gRPC interfaces: https://clickhouse.com/docs/en/interfaces/formats/#data-format-arrow-stream

[C] How about using MODULE library not SHARED library for driver?

Generally, plugin for C/C++ is implemented as MODULE library.

See also https://cmake.org/cmake/help/latest/command/add_library.html#normal-libraries for MODULE and SHARED libraries.

For example, PostgreSQL's extensions and MySQL's storage engines are implemented as MODULE library.

If we use MODULE library, driver module file name is something like adbc_driver_postgresql.so on Linux. It doesn't have lib prefix and shared object version such as 900.0.0.

[RPM] Add packages for Red Hat Enterprise Linux based distributions

Targets:

c/driver_manager/
c/drivers/sqlite/ - not needed?
c/drivers/postgres/
glib/ - not exists yet

Target distributions:

AlmaLinux 8
AlmaLinux 9

[Python] Set up wheel build processes/instructions

Follow up for #57/#53.

We should support Windows, MacOS, and Linux.

On Windows: use delvewheel

On MacOS: use delocate

On Linux: use auditwheel

We should start considering CI as well

[Format] Formalize thread safety and concurrency guarantees

Things to consider

What do underlying APIs provide (libpq, duckdb, sqlite, JDBC, ODBC, Flight SQL)
What do wrapper APIs expect (JDBC, ODBC, DBI, dbapi, Go's database library)

Example: libpq disallows concurrent queries through a single PGconn, so multiple AdbcStatements can't be used if they share a connection (and the semantics of that get murky anyways) - but what should the behavior be?

Implement database metadata calls in Java

For C/Java parity.

[C] Provide a "just query" method

For the common case of executing a single SQL string, let's have a method on the connection object for executing the query directly without the need for an intermediate Statement object

[C][Java] Ensure drivers comply with thread safety guarantees

From #64. The definitions were updated but we should ensure drivers comply (the test suite should validate this as much as possible).

ADBC/Ibis pain points

Need way to query driver and server version
Need handling of multiple databases within a connection (or: this is exposed as a 'catalog' already)

[Python] Implement DBAPI

Will make implementing Ibis backends easier since we can reuse more of the code for SQLAlchemy-based backends

Question around fixed hierarchy for schema

I noticed that the ADBC metadata information assumes a fixed hierarchy:

arrow-adbc/java/core/src/main/java/org/apache/arrow/adbc/core/AdbcConnection.java

Line 58 in 2485d7c

    
              * Get a hierarchical view of all catalogs, database schemas, tables, and columns.

What would the advice be for datasources that don't fit this, like Snowflake/Trino/Dremio where the hierarchy might be:

datasource.database.schema.table
postgres1.mydb.public.emps

[C][Java] Expand validation suites

Port existing tests to validation suite
Clean up C validation suite test helpers, see if they can't be structured better
Make sure API is better covered
Find some way to better cover error cases as well (e.g. failure of driver entrypoint function) - some of these tests may be specific to the driver manager
Test things using the driver struct as well (tests should use both the driver manager and the library directly)

[Format] Document compatibility goals

Document ABI compatibility goals in header
Set up ABI checker
Rename ADBC_VERSION_0_0_1 to ADBC_VERSION_1_0_0
Ensure AdbcDriver struct is all up-to-date