Comments (17)
I've also implemented an Arrow query interface for go-duckdb in a fork: https://github.com/spicehq/go-duckdb/pull/1/files
It's not using ADBC, but the general approach might also translate to calling the ADBC APIs as well. If you think its useful, I can create a PR for it.
Also, here is a blog post that is relevant to this discussion: https://voltrondata.com/resources/zero-copy-sharing-using-apache-arrow-and-golang
from go-duckdb.
@comunidadio I've opened a PR for DuckDB that should make it possible for us to do this without ADBC: duckdb/duckdb#7570
from go-duckdb.
@marcboeker sounds good! Happy to float a PR to wrap around my two proposed APIs if they get accepted by DuckDB. On the ADBC front, it seems like quite a lift to me atm. Will update us here if that changes
from go-duckdb.
DuckDB APIs are now merged to the feature
branch. Should be in master
in the next 2-3 weeks
from go-duckdb.
Quick update here: going by the commit message for pdet/duckdb@c93833c, it seems like ADBC itself may internally consume the new C APIs. So we can likely punt supporting ADBC until it comes up again π
from go-duckdb.
Hi @marcboeker , since #134 was merged maybe this issue can be closed?
Do you have a plan to release new version soon?
from go-duckdb.
If that works with DuckDB CLI or any other DuckDB library (e.g. Python, R) then it should work in Go too. I've never used Apache Arrow with DuckDB. I'm closing this. Feel free to reopen if you experience a problem when working with Apache Arrow tables in go-duckdb
.
from go-duckdb.
Im also interested in that. Couldn't find the example on how to use it with the Go library. Here is a recent blog though discussing that - https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
from go-duckdb.
If you look at the NodeJS bindings that implement a register_buffer
method for Connection
, you'll see that it internally calls a SQL function called scan_arrow_ipc
.
$> duckdb
v0.6.1 919cad22e8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D INSTALL arrow;
D LOAD arrow;
D SELECT * FROM duckdb_functions() WHERE function_name LIKE '%arrow%';
βββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββ
β schema_name β function_name β function_type β description β return_type β parameters β parameter_types β varargs β macro_definition β has_side_effects β function_oid β
β varchar β varchar β varchar β varchar β varchar β varchar[] β varchar[] β varchar β varchar β boolean β int64 β
βββββββββββββββΌβββββββββββββββββΌββββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββ€
β main β scan_arrow_ipc β table β β β [col0] β [STRUCT(ptr UBIGINT, size UBIGINT)[]] β β β β 1395 β
β main β arrow_scan β table β β β [col0, col1, col2] β [POINTER, POINTER, POINTER] β β β β 77 β
β main β to_arrow_ipc β table β β β [col0] β [TABLE] β β β β 1393 β
βββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββ
D
So it seems like this is possible if we pass to DuckDB a pointer to Arrow-encoded memory and its size. The JS bindings show us how to construct this string:
auto raw_ptr = reinterpret_cast<uint64_t>(arr.ArrayBuffer().Data());
auto length = (uint64_t)arr.ElementLength();
arrow_scan_function += "{'ptr': " + std::to_string(raw_ptr) + ", 'size': " + std::to_string(length) + "},";
My C++ is terrible, so I asked ChatGPT what the Golang equivalent of this code looks like. It helpfully outputted this (quite possibly incorrect code):
rawPtr := uintptr(unsafe.Pointer(&arr[0]))
length := uint64(len(arr))
arrowScanFunction += fmt.Sprintf("{'ptr': %d, 'size': %d},", rawPtr, length)
Hopefully, this points us in the right direction, and I'll take a stab at this soon!
from go-duckdb.
Happy to report that the aforementioned approach works like a charm! π
from go-duckdb.
@angadn Great to hear, thanks for the update. Yesterday I've played around with querying Apache Arrow via IPC:
install arrow; load arrow; select * from scan_arrow_ipc([{'ptr':"0x11ed639b0", 'size':"9"}]);
Therefore I've loaded some data trough Python into a in-memory buffer and tried to access it via it's address:
<pyarrow.Buffer address=0x11ed639b0 size=9 is_cpu=True is_mutable=False>
This resulted in a segfault of DuckDB. Probably as the address is in the wrong format or the memory is not accessible from the DuckDB process.
How have you loaded the data into mem and accessed it? Maybe we can add an additional loader that loads the data into mem and create a view on it with a custom name for convenience, like:
std::string final_query = "CREATE OR REPLACE TEMPORARY VIEW " + name + " AS SELECT * FROM " + arrow_scan_function;
from go-duckdb.
This resulted in a segfault of DuckDB. Probably as the address is in the wrong format or the memory is not accessible from the DuckDB process.
Are you running DuckDB in a separate process? In that case, you'll need some IPC with shared-memory across the two processes
In the past, this has worked best for me:
- Use
memfd_create
to create an anonymous FD - Write to the FD
- Send FD over a Unix Domain Socket as a CMSG using
sendmsg
- Receive FD with
recvmsg
mmap
to the given FD and read it from the recipient process
There's some implementation differences between FreeBSD, Linux, and macOS here, but these basic IPC mechanics should hold across all the platforms. When you send an FD over a Unix Domain Socket, the kernel intermediates and "translates" the sent FD into one which the recipient process can read
How have you loaded the data into mem and accessed it?
I used the Arrow IPC package to serialise my records into a []byte
. I cannot think of a way to do it without this step. If there's some way you can reference the underlying Go structs directly, that's amazing because eliminating this step will make it truly zero-copy in user-space
Maybe we can add an additional loader that loads the data into mem and create a view on it with a custom name for convenience, like:
std::string final_query = "CREATE OR REPLACE TEMPORARY VIEW " + name + " AS SELECT * FROM " + arrow_scan_function;
Yes. IMO we can also just keep it in Go and use the existing go-duckdb
query interface to do what we want π
from go-duckdb.
@marcboeker I came across the arrow_scan
function, and I realise now that we could theoretically dodge the IPC serialisation penalty if the Arrow structures are initialised using the C Data Interface .
The function seems to require 3 pointers, but I'm unable to find any example usages to do a quick POC. Digging deeper now!
from go-duckdb.
DuckDB 0.8.0 introduced support for ADBC (Arrow DataBase Connector) support.
Perhaps should try https://github.com/apache/arrow-adbc/tree/main/go/adbc - not sure if that's worth integrating in go-duckdb or keep separate?
from go-duckdb.
DuckDB 0.8.0 introduced support for ADBC (Arrow DataBase Connector) support.
Perhaps should try https://github.com/apache/arrow-adbc/tree/main/go/adbc - not sure if that's worth integrating in go-duckdb or keep separate?
@comunidadio just had another look at this - seems to me like for this to work, we'd have to write a C-to-C++ bridge to call the DuckDB ADBC functions, and then wrap them in go-duckdb
with CGO? It does look like a clean and future proof alternative to me, and looks like it should be zero-copy as well!
It does however seem like it'll be a greater lift than wrapping around the two APIs I proposed (if they get accepted). @marcboeker would love to have your thoughts on how much effort the ADBC approach looks like/if my thinking is right! π
from go-duckdb.
@angadn Thanks for digging into this. TBH I've never worked with Apache Arrow, so I can't really comment on the best approach on how to interface with Apache Arrow. If you would like to have Arrow support in go-duckdb, it would be great if you can provide the PR, as this is not really on my list of upcoming features.
from go-duckdb.
@phillipleblanc just had a look at your fork, nice work.
It would be neat to try and land some of that back into this repo, potentially as a nested module so that folks could opt out of pulling in all the arrow parts.
from go-duckdb.
Related Issues (20)
- Exposing filesystem interface HOT 1
- Make Apache Arrow Optional HOT 2
- Add support for the DATE type in the appender
- Does `go-duckdb` has a glibc version dependency? HOT 2
- Silent primary key violation in the appender HOT 2
- Appending a NULL in UUID[] column panics HOT 1
- found architecture 'arm64', required architecture 'x86_64' on mac os HOT 3
- libduckdb.a, building for macOS-x86_64 but attempting to link with file built for macOS-arm64 HOT 2
- Info: Deploying go-duckdb In A Distroless Docker Container
- DuckDB version mismatch in v1.6.5 release HOT 1
- Auto-load / auto-install in go-duckdb? And packaging JSON? HOT 1
- Cannot run duckdb v1.6.6 . Apache arrow lib broken HOT 1
- macos `go run` works but unable to execute the output of `go build` HOT 3
- Proposal: Support better error messages HOT 2
- How to cross-compile an application which uses DuckDB? HOT 2
- how to use duckdb_vss? HOT 2
- Appender performance hangs (> 10 minutes) when flushing HOT 4
- Upgrade arrow from v14 to v15?
- How to solve concurrent read and write without affecting each otherοΌThis is a big problem for me, please help me HOT 2
- Appender Not working. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-duckdb.