Giter Site home page Giter Site logo

Querying Apache Arrow table about go-duckdb HOT 17 CLOSED

angadn avatar angadn commented on July 24, 2024 3
Querying Apache Arrow table

from go-duckdb.

Comments (17)

phillipleblanc avatar phillipleblanc commented on July 24, 2024 5

I've also implemented an Arrow query interface for go-duckdb in a fork: https://github.com/spicehq/go-duckdb/pull/1/files

It's not using ADBC, but the general approach might also translate to calling the ADBC APIs as well. If you think its useful, I can create a PR for it.

Also, here is a blog post that is relevant to this discussion: https://voltrondata.com/resources/zero-copy-sharing-using-apache-arrow-and-golang

from go-duckdb.

angadn avatar angadn commented on July 24, 2024 3

@comunidadio I've opened a PR for DuckDB that should make it possible for us to do this without ADBC: duckdb/duckdb#7570

from go-duckdb.

angadn avatar angadn commented on July 24, 2024 2

@marcboeker sounds good! Happy to float a PR to wrap around my two proposed APIs if they get accepted by DuckDB. On the ADBC front, it seems like quite a lift to me atm. Will update us here if that changes

from go-duckdb.

angadn avatar angadn commented on July 24, 2024 2

DuckDB APIs are now merged to the feature branch. Should be in master in the next 2-3 weeks

from go-duckdb.

angadn avatar angadn commented on July 24, 2024 1

Quick update here: going by the commit message for pdet/duckdb@c93833c, it seems like ADBC itself may internally consume the new C APIs. So we can likely punt supporting ADBC until it comes up again πŸ™‚

from go-duckdb.

levakin avatar levakin commented on July 24, 2024 1

Hi @marcboeker , since #134 was merged maybe this issue can be closed?
Do you have a plan to release new version soon?

from go-duckdb.

marcboeker avatar marcboeker commented on July 24, 2024

If that works with DuckDB CLI or any other DuckDB library (e.g. Python, R) then it should work in Go too. I've never used Apache Arrow with DuckDB. I'm closing this. Feel free to reopen if you experience a problem when working with Apache Arrow tables in go-duckdb.

from go-duckdb.

yevgenypats avatar yevgenypats commented on July 24, 2024

Im also interested in that. Couldn't find the example on how to use it with the Go library. Here is a recent blog though discussing that - https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/

from go-duckdb.

angadn avatar angadn commented on July 24, 2024

If you look at the NodeJS bindings that implement a register_buffer method for Connection, you'll see that it internally calls a SQL function called scan_arrow_ipc.

$> duckdb
v0.6.1 919cad22e8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D INSTALL arrow;
D LOAD arrow;
D SELECT * FROM duckdb_functions() WHERE function_name LIKE '%arrow%';
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema_name β”‚ function_name  β”‚ function_type β”‚ description β”‚ return_type β”‚     parameters     β”‚            parameter_types            β”‚ varargs β”‚ macro_definition β”‚ has_side_effects β”‚ function_oid β”‚
β”‚   varchar   β”‚    varchar     β”‚    varchar    β”‚   varchar   β”‚   varchar   β”‚     varchar[]      β”‚               varchar[]               β”‚ varchar β”‚     varchar      β”‚     boolean      β”‚    int64     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ main        β”‚ scan_arrow_ipc β”‚ table         β”‚             β”‚             β”‚ [col0]             β”‚ [STRUCT(ptr UBIGINT, size UBIGINT)[]] β”‚         β”‚                  β”‚                  β”‚         1395 β”‚
β”‚ main        β”‚ arrow_scan     β”‚ table         β”‚             β”‚             β”‚ [col0, col1, col2] β”‚ [POINTER, POINTER, POINTER]           β”‚         β”‚                  β”‚                  β”‚           77 β”‚
β”‚ main        β”‚ to_arrow_ipc   β”‚ table         β”‚             β”‚             β”‚ [col0]             β”‚ [TABLE]                               β”‚         β”‚                  β”‚                  β”‚         1393 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
D

So it seems like this is possible if we pass to DuckDB a pointer to Arrow-encoded memory and its size. The JS bindings show us how to construct this string:

auto raw_ptr = reinterpret_cast<uint64_t>(arr.ArrayBuffer().Data());
auto length = (uint64_t)arr.ElementLength();

arrow_scan_function += "{'ptr': " + std::to_string(raw_ptr) + ", 'size': " + std::to_string(length) + "},";

My C++ is terrible, so I asked ChatGPT what the Golang equivalent of this code looks like. It helpfully outputted this (quite possibly incorrect code):

rawPtr := uintptr(unsafe.Pointer(&arr[0]))
length := uint64(len(arr))

arrowScanFunction += fmt.Sprintf("{'ptr': %d, 'size': %d},", rawPtr, length)

Hopefully, this points us in the right direction, and I'll take a stab at this soon!

from go-duckdb.

angadn avatar angadn commented on July 24, 2024

Happy to report that the aforementioned approach works like a charm! πŸ˜„

from go-duckdb.

marcboeker avatar marcboeker commented on July 24, 2024

@angadn Great to hear, thanks for the update. Yesterday I've played around with querying Apache Arrow via IPC:

install arrow; load arrow; select * from scan_arrow_ipc([{'ptr':"0x11ed639b0", 'size':"9"}]);

Therefore I've loaded some data trough Python into a in-memory buffer and tried to access it via it's address:

<pyarrow.Buffer address=0x11ed639b0 size=9 is_cpu=True is_mutable=False>

This resulted in a segfault of DuckDB. Probably as the address is in the wrong format or the memory is not accessible from the DuckDB process.

How have you loaded the data into mem and accessed it? Maybe we can add an additional loader that loads the data into mem and create a view on it with a custom name for convenience, like:

std::string final_query = "CREATE OR REPLACE TEMPORARY VIEW " + name + " AS SELECT * FROM " + arrow_scan_function;

from go-duckdb.

angadn avatar angadn commented on July 24, 2024

This resulted in a segfault of DuckDB. Probably as the address is in the wrong format or the memory is not accessible from the DuckDB process.

Are you running DuckDB in a separate process? In that case, you'll need some IPC with shared-memory across the two processes

In the past, this has worked best for me:

  1. Use memfd_create to create an anonymous FD
  2. Write to the FD
  3. Send FD over a Unix Domain Socket as a CMSG using sendmsg
  4. Receive FD with recvmsg
  5. mmap to the given FD and read it from the recipient process

There's some implementation differences between FreeBSD, Linux, and macOS here, but these basic IPC mechanics should hold across all the platforms. When you send an FD over a Unix Domain Socket, the kernel intermediates and "translates" the sent FD into one which the recipient process can read

How have you loaded the data into mem and accessed it?

I used the Arrow IPC package to serialise my records into a []byte. I cannot think of a way to do it without this step. If there's some way you can reference the underlying Go structs directly, that's amazing because eliminating this step will make it truly zero-copy in user-space

Maybe we can add an additional loader that loads the data into mem and create a view on it with a custom name for convenience, like:

std::string final_query = "CREATE OR REPLACE TEMPORARY VIEW " + name + " AS SELECT * FROM " + arrow_scan_function;

Yes. IMO we can also just keep it in Go and use the existing go-duckdb query interface to do what we want πŸ˜„

from go-duckdb.

angadn avatar angadn commented on July 24, 2024

@marcboeker I came across the arrow_scan function, and I realise now that we could theoretically dodge the IPC serialisation penalty if the Arrow structures are initialised using the C Data Interface .

The function seems to require 3 pointers, but I'm unable to find any example usages to do a quick POC. Digging deeper now!

from go-duckdb.

comunidadio avatar comunidadio commented on July 24, 2024

DuckDB 0.8.0 introduced support for ADBC (Arrow DataBase Connector) support.
Perhaps should try https://github.com/apache/arrow-adbc/tree/main/go/adbc - not sure if that's worth integrating in go-duckdb or keep separate?

from go-duckdb.

angadn avatar angadn commented on July 24, 2024

DuckDB 0.8.0 introduced support for ADBC (Arrow DataBase Connector) support.
Perhaps should try https://github.com/apache/arrow-adbc/tree/main/go/adbc - not sure if that's worth integrating in go-duckdb or keep separate?

@comunidadio just had another look at this - seems to me like for this to work, we'd have to write a C-to-C++ bridge to call the DuckDB ADBC functions, and then wrap them in go-duckdb with CGO? It does look like a clean and future proof alternative to me, and looks like it should be zero-copy as well!

It does however seem like it'll be a greater lift than wrapping around the two APIs I proposed (if they get accepted). @marcboeker would love to have your thoughts on how much effort the ADBC approach looks like/if my thinking is right! πŸ™‚

from go-duckdb.

marcboeker avatar marcboeker commented on July 24, 2024

@angadn Thanks for digging into this. TBH I've never worked with Apache Arrow, so I can't really comment on the best approach on how to interface with Apache Arrow. If you would like to have Arrow support in go-duckdb, it would be great if you can provide the PR, as this is not really on my list of upcoming features.

from go-duckdb.

catkins avatar catkins commented on July 24, 2024

@phillipleblanc just had a look at your fork, nice work.

It would be neat to try and land some of that back into this repo, potentially as a nested module so that folks could opt out of pulling in all the arrow parts.

from go-duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.