explanation of Arrow.Stream vs. Arrow.Table seems ambiguous about arrow-julia HOT 3 OPEN

bdklahn commented on September 23, 2024

explanation of Arrow.Stream vs. Arrow.Table seems ambiguous

from arrow-julia.

Comments (3)

Moelf commented on September 23, 2024

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

from arrow-julia.

bdklahn commented on September 23, 2024

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

That's part of the problem: The documentation is NOT saying that. You are.

After re-reading a dozen times, I think I might understand what this means to say.

I think it is partly confusing to use the word "iterate". That is a verb, which indicates something is now happening (e.g when a Table or Stream object are instantiated).
It might make more sense to say "iterator" (noun). That is an object like a generator or like a file handle which points to a position on disk (or memory), and has some state knowledge of how far to jump ahead, for each iteration.

I believe when a Table is instantiated, it presents a view where all the batches appear as if there is a single "batch" (one table). In this case, an iterator might be constructed to have a step size of only one record (e.g. row). When a Stream object is constructed, it looks like each step will produce all the records in a batch, and wrap them as a Table. Each iteration will produce a new table (as the docs indicate).

I understand (from deduction and experience) that having compressed arrow will require an additional "buffer" of the actual binary format, because compressed bits aren't a memmap-optimized data form. So, yes, if you use compressed arrow, loading in batch by batch can mitigate this. But if you are using Arrow, at all, it almost doesn't make any sense to compress (even, e.g. the lz4 compressed feather format). If you are doing any compression, maybe just use parquet. So, given normal (uncompressed) Arrow, an "entire table in RAM" should not be an issue. That's one of the main purposes of using Arrow, in the first place: "out of memory" processing. I would think the memory-loaded schema size for Table would not be significantly bigger than that of each tabular batch (if at all).

Anyway, in terms of the Table interface, it might even be bad practice to even mention iteration. Typically you want to encourage thinking about things in terms of vectorization. (e.g. broadcasting over as many rows at once) vs. any implication of processing anything row by row. -at least for DataFrame style interfacing. For Stream, it makes sense to mention iteration (for the comparison context), because the "processing" here is the actual process of of loading (and unloading) batches of records in and out of memory.

I'm not sure if something a little more like this would make sense (and is accurate):

". . . While Arrow.Table provides an interface where all record batches appear vertically concatenated into a single table, Arrow.Stream creates an iterator, where each iteration produces a separate table for from each batch. A Stream can be helpful for large compressed (e.g. lz4-compressed feather files)", where decompressed arrow data will need to be buffered in memory. The buffer would only need to accommodate one batch at a time, vs. all the batches at once as would be the case with Arrow.Table."

That could probably be simplified. I left out the "concatenating columns", because I'm not sure how relevant or distracting that might be, in this context. I mean . . . that's already implicit in the definition of an Arrow batch.

from arrow-julia.

Moelf commented on September 23, 2024

That's part of the problem: The documentation is NOT saying that. You are.

beat me, ever since this project merged into Apache monorepo, it's impossible to get anything through in a responsive manner, matter of fact the "doc is not rendering up to date" took what I feels like almost a year to address. So sorry, I can only add information in github issues :)

from arrow-julia.

explanation of Arrow.Stream vs. Arrow.Table seems ambiguous about arrow-julia HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent