Giter Site home page Giter Site logo

Comments (3)

Moelf avatar Moelf commented on September 23, 2024

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

from arrow-julia.

bdklahn avatar bdklahn commented on September 23, 2024

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

That's part of the problem: The documentation is NOT saying that. You are.

After re-reading a dozen times, I think I might understand what this means to say.

I think it is partly confusing to use the word "iterate". That is a verb, which indicates something is now happening (e.g when a Table or Stream object are instantiated).
It might make more sense to say "iterator" (noun). That is an object like a generator or like a file handle which points to a position on disk (or memory), and has some state knowledge of how far to jump ahead, for each iteration.

I believe when a Table is instantiated, it presents a view where all the batches appear as if there is a single "batch" (one table). In this case, an iterator might be constructed to have a step size of only one record (e.g. row). When a Stream object is constructed, it looks like each step will produce all the records in a batch, and wrap them as a Table. Each iteration will produce a new table (as the docs indicate).

I understand (from deduction and experience) that having compressed arrow will require an additional "buffer" of the actual binary format, because compressed bits aren't a memmap-optimized data form. So, yes, if you use compressed arrow, loading in batch by batch can mitigate this. But if you are using Arrow, at all, it almost doesn't make any sense to compress (even, e.g. the lz4 compressed feather format). If you are doing any compression, maybe just use parquet. So, given normal (uncompressed) Arrow, an "entire table in RAM" should not be an issue. That's one of the main purposes of using Arrow, in the first place: "out of memory" processing. I would think the memory-loaded schema size for Table would not be significantly bigger than that of each tabular batch (if at all).

Anyway, in terms of the Table interface, it might even be bad practice to even mention iteration. Typically you want to encourage thinking about things in terms of vectorization. (e.g. broadcasting over as many rows at once) vs. any implication of processing anything row by row. -at least for DataFrame style interfacing. For Stream, it makes sense to mention iteration (for the comparison context), because the "processing" here is the actual process of of loading (and unloading) batches of records in and out of memory.

I'm not sure if something a little more like this would make sense (and is accurate):

". . . While Arrow.Table provides an interface where all record batches appear vertically concatenated into a single table, Arrow.Stream creates an iterator, where each iteration produces a separate table for from each batch. A Stream can be helpful for large compressed (e.g. lz4-compressed feather files)", where decompressed arrow data will need to be buffered in memory. The buffer would only need to accommodate one batch at a time, vs. all the batches at once as would be the case with Arrow.Table."

That could probably be simplified. I left out the "concatenating columns", because I'm not sure how relevant or distracting that might be, in this context. I mean . . . that's already implicit in the definition of an Arrow batch.

from arrow-julia.

Moelf avatar Moelf commented on September 23, 2024

That's part of the problem: The documentation is NOT saying that. You are.

beat me, ever since this project merged into Apache monorepo, it's impossible to get anything through in a responsive manner, matter of fact the "doc is not rendering up to date" took what I feels like almost a year to address. So sorry, I can only add information in github issues :)

from arrow-julia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.