Comments (3)
While
Arrow.Table
will iterate all record batches in an arrow file/stream, concatenating columns
this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).
from arrow-julia.
While
Arrow.Table
will iterate all record batches in an arrow file/stream, concatenating columnsthis is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).
That's part of the problem: The documentation is NOT saying that. You are.
After re-reading a dozen times, I think I might understand what this means to say.
I think it is partly confusing to use the word "iterate". That is a verb, which indicates something is now happening (e.g when a Table or Stream object are instantiated).
It might make more sense to say "iterator" (noun). That is an object like a generator or like a file handle which points to a position on disk (or memory), and has some state knowledge of how far to jump ahead, for each iteration.
I believe when a Table is instantiated, it presents a view where all the batches appear as if there is a single "batch" (one table). In this case, an iterator might be constructed to have a step size of only one record (e.g. row). When a Stream object is constructed, it looks like each step will produce all the records in a batch, and wrap them as a Table. Each iteration will produce a new table (as the docs indicate).
I understand (from deduction and experience) that having compressed arrow will require an additional "buffer" of the actual binary format, because compressed bits aren't a memmap-optimized data form. So, yes, if you use compressed arrow, loading in batch by batch can mitigate this. But if you are using Arrow, at all, it almost doesn't make any sense to compress (even, e.g. the lz4 compressed feather format). If you are doing any compression, maybe just use parquet. So, given normal (uncompressed) Arrow, an "entire table in RAM" should not be an issue. That's one of the main purposes of using Arrow, in the first place: "out of memory" processing. I would think the memory-loaded schema size for Table would not be significantly bigger than that of each tabular batch (if at all).
Anyway, in terms of the Table interface, it might even be bad practice to even mention iteration. Typically you want to encourage thinking about things in terms of vectorization. (e.g. broadcasting over as many rows at once) vs. any implication of processing anything row by row. -at least for DataFrame style interfacing. For Stream, it makes sense to mention iteration (for the comparison context), because the "processing" here is the actual process of of loading (and unloading) batches of records in and out of memory.
I'm not sure if something a little more like this would make sense (and is accurate):
". . . While Arrow.Table provides an interface where all record batches appear vertically concatenated into a single table, Arrow.Stream creates an iterator, where each iteration produces a separate table for from each batch. A Stream can be helpful for large compressed (e.g. lz4-compressed feather files)", where decompressed arrow data will need to be buffered in memory. The buffer would only need to accommodate one batch at a time, vs. all the batches at once as would be the case with Arrow.Table."
That could probably be simplified. I left out the "concatenating columns", because I'm not sure how relevant or distracting that might be, in this context. I mean . . . that's already implicit in the definition of an Arrow batch.
from arrow-julia.
That's part of the problem: The documentation is NOT saying that. You are.
beat me, ever since this project merged into Apache monorepo, it's impossible to get anything through in a responsive manner, matter of fact the "doc is not rendering up to date" took what I feels like almost a year to address. So sorry, I can only add information in github issues :)
from arrow-julia.
Related Issues (20)
- Bus errors when writing `DataFrame` HOT 8
- Arrow stream writer and reader implementation questions
- [feature request] support run-end encoded layout
- Custom type cannot round trip (Colors.jl) HOT 1
- colmetadata does not read custom metadata with multiple writes
- `getindex` broken with `SVector{3, UInt}` in the presence of missing data HOT 2
- Removing .arrow files without closing Julia seems impossible in Windows HOT 18
- support Dates.CompoundPeriod in deserialization?
- copy does not copy to standard Julia Types HOT 5
- Unexpected allocations HOT 2
- Type instability in getcolumn
- Cannot append DictEncode columns to Stream
- Arrow-over-HTTP client and server examples in Julia
- Deeply nested structs cause long compilation times HOT 9
- `snappy_jll v1.2.0` lead to Arrow_jll failed to build HOT 4
- Deserialization as Vector{SubArray} breaks `push!` on DataFrame HOT 7
- Add support for FileIO HOT 2
- interoperability with round-tripping through data format broken (x-issue) HOT 1
- Arrow.jl fails to precompile with error "Magic file identifier "TZjf" not found." HOT 5
- Failure to read valid file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-julia.