Comments (7)
oh I misunderstood, it's inside a nested vector. I guess copying those would do it?
df = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);
transform!(df, :foo => ByRow(copy) => :foo)
from arrow-julia.
The workaround is to ask DataFrames to copy the columns:
DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true)
The reason for the current behavior is:
Arrow.Table
exposes an immutable view of the underlying byte-buffer (for e.g. 0-copy reads from mmap'd data)DataFrame
accepts arbitrary vectors as columns (again to support things like 0-copy reads)- the naive composition therefore results in immutable columns and confusing errors
(not saying it is ideal, just how/why we got here)
from arrow-julia.
From perspective of Arrow, a Vector{Vector{}}
is stored as a content
vector and an offset
vector, similar to how https://github.com/JuliaArrays/ArraysOfArrays.jl works.
Now, if it actually used that, the push!()
would have worked just fine, but instead Arrow.jl is doing something on its own.
Btw, if you're interested in a fully systematic way of dealing with Arrow-like schema, https://github.com/JuliaHEP/AwkwardArray.jl is something we're prototyping.
from arrow-julia.
Now, if it actually used that, the push!() would have worked just fine, but instead Arrow.jl is doing something on its own.
I don't think that's really accurate, the issue isn't the layout-in-memory, it's that Arrow.Table
's columns are deliberately immutable, since they are static view into the underlying bytes that back the table.
from arrow-julia.
When there's compression involved it won't be purely Mmaped. In general I agree, I'm saying if the resultant table uses that it would have worked. But likely out of the gate it's immutable however we implement it
from arrow-julia.
Right, I'm not saying it's always mmap'd, that was an example, but I'm saying Arrow.Table
always has immutable columns in the current design of this package
from arrow-julia.
Thanks for the quick comments!
The workaround is to ask DataFrames to copy the columns:
DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true)
Hmm, I don't see any effect of that here:
julia> typeof(df.foo)
Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})
julia> Arrow.write("/tmp/test.arrow", df);
julia> df2 = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
The snippet you posted is a little ambiguous, but additionally calling copy
or DataFrame
with copycols=true
(which seems like the default for copy
anyway) doesn't help either:
julia> df2 = DataFrame(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
from arrow-julia.
Related Issues (20)
- html comment tag at the top of main documentation page may have one too many dashes at the beginning
- explanation of Arrow.Stream vs. Arrow.Table seems ambiguous HOT 3
- `Arrow.write` performance on large DataFrame HOT 3
- Bus errors when writing `DataFrame` HOT 8
- Arrow stream writer and reader implementation questions
- [feature request] support run-end encoded layout
- Custom type cannot round trip (Colors.jl) HOT 1
- colmetadata does not read custom metadata with multiple writes
- `getindex` broken with `SVector{3, UInt}` in the presence of missing data HOT 2
- Removing .arrow files without closing Julia seems impossible in Windows HOT 18
- support Dates.CompoundPeriod in deserialization?
- copy does not copy to standard Julia Types HOT 5
- Unexpected allocations HOT 2
- Type instability in getcolumn
- Cannot append DictEncode columns to Stream
- Arrow-over-HTTP client and server examples in Julia
- Deeply nested structs cause long compilation times HOT 9
- `snappy_jll v1.2.0` lead to Arrow_jll failed to build HOT 4
- Add support for FileIO HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-julia.