Comments (4)
It also fails with compress=:zstd
from arrow-julia.
I have the same problem, this works without compression, but fails with. If one allows only a single thread then it doesn't fail, so perhaps compression is not thread safe?
using Random
using DataFrames
using Arrow
using Tables
function nextidrange(minId, maxId, batchsize, i)
fromId = minId + batchsize * (i-1)
toId = min(maxId, (minId + batchsize * i)-1)
return fromId, toId
end
minId = 1
maxId = 1000
idrange = (maxId - minId) + 1
df = DataFrame(ID=minId:maxId, B=rand(idrange), C=randstring.(fill(5,idrange)));
batchsize = 100
numbatches = ceil(Int32, idrange / batchsize)
partitions = Array{SubDataFrame}(undef, 0)
for i = 1:numbatches
fromId, toId = nextidrange(minId, maxId, batchsize, i)
push!(partitions, filter([:ID] => x -> fromId <= x <= toId, df; view = true))
end
io = IOBuffer()
Arrow.write(io, Tables.partitioner(partitions), compress=:zstd)
seekstart(io)
recordbatches = Arrow.Stream(io)
ab = Array{DataFrame}(undef,0)
for b in recordbatches
bt = b |> DataFrame
println("Rows = $(nrow(bt))")
push!(ab,bt)
end
from arrow-julia.
Thanks for the reports @kobusherbst and @altre; the compression machinery is indeed not threadsafe, which I've mostly resolve in my local branch, but there's also #108 which is interacting with my testing, so I'm trying to solve both issues in one go here to get threaded writing working reliably. Sorry for the slowness, but I think I'm getting close.
from arrow-julia.
Thank you @quinnj, both issues are a deal breaker for me in having to deal with huge 600 million plus row datasets.
from arrow-julia.
Related Issues (20)
- Issue with `Union{Missing, VersionNumber}` HOT 6
- GitHub Pages build error HOT 8
- Use https://arrow.apache.org/julia/ as the official Website URL HOT 7
- html comment tag at the top of main documentation page may have one too many dashes at the beginning
- explanation of Arrow.Stream vs. Arrow.Table seems ambiguous HOT 3
- `Arrow.write` performance on large DataFrame HOT 3
- Bus errors when writing `DataFrame` HOT 8
- Arrow stream writer and reader implementation questions
- [feature request] support run-end encoded layout
- Custom type cannot round trip (Colors.jl) HOT 1
- colmetadata does not read custom metadata with multiple writes
- `getindex` broken with `SVector{3, UInt}` in the presence of missing data HOT 2
- Removing .arrow files without closing Julia seems impossible in Windows HOT 18
- support Dates.CompoundPeriod in deserialization?
- copy does not copy to standard Julia Types HOT 5
- Unexpected allocations HOT 2
- Type instability in getcolumn
- Cannot append DictEncode columns to Stream
- Arrow-over-HTTP client and server examples in Julia
- Deeply nested structs cause long compilation times HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-julia.