Giter Site home page Giter Site logo

arrow-julia's Introduction

Arrow

docs CI codecov

deps version pkgeval

This is a pure Julia implementation of the Apache Arrow data standard. This package provides Julia AbstractVector objects for referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.

Please see this document for a description of the Arrow memory layout.

Installation

The package can be installed by typing in the following in a Julia REPL:

julia> using Pkg; Pkg.add("Arrow")

Local Development

When developing on Arrow.jl it is recommended that you run the following to ensure that any changes to ArrowTypes.jl are immediately available to Arrow.jl without requiring a release:

julia --project -e 'using Pkg; Pkg.develop(path="src/ArrowTypes")'

Format Support

This implementation supports the 1.0 version of the specification, including support for:

  • All primitive data types
  • All nested data types
  • Dictionary encodings and messages
  • Extension types
  • Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:

  • Tensors or sparse tensors
  • Flight RPC
  • C data interface

Third-party data formats:

See the full documentation for details on reading and writing arrow data.

arrow-julia's People

Contributors

baumgold avatar carlolucibello avatar davidanthoff avatar dmbates avatar ericphanson avatar etpinard avatar expandingman avatar guilhermebodin avatar jacobadenbaum avatar joaoaparicio avatar jrevels avatar kou avatar kristofferc avatar nhdaly avatar nickrobinson251 avatar okartal avatar omus avatar pcjentsch avatar piever avatar poncito avatar quinnj avatar raulcd avatar sglyon avatar simeonschaub avatar simondanisch avatar simsurace avatar tanmaykm avatar thecedarprince avatar tpgillam avatar visr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-julia's Issues

Releasing the lock on arrow file

When you read the file using Arrow.Table then a lock on the file is made. I could not find a function to release the lock within a session on Windows.
The only way that works is doing:

  1. removing all references to the table.
  2. doing GC.gc()
  3. then you can remove the file

This is valid and makes sense, but the question is if there is some other way to do it (maybe not).

If not then I understand that the recommended pattern for such cases is to open file, pass IO to ArrowTable, and close file. Is this correct and recommended pattern if you want to avoid memory mapping and locking the file?

This is needed in the scenarios when you want to use Arrow as a temporary storage on disk but want to clean up things after the session.

(I am writing a blog post on Arrow.jl now)

On Win10 the write does not always happen after Arrow.write returns

julia> using DataFrames, Arrow

julia> df = DataFrame(x1=[true, false], x2=[1,2],x3=[1,missing],x4=[1.0,2.0],x5=[1.0,missing])
2×5 DataFrame
│ Row │ x1   │ x2    │ x3      │ x4      │ x5       │
│     │ Bool │ Int64 │ Int64?  │ Float64 │ Float64? │
├─────┼──────┼───────┼─────────┼─────────┼──────────┤
│ 1   │ 1    │ 1     │ 1       │ 1.0     │ 1.0      │
│ 2   │ 0    │ 2     │ missing │ 2.0     │ missing  │

julia> Arrow.write("small.arrow", df)
"small.arrow"

julia> stat("small.arrow")
StatStruct(mode=0o100666, size=0)

julia> exit()

$ julia --banner=no
julia> stat("small.arrow")
StatStruct(mode=0o100666, size=734)

it seems that the write buffer is not closed properly (it gets closed when you exit Julia)

`NamedTuple{...,Union{...}}` values are serializable but inaccessible once deserialized

similarly flavored problem to #75:

julia> write_table(tbl) = (io = IOBuffer(); Arrow.write(io, tbl); seekstart(io); Arrow.Table(io))
write_table (generic function with 1 method)

julia> struct Foo
           a::Union{Int,String}
       end

julia> tbl = write_table((foos=[Foo(1), Foo("x")],))
Arrow.Table: (foos = Foo[Foo(1), Foo("x")],)

julia> tbl.foos[1] # works fine
Foo(1)

julia> const FooEmulation = NamedTuple{(:a,),Tuple{Union{Int,String}}}
NamedTuple{(:a,),Tuple{Union{Int64, String}}}

julia> tbl2 = write_table((foos=FooEmulation[(a=1,),(a="x",)],));

julia> tbl2.foos[1] # uh oh
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}}
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:171
Stacktrace:
 [1] convert(::Type{Tuple{Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}}}}, ::Tuple{Int64}) at ./essentials.jl:310
 [2] Tuple{Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}}}(::Tuple{Int64}) at ./tuple.jl:225
 [3] NamedTuple{(:a,),Tuple{Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}}}}(::Tuple{Int64}) at ./namedtuple.jl:90
 [4] getindex(::Arrow.Struct{NamedTuple{(:a,),Tuple{Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}}}},Tuple{Arrow.DenseUnion{Arrow.UnionT{Arrow.Flatbuf.UnionModeModule.Dense,nothing,Tuple{Union{Missing, Int64},String}},Tuple{Arrow.Primitive{Union{Missing, Int64},Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}}}}}}, ::Int64) at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:38
 [5] top-level scope at REPL[65]:1

Unable to handle Arrow IPC stream?

julia> Arrow.Table(transcode(LZ4FrameDecompressor, HTTP.get(url, Dict("Accept" => "application/vnd.arrow.stream", "Accept-Encoding" => "lz4")).body); debug=true)
didn't find continuation byte to keep parsing messages: 1772
ERROR: type Nothing has no field custom_metadata

This data parses fine with pyarrow:

    pa = PyCall.pyimport("pyarrow")
    bsr = pa.RecordBatchStreamReader(pa.PythonFile(pa.CompressedInputStream(pa.PythonFile(IOBuffer(bytes)), "lz4")))
    df = bsr.read_all().to_pandas(date_as_object=false)

Is this format expected to be supported atm?

flatbuffers split

It really, really should be its own package (I assume that's the ultimate attention, but even if this is so, now we have an issue to track). And, at the risk of beating a dead horse, really I am very serious when I say the package in that form is just really, really hard to understand if something goes wrong. I just want to warn that, having had experience with this, unless that package is rewritten to make all the steps a lot more transparent, I seriously doubt that anyone other than @quinnj is going to ever wind up doing much work on it. It's not even necessarily that there's anything overly convoluted going on in reading and writing, it's just nigh on impossible to pick it apart into any kind of manageable pieces.

Problems with "hard cases" of column names

julia> df = DataFrame(" "=>[true, false], "\t"=>[1,2],"\n"=>[1,missing],"\r"=>[1.0,2.0],"  "=>[1.0,missing])
2×5 DataFrame
│ Row │      │ \t    │ \n      │ \r      │          │
│     │ Bool │ Int64 │ Int64?  │ Float64 │ Float64? │
├─────┼──────┼───────┼─────────┼─────────┼──────────┤
│ 1   │ 1    │ 1     │ 1       │ 1.0     │ 1.0      │
│ 2   │ 0    │ 2     │ missing │ 2.0     │ missing  │

julia> Arrow.write("t.arrow", df)
"t.arrow"

julia> Arrow.Table("t.arrow")
Arrow.Table: NamedTuple()

Arrow conversion type for naive Julia DateTime - Arrow.Date or Arrow.Timestamp?

I was surprised that a DateTime gets automatically converted to an Arrow Date with MILLISECOND units. Could it be more natural to convert it to an Arrow Timestamp with no timezone info, as documented here: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L234-L238

This seems to be what pyarrow does by default.

I'm not quite clear on why Date MILLISECOND (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L196-L204) exists at all - it seems to overlap with Timestamp. Perhaps you know better?

Unable to round-trip a DataFrame

I am not sure what it is about this structure that causes the round-trip to fail. My actual case involves the same structure and names. I am just simulating data here for convenience.

julia> tt = DataFrame(C=rand(40), T1=rand(40), T2=rand(40), R1=rand(40), R2=rand(40), Trial = PooledArray(repeat(string.(1:4), inner=10)));

julia> Arrow.write("tt.arrow", tt)
"tt.arrow"

julia> ttnew = Arrow.Table("tt.arrow")
Arrow.Table: Error showing value of type Arrow.Table:
ERROR: KeyError: key :C not found
Stacktrace:
 [1] getindex at ./dict.jl:467 [inlined]
 [2] getcolumn at /home/bates/.julia/packages/Arrow/uVYhe/src/table.jl:38 [inlined]
 [3] #1 at ./none:0 [inlined]
 [4] iterate at ./generator.jl:47 [inlined]
 [5] collect(::Base.Generator{Array{Symbol,1},Tables.var"#1#2"{Arrow.Table}}) at ./array.jl:686
 [6] _totuple at ./tuple.jl:258 [inlined]
 [7] Tuple at ./tuple.jl:230 [inlined]
 [8] NamedTuple(::Arrow.Table) at /home/bates/.julia/packages/Tables/pKMcn/src/Tables.jl:177
 [9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Table) at /home/bates/.julia/packages/Tables/pKMcn/src/Tables.jl:183
 [10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Arrow.Table) at ./multimedia.jl:47
 [11] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:214
 [12] display(::REPL.REPLDisplay, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218
 [13] display(::Any) at ./multimedia.jl:328
 [14] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [15] invokelatest at ./essentials.jl:709 [inlined]
 [16] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:238
 [17] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:223
 [18] (::REPL.var"#do_respond#54"{Bool,Bool,REPL.var"#64#73"{REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:822
 [19] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [20] invokelatest at ./essentials.jl:709 [inlined]
 [21] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
 [22] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144
 [23] (::REPL.var"#38#42"{REPL.LineEditREPL,REPL.REPLBackendRef})() at ./task.jl:356

downstream packages need to put `Arrow.ArrowTypes.registertype!` statements in `__init__`

Otherwise Arrow.ArrowTypes.ARROW_TO_JULIA_TYPE_MAPPING doesn't get updated on package load, since the registertype! call gets executed at precompile time and that state is not preserved for using time. It seems like this is just a consequence of the design choice to use a global dictionary instead of dispatch (like StructTypes does). This caught me off-guard so I figured I'd file an issue; I think it should be mentioned in the docs at least.

Inconsistent Bool vectors from Julia to Python via Arrow

Just try to bring awareness to this issue but please feel free to close it if it is me that misused the API (but still worthwhile to keep a record of the "correct usage" though).

In Julia:

using DataFrames, Arrow, Random
Random.seed!(1)
test = DataFrame(a = randn(10000))
test.a = test.a .> 0.5
Arrow.write("test.arrow", test)
test = Arrow.Table("test.arrow") |> DataFrame
@show sum(test.a)

which gives 3133.

Then I load the same file from python using pyarrow:

import pyarrow.feather as feather
import pandas as pd
dt = feather.read_feather("test.arrow")
dt.sum()

which gives 384. By eye browsing the vector in Python is very inconsistent with the one saved in Julia.

Version info:
Julia:

Julia Version 1.5.0
Commit 96786e2 (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E7-8891 v2 @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-9.0.1 (ORCJIT, ivybridge)

Arrow.jl is 0.3.0.

Python is on 3.8.2 with Pyarrow 1.0.1.

Thanks!

documentation

We should have full-blown Documenter.jl documentation. I will help, I just wanted to open this issue to track.

Anyway, thanks so much for taking this whole thing over! I must admit I feel a pang of regret for not carrying it all the way through myself, especially when I was so close, but as I was doing on it entirely in my free time, and the feared nightmares in which it would be absolutely crucial for me to have a full robust arrow package in order to even use Julia at work never materialized, it was incredibly hard for me to stay motivated, especially when I hit major obstacles (like the flatbuffers issues). Let this be a lesson not to take on major projects during playtime that are frankly just no goddamn fun 😆

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Does Arrow.jl Support Handling for Larger-Than-Memory Conditions?

Many thanks to @quinnj and the rest of the JuliaData group for making this wonderful package!

I have been exploring Arrow.jl and the Apache Arrow standards and was wondering if Arrow.jl was able to support handling for Larger-Than-Memory conditions? I was exploring mmap based off of the email chain here as I conversed with members from the Apache Arrow community: https://lists.apache.org/thread.html/rd4a366fbb7104e306e342bd41a18f8ff24bdfc6151906270920999b6%40%3Cuser.arrow.apache.org%3E

I noticed that the fundamental Arrow.Table function utilizes memory mapping so my question can be distilled to, "Does Arrow.jl handle memory mapping for me so I don't have to worry about Larger-Than-Memory conditions?"

No error when writing NamedTuple "table" with different-length columns

julia> b = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> Arrow.write(b, (;a = Int[], b = ["asd"], c=collect(1:100)))
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=1296, maxsize=Inf, ptr=1297, mark=-1)

julia> Arrow.Table(seekstart(b))
Arrow.Table: (a = Int64[], b = ["asd"], c = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100])

I'm surprised that this roundtrips fine. Should there be an error if a table's columns have different lengths, both when reading and writing? Or is this allowable in the Arrow spec (curious what the usecase would be)?

memory leaking when reading compressed arrow files

I've posted this when i use python's pandas to read parquet using pyarrow engine, it makes memory leaking problem.
https://issues.apache.org/jira/browse/ARROW-6874?focusedCommentId=17171226&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17171226

when i use Arrow.jl to read feather, also has the same problem.
code is here:

@showprogress 1 "reading market data" for fp in glob("*.feather","/mnt/Data/market_data/")[1:4]
println(fp)
open(fp) do h
tab = Arrow.Table(read(h))
df = DataFrame(tab)
finalize(tab)
tab = nothing
GC.gc()
end

and mem grows .....
maybe the same problem but different causes....

Attempting to serialize `DataType`s induces segfault

Didn't really expect this to work necessarily out of the box, but probably should throw an error rather than crash if possible 😁

Running on 32560d3

julia> using Arrow

julia> io = IOBuffer();

julia> Arrow.write(io, (types = DataType[Float64, Int64],))
Unreachable reached at 0x12d517537

signal (4): Illegal instruction: 4
in expression starting at REPL[3]:1
ToStruct at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:67
#47 at ./none:0
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:686 [inlined]
_totuple at ./tuple.jl:258 [inlined]
Tuple at ./tuple.jl:230 [inlined]
#arrowvector#46 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:89
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:81 [inlined]
#arrowvector#10 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:78
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:69
unknown function (ip: 0x12d5189c4)
#arrowvector#4 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:52
unknown function (ip: 0x12d5171d6)
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:48
unknown function (ip: 0x12d516e54)
#47 at ./none:0
iterate at ./generator.jl:47 [inlined]
collect_to! at ./array.jl:732
collect_to_with_first! at ./array.jl:710
unknown function (ip: 0x12d516a2a)
collect at ./array.jl:691
_totuple at ./tuple.jl:258 [inlined]
Tuple at ./tuple.jl:230 [inlined]
#arrowvector#46 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:89
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:81 [inlined]
#arrowvector#10 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:78
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:69
unknown function (ip: 0x12d511d34)
#arrowvector#4 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:52
unknown function (ip: 0x12d50fb86)
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:48
unknown function (ip: 0x12d50f804)
#47 at ./none:0
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:686
_totuple at ./tuple.jl:258 [inlined]
Tuple at ./tuple.jl:230 [inlined]
#arrowvector#46 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:89
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:81 [inlined]
#arrowvector#10 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:78
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:69
unknown function (ip: 0x12d50cacd)
#arrowvector#4 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:52
unknown function (ip: 0x12d508bff)
arrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:48 [inlined]
#toarrowvector#3 at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:36
toarrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:34 [inlined]
toarrowvector##kw at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/arraytypes.jl:34 [inlined]
#113 at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:202
eachcolumn at /Users/jarrettrevels/.julia/packages/Tables/iG2a3/src/utils.jl:70 [inlined]
toarrowtable at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:201
unknown function (ip: 0x12d50825f)
macro expansion at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:111 [inlined]
macro expansion at ./task.jl:332 [inlined]
write at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:108
#write#106 at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:83 [inlined]
write at /Users/jarrettrevels/.julia/dev/Arrow/src/write.jl:83
unknown function (ip: 0x12d502908)
jl_apply at /Users/jarrettrevels/data/repos/julia/src/./julia.h:1690 [inlined]
do_call at /Users/jarrettrevels/data/repos/julia/src/interpreter.c:117
eval_body at /Users/jarrettrevels/data/repos/julia/src/interpreter.c:0
jl_interpret_toplevel_thunk at /Users/jarrettrevels/data/repos/julia/src/interpreter.c:660
jl_toplevel_eval_flex at /Users/jarrettrevels/data/repos/julia/src/toplevel.c:840
jl_toplevel_eval_flex at /Users/jarrettrevels/data/repos/julia/src/toplevel.c:790
jl_toplevel_eval at /Users/jarrettrevels/data/repos/julia/src/toplevel.c:849 [inlined]
jl_toplevel_eval_in at /Users/jarrettrevels/data/repos/julia/src/toplevel.c:883
eval at ./boot.jl:331
eval_user_input at /Users/jarrettrevels/data/repos/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:134
repl_backend_loop at /Users/jarrettrevels/data/repos/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:195
start_repl_backend at /Users/jarrettrevels/data/repos/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:180
#run_repl#37 at /Users/jarrettrevels/data/repos/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:292
run_repl at /Users/jarrettrevels/data/repos/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
#806 at ./client.jl:399
jfptr_YY.806_40248 at /Users/jarrettrevels/data/repos/julia/usr/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/jarrettrevels/data/repos/julia/src/./julia.h:1690 [inlined]
do_apply at /Users/jarrettrevels/data/repos/julia/src/builtins.c:655
jl_f__apply at /Users/jarrettrevels/data/repos/julia/src/builtins.c:669 [inlined]
jl_f__apply_latest at /Users/jarrettrevels/data/repos/julia/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710 [inlined]
invokelatest at ./essentials.jl:709 [inlined]
run_main_repl at ./client.jl:383
exec_options at ./client.jl:313
_start at ./client.jl:506
jfptr__start_53133 at /Users/jarrettrevels/data/repos/julia/usr/lib/julia/sys.dylib (unknown line)
true_main at /usr/local/bin/julia (unknown line)
main at /usr/local/bin/julia (unknown line)
Allocations: 12975029 (Pool: 12972329; Big: 2700); GC: 13
[1]    41167 illegal hardware instruction  julia
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.4.0)
  CPU: Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

tables containing `Set` values are serializable but corresponding deserialized `Arrow.Table`s are inaccessible

Running on 32560d3:

julia> using Arrow

julia> construct_table(args...) = (io = IOBuffer(); Arrow.write(io, args...); seekstart(io); Arrow.Table(io))
construct_table (generic function with 1 method)

julia> t = construct_table((sets = [Set([1,2,3]), Set([1,2,3])],));

julia> collect(t.sets)
ERROR: MethodError: Cannot `convert` an object of type Pair{Int64,Nothing} to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  ...
Stacktrace:
 [1] setindex!(::Dict{Int64,Nothing}, ::Nothing, ::Pair{Int64,Nothing}) at ./dict.jl:372
 [2] push!(::Set{Int64}, ::Pair{Int64,Nothing}) at ./set.jl:57
 [3] union!(::Set{Int64}, ::Dict{Int64,Nothing}) at ./abstractset.jl:91
 [4] Set{Int64}(::Dict{Int64,Nothing}) at ./set.jl:10
 [5] getindex at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:44 [inlined]
 [6] copyto_unaliased!(::IndexLinear, ::Array{Set{Int64},1}, ::IndexLinear, ::Arrow.Struct{Set{Int64},Tuple{Arrow.Map{Dict{Int64,Nothing},Int32,Arrow.Struct{Arrow.KeyValue{Int64,Nothing},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Struct{Nothing,Tuple{}}}}}}}) at ./abstractarray.jl:860
 [7] copyto! at ./abstractarray.jl:840 [inlined]
 [8] _collect_indices at ./array.jl:642 [inlined]
 [9] collect(::Arrow.Struct{Set{Int64},Tuple{Arrow.Map{Dict{Int64,Nothing},Int32,Arrow.Struct{Arrow.KeyValue{Int64,Nothing},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Struct{Nothing,Tuple{}}}}}}}) at ./array.jl:626
 [10] top-level scope at REPL[36]:1

How to unpack a `Arrow.Timestamp{Arrow.Flatbuf.TimeUnitModule.MILLISECOND,:UTC}`?

I've ended up with one of these:

julia> df.timestamp[3]
Arrow.Timestamp{Arrow.Flatbuf.TimeUnitModule.MILLISECOND,:UTC}(1556703758692)

julia> convert(DateTime, df.timestamp[3])
ERROR: MethodError: Cannot `convert` an object of type Arrow.Timestamp{Arrow.Flatbuf.TimeUnitModule.MILLISECOND,:UTC} to an object of type DateTime
Closest candidates are:
  convert(::Type{DateTime}, ::Date) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Dates/src/conversions.jl:30
  convert(::Type{DateTime}, ::Millisecond) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Dates/src/conversions.jl:34
  convert(::Type{DateTime}, ::Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64}) at~/.julia/packages/Arrow/JIMGY/src/eltypes.jl:164
  ...
Stacktrace:
 [1] top-level scope at REPL[123]:1

Is there any easy way to unpack it into a (tz-less, implicitly UTC) DateTime?

Wrong handling of `missing` with `Char`

An example:

julia> df = DataFrame(x1=[true, missing], x2=['a', missing], x3=[Date(2020,11,19), missing])
2×3 DataFrame
 Row │ x1       x2       x3         
     │ Bool?    Char?    Date?      
─────┼──────────────────────────────
   1 │    true  a        2020-11-19
   2 │ missing  missing  missing    

julia> Arrow.write("test.arrow", df)
"test.arrow"

julia> Arrow.Table("test.arrow") |> DataFrame
2×3 DataFrame
 Row │ x1       x2    x3         
     │ Bool?    Char  Date?      
─────┼───────────────────────────
   1 │    true  a     2020-11-19
   2 │ missing  \0    missing    

(I additionally made sure that Bool and Date are OK)

Arrow.Table error while reading more than 2^32 bytes

Problem

I was trying to construct an Arrow.Table from an HTTP response but errors keep happening when the response is too large.

The info of the HTTP response

julia> varinfo(r"res")
  name       size summary               
  –––– –––––––––– ––––––––––––––––––––––
  res  17.176 GiB HTTP.Messages.Response

Then when I was trying to construct an Arrow.Table, I kept getting the following error message

julia> tbl = Arrow.Table(res.body)
ERROR: TaskFailedException:
InexactError: trunc(UInt32, 4295335456)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{UInt32}, ::Int64) at ./boot.jl:558
 [2] checked_trunc_uint at ./boot.jl:588 [inlined]
 [3] toUInt32 at ./boot.jl:672 [inlined]
 [4] UInt32 at ./boot.jl:712 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] Array at /home/poyu/.julia/packages/Arrow/iCrzQ/src/FlatBuffers/table.jl:98 [inlined]
 [7] Arrow.FlatBuffers.Array{Arrow.Flatbuf.Buffer,S,TT} where TT where S(::Arrow.Flatbuf.RecordBatch, ::UInt16) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/FlatBuffers/table.jl:108
 [8] getproperty at /home/poyu/.julia/packages/Arrow/iCrzQ/src/metadata/Message.jl:90 [inlined]
 [9] buildbitmap(::Arrow.Batch, ::Arrow.Flatbuf.RecordBatch, ::Int64, ::Int64) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:361
 [10] build(::Arrow.Flatbuf.Field, ::Arrow.Flatbuf.Utf8, ::Arrow.Batch, ::Arrow.Flatbuf.RecordBatch, ::Dict{Int64,Arrow.DictEncoding}, ::Int64, ::Int64, ::Bool) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:404
 [11] build(::Arrow.Flatbuf.Field, ::Arrow.Batch, ::Arrow.Flatbuf.RecordBatch, ::Dict{Int64,Arrow.DictEncoding}, ::Int64, ::Int64, ::Bool) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:355
 [12] iterate(::Arrow.VectorIterator, ::Tuple{Int64,Int64,Int64}) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:331
 [13] iterate at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:328 [inlined]
 [14] copyto!(::Array{Any,1}, ::Arrow.VectorIterator) at ./abstractarray.jl:733
 [15] _collect at ./array.jl:630 [inlined]
 [16] collect at ./array.jl:624 [inlined]
 [17] macro expansion at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:249 [inlined]
 [18] (::Arrow.var"#90#96"{Bool,Dict{Int64,Arrow.DictEncoding},Arrow.Batch})() at ./threadingconstructs.jl:169
Stacktrace:
 [1] wait at ./task.jl:267 [inlined]
 [2] fetch at ./task.jl:282 [inlined]
 [3] macro expansion at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:198 [inlined]
 [4] (::Arrow.var"#86#92"{Arrow.Table,Channel{Task}})() at ./threadingconstructs.jl:169
Stacktrace:
 [1] wait at ./task.jl:267 [inlined]
 [2] Arrow.Table(::Array{UInt8,1}, ::Int64, ::Nothing; convert::Bool) at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:256
 [3] Table at /home/poyu/.julia/packages/Arrow/iCrzQ/src/table.jl:184 [inlined] (repeats 2 times)
 [4] top-level scope at REPL[23]:1

I have little idea of how Arrow.jl is working, but I suspect the error happens because the Array{T}(t::Table, off) in FlatBuffers/table.jl is trying to convert a 64-bit pointer to UInt32

Version Info

My Arrow.jl version

julia> Pkg.status("Arrow")
Status `~/.julia/environments/v1.5/Project.toml`
  [69666777] Arrow v0.4.0

My Julia version

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)

factor read as string

Having a feather v2 file saved from R where some fields are of factor type. When importing in R or python I am getting those fields as factor/category, as expected. In julia they are read as string. I would like to read them as categorical variables instead.

Reading only a subset of columns

Please correct me if this is possible already. I looked through the source code and the documentation and did not find a clear way to do this: basically, I want to read a FeatherV2 file, but not mmap every single column. I already know which columns I need and I'd like to tell Arrow.Table the subset of columns I want read into memory.

This is similar to this issue on Feather.jl.

This seems to be possible in the R arrow package using col_select.

reconsidering the current type registration/serialization mechanism (and its internal usage)

E.g.

julia> using Arrow, UUIDs                           
                                                                             
julia> table = (;x = [uuid4() for _ = 1:5])                                                                               
(x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2a4ecfd7-049e-
468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe3615c2d0")],)   

julia> Arrow.write("uuids.arrow", table)                                                                                  
"uuids.arrow"                                                                                                             
                                                                                                                          
julia> Arrow.Table("uuids.arrow")
Arrow.Table: (x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2
a4ecfd7-049e-468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe361
5c2d0")],)

In a new session:

julia> using Arrow, UUIDs

julia> Arrow.Table("uuids.arrow")
┌ Warning: unsupported ARROW:extension:name type: "JuliaLang.UUID"
└ @ Arrow.ArrowTypes ~/.julia/packages/Arrow/CyJ4L/src/arrowtypes.jl:141
Arrow.Table: (x = NamedTuple{(:value,),Tuple{UInt128}}[(value = 0x03fec02f3d334cf888e01e152faccff7,), (value = 0xa789baa790594bc4b0d92af1dca4bf88,), (value = 0x2a4ecfd7049e468d8fed8f967b732ee0,), (value = 0x91b9cad6b84c4ee8aeaf7448778fb1d5,), (value = 0x39edead71670430a8e4dfefe3615c2d0,)],)

I believe this is due to the write-time registration of types at

https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86

remove global metadata cache (OBJ_METADATA)

EDIT: I rescoped this issue, see #90 (comment)

I noticed that metadata is stored in a global IdDict - would it make sense to provide an unsetmetadata!(x) (or use a sentinel e.g. setmetadata!(x, nothing)) that calls delete!(OBJ_METADATA, x)?

I could see a memory-leaky scenario where a e.g. a long running service writes a bunch of Arrow objects and attaches a small amount of metadata to each one and eventually OOMs or something.

Error writing simple ZonedDateTime

julia> using TimeZones

julia> using Arrow

julia> Arrow.write(IOBuffer(),(;a=[now(tz"UTC")]))
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
 [1] fieldcount at ./reflection.jl:725 [inlined]
 [2] arrowvector(::Arrow.ArrowTypes.StructType, ::Type{TimeZone}, ::Type{TimeZone}, ::Arrow.ToStruct{TimeZone,2,Array{ZonedDateTime,1}}, ::Array{Arrow.DictEncoding,1}, ::Nothing; kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:419
 [3] arrowvector(::Type{TimeZone}, ::Type{TimeZone}, ::Arrow.ToStruct{TimeZone,2,Array{ZonedDateTime,1}}, ::Array{Arrow.DictEncoding,1}, ::Nothing; kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:82
 [4] arrowvector(::Arrow.ToStruct{TimeZone,2,Array{ZonedDateTime,1}}, ::Array{Arrow.DictEncoding,1}, ::Nothing; dictencoding::Bool, dictencode::Bool, kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:62
 [5] (::Arrow.var"#41#43"{Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}},Array{ZonedDateTime,1},Array{Arrow.DictEncoding,1}})(::Int64) at ./none:0
 [6] iterate at ./generator.jl:47 [inlined]
 [7] collect_to!(::Array{Arrow.Primitive{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.ToStruct{DateTime,1,Array{ZonedDateTime,1}}}},1}, ::Base.Generator{UnitRange{Int64},Arrow.var"#41#43"{Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}},Array{ZonedDateTime,1},Array{Arrow.DictEncoding,1}}}, ::Int64, ::Int64) at ./array.jl:732
 [8] collect_to_with_first!(::Array{Arrow.Primitive{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.ToStruct{DateTime,1,Array{ZonedDateTime,1}}}},1}, ::Arrow.Primitive{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.MILLISECOND,Int64},Arrow.ToStruct{DateTime,1,Array{ZonedDateTime,1}}}}, ::Base.Generator{UnitRange{Int64},Arrow.var"#41#43"{Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}},Array{ZonedDateTime,1},Array{Arrow.DictEncoding,1}}}, ::Int64) at ./array.jl:710
 [9] collect(::Base.Generator{UnitRange{Int64},Arrow.var"#41#43"{Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}},Array{ZonedDateTime,1},Array{Arrow.DictEncoding,1}}}) at ./array.jl:691
 [10] _totuple at ./tuple.jl:258 [inlined]
 [11] Tuple at ./tuple.jl:230 [inlined]
 [12] arrowvector(::Arrow.ArrowTypes.StructType, ::Type{ZonedDateTime}, ::Type{ZonedDateTime}, ::Array{ZonedDateTime,1}, ::Array{Arrow.DictEncoding,1}, ::Nothing; kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:419
 [13] arrowvector(::Type{ZonedDateTime}, ::Type{ZonedDateTime}, ::Array{ZonedDateTime,1}, ::Array{Arrow.DictEncoding,1}, ::Nothing; kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:82
 [14] arrowvector(::Array{ZonedDateTime,1}, ::Array{Arrow.DictEncoding,1}, ::Nothing; dictencoding::Bool, dictencode::Bool, kw::Base.Iterators.Pairs{Symbol,Union{Nothing, Bool},NTuple{4,Symbol},NamedTuple{(:compression, :largelists, :denseunions, :dictencodenested),Tuple{Nothing,Bool,Bool,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:62
 [15] toarrowvector(::Array{ZonedDateTime,1}, ::Array{Arrow.DictEncoding,1}, ::Nothing; compression::Nothing, kw::Base.Iterators.Pairs{Symbol,Bool,NTuple{4,Symbol},NamedTuple{(:largelists, :denseunions, :dictencode, :dictencodenested),NTuple{4,Bool}}}) at ~/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:45
 [16] (::Arrow.var"#96#97"{Bool,Nothing,Bool,Bool,Bool,Array{Any,1},Array{Type,1},Array{Arrow.DictEncoding,1}})(::Array{ZonedDateTime,1}, ::Int64, ::Symbol) at ~/.julia/packages/Arrow/JIMGY/src/write.jl:250
 [17] macro expansion at ~/.julia/packages/Tables/xHhzi/src/utils.jl:71 [inlined]
 [18] eachcolumn at ~/.julia/packages/Tables/xHhzi/src/utils.jl:65 [inlined]
 [19] toarrowtable(::NamedTuple{(:a,),Tuple{Array{ZonedDateTime,1}}}, ::Bool, ::Nothing, ::Bool, ::Bool, ::Bool) at ~/.julia/packages/Arrow/JIMGY/src/write.jl:249
 [20] macro expansion at ~/.julia/packages/Arrow/JIMGY/src/write.jl:128 [inlined]
 [21] macro expansion at ./task.jl:332 [inlined]
 [22] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::NamedTuple{(:a,),Tuple{Array{ZonedDateTime,1}}}, ::Bool, ::Bool, ::Nothing, ::Bool, ::Bool, ::Bool) at ~/.julia/packages/Arrow/JIMGY/src/write.jl:125
 [23] #write#91 at ~/.julia/packages/Arrow/JIMGY/src/write.jl:47 [inlined]
 [24] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::NamedTuple{(:a,),Tuple{Array{ZonedDateTime,1}}}) at ~/.julia/packages/Arrow/JIMGY/src/write.jl:47
 [25] top-level scope at REPL[133]:1

Is this expected?

support for heterogeneously typed tuples

Currently writing heterogeneously typed tuples errors:

julia> Arrow.write("foo.arrow", DataFrame(a = [(1, 2.)]))
ERROR: MethodError: no method matching String(::Int64)
Closest candidates are:
  String(::String) at boot.jl:321
  String(::Array{UInt8,1}) at strings/string.jl:39
  String(::Base.CodeUnits{UInt8,String}) at strings/string.jl:77
  ...
Stacktrace:
 [1] fieldoffset(::Arrow.FlatBuffers.Builder, ::Int64, ::Arrow.Primitive{Int64,Arrow.ToStruct{Int64,1,Array{Tuple{Int64,Float64},1}}}) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:313
 [2] (::Arrow.var"#74#75"{Arrow.FlatBuffers.Builder,Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Arrow.ToStruct{Int64,1,Array{Tuple{Int64,Float64},1}}},Arrow.Primitive{Float64,Arrow.ToStruct{Float64,2,Array{Tuple{Int64,Float64},1}}}}},Tuple{Int64,Int64}})(::Int64) at ./tuple.jl:0
 [3] iterate at ./generator.jl:47 [inlined]
 [4] collect at ./array.jl:686 [inlined]
 [5] arrowtype(::Arrow.FlatBuffers.Builder, ::Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Arrow.ToStruct{Int64,1,Array{Tuple{Int64,Float64},1}}},Arrow.Primitive{Float64,Arrow.ToStruct{Float64,2,Array{Tuple{Int64,Float64},1}}}}}) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/eltypes.jl:389
 [6] fieldoffset(::Arrow.FlatBuffers.Builder, ::Symbol, ::Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Arrow.ToStruct{Int64,1,Array{Tuple{Int64,Float64},1}}},Arrow.Primitive{Float64,Arrow.ToStruct{Float64,2,Array{Tuple{Int64,Float64},1}}}}}) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:349
 [7] (::Arrow.var"#115#116"{(:a,),Arrow.FlatBuffers.Builder,Arrow.ToArrowTable})(::Int64) at ./array.jl:0
 [8] iterate at ./generator.jl:47 [inlined]
 [9] collect at ./array.jl:686 [inlined]
 [10] makeschema(::Arrow.FlatBuffers.Builder, ::Tables.Schema{(:a,),Tuple{Tuple{Int64,Float64}}}, ::Arrow.ToArrowTable) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:272
 [11] makeschemamsg(::Tables.Schema{(:a,),Tuple{Tuple{Int64,Float64}}}, ::Arrow.ToArrowTable) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:308
 [12] macro expansion at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:114 [inlined]
 [13] macro expansion at ./task.jl:332 [inlined]
 [14] write(::IOStream, ::DataFrame, ::Bool, ::Bool, ::Nothing, ::Bool, ::Bool, ::Bool, ::Int64) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:108
 [15] #104 at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:77 [inlined]
 [16] open(::Arrow.var"#104#105"{Bool,Nothing,Bool,Bool,Bool,Int64,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:325
 [17] open at ./io.jl:323 [inlined]
 [18] #write#103 at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:76 [inlined]
 [19] write(::String, ::DataFrame) at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/write.jl:76
 [20] top-level scope at REPL[3]:1

This error is fixed if I replace String with string here, but reading it back still errors:

julia> Arrow.Table("foo.arrow")
Arrow.Table: (a = [Error showing value of type Arrow.Table:
ERROR: MethodError: no method matching Tuple{Int64,Float64}(::Int64, ::Float64)
Closest candidates are:
  Tuple{Int64,Float64}(::Any) where T<:Tuple at tuple.jl:230
Stacktrace:
 [1] getindex at /local/scratch/ssd/sschaub/.julia/dev/Arrow/src/arraytypes/struct.jl:44 [inlined]
 [2] isassigned(::Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Int64) at ./abstractarray.jl:408
 [3] show_delim_array(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::String, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:740
 [4] show_delim_array at ./show.jl:733 [inlined]
 [5] show_vector(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::Char) at ./arrayshow.jl:476
 [6] show_vector at ./arrayshow.jl:461 [inlined]
 [7] show at ./arrayshow.jl:432 [inlined]
 [8] show(::IOContext{REPL.Terminals.TTYTerminal}, ::NamedTuple{(:a,),Tuple{Arrow.Struct{Tuple{Int64,Float64},Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}}}) at ./namedtuple.jl:150
 [9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Table) at /local/scratch/ssd/sschaub/.julia/packages/Tables/8Ud85/src/Tables.jl:185
 [10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Arrow.Table) at ./multimedia.jl:47
 [11] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:214
 [12] display(::REPL.REPLDisplay, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218
 [13] display(::Any) at ./multimedia.jl:328
 [14] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [15] invokelatest at ./essentials.jl:709 [inlined]
 [16] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:238
 [17] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:223
 [18] (::REPL.var"#do_respond#54"{Bool,Bool,VSCodeServer.var"#40#41"{REPL.LineEditREPL,REPL.LineEdit.Prompt},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:822
 [19] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [20] invokelatest at ./essentials.jl:709 [inlined]
 [21] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
 [22] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144
 [23] (::REPL.var"#38#42"{REPL.LineEditREPL,REPL.REPLBackendRef})() at ./task.jl:356

Tuples probably have to be special-cased for this to work, but I am not quite sure what the best solution here would be. If anyone can give me some tips what would need to be done here, I could give it a stab.

Error with DatePart('Z')

Hello, I have encountered an small problem if one of my dateformats after adding Arrow to the environment.

MWE
julia> using Dates

julia> time_str = "2020-12-17T20:11:39.093Z"
"2020-12-17T20:11:39.093Z"

julia> DateTime(time_str, dateformat"yyyy-mm-ddTHH:MM:SS.sZ")
2020-12-17T20:11:39.093

julia> using Arrow

julia> DateTime(time_str, dateformat"yyyy-mm-ddTHH:MM:SS.sZ")
ERROR: ArgumentError: Unable to parse date time. Expected directive DatePart(Z) at char 22

On julia v1.5.3 and Arrow#v1.0.3

It's not a big deal though, since I can just remove the 'Z' char and it works fine but I feel it could be worth looking into.

Thank you

is not a valid arrow file

When attempting to read arrow file but having a typo in the file name, the error message could mention that file doesn't exists rather than saying it is not a valid file.

In-memory zero copy support

I'm very excited to see support for Arrow in Julia! I was interested in seeing how Julia's Arrow performance compared to Python's based on the blog at, https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a. The performance is much slower, at least on the first iteration through the data. After analyzing performance and looking at the implementation a bit, it looks like the Julia implementation is copying the dataset into memory even though the file is memory-mapped. The memory allocations is high and matches the amount of memory in the dataset.

I understand that support for Arrow in Julia is new, but I was wondering if my understanding is correct and if you are planning to be able to iterate over columns in a memory-mapped Arrow file without having to copy the entire dataset into memory first.

Regards --Roland

Segmentation Fault with Threads.@spawn + Tables.partitioner + write with compression

I'm not completely certain where the problem actually is here, but this works when removing compression or not using the partitioner:

using Arrow
using DataFrames

const flup = DataFrame([[rand('a':'h') for _ in 1:300] for _ in 'a':'h'], Symbol.('a':'h'))
seqis = [Iterators.partition(1:300, 20) for _ in 1:100]
d = mktempdir()
try 
    t = Threads.@spawn for (i,is) in enumerate(seqis)
        Arrow.write(joinpath(d, "$i.arrow"), Tables.partitioner((flup[i,:] for i in is)); compress=:lz4)
    end
    wait(t)
finally
    rm(d, recursive=true)
end

Causes:

compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/primitive.jl:77
unknown function (ip: 0x7f157f2f25ff)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
double free or corruption (out)
#toarrowvector#3 at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:38

signal (6): Aborted
in expression starting at none:0
toarrowvector##kw at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:34 [inlined]
toarrowvector##kw at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:34 [inlined]
#113 at /home/ec2-user/.julia/pac#113 at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:202
unknown function (ip: 0x7f157f2e9unknown function (ip: 0x7f157f2e904e)

signal (11): Segmentation fault
in expression starting at none:0   
LZ4HC_InsertAndGetWiderMatch.constprop.5 at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4HC_compress_generic_internal.part.1 at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4_compress_HC_continue at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_flush at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_compressEnd at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_compressEnd at /home/ec2-user/.julia/packages/CodecLz4/2JFgC/src/headers/lz4frame.jl:297
process at /home/ec2-user/.julia/packages/CodecLz4/2JFgC/src/frame_compression.jl:129
transcode at /home/ec2-user/.julia/packages/TranscodingStreams/MsN8d/src/transcode.jl:90
compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/compressed.jl:45 [inlined]
compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/compressed.jl:49 [inlined]
signal (11): Segmentation fault
in expression starting at none:0
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
eachcolumn at /home/ec2-user/.julia/packages/Tables/iG2a3/src/utils.jl:70
unknown function (ip: 0x7f157f2e748c)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
toarrowtable at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:201
unknown function (ip: 0x7f157f2e5d76)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
macro expansion at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:127 [inlined]
#109 at ./threadingconstructs.jl:169
unknown function (ip: 0x7f157f2fa13c)
gsignal at /usr/bin/../lib64/libc.so.6 (unknown line)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
abort at /usr/bin/../lib64/libc.so.6 (unknown line)
unknown function (ip: 0x7f15bb6b95dd)
unknown function (ip: (nil))
Allocations: 48301139 (Pool: 48286788; Big: 14351); GC: 51
LZ4HC_InsertAndGetWiderMatch.constprop.5 at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4HC_compress_generic_internal.part.1 at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4_compress_HC_continue at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_flush at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_compressEnd at /home/ec2-user/.julia/artifacts/25d411cb0031c3b42c10c945cdf6eb5253cb44b7/lib/liblz4.so (unknown line)
LZ4F_compressEnd at /home/ec2-user/.julia/packages/CodecLz4/2JFgC/src/headers/lz4frame.jl:297
process at /home/ec2-user/.julia/packages/CodecLz4/2JFgC/src/frame_compression.jl:129
transcode at /home/ec2-user/.julia/packages/TranscodingStreams/MsN8d/src/transcode.jl:90
compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/compressed.jl:45 [inlined]
compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/compressed.jl:49 [inlined]
compress at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/primitive.jl:77
unknown function (ip: 0x7f157f2f25ff)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
#toarrowvector#3 at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:38
toarrowvector##kw at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:34 [inlined]
toarrowvector##kw at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/arraytypes.jl:34 [inlined]
#113 at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:202
unknown function (ip: 0x7f157f2e904e)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
eachcolumn at /home/ec2-user/.julia/packages/Tables/iG2a3/src/utils.jl:70
unknown function (ip: 0x7f157f2e748c)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
toarrowtable at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:201
unknown function (ip: 0x7f157f2e5d76)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
macro expansion at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:127 [inlined]
#109 at ./threadingconstructs.jl:169
unknown function (ip: 0x7f157f2fa13c)
jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f15bb6b95dd)
unknown function (ip: (nil))
Allocations: 48301139 (Pool: 48286788; Big: 14351); GC: 51
__libc_message at /usr/bin/../lib64/libc.so.6 (unknown line)
malloc_printerr at /usr/bin/../lib64/libc.so.6 (unknown line)
Segmentation fault

Some tricky data for Arrow

I have this data in the JDF.jl tests. Arrow is currently failing to write it. We could incorporate something like this in the tests.

using JDF
using DataFrames
using Random: randstring
using WeakRefStrings

df = DataFrame([collect(1:100) for i = 1:3000])
df[!, :int_missing] =
    rand([rand(rand([UInt, Int, Float64, Float32, Bool])), missing], nrow(df))

df[!, :missing] .= missing
df[!, :strs] = [randstring(8) for i = 1:nrow(df)]
df[!, :stringarray] = StringVector([randstring(8) for i = 1:nrow(df)])

df[!, :strs_missing] = [rand([missing, randstring(8)]) for i = 1:nrow(df)]
df[!, :stringarray_missing] =
    StringVector([rand([missing, randstring(8)]) for i = 1:nrow(df)])
df[!, :symbol_missing] = [rand([missing, Symbol(randstring(8))]) for i = 1:nrow(df)]
df[!, :char] = getindex.(df[!, :strs], 1)
df[!, :char_missing] = allowmissing(df[!, :char])
df[rand(1:nrow(df), 10), :char_missing] .= missing

@time JDF.save("a.jdf", df) # works with JDF 


# but fails with Arrow
Arrow.write("a.arrow", df)
error msg 4,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Union{Missing, UInt64},Missing,String,String,Union{Missing, String},Union{Missing, String},Union{Missing, Symbol},Char,Union{Missing, Char}}}, ::DataFrame, ::Dict{Int64,Tuple{Int64,Type,Any}}) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:258 [9] macro expansion at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:71 [inlined] [10] macro expansion at .\task.jl:332 [inlined] [11] write(::IOStream, ::DataFrame, ::Bool, ::Bool) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:58 [12] #29 at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:21 [inlined] [13] open(::Arrow.var"#29#30"{Bool,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:325 [14] open at .\io.jl:323 [inlined] [15] #write#28 at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:20 [inlined] [16] write(::String, ::DataFrame) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:20 [17] top-level scope at REPL[44]:1 [18] include_string(::Function, ::Module, ::String, ::String) at .\loading.jl:1088

Error with DictEncoded columns of more than 127 unique values

julia> using DataFrames, Arrow

julia> let
           _df = DataFrame(a=string.(1:128))
           _df.a = Arrow.DictEncode(_df.a)
           df = DataFrame(Arrow.Table(Arrow.write("_df", _df)))
           copy(df)
       end
ERROR: InexactError: trunc(Int8, 128)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int8}, ::Int64) at ./boot.jl:558
 [2] checked_trunc_sint at ./boot.jl:580 [inlined]
 [3] toInt8 at ./boot.jl:595 [inlined]
 [4] Int8 at ./boot.jl:705 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] setindex!(::Dict{String,Int8}, ::Int64, ::String) at ./dict.jl:380
 [7] Dict{String,Int8}(::Base.Generator{Base.Iterators.Enumerate{Array{String,1}},Arrow.var"#64#65"}) at ./dict.jl:103
 [8] copy(::Arrow.DictEncoded{String,Int8,SentinelArrays.ChainedVector{String,Arrow.List{String,Int32,Array{UInt8,1}}}}) at /Users/yvishnevsky/.julia/packages/Arrow/JIMGY/src/arraytypes.jl:702
 [9] DataFrame(::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index; copycols::Bool) at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:148
 [10] manipulate(::DataFrame, ::Base.OneTo{Int64}; copycols::Bool, keeprows::Bool) at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:542
 [11] #manipulate#299 at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:550 [inlined]
 [12] #select#296 at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:493 [inlined]
 [13] getindex at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:468 [inlined]
 [14] #copy#160 at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:832 [inlined]
 [15] #DataFrame#125 at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:165 [inlined]
 [16] DataFrame(::DataFrame) at /Users/yvishnevsky/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:165
 [17] top-level scope at REPL[8]:5

I've also been able to inadvertently trigger an OutOfMemoryError when loading a 4-field 7,900-element Arrow file into a DataFrame, when I index into the 7,800th field.

I haven't reduced this down to a reproducible case without private data, but maybe the fact that it's an out of memory error provides some sort of hint. The specific column that I am unable to index into is also a DictEncoded column.

writing column with missing / struct data errors

using Arrow, Dates

struct MyDate
    date::Date
end
Arrow.ArrowTypes.registertype!(MyDate, MyDate)
table = (; date = [rand(Bool) ? MyDate(now()) : missing for _ = 1:10])
io = IOBuffer()
Arrow.write(io, table)

yields

julia> Arrow.write(io, table)
ERROR: TaskFailedException:MethodError: Cannot `convert` an object of type Day to an object of type Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32}
Closest candidates are:  convert(::Type{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32}}, ::Date) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/eltypes.jl:188
  convert(::Type{T}, ::T) where T at essentials.jl:171  Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32}(::Any) where {U, T} at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/eltypes.jl:165
Stacktrace: [1] arrowconvert(::Type{T} where T, ::Day) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arrowtypes.jl:32 [2] getindex at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/utils.jl:118 [inlined]
 [3] iterate at ./abstractarray.jl:986 [inlined] [4] writearray(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32}}, ::Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32},Arrow.ToStruct{Date,1,Array{Union{Missing, MyDate},1}}}) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/utils.jl:50
 [5] writebuffer(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Arrow.Primitive{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32},Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32},Arrow.ToStruct{Date,1,Array{Union{Missing, MyDate},1}}}}, ::Int64) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/primitive.jl:102
 [6] writebuffer(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Arrow.Struct{Union{Missing, MyDate},Tuple{Arrow.Primitive{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32},Arrow.Converter{Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY,Int32},Arrow.ToStruct{Date,1,Array{Union{Missing, MyDate},1}}}}}}, ::Int64) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/arraytypes/struct.jl:127
 [7] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Arrow.Message, ::Tuple{Array{Arrow.Block,1},Array{Arrow.Block,1}}, ::Base.RefValue{Tables.Schema}, ::Int64) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:249
 [8] macro expansion at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:106 [inlined]
 [9] (::Arrow.var"#107#110"{Base.GenericIOBuffer{Array{UInt8,1}},Int64,Arrow.OrderedChannel{Arrow.Message},Base.RefValue{Tables.Schema},Tuple{Array{Arrow.Block,1},Array{Arrow.Block,1}}})() at ./threadingconstructs.jl:169
Stacktrace:
 [1] wait at ./task.jl:267 [inlined]
 [2] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::NamedTuple{(:date,),Tuple{Array{Union{Missing, MyDate},1}}}, ::Bool, ::Bool, ::Nothing, ::Bool, ::Bool, ::Bool, ::Int64) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:141
 [3] #write#106 at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:83 [inlined]
 [4] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::NamedTuple{(:date,),Tuple{Array{Union{Missing, MyDate},1}}}) at /home/ec2-user/.julia/packages/Arrow/CyJ4L/src/write.jl:83
 [5] top-level scope at REPL[24]:1

Arrow incorrectly handles Substring{String}

Writing then reading SubStrings (which can naturally appear in DataFrames read from CSV files with CSV.jl) results in garbled UInt8 output:

julia> using DataFrames, Arrow

julia> let
               s = "a" ^ 100
               df = DataFrame(a=[SubString(s, 1:10), SubString(s, 11:20)])
               DataFrame(Arrow.Table(Arrow.write("s", df)))
       end
2×1 DataFrame
│ Row │ a                                                            │
│     │ Array{UInt8,1}                                               │
├─────┼──────────────────────────────────────────────────────────────┤
│ 1   │ [0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61] │
│ 2   │ [0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61, 0x61] │

julia> let
               s = "a" ^ 100
               df = DataFrame(a=[String(SubString(s, 1:10)), String(SubString(s, 11:20))])
               DataFrame(Arrow.Table(Arrow.write("s", df)))
       end
2×1 DataFrame
│ Row │ a          │
│     │ String     │
├─────┼────────────┤
│ 1   │ aaaaaaaaaa │
│ 2   │ aaaaaaaaaa │

TimeZones.jl support

From what i gather the C bindings for Array used DateTimes with timezones.

It would be very useful to support TimeZones.jl for that.

Arrow.DictEncoded conversion in, e.g. DataFrame

I'm not sure if this is an Arrow issue, or a Tables issue, or a DataFrames issue but when I converted an Arrow.Table for which the columntable type is

::NamedTuple{(:Subj, :CC, :trial, :Q, :F, :P, :Pword, :Item, :rt, :Score), Tuple{Arrow.DictEncoded{String, UInt8}, Arrow.DictEncoded{String, UInt8}, Arrow.Primitive{Int16, Int16}, Arrow.DictEncoded{String, UInt8}, Arrow.DictEncoded{String, UInt8}, Arrow.DictEncoded{String, UInt8}, Arrow.DictEncoded{String, UInt16}, Arrow.DictEncoded{String, UInt16}, Arrow.Primitive{Int16, Int16}, Arrow.DictEncoded{String, UInt8}}}

all of my compact representations as Arrow.DictEncoded{String, UInt8} were converted to Vector{String}, which, in these cases, inflates the amount of storage used considerably. Of course, I can go back and convert them to PooledArrays but it seems that somewhere along the line one of the packages should recognize that an Arrow.DictEncoded{String, UInt8} is best converted to a PooledVector{String, UInt8, Vector{UInt8}}

Opening a file that does not exist still succeeds

julia> using Arrow

julia> Arrow.Table("doesnotexist.arrow")
Arrow.Table: NamedTuple()

Reading a non-existent file should probably error, matching read("doesnotexist.arrow") and CSV.read("doesnotexist.csv"), rather than silently creating an empty file.

I noticed that Mmap.mmap(file) creates the file if it does not exist, which might be the underlying reason for the current behavior.

CategoricalArrays strikes again

I think we should handle this:

julia> df = DataFrame(x6=categorical(["a","b"]), x7=categorical(["a",missing]))
2×2 DataFrame
│ Row │ x6   │ x7      │
│     │ Cat… │ Cat…?   │
├─────┼──────┼─────────┤
│ 1   │ a    │ a       │
│ 2   │ b    │ missing │

julia> Arrow.write("test.arrow", df)
ERROR: MethodError: no method matching arrowtype(::Arrow.FlatBuffers.Builder, ::Type{CategoricalValue{String,UInt32}})

Re-use PyArrow memory via PyCall

Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.

Create example arr:

julia> using PyCall

julia> ak = pyimport("awkward");

julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]")
PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

julia> arr.layout
PyObject <ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets>
    <content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content>
</ListOffsetArray64>

Then you can get an pyarrow object via:

julia> arr_arrow = ak.to_arrow(arr)
PyObject <pyarrow.lib.ListArray object at 0x7f2144343048>
...
..

julia> @time [Int64[x...] for x in arr]
  0.034800 seconds (38.72 k allocations: 2.031 MiB)
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 []
 [4, 5]

Currently the fastest / least copy method of re-using as been:

function view_ak(arr)
    c = PyArray(arr.layout."content")
    o = PyArray(arr.layout."offsets")
    @views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1]
end

julia> @time view_ak(arr)
  0.000089 seconds (37 allocations: 1.609 KiB)
3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}:
 [1, 2, 3]
 0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64
 [4, 5]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.