Comments (12)
The UUID case is now fixed (defined by default in Arrow) and we've updated the docs to mention the need to call registertype!
. I'm considering some larger changes to type serializing and such, so this might be something we make easier with that.
from arrow-julia.
@quinnj what's the motivation behind https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86? Is it just to give the "convenience behavior" listed in the OP or is there a deeper reason? If it's just the former, I wonder if it's better just to remove it...I ran into another related issue just now.
I'm essentially implementing the following (which is also why I needed #150):
struct Foo ... end
struct _FooArrow ... end
Foo(::_FooArrow) = ...
Arrow.ArrowTypes.registertype!(Foo, _FooArrow)
Arrow.ArrowTypes.arrowconvert(::Type{_FooArrow}, f::Foo) = ...
the above in theory would allow me to have full control over Arrow <-> Julia conversion for my Foo
type.
The problem is that Arrow.jl is automatically calling ArrowTypes.registertype!(_FooArrow, _FooArrow)
on write even though I don't want it to :( as a caller I can't really think of a scenario where I would want auto-registration, but I could be missing something.
Even if we can't get rid of it in general, would it be possible to gate this behavior behind a flag passed to Arrow.write
(autoregister=true
)?
from arrow-julia.
(The workaround is to just call Arrow.ArrowTypes.registertype!(UUID, UUID)
before deserializing. But I think the hidden statefulness is still confusing / problematic).
from arrow-julia.
Just to add another reason in favor of removing it, mutating the global registration dict at write-time seems like it could be an issue for concurrent writing from different threads (ref #90 (comment) for other thread safety issues). Whereas the user could be sure to always manually register outside the threaded region of code.
from arrow-julia.
Yeah, these are good points for removing the auto registering. The main reason for having it was convenience.
from arrow-julia.
@jrevels , can you explain your use-case/example a bit more? What I dont' quite follow is how _FooArrow
will be supported? The 2nd argument to registertype!
should be a native arrow type that your custom type converts to.
from arrow-julia.
Hold up, don't mind me. I'm digging back through all the code and in the structs.jl file we know how to serialize a _FooArrow
, so yeah, I think I understand the example better now.
from arrow-julia.
Wait, backsies again. So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow)
, we don't know how to deserialize the struct, it would just deserialize as a NamedTuple. Here's where my thinking is going, though I recognize the code itself doesn't currently reflect this vision:
- Stop autoregistering
StructType
s, we'd require users to callArrowTypes.registertype!
for custom types - By default, we'd assume that
Arrow.Types.registertype!(::Type{T}) where {T} = registertype!(T, T)
, which means the custom struct would be serialized as-is, and when deserializing, we'd just call (essentially)T(serialized_fields...)
- For more customizability when serializing custom structs, you would do something like:
ArrowTypes.registertype!(T, @NamedTuple{field1::Int, field2::String})
, where, e.g., you only want to serializefield1
andfield2
of your custom type. This would then require a corresponding definition like:ArrowTypes.arrowconvert(::Type{@NamedTuple{field1::Int, field2::String}}, x::T) = (field1=x.field1, field2=x.field2)
, though I think we could provide some kind of auto-convert fallback, like:ArrowTypes.arrowconvert(::Type{T}, x) where {T <: NamedTuple} = (; nm=>getfield(x, nm) for nm in names(T))
- In addition, you'd be able to define a custom
ArrowTypes.arrowconvert(::Type{T}, x::@NamedTuple{field1::Int, field2::Strong}) = T(x.field1, x.field2)
; this would allow "hooking" into deserialization, to fix cases like #135
from arrow-julia.
So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple.
Ah, but for me this is the desired behavior :) I want it to deserialize as NamedTuple unless I, the caller, tell it explicitly not to. Right now it feels like Arrow.jl is making the decision for me, and it's making the wrong one (AFAICT).
Reopening this issue as it seems like the discussion may lead to some action items :)
from arrow-julia.
ref beacon-biosignals/Onda.jl#68 for a motivating example.
my thoughts are very rough/not super well-considered yet, but off the top of my head, here are my big "wants" (some of these might already be possible w/ existing behavior):
- For the current Dict-based
registertype!
mechanism to somehow be replaced/augmented by method dispatch (motivation here: https://github.com/beacon-biosignals/Onda.jl/pull/68/files#diff-9d1b70fd041b1dbbe08ff4096cf1c68daa131b7d249d2ba3101e9079e129f44cR505 ; if there's another way to do this that I'm not seeing with the current system, that'd be dope). The barriers here AFAICT are a) dynamic dispatch might be slower than the current Dict look up during deserialization, so we'd have to amortize it by doing it up front based on the present extension metadata (I think this should work?) and b) this mechanism would be less dynamic thanregistertype!
currently is (I would be happy to make it less dynamic, but maybe there's a use case where the extra dynamism is useful?). If handling this starts to look like it requires@eval
, we could consider a combined approach, e.g. keep the current*_MAPPING
Dict but have it containextension string => Julia function
pairs, to hide Julia's "smart dispatch" behind a "dumb dispatch" tier. That way the mapping would be clear/resolvable withouteval
but callers could take advantage of Julia's dispatch for e.g. allowing full use of type parameters. - For there to be very clear/well-named/well-documented/separate "pre-serialization hook" and "post-deserialization hook" transformations. I'm not overly concerned with names as long as the duality is clear, e.g.
lower
/raise
,toarrow
/fromarrow
, etc. I wouldn't want to use a single function for both as, IME, doing so forces the caller to "do more than they intend to do" by overloading that function. Like, I want to be able to define a pre-serialization hook to go fromA
in Julia ->B
in Arrow, without implying anything about the transformation behavior ofA
in Arrow ->B
in Julia. - For me to be able to specify/toggle behavior at the callsite. i.e. turn off the pre/post-hooks for a specific type during a specific read/write operation. Overriding these sorts of global behaviors is really useful in practice for e.g. migrating data between different representation versions w/o needing to adjust your running environment.
from arrow-julia.
preserving some relevant convo from the Julia Slack (https://julialang.slack.com/archives/C674VR0HH/p1615846377461400?thread_ts=1615681758.430500&cid=C674VR0HH)
from arrow-julia.
Closing now that #156 is merged/tagged
from arrow-julia.
Related Issues (20)
- Release document misses how to register ArrowTypes to the Julia General Registry
- Arrow.jl 2.6 breaks Legolas.jl's tests HOT 11
- Incorrect syntax in ArrowTypes code HOT 2
- Error with v2.6.0 HOT 9
- Issue with `Union{Missing, VersionNumber}` HOT 6
- GitHub Pages build error HOT 8
- Use https://arrow.apache.org/julia/ as the official Website URL HOT 7
- html comment tag at the top of main documentation page may have one too many dashes at the beginning
- explanation of Arrow.Stream vs. Arrow.Table seems ambiguous HOT 3
- `Arrow.write` performance on large DataFrame HOT 3
- Bus errors when writing `DataFrame` HOT 8
- Arrow stream writer and reader implementation questions
- [feature request] support run-end encoded layout
- Custom type cannot round trip (Colors.jl) HOT 1
- colmetadata does not read custom metadata with multiple writes
- `getindex` broken with `SVector{3, UInt}` in the presence of missing data HOT 2
- Removing .arrow files without closing Julia seems impossible in Windows HOT 18
- support Dates.CompoundPeriod in deserialization?
- copy does not copy to standard Julia Types HOT 5
- Unexpected allocations HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-julia.