Giter Site home page Giter Site logo

Comments (12)

quinnj avatar quinnj commented on May 26, 2024 1

The UUID case is now fixed (defined by default in Arrow) and we've updated the docs to mention the need to call registertype!. I'm considering some larger changes to type serializing and such, so this might be something we make easier with that.

from arrow-julia.

jrevels avatar jrevels commented on May 26, 2024 1

@quinnj what's the motivation behind https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86? Is it just to give the "convenience behavior" listed in the OP or is there a deeper reason? If it's just the former, I wonder if it's better just to remove it...I ran into another related issue just now.

I'm essentially implementing the following (which is also why I needed #150):

struct Foo ... end

struct _FooArrow ... end

Foo(::_FooArrow) = ...

Arrow.ArrowTypes.registertype!(Foo, _FooArrow)

Arrow.ArrowTypes.arrowconvert(::Type{_FooArrow}, f::Foo) = ...

the above in theory would allow me to have full control over Arrow <-> Julia conversion for my Foo type.

The problem is that Arrow.jl is automatically calling ArrowTypes.registertype!(_FooArrow, _FooArrow) on write even though I don't want it to :( as a caller I can't really think of a scenario where I would want auto-registration, but I could be missing something.

Even if we can't get rid of it in general, would it be possible to gate this behavior behind a flag passed to Arrow.write (autoregister=true)?

from arrow-julia.

ericphanson avatar ericphanson commented on May 26, 2024

(The workaround is to just call Arrow.ArrowTypes.registertype!(UUID, UUID) before deserializing. But I think the hidden statefulness is still confusing / problematic).

from arrow-julia.

ericphanson avatar ericphanson commented on May 26, 2024

Just to add another reason in favor of removing it, mutating the global registration dict at write-time seems like it could be an issue for concurrent writing from different threads (ref #90 (comment) for other thread safety issues). Whereas the user could be sure to always manually register outside the threaded region of code.

from arrow-julia.

quinnj avatar quinnj commented on May 26, 2024

Yeah, these are good points for removing the auto registering. The main reason for having it was convenience.

from arrow-julia.

quinnj avatar quinnj commented on May 26, 2024

@jrevels , can you explain your use-case/example a bit more? What I dont' quite follow is how _FooArrow will be supported? The 2nd argument to registertype! should be a native arrow type that your custom type converts to.

from arrow-julia.

quinnj avatar quinnj commented on May 26, 2024

Hold up, don't mind me. I'm digging back through all the code and in the structs.jl file we know how to serialize a _FooArrow, so yeah, I think I understand the example better now.

from arrow-julia.

quinnj avatar quinnj commented on May 26, 2024

Wait, backsies again. So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple. Here's where my thinking is going, though I recognize the code itself doesn't currently reflect this vision:

  • Stop autoregistering StructTypes, we'd require users to call ArrowTypes.registertype! for custom types
  • By default, we'd assume that Arrow.Types.registertype!(::Type{T}) where {T} = registertype!(T, T), which means the custom struct would be serialized as-is, and when deserializing, we'd just call (essentially) T(serialized_fields...)
  • For more customizability when serializing custom structs, you would do something like: ArrowTypes.registertype!(T, @NamedTuple{field1::Int, field2::String}), where, e.g., you only want to serialize field1 and field2 of your custom type. This would then require a corresponding definition like: ArrowTypes.arrowconvert(::Type{@NamedTuple{field1::Int, field2::String}}, x::T) = (field1=x.field1, field2=x.field2), though I think we could provide some kind of auto-convert fallback, like: ArrowTypes.arrowconvert(::Type{T}, x) where {T <: NamedTuple} = (; nm=>getfield(x, nm) for nm in names(T))
  • In addition, you'd be able to define a custom ArrowTypes.arrowconvert(::Type{T}, x::@NamedTuple{field1::Int, field2::Strong}) = T(x.field1, x.field2); this would allow "hooking" into deserialization, to fix cases like #135

from arrow-julia.

jrevels avatar jrevels commented on May 26, 2024

So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple.

Ah, but for me this is the desired behavior :) I want it to deserialize as NamedTuple unless I, the caller, tell it explicitly not to. Right now it feels like Arrow.jl is making the decision for me, and it's making the wrong one (AFAICT).

Reopening this issue as it seems like the discussion may lead to some action items :)

from arrow-julia.

jrevels avatar jrevels commented on May 26, 2024

ref beacon-biosignals/Onda.jl#68 for a motivating example.

my thoughts are very rough/not super well-considered yet, but off the top of my head, here are my big "wants" (some of these might already be possible w/ existing behavior):

  • For the current Dict-based registertype! mechanism to somehow be replaced/augmented by method dispatch (motivation here: https://github.com/beacon-biosignals/Onda.jl/pull/68/files#diff-9d1b70fd041b1dbbe08ff4096cf1c68daa131b7d249d2ba3101e9079e129f44cR505 ; if there's another way to do this that I'm not seeing with the current system, that'd be dope). The barriers here AFAICT are a) dynamic dispatch might be slower than the current Dict look up during deserialization, so we'd have to amortize it by doing it up front based on the present extension metadata (I think this should work?) and b) this mechanism would be less dynamic than registertype! currently is (I would be happy to make it less dynamic, but maybe there's a use case where the extra dynamism is useful?). If handling this starts to look like it requires @eval, we could consider a combined approach, e.g. keep the current *_MAPPING Dict but have it contain extension string => Julia function pairs, to hide Julia's "smart dispatch" behind a "dumb dispatch" tier. That way the mapping would be clear/resolvable without eval but callers could take advantage of Julia's dispatch for e.g. allowing full use of type parameters.
  • For there to be very clear/well-named/well-documented/separate "pre-serialization hook" and "post-deserialization hook" transformations. I'm not overly concerned with names as long as the duality is clear, e.g. lower/raise, toarrow/fromarrow, etc. I wouldn't want to use a single function for both as, IME, doing so forces the caller to "do more than they intend to do" by overloading that function. Like, I want to be able to define a pre-serialization hook to go from A in Julia -> B in Arrow, without implying anything about the transformation behavior of A in Arrow -> B in Julia.
  • For me to be able to specify/toggle behavior at the callsite. i.e. turn off the pre/post-hooks for a specific type during a specific read/write operation. Overriding these sorts of global behaviors is really useful in practice for e.g. migrating data between different representation versions w/o needing to adjust your running environment.

from arrow-julia.

jrevels avatar jrevels commented on May 26, 2024

preserving some relevant convo from the Julia Slack (https://julialang.slack.com/archives/C674VR0HH/p1615846377461400?thread_ts=1615681758.430500&cid=C674VR0HH)

Screen Shot 2021-03-16 at 10 27 42 PM

from arrow-julia.

quinnj avatar quinnj commented on May 26, 2024

Closing now that #156 is merged/tagged

from arrow-julia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.