Giter Site home page Giter Site logo

arrow2-convert's People

Contributors

emilk avatar jleibs avatar jondo2010 avatar jorgecarleitao avatar joshuataylor avatar ncpenke avatar nielsmeima avatar sebosp avatar teymour-aldridge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

arrow2-convert's Issues

Add support for large types

The complexity is mostly in the serialize path since for deserialize we can just look at the arrow type (LargeList, LargeUtf8, etc) and cast to the appropriate array type.

Couple of ways I can think of support this for serialize:

  1. Only support i64 offsets, and provide a conversion method that converts large types to small types in another pass

  2. Support an attribute either on a container or per field to use the large offset.

Renaming crates before publishing?

One thing that stands out is that we have two crates: arrow2_derive, and derive_internals, both both need to be published.

If we were to follow the typical convention, the derive_internals crate should actually be arrow2_derive, and the current arrow2_derive, which contains the recently added traits should be called something else, perhaps arrow2_convert?

There is some additional functionality that could go into arrow2_convert that provide additional helper API that's higher-level than what the arrow2 crate provides.

@jorgecarleitao thoughts?

Bump arrow2 version

arrow2-derive currently depends on arrow2 = { version = "0.4", default-features = false }. This makes it incompatible with arrow2 0.6.2. Since the version change is not exactly trivial I'm opening this issue.

I couldn't resolve the following trying to bump the version manually:

error[E0277]: a value of type `DataType` cannot be built from an iterator over elements of type `Field`
   --> src/test.rs:270:69
    |
270 |         let fields = (0..FooArray::n_fields()).map(FooArray::field).collect();
    |                                                                     ^^^^^^^ value of type `DataType` cannot be built from `std::iter::Iterator<Item=Field>`
    |
    = help: the trait `FromIterator<Field>` is not implemented for `DataType`

Fallible deserialization and constrained types/collections

(This is indirectly related to #79)

A few examples I was also wondering about, might be worth pondering on - legit collections where direct .collect() just doesn't cut it (along with the current TryIntoCollection logic).

  1. Types like vec1::Vec1<T> (a fairly popular crate for what it does). It can be iterated over but, obviously, doesn't support FromIterator because it can fail if there's no items. It also can't implement Default and you have to provide the first element into new(first). Serialization is not a problem, but what about deserialization?

  2. Any constrained collections, e.g. maybe EvenVec<T: Integer> that has a try_push() which may fail if the integer is not even. Again, FromIterator can't be implemented (but Default can be, unlike the previous example).

In both of these examples you would probably have them implement TryFrom<Vec<T>> or TryFrom<&[T]> (or both, depending on whether it uses Vec<T> internally or not) - in fact vec1::Vec1 already implements both.

  • So, one way would be to convert/collect elements from arrow into a Vec and then TryFrom-convert it into your custom collection. I believe this will cover the majority of cases like this.
  • Another way would be to add support for failures during deserialization (see next example).

In regards to fallible deserialization, there's an even simpler example without containers:

  1. struct EvenNumber(i32) with an EvenNumber::try_new(i32) -> Result<Self, Error> fallible constructor. Again, serializing this is fine. But what about deserialization? The current signature returns deserialize(...) -> Option<T> which wouldn't support cases like this.

  2. Similar example, but from standard library - the family of NonZero{U,I}* types - https://doc.rust-lang.org/stable/std/num/index.html.

Add support for Vec<Option<Struct>>

This is mostly in place with #10, but need to finish the deserialize path to check the valid bits, and do the right thing with the nested field iterators. Also need test cases.

Crash while serializing

thread 'main' panicked at 'attempt to subtract with overflow', src/analysis/tasks.rs:270:75
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:114:5
   3: <lisa_rust_analysis::analysis::tasks::MutableTaskStateArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
             at ./tools/analysis/src/analysis/tasks.rs:270:75
   4: <lisa_rust_analysis::analysis::tasks::TaskState as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
             at ./tools/analysis/src/analysis/tasks.rs:270:75
   5: <lisa_rust_analysis::analysis::tasks::MutableTasksStatesRowArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
             at ./tools/analysis/src/analysis/tasks.rs:575:24
   6: <lisa_rust_analysis::analysis::tasks::TasksStatesRow as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
             at ./tools/analysis/src/analysis/tasks.rs:575:24
   7: arrow2_convert::serialize::arrow_serialize_extend_internal
             at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:328:9
   8: arrow2_convert::serialize::arrow_serialize_to_mutable_array
             at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:343:5
   9: <Collection as arrow2_convert::serialize::TryIntoArrow<alloc::boxed::Box<dyn arrow2::array::Array>,Element>>::try_into_arrow
             at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:427:12

This crashes in debug, and compiling in --release mode silently leads to a corrupted file that pyarrow chokes on when using pyarrow.compute.struct_field:

pyarrow.lib.ArrowIndexError: Index -1 out of bounds

is it possible to run `try_into_collection` on a `Chunk` instead of an `Array?`

Starting with the parquet_read_parallel example from arrow2, I am trying to deserialize a Chunk into a Vec of structs.

Using the deserialize_parallel function as defined in the above example, the following code currently works for me:

pub struct Document {
    content: String,
}

...
let chunk = deserialize_parallel(&mut columns)?;
let array = StructArray::new(
    DataType::Struct(fields.clone()),
    chunk.arrays().to_vec(),
    None,
);
let documents: Vec<Document> = array.to_boxed().try_into_collection().unwrap();

Questions:

  1. With the currently exposed APIs in arrow2 and arrow2-convert, is there a better way to convert the Chunk into a Struct? I think the extra conversion from Chunk to StructArray with the to_boxed at the end is perhaps not the most efficient.
  2. Would it be possible to expose TryIntoCollection::try_into_collection directly on the Chunk as well?

Can't (de)serialize `Buffer<u8>>`

arrow_array_deserialize_iterator::<Buffer<u8>>(&array) yields:

error[E0277]: the trait bound `u8: ArrowEnableVecForType` is not satisfied
  --> arrow2_convert/tests/test_deserialize.rs:85:16
   |
85 |     arrow_array_deserialize_iterator::<Buffer<u8>>(&array)
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `ArrowEnableVecForType` is not implemented for `u8`
   |
   = help: the following other types implement trait `ArrowEnableVecForType`:
             f32
             f64
             i16
             i32
             i64
             i8
             u16
             u32
             u64
   = note: required for `Buffer<u8>` to implement `ArrowDeserialize`

So all types are supported except for u8. Buffer<u8> is sometimes desired over Vec<u8>, because of the internal ref-counting of Buffer.

Re-export `ArrowSerialize`, `ArrowDeserialize` and `ArrowField` at the top

Let's be ergonomic, and more like serde.

There's only 3 modules, they are all tiny, essentially 1 main symbol in each, and having to always access arrow2_convert::serialize::ArrowSerialize is not really cool.

Also, having both ArrowField symbols (trait and proc-macro) in the same scope would be a more common way of doing things and a bit nicer.

Plus proc-macros (and regular macros) will generate a lot less code spam.

Suggestion: export all three main traits at the crate root. If that's ok, I can open a quick PR (and I'll update the proc-macros in the generics branch).

New release? ๐Ÿ™

arrow2 0.17.0 was released last week, and it would be nice with a new arrow2-convert release to match!

Support conversion of rust struct to an arrow2 chunk

Created from the discussion in jorgecarleitao/arrow2#1092.

A rust struct can conceptually represent either an Arrow Struct or an arrow2::Chunk (a column group). The arrow2::Chunk is important since it's used in the deserialization/serialization API for parquet and flight conversion.

We can extend the arrow2_convert::TryIntoArrow and arrow2_convert::FromArrow traits to convert to/from arrow2::Chunk, but there are two possible mappings from a vector of structs, Vec<S> to Chunk:

  1. The Chunk has a single field of type Struct
  2. The Chunk contains the same number of fields as the struct.

1 can be easily supported by wrapping the an arrow2::Array in a Chunk.

2 has a couple of approaches:

a. A new derive macro to generate the mapping to a Chunk (eg. ArrowChunk or ArrowRoot).
b. Providing a helper method to convert a arrow2::StructArray to a Chunk by unwrapping the fields.

One related use-case that could guide this design is to support generic typed versions of the arrow2 csv, json, parquet, and flight serialize/deserialize methods, where the schema is specified by a rust struct (opened #41 for this). To achieve this, it would be useful to access the deserialize/serialize methods of each column separately for parallelism which is cleaner via 2a.

Generics example

@ncpenke Hey - I'm in the process of adding generics support (to structs, as a first step), which is a pretty painful process to say the least ๐Ÿคฃ

As a matter of fact, most of it compiles, except a few weird quirks. It would really help if you could provide a hand-written example of how it's supposed to work so I wouldn't be guessing blindly.

First, deserialization. There's this bound in the ArrowDeserialize trait that seems to be causing problems:

pub trait ArrowDeserialize: ArrowField + Sized
where
    Self::ArrayType: ArrowArray,
    for<'a> &'a Self::ArrayType: IntoIterator, // <----
{ ... }

Basically, if for some struct Foo<A, B> for all implementations we require

impl OneOfTheTraits for OneOfTheStructs<A, B>
where 
    A: ArrowDeserialize, 
    B: ArrowDeserialize,
{ ... }

this leads to

error[E0277]: `&'a <A as ArrowDeserialize>::ArrayType` is not an iterator
   = help: the trait `for<'a> Iterator` is not implemented for `&'a <A as ArrowDeserialize>::ArrayType`
   = note: required because of the requirements on the impl of `for<'a> IntoIterator` for `&'a <A as ArrowDeserialize>::ArrayType`

However, if we require

impl OneOfTheTraits for OneOfTheStructs<A, B>
where
    A: ArrowDeserialize,
    B: ArrowDeserialize,
    for<'a> &'a <A as ArrowDeserialize>::ArrayType: IntoIterator,
    for<'a> &'a <B as ArrowDeserialize>::ArrayType: IntoIterator,
{ ... }

this results in

overflow evaluating the requirement `for<'_a> &'_a FooArray<_, _>: IntoIterator`

Wonder if you'd have any ideas on this, or maybe provide a trivial working example? ๐Ÿค”

(I have a feeling that those trait bounds together and all the for<'_> are messing things up big time; perhaps this could be rewritten once GATs are stabilized in less than a month from now in 1.65?... idk)

Provide a helper function to flatten a `Chunk` that wraps a `StructArray`

From the discussion in: #40.

When converting a Vec<S> into an arrow2::Chunk via try_into_arrow where S is a struct, the resulting Chunk will wrap a StructArray. We'd like to facilitate the conversion of this Chunk to another Chunk that directly wraps the fields of the StructArray.

One way this could be accomplished is by introducing a FlattenChunk trait, which defines a flatten method, that consumes a Chunk and returns a modified Chunk with the fields of a StructArray or the original Chunk if it was not wrapping a StructArray. This trait could be implemented for Chunk<A> where A: AsRef<dyn Array>.

`#[arrow(transparent)]` derive mode

This should be fairly trivial to implement. Similar to #[serde(transparent)], so that we could easily wrap newtypes like

#[arrow(transparent)]
struct NewType(i32);

When generics land, should also support generic newtypes.

trait bound not satisfied

Hey! This seems to be exactly what I need for translating structured log data that I've packed into structs cleanly into the arrow format, so I'm really excited by the potential of this crate.

I seem to be getting an issue in my project:

error[E0277]: the trait bound `arrow2::array::struct_::StructArray: From<LogDataArray>` is not satisfied
  --> src/main.rs:37:35
   |
37 | #[derive(Debug, Clone, PartialEq, StructOfArrow)]
   |                                   ^^^^^^^^^^^^^ the trait `From<LogDataArray>` is not implemented for `arrow2::array::struct_::StructArray`
   | 
  ::: /home/weaton/.cargo/git/checkouts/arrow2-derive-1c9d82f1d5ff8f2d/d4f7231/src/lib.rs:9:24
   |
9  | pub trait ArrowStruct: Into<StructArray> {
   |                        ----------------- required by this bound in `ArrowStruct`
   |
   = help: the following implementations were found:
             <arrow2::array::struct_::StructArray as From<arrow2::array::growable::structure::GrowableStruct<'a>>>
             <arrow2::array::struct_::StructArray as From<arrow2::record_batch::RecordBatch>>
   = note: required because of the requirements on the impl of `Into<arrow2::array::struct_::StructArray>` for `LogDataArray`
   = note: this error originates in the derive macro `StructOfArrow` (in Nightly builds, run with -Z macro-backtrace for more info)

Any idea what could causing this? When I drop my struct definition into your test suite it compiles, which is causing me to scratch my head.

Here's my cargo.toml entry, am I missing something silly?

arrow2-derive = { git = "https://github.com/jorgecarleitao/arrow2-derive.git", branch = "main" }

Thanks again for your work on this!

Add support for unions/enums

For sparse vs dense, should we expose this as an attribute on the type or could the same type be sparse or dense based on use-case (in which it should be some kind of runtime flag)?

improve "Data type mismatch" error message

Currently the error message doesn't give enough information to debug what's going on.

Err(arrow2::error::Error::InvalidArgumentError(
"Data type mismatch".to_string(),
))

Something like this helped me debug an issue I was running into:

Err(arrow2::error::Error::InvalidArgumentError(format!(
    "Data type mismatch. Expected: {:?} | Found: {:?}",
    &<ArrowType as ArrowField>::data_type(),
    arr.data_type()
)))

It produced an error message that looks like:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Data type mismatch. Expected: Struct([...redacted...]) | Found: Struct(...redacted...)")', examples/read_parquet_specific_columns.rs:95:79

Could potentially make it even better by doing a diff.

[Help] Implement arrow_serialize/deserialize for newtype over num_complex

I'd like to create a newtype struct that contains num_complex::Complex32 and can convert to/from arrow.

struct Phasor(num_complex::Complex32);

I'm having trouble understanding the documentation for implementing ArrowSerialize and ArrowDeserialize with types other than privatives (as shown in the complex_example.rs) can anyone help point me in the right direction? Here is what I have so far, though I know I'm not approaching this right...

impl arrow2_convert::field::ArrowField for Phasor {
    type Type = Self;

    fn data_type() -> arrow2::datatypes::DataType {
        arrow2::datatypes::DataType::Extension(
            "phasor".to_string(),
            Box::new(arrow2::datatypes::DataType::Struct(vec![
                Field::new("re", arrow2::datatypes::DataType::Float32, false),
                Field::new("im", arrow2::datatypes::DataType::Float32, false),
            ])),
            None,
        )
    }
}

impl arrow2_convert::serialize::ArrowSerialize for Phasor {
    type MutableArrayType = arrow2::array::MutableStructArray;

    fn new_array() -> Self::MutableArrayType {
        Self::MutableArrayType::new(
            <Self as arrow2_convert::field::ArrowField>::data_type(),
            vec![],
        )
    }

    fn arrow_serialize(v: &Self, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
        let real: &mut MutablePrimitiveArray<PhasorType> = array.value(0).unwrap();
        real.try_push(Some(v.re()));
        let imag: &mut MutablePrimitiveArray<PhasorType> = array.value(1).unwrap();
        imag.try_push(Some(v.im()));

        array.push(true);
        Ok(())
    }
}

impl arrow2_convert::deserialize::ArrowDeserialize for Phasor {
    type ArrayType = arrow2::array::StructArray;

    fn arrow_deserialize(v: Option<???>) -> Option<Self> {
        v.map(|t| Phasor::new(t.get(0).unwrap(), t.get(1).unwrap()))
    }
}

arrow2_convert::arrow_enable_vec_for_type!(Phasor);

I've seen issue #79 reference adding support for remote types, but until then and to better understand arrow2 I'd like to understand how to do this manually. Thanks in advance!

Implement `ArrowSerialize/ArrowDeserialize` for `Utf8Scalar`

Utf8Scalar is similar to Option<String> except they are refcounted, and thus support O(1) cloning and slicing.

In other words, this should work:

use arrow2_convert::{ArrowDeserialize, ArrowField, ArrowSerialize};

#[derive(ArrowField, ArrowSerialize, ArrowDeserialize)]
pub struct Foo {
    pub string: arrow2::scalar::Utf8Scalar<i32>,
}

Support remote types and remote containers

There are two separate problems, listing them here in case we decide to implement it later.

First, same as in this example in serde: https://serde.rs/remote-derive.html, would be nice to provide a way to set up remote derives for foreign types.

Another problem is with containers, and the main one is Vec. Would be nice if we could provide a way to use other vec-like types. This could look e.g. like this (note that it would sort-of cover #72):

use fancy_crate::FancyVec;
use std::collections::VecDeque;

#[derive(ArrowField)]
struct Foo {
    #[arrow(vec)]
    vec_custom: Vec<i32, MyAlloc>,
    #[arrow(vec(push = "push_back"))]
    deque: VecDeque<i32>,
    #[arrow(vec)]
    vec_remote: FancyVec<i32>,
}

Add support for nested types

I.e. given two derived structs Foo and Bar on which Foo uses Bar, Foo's schema should use Bar's schema to be derived.

Fix deserialization for union slices

Currently, this doesn't work because we iterate through a UnionArray manually, and ignore the slice offset. Need to use scalars (#33) to consider the slice offset

enable customizing list inner child element name?

When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:

message spark_schema {
  ....
  OPTIONAL group mylistcolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.

Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

I'm guessing this is because of this line of code?

arrow2::datatypes::DataType::List(Box::new(<T as ArrowField>::field("item")))

  1. If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
  2. Should the default by re-evaluated if parquet-mr / Spark uses element?

P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31

derive(ArrowField) breaks in macros

I'm trying to either:

  • apply #[derive(ArrowField)] to a struct defined in a macro_rules
  • apply my macro_rules on a struct using #[derive(ArrowField)]

The struct def is parsed using this little book of rust macros:
https://veykril.github.io/tlborm/decl-macros/building-blocks/parsing.html#struct

use arrow2_convert::ArrowField;

macro_rules! mymacro {
    (
        $( #[$meta:meta] )*
        //  ^~~~attributes~~~~^
            $vis:vis struct $name:ident {
                $(
                    $( #[$field_meta:meta] )*
                    //          ^~~~field attributes~~~!^
                        $field_vis:vis $field_name:ident : $field_ty:ty
                    //          ^~~~~~~~~~~~~~~~~a single field~~~~~~~~~~~~~~~^
                ),*
                    $(,)? }
    ) => {
        $( #[$meta] )*
            $vis struct $name {
                $(
                    $( #[$field_meta] )*
                        $field_vis $field_name : $field_ty
                ),*
            }
    }
}

mymacro! {
    #[derive(ArrowField)]
    struct Foo2 {
        myfield: u8,
    }
}

Sadly this fails with:

error: proc-macro derive panicked
  --> src/main.rs:97:14
   |
97 |     #[derive(ArrowField)]
   |              ^^^^^^^^^^
   |
   = help: message: Only types are supported atm

error: could not compile `dataframe` due to previous error

The only macro that seemed to work is this one:

macro_rules! mymacro2 {
    ($($tts:tt)*) => { $($tts)* }
}

Other derive macros such as JsonSchema don't seem to have this issue:
https://docs.rs/schemars/latest/schemars/

EDIT: This also fails, but differently:

mymacro! {
    struct Foo2 {
        myfield: u8,
    }
}

#[derive(ArrowField)]
struct Foo3(Foo2);

fails with:

error: proc-macro derive panicked
   --> src/main.rs:114:10
    |
114 | #[derive(ArrowField)]
    |          ^^^^^^^^^^
    |
    = help: message: called `Option::unwrap()` on a `None` value

Add support for deserializing column selections

If a rust type tree (struct and recursively nested fields) uses a subset of fields that are present in a schema, then it should be possible to deserialize the type tree from an arrow representation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.