dataengineeringlabs / arrow2-convert Goto Github PK
View Code? Open in Web Editor NEWDerive for arrow2
License: Apache License 2.0
Derive for arrow2
License: Apache License 2.0
Add check cargo fmt --all -- --check
The complexity is mostly in the serialize path since for deserialize we can just look at the arrow type (LargeList, LargeUtf8, etc) and cast to the appropriate array type.
Couple of ways I can think of support this for serialize:
Only support i64 offsets, and provide a conversion method that converts large types to small types in another pass
Support an attribute either on a container or per field to use the large offset.
One thing that stands out is that we have two crates: arrow2_derive, and derive_internals, both both need to be published.
If we were to follow the typical convention, the derive_internals crate should actually be arrow2_derive, and the current arrow2_derive, which contains the recently added traits should be called something else, perhaps arrow2_convert?
There is some additional functionality that could go into arrow2_convert that provide additional helper API that's higher-level than what the arrow2 crate provides.
@jorgecarleitao thoughts?
arrow2-derive
currently depends on arrow2 = { version = "0.4", default-features = false }
. This makes it incompatible with arrow2 0.6.2
. Since the version change is not exactly trivial I'm opening this issue.
I couldn't resolve the following trying to bump the version manually:
error[E0277]: a value of type `DataType` cannot be built from an iterator over elements of type `Field`
--> src/test.rs:270:69
|
270 | let fields = (0..FooArray::n_fields()).map(FooArray::field).collect();
| ^^^^^^^ value of type `DataType` cannot be built from `std::iter::Iterator<Item=Field>`
|
= help: the trait `FromIterator<Field>` is not implemented for `DataType`
Support &str, &[u8], etc. Need to think through how nesting would work for this and would also require handling lifetimes correctly.
(This is indirectly related to #79)
A few examples I was also wondering about, might be worth pondering on - legit collections where direct .collect()
just doesn't cut it (along with the current TryIntoCollection
logic).
Types like vec1::Vec1<T>
(a fairly popular crate for what it does). It can be iterated over but, obviously, doesn't support FromIterator
because it can fail if there's no items. It also can't implement Default
and you have to provide the first element into new(first)
. Serialization is not a problem, but what about deserialization?
Any constrained collections, e.g. maybe EvenVec<T: Integer>
that has a try_push()
which may fail if the integer is not even. Again, FromIterator
can't be implemented (but Default
can be, unlike the previous example).
In both of these examples you would probably have them implement TryFrom<Vec<T>>
or TryFrom<&[T]>
(or both, depending on whether it uses Vec<T>
internally or not) - in fact vec1::Vec1
already implements both.
TryFrom
-convert it into your custom collection. I believe this will cover the majority of cases like this.In regards to fallible deserialization, there's an even simpler example without containers:
struct EvenNumber(i32)
with an EvenNumber::try_new(i32) -> Result<Self, Error>
fallible constructor. Again, serializing this is fine. But what about deserialization? The current signature returns deserialize(...) -> Option<T>
which wouldn't support cases like this.
Similar example, but from standard library - the family of NonZero{U,I}*
types - https://doc.rust-lang.org/stable/std/num/index.html.
This is mostly in place with #10, but need to finish the deserialize path to check the valid bits, and do the right thing with the nested field iterators. Also need test cases.
thread 'main' panicked at 'attempt to subtract with overflow', src/analysis/tasks.rs:270:75
stack backtrace:
0: rust_begin_unwind
at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:65:14
2: core::panicking::panic
at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:114:5
3: <lisa_rust_analysis::analysis::tasks::MutableTaskStateArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
at ./tools/analysis/src/analysis/tasks.rs:270:75
4: <lisa_rust_analysis::analysis::tasks::TaskState as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
at ./tools/analysis/src/analysis/tasks.rs:270:75
5: <lisa_rust_analysis::analysis::tasks::MutableTasksStatesRowArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
at ./tools/analysis/src/analysis/tasks.rs:575:24
6: <lisa_rust_analysis::analysis::tasks::TasksStatesRow as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
at ./tools/analysis/src/analysis/tasks.rs:575:24
7: arrow2_convert::serialize::arrow_serialize_extend_internal
at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:328:9
8: arrow2_convert::serialize::arrow_serialize_to_mutable_array
at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:343:5
9: <Collection as arrow2_convert::serialize::TryIntoArrow<alloc::boxed::Box<dyn arrow2::array::Array>,Element>>::try_into_arrow
at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:427:12
This crashes in debug, and compiling in --release mode silently leads to a corrupted file that pyarrow chokes on when using pyarrow.compute.struct_field
:
pyarrow.lib.ArrowIndexError: Index -1 out of bounds
Starting with the parquet_read_parallel example from arrow2, I am trying to deserialize a Chunk into a Vec of structs.
Using the deserialize_parallel
function as defined in the above example, the following code currently works for me:
pub struct Document {
content: String,
}
...
let chunk = deserialize_parallel(&mut columns)?;
let array = StructArray::new(
DataType::Struct(fields.clone()),
chunk.arrays().to_vec(),
None,
);
let documents: Vec<Document> = array.to_boxed().try_into_collection().unwrap();
Questions:
Chunk
to StructArray
with the to_boxed
at the end is perhaps not the most efficient.TryIntoCollection::try_into_collection
directly on the Chunk as well?arrow_array_deserialize_iterator::<Buffer<u8>>(&array)
yields:
error[E0277]: the trait bound `u8: ArrowEnableVecForType` is not satisfied
--> arrow2_convert/tests/test_deserialize.rs:85:16
|
85 | arrow_array_deserialize_iterator::<Buffer<u8>>(&array)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `ArrowEnableVecForType` is not implemented for `u8`
|
= help: the following other types implement trait `ArrowEnableVecForType`:
f32
f64
i16
i32
i64
i8
u16
u32
u64
= note: required for `Buffer<u8>` to implement `ArrowDeserialize`
So all types are supported except for u8
. Buffer<u8>
is sometimes desired over Vec<u8>
, because of the internal ref-counting of Buffer
.
Let's be ergonomic, and more like serde.
There's only 3 modules, they are all tiny, essentially 1 main symbol in each, and having to always access arrow2_convert::serialize::ArrowSerialize
is not really cool.
Also, having both ArrowField
symbols (trait and proc-macro) in the same scope would be a more common way of doing things and a bit nicer.
Plus proc-macros (and regular macros) will generate a lot less code spam.
Suggestion: export all three main traits at the crate root. If that's ok, I can open a quick PR (and I'll update the proc-macros in the generics branch).
arrow2
0.17.0
was released last week, and it would be nice with a new arrow2-convert
release to match!
Created from the discussion in jorgecarleitao/arrow2#1092.
A rust struct can conceptually represent either an Arrow Struct
or an arrow2::Chunk
(a column group). The arrow2::Chunk
is important since it's used in the deserialization/serialization API for parquet and flight conversion.
We can extend the arrow2_convert::TryIntoArrow
and arrow2_convert::FromArrow
traits to convert to/from arrow2::Chunk
, but there are two possible mappings from a vector of structs, Vec<S>
to Chunk
:
Chunk
has a single field of type Struct
Chunk
contains the same number of fields as the struct.1 can be easily supported by wrapping the an arrow2::Array
in a Chunk
.
2 has a couple of approaches:
a. A new derive macro to generate the mapping to a Chunk (eg. ArrowChunk
or ArrowRoot
).
b. Providing a helper method to convert a arrow2::StructArray
to a Chunk
by unwrapping the fields.
One related use-case that could guide this design is to support generic typed versions of the arrow2 csv, json, parquet, and flight serialize/deserialize methods, where the schema is specified by a rust struct (opened #41 for this). To achieve this, it would be useful to access the deserialize/serialize methods of each column separately for parallelism which is cleaner via 2a.
@ncpenke Hey - I'm in the process of adding generics support (to structs, as a first step), which is a pretty painful process to say the least ๐คฃ
As a matter of fact, most of it compiles, except a few weird quirks. It would really help if you could provide a hand-written example of how it's supposed to work so I wouldn't be guessing blindly.
First, deserialization. There's this bound in the ArrowDeserialize
trait that seems to be causing problems:
pub trait ArrowDeserialize: ArrowField + Sized
where
Self::ArrayType: ArrowArray,
for<'a> &'a Self::ArrayType: IntoIterator, // <----
{ ... }
Basically, if for some struct Foo<A, B>
for all implementations we require
impl OneOfTheTraits for OneOfTheStructs<A, B>
where
A: ArrowDeserialize,
B: ArrowDeserialize,
{ ... }
this leads to
error[E0277]: `&'a <A as ArrowDeserialize>::ArrayType` is not an iterator
= help: the trait `for<'a> Iterator` is not implemented for `&'a <A as ArrowDeserialize>::ArrayType`
= note: required because of the requirements on the impl of `for<'a> IntoIterator` for `&'a <A as ArrowDeserialize>::ArrayType`
However, if we require
impl OneOfTheTraits for OneOfTheStructs<A, B>
where
A: ArrowDeserialize,
B: ArrowDeserialize,
for<'a> &'a <A as ArrowDeserialize>::ArrayType: IntoIterator,
for<'a> &'a <B as ArrowDeserialize>::ArrayType: IntoIterator,
{ ... }
this results in
overflow evaluating the requirement `for<'_a> &'_a FooArray<_, _>: IntoIterator`
Wonder if you'd have any ideas on this, or maybe provide a trivial working example? ๐ค
(I have a feeling that those trait bounds together and all the for<'_>
are messing things up big time; perhaps this could be rewritten once GATs are stabilized in less than a month from now in 1.65?... idk)
From the discussion in: #40.
When converting a Vec<S>
into an arrow2::Chunk
via try_into_arrow
where S
is a struct, the resulting Chunk will wrap a StructArray
. We'd like to facilitate the conversion of this Chunk
to another Chunk
that directly wraps the fields of the StructArray
.
One way this could be accomplished is by introducing a FlattenChunk
trait, which defines a flatten
method, that consumes a Chunk and returns a modified Chunk
with the fields of a StructArray
or the original Chunk
if it was not wrapping a StructArray
. This trait could be implemented for Chunk<A> where A: AsRef<dyn Array>
.
Currently we're borrowing from arrow2's error type
This should be fairly trivial to implement. Similar to #[serde(transparent)]
, so that we could easily wrap newtypes like
#[arrow(transparent)]
struct NewType(i32);
When generics land, should also support generic newtypes.
The following enum gets mapped to a bool array currently:
enum Foo {
Variant1,
}
Couldn't it be mapped to a null array instead ? I'm just recently started using arrow and your lib so I'm trying to figure out the landscape of how things are done and how to take best advantage of what's available.
Hey! This seems to be exactly what I need for translating structured log data that I've packed into structs cleanly into the arrow format, so I'm really excited by the potential of this crate.
I seem to be getting an issue in my project:
error[E0277]: the trait bound `arrow2::array::struct_::StructArray: From<LogDataArray>` is not satisfied
--> src/main.rs:37:35
|
37 | #[derive(Debug, Clone, PartialEq, StructOfArrow)]
| ^^^^^^^^^^^^^ the trait `From<LogDataArray>` is not implemented for `arrow2::array::struct_::StructArray`
|
::: /home/weaton/.cargo/git/checkouts/arrow2-derive-1c9d82f1d5ff8f2d/d4f7231/src/lib.rs:9:24
|
9 | pub trait ArrowStruct: Into<StructArray> {
| ----------------- required by this bound in `ArrowStruct`
|
= help: the following implementations were found:
<arrow2::array::struct_::StructArray as From<arrow2::array::growable::structure::GrowableStruct<'a>>>
<arrow2::array::struct_::StructArray as From<arrow2::record_batch::RecordBatch>>
= note: required because of the requirements on the impl of `Into<arrow2::array::struct_::StructArray>` for `LogDataArray`
= note: this error originates in the derive macro `StructOfArrow` (in Nightly builds, run with -Z macro-backtrace for more info)
Any idea what could causing this? When I drop my struct definition into your test suite it compiles, which is causing me to scratch my head.
Here's my cargo.toml entry, am I missing something silly?
arrow2-derive = { git = "https://github.com/jorgecarleitao/arrow2-derive.git", branch = "main" }
Thanks again for your work on this!
For sparse vs dense, should we expose this as an attribute on the type or could the same type be sparse or dense based on use-case (in which it should be some kind of runtime flag)?
Currently the error message doesn't give enough information to debug what's going on.
arrow2-convert/arrow2_convert/src/deserialize.rs
Lines 305 to 307 in 6c37e29
Something like this helped me debug an issue I was running into:
Err(arrow2::error::Error::InvalidArgumentError(format!(
"Data type mismatch. Expected: {:?} | Found: {:?}",
&<ArrowType as ArrowField>::data_type(),
arr.data_type()
)))
It produced an error message that looks like:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Data type mismatch. Expected: Struct([...redacted...]) | Found: Struct(...redacted...)")', examples/read_parquet_specific_columns.rs:95:79
Could potentially make it even better by doing a diff.
I'd like to create a newtype struct that contains num_complex::Complex32 and can convert to/from arrow.
struct Phasor(num_complex::Complex32);
I'm having trouble understanding the documentation for implementing ArrowSerialize and ArrowDeserialize with types other than privatives (as shown in the complex_example.rs) can anyone help point me in the right direction? Here is what I have so far, though I know I'm not approaching this right...
impl arrow2_convert::field::ArrowField for Phasor {
type Type = Self;
fn data_type() -> arrow2::datatypes::DataType {
arrow2::datatypes::DataType::Extension(
"phasor".to_string(),
Box::new(arrow2::datatypes::DataType::Struct(vec![
Field::new("re", arrow2::datatypes::DataType::Float32, false),
Field::new("im", arrow2::datatypes::DataType::Float32, false),
])),
None,
)
}
}
impl arrow2_convert::serialize::ArrowSerialize for Phasor {
type MutableArrayType = arrow2::array::MutableStructArray;
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::new(
<Self as arrow2_convert::field::ArrowField>::data_type(),
vec![],
)
}
fn arrow_serialize(v: &Self, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
let real: &mut MutablePrimitiveArray<PhasorType> = array.value(0).unwrap();
real.try_push(Some(v.re()));
let imag: &mut MutablePrimitiveArray<PhasorType> = array.value(1).unwrap();
imag.try_push(Some(v.im()));
array.push(true);
Ok(())
}
}
impl arrow2_convert::deserialize::ArrowDeserialize for Phasor {
type ArrayType = arrow2::array::StructArray;
fn arrow_deserialize(v: Option<???>) -> Option<Self> {
v.map(|t| Phasor::new(t.get(0).unwrap(), t.get(1).unwrap()))
}
}
arrow2_convert::arrow_enable_vec_for_type!(Phasor);
I've seen issue #79 reference adding support for remote types, but until then and to better understand arrow2 I'd like to understand how to do this manually. Thanks in advance!
Utf8Scalar
is similar to Option<String>
except they are refcounted, and thus support O(1)
cloning and slicing.
In other words, this should work:
use arrow2_convert::{ArrowDeserialize, ArrowField, ArrowSerialize};
#[derive(ArrowField, ArrowSerialize, ArrowDeserialize)]
pub struct Foo {
pub string: arrow2::scalar::Utf8Scalar<i32>,
}
To better support chunk and scalar conversions.
There are two separate problems, listing them here in case we decide to implement it later.
First, same as in this example in serde: https://serde.rs/remote-derive.html, would be nice to provide a way to set up remote derives for foreign types.
Another problem is with containers, and the main one is Vec
. Would be nice if we could provide a way to use other vec-like types. This could look e.g. like this (note that it would sort-of cover #72):
use fancy_crate::FancyVec;
use std::collections::VecDeque;
#[derive(ArrowField)]
struct Foo {
#[arrow(vec)]
vec_custom: Vec<i32, MyAlloc>,
#[arrow(vec(push = "push_back"))]
deque: VecDeque<i32>,
#[arrow(vec)]
vec_remote: FancyVec<i32>,
}
Sorry to be a bother but would it be possible to release v0.3.0 on crates.io?
This is now supported in latest main of arrow2
I.e. given two derived structs Foo
and Bar
on which Foo
uses Bar
, Foo
's schema should use Bar
's schema to be derived.
Currently, this doesn't work because we iterate through a UnionArray manually, and ignore the slice offset. Need to use scalars (#33) to consider the slice offset
When Spark outputs a parquet file, I believe it always uses the inner list item name of element
as opposed to item
:
message spark_schema {
....
OPTIONAL group mylistcolumn (LIST) {
REPEATED group list {
OPTIONAL BYTE_ARRAY element (UTF8);
}
}
...
}
It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item
rather than element
.
Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])
Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])
I'm guessing this is because of this line of code?
arrow2-convert/arrow2_convert/src/field.rs
Line 214 in 7d9e132
element
?P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31
Ideally we should be able to support serializing and deserializing of:
Arc<T>
Arc<[T]>
For all T: ArrowSerialize + ArrowDeserialize
I'm trying to either:
#[derive(ArrowField)]
to a struct defined in a macro_rules#[derive(ArrowField)]
The struct def is parsed using this little book of rust macros:
https://veykril.github.io/tlborm/decl-macros/building-blocks/parsing.html#struct
use arrow2_convert::ArrowField;
macro_rules! mymacro {
(
$( #[$meta:meta] )*
// ^~~~attributes~~~~^
$vis:vis struct $name:ident {
$(
$( #[$field_meta:meta] )*
// ^~~~field attributes~~~!^
$field_vis:vis $field_name:ident : $field_ty:ty
// ^~~~~~~~~~~~~~~~~a single field~~~~~~~~~~~~~~~^
),*
$(,)? }
) => {
$( #[$meta] )*
$vis struct $name {
$(
$( #[$field_meta] )*
$field_vis $field_name : $field_ty
),*
}
}
}
mymacro! {
#[derive(ArrowField)]
struct Foo2 {
myfield: u8,
}
}
Sadly this fails with:
error: proc-macro derive panicked
--> src/main.rs:97:14
|
97 | #[derive(ArrowField)]
| ^^^^^^^^^^
|
= help: message: Only types are supported atm
error: could not compile `dataframe` due to previous error
The only macro that seemed to work is this one:
macro_rules! mymacro2 {
($($tts:tt)*) => { $($tts)* }
}
Other derive macros such as JsonSchema
don't seem to have this issue:
https://docs.rs/schemars/latest/schemars/
EDIT: This also fails, but differently:
mymacro! {
struct Foo2 {
myfield: u8,
}
}
#[derive(ArrowField)]
struct Foo3(Foo2);
fails with:
error: proc-macro derive panicked
--> src/main.rs:114:10
|
114 | #[derive(ArrowField)]
| ^^^^^^^^^^
|
= help: message: called `Option::unwrap()` on a `None` value
If a rust type tree (struct and recursively nested fields) uses a subset of fields that are present in a schema, then it should be possible to deserialize the type tree from an arrow representation.
Similar to serde, should be able to skip any fields implement Default
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.