Giter Site home page Giter Site logo

arrow2's Introduction

THIS CRATE IS UNMAINTAINED

As of 2024-01-17 this crate is no longer maintained. See discussion #1429 for more details.

Arrow2: Transmute-free Arrow

test codecov

A Rust crate to work with Apache Arrow. The most feature-complete implementation of the Arrow format after the C++ implementation.

Check out the guide for a general introduction on how to use this crate, and API docs for a detailed documentation of each of its APIs.

Features

  • Most feature-complete implementation of Apache Arrow after the reference implementation (C++)
    • Decimal 256 unsupported (not a Rust native type)
  • C data interface supported for all Arrow types (read and write)
  • C stream interface supported for all Arrow types (read and write)
  • Full interoperability with Rust's Vec
  • MutableArray API to work with bitmaps and arrays in-place
  • Full support for timestamps with timezones, including arithmetics that take timezones into account
  • Support to read from, and write to:
    • CSV
    • Apache Arrow IPC (all types)
    • Apache Arrow Flight (all types)
    • Apache Parquet (except deep nested types)
    • Apache Avro (all types)
    • NJSON
    • ODBC (some types)
  • Extensive suite of compute operations
    • aggregations
    • arithmetics
    • cast
    • comparison
    • sort and merge-sort
    • boolean (AND, OR, etc) and boolean kleene
    • filter, take
    • hash
    • if-then-else
    • nullif
    • temporal (day, month, week day, hour, etc.)
    • window
    • ... and more ...
  • Extensive set of cargo feature flags to reduce compilation time and binary size
  • Fully-decoupled IO between CPU-bounded and IO-bounded tasks, allowing this crate to both be used in async contexts without blocking and leverage parallelism
  • Fastest known implementation of Avro and Parquet (e.g. faster than the official C++ implementations)

Safety and Security

This crate uses unsafe when strictly necessary:

  • when the compiler can't prove certain invariants and
  • FFI

We have extensive tests over these, all of which run and pass under MIRI. Most uses of unsafe fall into 3 categories:

  • The Arrow format has invariants over UTF-8 that can't be written in safe Rust
  • TrustedLen and trait specialization are still nightly features
  • FFI

We actively monitor for vulnerabilities in Rust's advisory and either patch or mitigate them (see e.g. .cargo/audit.yaml and .github/workflows/security.yaml).

Reading from untrusted data currently may panic! on the following formats:

  • Apache Parquet
  • Apache Avro

We are actively addressing this.

Integration tests

Our tests include roundtrip against:

  • Apache Arrow IPC (both little and big endian) generated by C++, Java, Go, C# and JS implementations.
  • Apache Parquet format (in its different configurations) generated by Arrow's C++ and Spark's implementation
  • Apache Avro generated by the official Rust Avro implementation

Check DEVELOPMENT.md for our development practices.

Versioning

We use the SemVer 2.0 used by Cargo and the remaining of the Rust ecosystem, we also use the 0.x.y versioning, since we are iterating over the API.

Design

This repo and crate's primary goal is to offer a safe Rust implementation of the Arrow specification. As such, it

  • MUST NOT implement any logical type other than the ones defined on the arrow specification, schema.fbs.
  • MUST lay out memory according to the arrow specification
  • MUST support reading from and writing to the C data interface at zero-copy.
  • MUST support reading from, and writing to, the IPC specification, which it MUST verify against golden files available here.

Design documents about each of the parts of this repo are available on their respective READMEs.

FAQ

Any plans to merge with the Apache Arrow project?

Maybe. The primary reason to have this repo and crate is to be able to prototype and mature using a fundamentally different design based on a transmute-free implementation. This requires breaking backward compatibility and loss of features that is impossible to achieve on the Arrow repo.

Furthermore, the arrow project currently has a release mechanism that is unsuitable for this type of work:

  • A release of the Apache consists of a release of all implementations of the arrow format at once, with the same version. It is currently at 5.0.0.

This implies that the crate version is independent of the changelog or its API stability, which violates SemVer. This procedure makes the crate incompatible with Rust's (and many others') ecosystem that heavily relies on SemVer to constraint software versions.

Secondly, this implies the arrow crate is versioned as >0.x. This places expectations about API stability that are incompatible with this effort.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

arrow2's People

Contributors

abreis avatar alexander-beedie avatar anirishduck avatar ariesdevil avatar b41sh avatar cjermain avatar dandandan avatar dousir9 avatar elferherrera avatar haoyang670 avatar hohav avatar houqp avatar illumination-k avatar jimexist avatar jorgecarleitao avatar mdrach avatar ncpenke avatar ozgrakkurt avatar qqwy avatar reswqa avatar rinchannowww avatar ritchie46 avatar simonvandel avatar sundy-li avatar tustvold avatar vasanthakumarv avatar xuanwo avatar yjhmelody avatar yjshen avatar zhyass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow2's Issues

Auto vectorized vs packed_simd in arrow

I found that if the primitive array has no null values, Auto vectorized can outperforms manual simds.

arrow2-sum 2^13 u32     time:   [545.01 ns 546.26 ns 547.72 ns]                                 
                        change: [-0.2234% +0.2358% +0.8664%] (p = 0.41 > 0.05)
                        No change in performance detected.


arrow2-sum 2^13 u32 nosimd                                                                            
                        time:   [316.59 ns 317.36 ns 318.20 ns]
                        change: [+0.4290% +0.8618% +1.2727%] (p = 0.00 < 0.05)
                        Change within noise threshold.

arrow2-sum null 2^13 u32                                                                             
                        time:   [4.3197 us 4.3290 us 4.3394 us]
                        change: [-0.1470% +0.3333% +0.8057%] (p = 0.19 > 0.05)


arrow2-sum null 2^13 u32 nosimd                                                                             
                        time:   [10.149 us 10.168 us 10.190 us]
                        change: [-0.8429% -0.4038% +0.1192%] (p = 0.11 > 0.05)

codes:

fn bench_sum(arr_a: &PrimitiveArray<u32>) {
    sum(criterion::black_box(arr_a)).unwrap();
}

fn bench_sum_nosimd(arr_a: &PrimitiveArray<u32>) {
    nosimd_sum_u32(criterion::black_box(arr_a)).unwrap();
}

fn nosimd_sum_u32(arr_a: &PrimitiveArray<u32>) -> Option<u32> {
    let mut sum = 0;
    if arr_a.null_count() == 0 {
        arr_a.values().as_slice().iter().for_each(|f| {
            sum += *f;
        });
    } else {
        if let Some(c) = arr_a.validity() {
            arr_a
                .values()
                .as_slice()
                .iter()
                .zip(c.iter())
                .for_each(|(f, is_null)| sum += *f * (is_null as u32));
        }
    }
    // disable optimize
    assert!(sum > 0);
    return Some(sum);
}

Currently I did not use Godbolt to see the assembly codes...

Support loading Feather v2 (IPC) files with more than 1 million tables

As can be seen in pola-rs/polars#1023, loading of Feather v2 (IPC) files with more than 1 million tables does not work.

(py)arrow had the same bug: https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10056

It boils down to the flatbuffer verification code, which has max_tables=1_000_000 by default.
Increasing this limit solves the problem.

In (py)arrow the max table value is determined per dataset based on the footer size, to prevent specially crafted IPC files to take an extraordinary amount of time to verify a very small input IPC file:

apache/arrow#9447

In [33]: %time df2 = pl.read_ipc(''test.v2.feather'', use_pyarrow=False)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

/software/polars/py-polars/polars/io.py in read_ipc(file, use_pyarrow, storage_options)
    415             tbl = pa.feather.read_table(data)
    416             return pl.DataFrame.from_arrow(tbl)
--> 417         return pl.DataFrame.read_ipc(data)
    418 
    419 

/software/polars/py-polars/polars/eager/frame.py in read_ipc(file)
    606         """
    607         self = DataFrame.__new__(DataFrame)
--> 608         self._df = PyDataFrame.read_ipc(file)
    609         return self
    610 

RuntimeError: Any(ArrowError(Ipc("Unable to get root as footer: TooManyTables")))

I think if your you replace gen::File::root_as_footer with gen::File::root_as_footer_with_opts, you can set the max_table option: At

let footer = gen::File::root_as_footer(&footer_data[..])

https://docs.rs/flatbuffers/2.0.0/src/flatbuffers/get_root.rs.html#39-49

Convenience methods on StructArray

StructArray in arrow-rs has a few convenience methods that StructArray in arrow2 is missing:

/// Returns the field at pos. 
fn column(&self, pos: usize) -> &ArrayRef;

/// Return the number of fields in this struct array. 
fn num_columns(&self) -> usize;

/// Returns the fields of the struct array. 
fn columns(&self) -> Vec<&ArrayRef>;

/// Returns child array refs of the struct array. 
fn columns_ref(&self) -> Vec<ArrayRef>;

/// Return field names in this struct array.
fn column_names(&self) -> Vec<&str>;

/// Return child array whose field name equals to column_name. 
fn column_by_name(&self, column_name: &str) -> Option<&ArrayRef>;

For reference, arrow2::StructArray has:

fn values(&self) -> &[Arc<dyn Array>]; // technically the same as `columns()`
fn fields(&self) -> &[Field];

which is enough to replicate those methods. Having them available on StructArray is quite convenient though, particularly for new users.

Is it okay if I submit a PR to bring most of them back?
(and I'm thinking of making the return types more idiomatic, but that can be discussed in the PR)

Investigate how to add support for timezones in timestamp

We currently support timezones at logical level, via Timestamp(_, tz), but have no mechanism to perform operations with them, The goal is to investigate how to add support for them. This will likely require depending on chrono-tz.

A simple concrete goal here is to add support to cast from and to Timestamp(_, str) and Timestamp(_, None).

A question here is how we represent these things: it seems that in chrono-tz, timezones are static types (i.e. struct), while timezones in Arrow are dynamic (i.e. str).

Make IpcWriteOptions easier to find.

Expose arrow2::io::ipc::write::IpcWriteOptions instead of arrow2::io::ipc::write::common::IpcWriteOptions, for easier access and logical meaning.

Improve the performance in cast kernel using AsPrimitive trait in generic dispatch

Beat arrow2 cast performance less than 10 LOC, about 7 times faster in casting 2^20 size u32array to u64array.

Benchmark codes:

https://github.com/sundy-li/learn/blob/master/arrow-vs-arrow2/benches/cast_kernels.rs

Gnuplot not found, using plotters backend
cast u32 to u64 1048576 time:   [4.3537 ms 4.3615 ms 4.3702 ms]                                     
                        change: [-0.0612% +0.1772% +0.4521%] (p = 0.17 > 0.05)
                        No change in performance detected.


cast u32 to u64 v2 1048576                                                                            
                        time:   [630.99 us 645.11 us 664.46 us]
                        change: [-11.727% -4.8932% +1.8462%] (p = 0.20 > 0.05)
                        No change in performance detected.

Why not use unsigned physical representation for Date/DateTime types


| `DataType`            | `PhysicalType`            |
|-----------------------|---------------------------|
| `Date32`              | `PrimitiveArray<i32>`     |
| `Date64`              | `PrimitiveArray<i64>`     |
| `Time32(_)`           | `PrimitiveArray<i32>`     |
| `Time64(_)`           | `PrimitiveArray<i64>`     |

Since the Date represents the time elapsed time since the UNIX epoch, So I wonder why not use unsigned physical representation for these types.

CheckedDiv will not work on float

I was checking that the kernel for checked division wont work for floats. That is because it depends on the trait CheckedDiv and that is only implemented for integers (trait).

I'm wondering if instead of depending on that trait it should implement its own version of the checked operation that will include floats. We can check if the value is Zero and return None if that is the case. At least Zero is implemented for floats

Add integration tests for writing to parquet

Currently we generate parquet files from pyarrow and read them from Rust. Let's add the other way around: write from Rust and read from pyarrow.

This will require a small executable that writes them, so that python can call it when running the tests.

Split crate in smaller crates

Hi,

I would like to gauge interest in splitting this crate in 3 smaller crates,

  • arrow2-core
  • arrow2-compute
  • arrow2-io

so that we can:

  1. apply a different versioning to them. E.g. the core part hasn't had a backward incompatible change for some time now, but all other parts had some recently.
  2. add features more dedicated to each of the parts. E.g. compute currently drags a lot of stuff, even if people would just like part of the compute
  3. Reduce the compilation times

arrow2-core would contain:

alloc
array
bitmap
buffer
datatypes
ffi
types
util
error

arrow2-compute would contain the compute, and arrow2-io would contain io.

Consider slipstream for auto-vectorization

@Dandandan raised the idea of considering the slipstream crate for SIMD / auto-vectorization.

I agree that we should consider it - if anything, to offload that part of the code-base to slipstream. IMO our requirements are:

  • As secure or more secure than the existing implementation
  • As fast or faster than the existing implementation
  • compatible with the bitmaps that we use for masks, including bitmaps with offsets

Add `extend` / `extend_unchecked` on builders/ mutablearrays

What do you think of extending the MutableArray API's with those methods?

The MutableBuffer already exposes this API. Often it's more user-friendly to work on the builders, as they maintain the overhead of the correct composite of arrays (a bitmap, an offset, a value-array, etc). With an extend_(unchecked) we can probably amortize allocation/bound checks.

[question] Questions regarding project aim

Not gone through the source yet, so I hope I don't ask dumb things.

This is very cool and probably takes a way a lot of the gripes in the current implementation. I also think this can help compile times quite a lot.

As the memory model of arrow is stable. Do you think it would be possible to define traits for Arrow2 arrays and default arrow Arrays that make it possible to generalize the existing kernels between both crates? Or is that a ball and chain you don't want to have for this repo (which I can imagine).

And do you aim publishing to crates.io? :)

Filter operation on sliced utf8 arrays are incorrect

This test because the Utf8 column has a different length than that of the primitive values.

I haven't been able to isolate it yet. But what is different for the utf8 column is that it is sliced in to subslices that are sent to differen threads. There the filter is applied (also sliced), and then on the main thread everything is concatenated.

Poor performance in `sort::sort_to_indices` with limit option in arrow2

Hello, I did a benchmark vs arrow1.

It shows that arrow2 has great performance improved in min, sum and max etc.

But the sort::sort_to_indices function with limit 100 has a performance drop.

Benchmark results:

arrow2-sort 2^13 f32    time:   [67.831 us 67.978 us 68.144 us]   

arrow1-sort 2^13 f32    time:   [34.306 us 34.405 us 34.521 us] 

Codes: https://github.com/sundy-li/learn/tree/master/arrow-vs-arrow2/benches

Scripts:
cargo bench -- "arrow1-sort 2\^13 f32"
cargo bench -- "arrow2-sort 2\^13 f32"

Add support to read from Parquet

  • Create new module io/parquet
  • add dependency on the parquet crate without arrow feature
  • add conditional feature parquet and feature-gate io/parquet with it
  • re-write the arrow/ section of parquet's crate for this API

This is a large task ^_^

Note that parquet's crate has non-trivial and potentially unsound signatures of the physical types. E.g. INT96 is represented as Option<[u32; 3]> which has 16, not 12 bytes. We should spend some time investigating if a re-write of that is needed to e.g. pass MIRI checks and be generally safe and sound.

IPC writing writes all values in sliced arrays with offsets.

I am new to Rust and Arrow (so apologies for my lack of clarity in explaining this issue) and have a use case where I need to transform columnar record batches to row-oriented record batches, i.e. a single record batch for each row.

Full code to reproduce: https://github.com/dectl/arrow2-slice-bug

It is expected that the rows vector of arrays would each contain values for a single row as I sliced the columns using a row number offset with a length of 1.

In this example, I read 5 rows from a csv file and create a record batch. The row-oriented record batches produced are all the same as the original record batch of 5. Looking at the debug output for rows[0] below, it appears that the arrays point to the full record batch buffer, containing data for all 5 rows, rather than the slice of data for the single row. Is there perhaps an issue with the offsets in this implementation or am I doing something wrong?

RecordBatch { schema: Schema { fields: [Field { name: "trip_id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_pickup_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_dropoff_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "passenger_count", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_distance", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "fare_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "tip_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "total_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "payment_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9da480, len: 6, data: [0, 36, 72, 108, 144, 180] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9da780, len: 180, data: [97, 49, 55, 99, 53, 50, 97, 49, 45, 100, 53, 52, 49, 45, 52, 54, 99, 56, 45, 57, 100, 49, 101, 45, 51, 51, 51, 57, 97, 56, 55, 49, 53, 100, 50, 48, 48, 49, 101, 100, 100, 48, 52, 56, 45, 97, 48, 100, 48, 45, 52, 100, 49, 49, 45, 56, 102, 99, 99, 45, 99, 54, 100, 51, 52, 48, 51, 51, 52, 57, 53, 50, 56, 55, 49, 54, 97, 56, 97, 98, 45, 98, 99, 56, 51, 45, 52, 53, 50, 56, 45, 97, 57, 98, 100, 45, 54, 55, 101, 48, 51, 53, 49, 54, 50, 49, 51, 100, 99, 56, 99, 99, 48, 48, 57, 97, 45, 57, 57, 97, 52, 45, 52, 49, 48, 101, 45, 56, 56, 56, 54, 45, 52, 101, 55, 57, 56, 101, 57, 49, 56, 51, 52, 50, 101, 50, 54, 54, 54, 97, 57, 101, 45, 53, 102, 55, 48, 45, 52, 54, 98, 53, 45, 57, 55, 98, 99, 45, 53, 55, 53, 98, 100, 52, 56, 100, 53, 57, 57, 100] }, offset: 0, length: 180 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9daa80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9dac80, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 50, 58, 51, 48, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 53, 58, 53, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 49, 58, 51, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 54, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 49, 57, 58, 53, 55] }, offset: 0, length: 95 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9dae80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9db080, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 52, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 54, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 49, 58, 49, 52, 58, 50, 49, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 51, 48, 58, 53, 54] }, offset: 0, length: 95 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4180, len: 5, data: [0.0, 1.28, 2.47, 6.3, 2.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4380, len: 5, data: [3.5, 20.0, 10.5, 21.0, 10.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4580, len: 5, data: [0.01, 4.06, 3.54, 0.0, 0.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4780, len: 5, data: [4.81, 24.36, 15.34, 25.05, 11.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4980, len: 5, data: [1.0, 1.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4c80, len: 5, data: [1.0, 2.0, 1.0, 1.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }] }

This seems wrong: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }

use std::path::Path;
use std::sync::Arc;

use arrow2::io::csv::read;
use arrow2::array::Array;
use arrow2::record_batch::RecordBatch;


fn main() {
    let mut reader = read::ReaderBuilder::new().from_path(&Path::new("sample.csv")).unwrap();
    let schema = Arc::new(read::infer_schema(&mut reader, Some(10), true, &read::infer).unwrap());
    
    let mut rows = vec![read::ByteRecord::default(); 5];
    read::read_rows(&mut reader, 0, &mut rows).unwrap();
    let rows = rows.as_slice();

    let record_batch = read::deserialize_batch(
        rows,
        schema.fields(),
        None,
        0,
        read::deserialize_column,
    ).unwrap();

    println!("full record batch:");
    println!("{:?}", record_batch);

    // Convert columnar record batch to a vector of "row" (single record) record batches
    let mut rows = Vec::new();

    for i in 0..record_batch.num_rows() {
        let mut row = Vec::new();
        for column in record_batch.columns() {
            let arr: Arc<dyn Array> = column.slice(i, 1).into();
            row.push(arr);
        }
        rows.push(row);
    }
    
    println!("row record batches:");
    for (i, row) in rows.into_iter().enumerate() {
        println!("row {}", i);
        let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
        println!("{:?}", row_record_batch);
    }
}

Below is my equivalent code for the official arrow-rs implementation, which appears to work as expected.

use std::fs::File;
use std::path::Path;

use arrow::csv;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;


fn main() {
    let file = File::open(&Path::new("sample.csv")).unwrap();
    let builder = csv::ReaderBuilder::new()
        .has_header(true)
        .infer_schema(Some(10));
    let mut csv = builder.build(file).unwrap();

    let record_batch = csv.next().unwrap().unwrap();
    let schema = record_batch.schema().clone();

    println!("full record batch:");
    print_batches(&[record_batch.clone()]).unwrap();

    // Convert columnar record batch to a vector of "row" (single record) record batches
    let mut rows = Vec::new();


    for i in 0..record_batch.num_rows() {
        let mut row = Vec::new();
        for column in record_batch.columns() {
            row.push(column.slice(i, 1));
        }
        rows.push(row);
    }
    

    // Using new record_batch.slice() method now implemented in SNAPSHOT-5.0.0
    /*
    for i in 0..record_batch.num_rows() {
        rows.push(record_batch.slice(i, 1));
    }
    */

    println!("row record batches:");
    for row in rows {
        let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
        print_batches(&[row_record_batch]).unwrap();
    }

    /*
    for row in rows {
        print_batches(&[row]).unwrap();
    }
     */
}

Combine cast in simd sum

Currently, there are 2 ways to make sum(UInt8Array) -> UInt64

A. Cast UInt8Array to UInt64Array, and apply the simd sum kernel.
B. Iterator the UInt8Array and calculate add operation in one loop.

I did some benchmarks, seems B has better performance even with null values in array.

Implement methods to `*Primitive` for `T: NaturalNativeType`

When creating nested types like a dictionary, it would be nice to not have to use .to(data_type) when the datatype is the "natural" datatype of the data (e.g. Utf8).

E.g.

let data = vec![Some("a"), Some("b"), Some("a")];

let data = data.into_iter().map(Result::Ok);
let a = DictionaryPrimitive::<i32, Utf8Primitive<i32>, &str>::try_from_iter(data)?;
let array = a.to(DataType::Dictionary(
            Box::new(DataType::Int32),
            Box::new(DataType::Utf8),
));

allow let array = a.into() instead, where the logical types come from the natural types for the physical type.

Keep the existing API for when the type is not the natural one.

Feature: Decimal and Timestamp type in ffi

Hi,

I'm working on some binding between python and rust to optimize some IO and parsing of data from cloud storages. I didn't follow the mailing list so I would like to ask if there are chance to have Decimal and Timestamp types to and from ffi. From my view, they are really Primitive:: or Primitive:: underneath, with logical type.

Thank you

Add interface to extend `MutableBitmap` from a slice of a `Bitmap`

One of the slowest operations in growable is copying a slice of bits from a Bitmap into a MutableBitmap.

The goal of this issue is to implement

impl MutableBitmap {
    fn extend_from_bitmap(&mut self, bitmap: &Bitmap, start: usize, len: usize) {
        todo!()
    }
}

that extends MutableBitmap by a slice [start, start+len) taken from Bitmap. Note that both MutableBitmap and Bitmap may have an offset, and start itself may not be a multiple of 8.

Then, benchmark the result of the growable benchmark by replacing the code on extend_validity and build_extend_null_bits by calling this function.

This is essentially an (fun) exercise with bitmaps. This will likely benefit from merge_reversed, a useful tool to shift bytes within arrow bitmaps

Out of bound read

There is an out of bound read here.

index + 1 is out of bounds. There is some funky things going on there, as we transmute u8 to u64.

This also exists in the original crate, filed under ARROW-11865

Consider bitvec for bitmap ops

@myrrlyn recently forked a simple repo for benchmarks of bitmap operations and shows a 10x improvement over a simple operation: https://github.com/myrrlyn/bitmap_perf_test ๐Ÿš€

@myrrlyn, I notice that the return type of the function changed to BitVec<Lsb0, u8>, which I assume is the container for bitmaps of bitvec. In the use-case that motivated that repo (which is this repo and arrow's repo), we have a bit more constraints into what we can use because we must fulfill arrow's specification. This imposes constraints into how we can allocate and deallocate memory (we have our own allocator to support FFI and other shenanigans).

In this context, I wonder: does BitVec support a custom container and/or allocator? E.g. is there an API to wrap a pre-allocated &mut [u8] into BitVec and use bitvec's algorithms / implementations?

Add documentation to guide

I am considering moving some of the pages I have in the arrow guide I was writing to the guide Arrow2 has.

what do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.