Giter Site home page Giter Site logo

Comments (18)

elferherrera avatar elferherrera commented on August 15, 2024 1

I've been reading the some rules for these operations and there are more cases like that. For example, what would happen if we add two arrays with different precision and scale? Let's say a Decimal(4,3) + Decimal(5,2), then the resulting array should be Decimal(6,3) to accommodate for all the new numbers and decimals.

I see what you mean with the different functions, but how will you select when to apply them? I'm wondering if for each number in the array you will have to check for these cases and apply the required one.

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024 1

Yes, good idea. You can do that: write Closes #XX in the description. :)

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

Hi Jorge. I was going through the code thinking about these arithmetics. I would guess the arithmetics have to consider the precision and scale of the numbers, right?
For example, if we have a Decimal(5,2) (precision 5 and scale 2) the max number that can exist for that primitive is 999.99. Therefore, adding 555.55 + 555.55 should return 999.99. Is that correct?
Because now if you add those two numbers you would get 1111.1 which violates the selected precision and scale

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

Hi @elferherrera ! Very good points, to which I do not have a good answer for. My initial though would be to check how postgres does it.

One idea is to add different functions for different cases, e.g. checked_add that writes a null when overflows, saturating_add that saturates, add, that panics, "adaptive_add", that increases the precision when overflowing?

I was hopping that each these variations would re-use most the code.

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

I think that as long as we are consistent, it should not be a problem. We already panick with other types atm: e.g. if any of the slots of an Int32 is i32::MAX and we add with 1 on the other side, it panics.

My thinking would be that arrow2 should not have to take a decision on which type to apply; the problem-specific applications would decide. For example, DataFusion would use the one closest to postgres, Polars could use another variation, etc. A bit like Rust does: it offers checked_add, add, saturating_add, etc, and leave each user to pick the one that fits their use-case.

The "adaptive" case is a bit tricky in datafusion atm because it assumes that the logical schema remains constant throughout the application. My hypothesis is that the user would cast the decimal to a larger precision if they feel that the operation would overflow on the current precision.

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

At the moment there is no Decimal in Rust Arrow, right?

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

I think that there is ongoing work to support it here.

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

Coming back with another question about the arithmetics module. The arithmetic function takes references to dyn Array, that would mean that in the future these arithmetics could be performed on a dictionary or a struct. However, that will panic when it is downcasted to a PrimaryArray.

Don't you think it will be better to specify that the arguments for the function should be PrimitiveArray from the beginning?

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

The general design in the crate has been: offer both untyped and typed APIs, so that the users can decide which one they prefer. The untyped is less prone to optimizations but easier to use, the typed benefits from help from the compiler wrt to type checking.

The idea would be to extend that function to other logical types. E.g. Primitive + Dictionary makes sense when rhs is a dictionary of primitives with the same logical type. Same for Float32 + Decimal. We can extend this this via

match (lhs.data_type(), rhs.data_type()) {
}

instead of requiring both sides to be equal and use match lhs.data_type() {}.

The typed API is arithmetic_primitive (currently not pub, though xD).

The other idea is to offer a second function can_arithmetic() -> bool, that declares the valid combinations. We do this in cast (can_cast).

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

I see. It makes sense to have both. I was wondering why you had arithmetic and arithmetic_primitive, now you had made that very clear. I guess eventually there will have to be a arithmetic_dictionary or arithmetic_structs.

However, I'm curious to know how are you planning to find what type of array the dyn Array will be? how will the function know that it is a primitive array of or a dictionary?

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

Ahh, now I understand your question.

The easiest way to think about it is to reduce the trait Array to the following two functions:

trait Array {
    fn as_any(&self) -> &dyn std::any::Any;

    fn data_type(&self) -> &DataType;
}

the general flow of any function (kernel or deserialization to any format), is:

match array.data_type() {
        DataType::Float32 => {
            let array = array.as_any().downcast_ref::<PrimitiveArray<f32>>().unwrap();
            // let array = f32-specific operator
        }
        DataType::Float64 => {
            let array = array.as_any().downcast_ref::<PrimitiveArray<f64>>().unwrap();
            // let array = f64-specific operator
        }
        _ => Err("This operator is only valid for float point.".to_string()),
}

i.e. we match array.data_type() and downcast to the corresponding physical representation. There is a one-to-many correspondence between a DataType (e.g. DataType::Utf8) and a physical in-memory representation (e.g. Utf8Array<i32>). I tried to describe the idea in here, which includes the map between DataType and the physical representation.

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

I think I get it. It will make sense to add that logic in the macro
macro that I added to the arithmetics module. By the way, I have been adding more docs to the functions to help me understand what they do 😀.

Regarding the Array Trait. I think it needs another function that can tell you what type of array it is. For example, we could have a primitive array and I dictionary array. Both implement Array, so both can be used as arguments in the arithmetics function. However, the dictionary will panic when you do this:

let rhs = $rhs.as_any().downcast_ref::<$primitive_array_type>();

Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

It will make sense to add that logic in the macro macro that I added to the arithmetics module.

Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.

Matching on the data type wont be enough to know the type of array used for the arithmetic. Or what do you think?

Note that the DataType of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary, the other is simply DataType::Int*. So, we will be able to match the dictionary and downcast it accordingly. Or am I missing something fundamental here?

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

Note that the DataType of a dictionary array is not the same as the datatype of a primitive array. One is marked as DataType::Dictionary, the other is simply DataType::Int*. So, we will be able to match the dictionary and downcast it accordingly. Or am I missing something fundamental here?

hahaha... my mistake. I just checked the DataType enum and I saw dictionary there. For a moment I thought that DataType only had primitives.

Yeap, that looks great! I would be grateful if you would like to contribute that as a PR.

Sure thing. I was planning to add it in the PR for the decimals arithmetics Im working on. Can I use the same branch to do multiple PRs? or do I have to create a separate branch per PR?

from arrow2.

jorgecarleitao avatar jorgecarleitao commented on August 15, 2024

Github does not allow the same branch to multiple PRs. It is more important to me that the individual commits are well organized so that we can cherry-pick / revert / create a changelog out of them. So, pure docs separated from functionality is good, but they do not have to be different PRs; the same PR with two commits is fine for now.

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

cool... then let me create a branch with these changes to PR the macro

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

A quick favour, do you mind checking this file with the add options for decimal?

It has the template I want to use for the other operations: subtraction, multiplication and division.

For the adaptive function I managed to adjust the scale for the resulting vector but not the precision. The problem I have is that I don't know how to keep track of the precision during the iteration process in the binary function. I'm thinking that the only way I can achieve this is by using a for loop and store the results in a vector. It will be slower but we can keep track of the precision in case it changes in one of the operations.

I created a new function called binary_with_change that will change the data_type for the resulting array. But now that I'm writing about it, I think that function has to be in the decimal folder and it should be private.

from arrow2.

elferherrera avatar elferherrera commented on August 15, 2024

@jorgecarleitao do you mind linking this issue with the Decimals PR?

from arrow2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.