Giter Site home page Giter Site logo

Comments (1)

piccolbo avatar piccolbo commented on June 16, 2024

The vectorized reduce feature in rmr seems too low level for plyrmr. The problem is that we need to combine a vectorized grouping option with a reducer that can deal with multiple groups. That way we have two element of an expression that need to be aligned or errors will follow. For instance

data %|%
group(cols, .vectorized = TRUEorFALSE) %|%
do(some.fun)

Whether .vectorized can be set to TRUE depends on what some.fun is capable of. Ideally one would like to get a separation of concerns such that if some.fun can deal with multiple groups, it will set the grouping option accordingly. Then we can remove the .vectorized option from group and let the reduce side determine what it can deal with. The same is true for the combiner or .recursive argument. The knowledge that an operation is associative and commutative rests naturally with the operation itself and can be encapsulated therein. The trouble starts when we combine multiple operations that require different settings. Let's see if we can sort it out.

API for vectorized reduce ops

  • The data argument in input to a vectorized operation is expanded with an attribute, keys containing the names as character vector of the columns in the data that define the grouping
  • the return value contains such attribute and it can't be modified. The range of the key columns can only be reduced, meaning unique(.data[, attributes(.data)$keys]) is the allowed range for the keys in output. This is to prevent vectorized operation from modifying the grouping.
  • preservation of groups can be enforced by checking the attribute keys and the range of corresponding cols. Could be expensive to enforce and remain at the level of specs.
  • a non vectorized operation f can be promoted to a vectorized one as follows
keycols = data[, attributes(.data)$keys]
lapply(split(data, keycols), f)

This isn't more efficient, it's just to clarify the semantics. Could be used to combine vectorized and non vectorized if necessary. Could be dplyr-enhanced to get some speed back, still not immune to small group syndrome.

Composition rules

  • if reduce operation is composed of functions f1 ... fn and f1 is mergeable than combiner is f1 and reducer is composition f1 ... fn
  • if reduce operation is composed of functions f1 ... fn we could
    • make all functions vectorized if at least one is vectorized and use vectorized grouping. This may incur some overhead by having to split the data again and again instead of a single time
    • use vectorized grouping iff all functions are vectorized. After all, just one is enough to drop form C speed to R speed.

Make vectorized reduce the only option?

There is a clear trend in rmr evolution toward more vectorization. It makes programs slightly harder to write, but is necessary for speed in many cases, but not all. The question is: is it worth having two modes, vectorized and not, for the sake of simplicity when vectorized mode is not necessary (large groups)? Since rmr 2.0 we don't have a non vectorized map no more, even if it is not necessary and slightly more complicated in the case of large records. In plyrmr we can tackle this in a different way. We can turn on vectorization all the time. We allow operation to declare their vectorization compatibility. We can promote the ones that are not vectorized. For instance select(data, sum(col)) is not vectorized (we are always talking wrt the keys, sum is vectorized wrt its argument). But in a grouped context sum(col) would be wrapped in a dplyr::group_by to provide the vectorization, at least whenever dplyr can. Otherwise natively vectorized ops, possibly implemented in C, would be executed as they are.

from plyrmr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.