expose vectorized reduce feature in a plyrmr-esque way,about revolutionanalytics/plyrmr

Comments (1)

piccolbo commented on June 16, 2024

The vectorized reduce feature in rmr seems too low level for plyrmr. The problem is that we need to combine a vectorized grouping option with a reducer that can deal with multiple groups. That way we have two element of an expression that need to be aligned or errors will follow. For instance

data %|%
group(cols, .vectorized = TRUEorFALSE) %|%
do(some.fun)

Whether .vectorized can be set to TRUE depends on what some.fun is capable of. Ideally one would like to get a separation of concerns such that if some.fun can deal with multiple groups, it will set the grouping option accordingly. Then we can remove the .vectorized option from group and let the reduce side determine what it can deal with. The same is true for the combiner or .recursive argument. The knowledge that an operation is associative and commutative rests naturally with the operation itself and can be encapsulated therein. The trouble starts when we combine multiple operations that require different settings. Let's see if we can sort it out.

API for vectorized reduce ops

The data argument in input to a vectorized operation is expanded with an attribute, keys containing the names as character vector of the columns in the data that define the grouping
the return value contains such attribute and it can't be modified. The range of the key columns can only be reduced, meaning unique(.data[, attributes(.data)$keys]) is the allowed range for the keys in output. This is to prevent vectorized operation from modifying the grouping.
preservation of groups can be enforced by checking the attribute keys and the range of corresponding cols. Could be expensive to enforce and remain at the level of specs.
a non vectorized operation f can be promoted to a vectorized one as follows

keycols = data[, attributes(.data)$keys]
lapply(split(data, keycols), f)

This isn't more efficient, it's just to clarify the semantics. Could be used to combine vectorized and non vectorized if necessary. Could be dplyr-enhanced to get some speed back, still not immune to small group syndrome.

Composition rules

if reduce operation is composed of functions f1 ... fn and f1 is mergeable than combiner is f1 and reducer is composition f1 ... fn
if reduce operation is composed of functions f1 ... fn we could
- make all functions vectorized if at least one is vectorized and use vectorized grouping. This may incur some overhead by having to split the data again and again instead of a single time
- use vectorized grouping iff all functions are vectorized. After all, just one is enough to drop form C speed to R speed.

Make vectorized reduce the only option?

There is a clear trend in rmr evolution toward more vectorization. It makes programs slightly harder to write, but is necessary for speed in many cases, but not all. The question is: is it worth having two modes, vectorized and not, for the sake of simplicity when vectorized mode is not necessary (large groups)? Since rmr 2.0 we don't have a non vectorized map no more, even if it is not necessary and slightly more complicated in the case of large records. In plyrmr we can tackle this in a different way. We can turn on vectorization all the time. We allow operation to declare their vectorization compatibility. We can promote the ones that are not vectorized. For instance select(data, sum(col)) is not vectorized (we are always talking wrt the keys, sum is vectorized wrt its argument). But in a grouped context sum(col) would be wrapped in a dplyr::group_by to provide the vectorization, at least whenever dplyr can. Otherwise natively vectorized ops, possibly implemented in C, would be executed as they are.

from plyrmr.

expose vectorized reduce feature in a plyrmr-esque way about plyrmr HOT 1 CLOSED

Comments (1)

API for vectorized reduce ops

Composition rules

Make vectorized reduce the only option?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent