Deion Thank you to the hard-working <code class="notranslate

SkipList must be introduced FYI: <a href="https://github.com/pa

I was just curious as rolling rank has popped up a few times, e.g. <a class="issue-lin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Rolling ewm/prod/rank about polars HOT 4 OPEN

Mirac-Le commented on July 18, 2024

Rolling ewm/prod/rank

from polars.

Comments (4)

MarcoGorelli commented on July 18, 2024 2

yes, that would be less efficient than a dedicated function

from polars.

wukan1986 commented on July 18, 2024 2

SkipList must be introduced

FYI:
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1281

from functools import lru_cache
from typing import Tuple

import numpy as np
from numba import jit
from pandas._libs.window.aggregations import roll_kurt as _roll_kurt
from pandas._libs.window.aggregations import roll_rank as _roll_rank
from polars import Series
from polars import Expr, Int32, UInt16, map_batches

@lru_cache
@jit(nopython=True, nogil=True, fastmath=True, cache=True)
def get_window_bounds(
        num_values: int = 0,
        window_size: int = 10,
) -> Tuple[np.ndarray, np.ndarray]:
    end = np.arange(1, num_values + 1, dtype=np.int64)
    start = end - window_size
    start = np.clip(start, 0, num_values)
    return start, end


def roll_rank(x: Series, d: int, pct: bool = True, method: str = 'average', ascending: bool = True):
    start, end = get_window_bounds(len(x), d)
    """
    https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1281

    def roll_rank(const float64_t[:] values, ndarray[int64_t] start,
              ndarray[int64_t] end, int64_t minp, bint percentile,
              str method, bint ascending) -> np.ndarray:

    O(N log(window)) implementation using skip list
    """
    ret = _roll_rank(x.to_numpy().astype(float), start, end, d, pct, method, ascending)
    return Series(ret, nan_to_null=True)

def ts_rank(x: Expr, d: int = 5) -> Expr:
    return x.map_batches(lambda a: roll_rank(a, d, True))

from polars.

cmdlineluser commented on July 18, 2024 1

I was just curious as rolling rank has popped up a few times, e.g. #4808

Are these operations currently possible in Polars via Expr.rolling()?

df = pl.DataFrame({"A": [1, 4, 2, 3, 5, 3]})

(df.with_row_index()
   .with_columns(
      rank = pl.col("A").rank().rolling(index_column="index", period="3i").list[-1],
      prod = pl.col("A").product().rolling(index_column="index", period="3i"),
      ewm  = pl.col("A").ewm_mean(span=1.5).rolling(index_column="index", period="3i").list[-1]
   )
)

shape: (6, 5)
┌───────┬─────┬──────┬──────┬──────────┐
│ index ┆ A   ┆ rank ┆ prod ┆ ewm      │
│ ---   ┆ --- ┆ ---  ┆ ---  ┆ ---      │
│ u32   ┆ i64 ┆ f64  ┆ i64  ┆ f64      │
╞═══════╪═════╪══════╪══════╪══════════╡
│ 0     ┆ 1   ┆ 1.0  ┆ 1    ┆ 1.0      │
│ 1     ┆ 4   ┆ 2.0  ┆ 4    ┆ 3.5      │
│ 2     ┆ 2   ┆ 2.0  ┆ 8    ┆ 2.290323 │
│ 3     ┆ 3   ┆ 2.0  ┆ 24   ┆ 2.870968 │
│ 4     ┆ 5   ┆ 3.0  ┆ 30   ┆ 4.580645 │
│ 5     ┆ 3   ┆ 1.5  ┆ 45   ┆ 3.322581 │
└───────┴─────┴──────┴──────┴──────────┘

But because a list is accumulated, it is less efficient than dedicated functions would be?

from polars.

Mirac-Le commented on July 18, 2024

@cmdlineluser @MarcoGorelli @wukan1986

Thank you all for your responses; I've learned a lot from you. In my actual projects, I need to calculate rolling window rank/ewm/prod and other operators, grouped by some_col. I use the following code to calculate the EMA for the sliding window, and the results seem to be the same as those obtained from xarray's rolling_exp.mean().

polars code:

df.with_columns(
        pl.col(col_name)
        .ewm_mean(span=window_size, ignore_nulls=True)
        .over("some_col")
        .alias(f"{col_name}_ema_{window_size}")
    )

Could you please let me know if there are similar methods to calculate rank/prod, etc.?

For the rank/prod calculation, the current solution I think is:

df.with_row_index().rolling(
    index_column="dt", group_by="some_col", period="3m"
).agg(pl.col("val1").rank().last().alias("rank"))

To calculate prod, I just replace .rank() with .cumprod()
(Sorry, I might have been a bit misleading. I meant cumprod, not prod.)

My dt is at the minute level, but there is a period of time where it is discontinuous. Therefore, I would prefer to determine the sliding window based on the number of rows, i.e., period='3i'. Do I need to manually create an incrementing integer column based on the group?

I hope to learn the most elegant and standard way to do this from your best practices. Thanks again for all your help!

from polars.

Rolling ewm/prod/rank about polars HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent