Comments (4)
yes, that would be less efficient than a dedicated function
from polars.
SkipList must be introduced
FYI:
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1281
from functools import lru_cache
from typing import Tuple
import numpy as np
from numba import jit
from pandas._libs.window.aggregations import roll_kurt as _roll_kurt
from pandas._libs.window.aggregations import roll_rank as _roll_rank
from polars import Series
from polars import Expr, Int32, UInt16, map_batches
@lru_cache
@jit(nopython=True, nogil=True, fastmath=True, cache=True)
def get_window_bounds(
num_values: int = 0,
window_size: int = 10,
) -> Tuple[np.ndarray, np.ndarray]:
end = np.arange(1, num_values + 1, dtype=np.int64)
start = end - window_size
start = np.clip(start, 0, num_values)
return start, end
def roll_rank(x: Series, d: int, pct: bool = True, method: str = 'average', ascending: bool = True):
start, end = get_window_bounds(len(x), d)
"""
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1281
def roll_rank(const float64_t[:] values, ndarray[int64_t] start,
ndarray[int64_t] end, int64_t minp, bint percentile,
str method, bint ascending) -> np.ndarray:
O(N log(window)) implementation using skip list
"""
ret = _roll_rank(x.to_numpy().astype(float), start, end, d, pct, method, ascending)
return Series(ret, nan_to_null=True)
def ts_rank(x: Expr, d: int = 5) -> Expr:
return x.map_batches(lambda a: roll_rank(a, d, True))
from polars.
I was just curious as rolling rank has popped up a few times, e.g. #4808
Are these operations currently possible in Polars via Expr.rolling()
?
df = pl.DataFrame({"A": [1, 4, 2, 3, 5, 3]})
(df.with_row_index()
.with_columns(
rank = pl.col("A").rank().rolling(index_column="index", period="3i").list[-1],
prod = pl.col("A").product().rolling(index_column="index", period="3i"),
ewm = pl.col("A").ewm_mean(span=1.5).rolling(index_column="index", period="3i").list[-1]
)
)
shape: (6, 5)
┌───────┬─────┬──────┬──────┬──────────┐
│ index ┆ A ┆ rank ┆ prod ┆ ewm │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ f64 ┆ i64 ┆ f64 │
╞═══════╪═════╪══════╪══════╪══════════╡
│ 0 ┆ 1 ┆ 1.0 ┆ 1 ┆ 1.0 │
│ 1 ┆ 4 ┆ 2.0 ┆ 4 ┆ 3.5 │
│ 2 ┆ 2 ┆ 2.0 ┆ 8 ┆ 2.290323 │
│ 3 ┆ 3 ┆ 2.0 ┆ 24 ┆ 2.870968 │
│ 4 ┆ 5 ┆ 3.0 ┆ 30 ┆ 4.580645 │
│ 5 ┆ 3 ┆ 1.5 ┆ 45 ┆ 3.322581 │
└───────┴─────┴──────┴──────┴──────────┘
But because a list is accumulated, it is less efficient than dedicated functions would be?
from polars.
@cmdlineluser @MarcoGorelli @wukan1986
Thank you all for your responses; I've learned a lot from you. In my actual projects, I need to calculate rolling window rank/ewm/prod
and other operators, grouped by some_col
. I use the following code to calculate the EMA for the sliding window, and the results seem to be the same as those obtained from xarray's rolling_exp.mean()
.
polars code:
df.with_columns(
pl.col(col_name)
.ewm_mean(span=window_size, ignore_nulls=True)
.over("some_col")
.alias(f"{col_name}_ema_{window_size}")
)
Could you please let me know if there are similar methods to calculate rank/prod
, etc.?
For the rank/prod
calculation, the current solution I think is:
df.with_row_index().rolling(
index_column="dt", group_by="some_col", period="3m"
).agg(pl.col("val1").rank().last().alias("rank"))
To calculate prod
, I just replace .rank()
with .cumprod()
(Sorry, I might have been a bit misleading. I meant cumprod
, not prod
.)
My dt
is at the minute level, but there is a period of time where it is discontinuous. Therefore, I would prefer to determine the sliding window based on the number of rows, i.e., period='3i'
. Do I need to manually create an incrementing integer column based on the group?
I hope to learn the most elegant and standard way to do this from your best practices. Thanks again for all your help!
from polars.
Related Issues (20)
- cannot find function `as_struct` in this scope HOT 2
- QOL improvements for .rolling
- Include example with function accepting multiple arguments in `Expr.map_batches`
- Documentation issue in `normalize`/`name` parameter from `.value_counts()` method HOT 2
- writing to os.devnull
- `Series[list].explode()` should not return `None` for empty lists HOT 2
- write_database to snowflake with adbc engine spouts context canceled error log
- Cannot tell if hvplot version 0.10.0 >= 0.9.1 HOT 2
- dtype 'Time' gets converted to i64 when collect(streaming=True) is used. HOT 1
- `.agg_groups()` PanicException when not used in a group_by context
- Additional Parameter for json_normalize HOT 5
- fold shouldn't require that acc and exprs share the same dtype
- Adding `descending` parameter to `Expr.over` HOT 5
- polars.LazyFrame.head recommends using fetch()
- Reading large json file error: ComputeError: InputTooLarge at character 0
- Serialize for AnyType has a todo!() HOT 1
- File cache invalidation not triggered for HTTP if size is the same
- Loading parquet written from an Arrow table produces non-deterministic incorrect numbers since 1.2.0 HOT 3
- illegal hardware instruction with python 3.12.4 and polars 1.1.0 on MacOS Sanoma 14, M2 HOT 3
- Panic when call `hash()` on `struct` dtype HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.