Checks <input type="c

Minimal repro: <div class="highlight highlight-source-python notranslate position-

determining the groups takes a long time <div class="Box Box--co

Pathological slowdown in `LazyFrame.group_by_dynamic` with simple input,about pola-rs/polars

MarcoGorelli commented on June 11, 2024 1

Yeah it's the equivalent of having 1 observation every 2 minutes, and then resampling so they're every 10 microseconds..

So I think a slowdown is expected - not saying it's not addressable, but I don't think it's at all common to do this, and so that it's low-prio compared with other open issues

from polars.

ritchie46 commented on June 11, 2024

Minimal repro:

df = pl.DataFrame(
    {'id': [67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67,
  67],
 'time_ns': [15016000000,
  15126000000,
  15236000000,
  15346000000,
  15456000000,
  15566000000,
  15676000000,
  15786000000,
  15896000000,
  16006000000,
  16116000000,
  16226000001,
  16336000001,
  16446000001,
  16556000000,
  16666000000,
  16776000000,
  16886000001,
  16996000001,
  17106000001]}
    ).set_sorted("time_ns")

df.group_by_dynamic("time_ns", every="10i", check_sorted=False).agg(pl.col("id").alias("group"))

from polars.

MarcoGorelli commented on June 11, 2024

determining the groups takes a long time

polars/crates/polars-time/src/group_by/dynamic.rs

Lines 315 to 324 in 42a4b01

    
           let (groups, lower, upper) = group_by_windows( 
        
               w, 
        
               ts, 
        
               options.closed_window, 
        
               tu, 
        
               tz, 
        
               include_lower_bound, 
        
               include_upper_bound, 
        
               options.start_by, 
        
           );

if you're making groups every 10 units, and your measurements span 2 billion units, then that's a lot of groups...there's probably some fastpath which could be introduced to avoid creating a lot of them though

from polars.

ritchie46 commented on June 11, 2024

Yes, we seem to iterate A LOT! Care to look a that one? Then I will do the pivots. :D

from polars.

MarcoGorelli commented on June 11, 2024

I think this isn't so simple to speedup, there's already an early continue

polars/crates/polars-time/src/windows/group_by.rs

Lines 79 to 85 in 25536cf

    
           'bounds: for bi in bounds_iter { 
        
               // find starting point of window 
        
               for &t in &time[start..time.len().saturating_sub(1)] { 
        
                   // the window is behind the time values. 
        
                   if bi.is_future(t, closed_window) { 
        
                       continue 'bounds; 
        
                   }

this may require a larger refactor..

from polars.

ritchie46 commented on June 11, 2024

Oh, I didn't realize we went in steps of 10 through 2 billion units. Ok.. :/

from polars.

kszlim commented on June 11, 2024

Is there a way to make it work with a time and/or duration datatype? I guess I could convert the column to seconds and then it should work fine with indices?

from polars.

MarcoGorelli commented on June 11, 2024

Regardless of what dtype you convert it to, if your every is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impact

May I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)

from polars.

kszlim commented on June 11, 2024

Just trying to do a lazy downsample within groups.

from polars.

mcrumiller commented on June 11, 2024

If you're doing an operation on every 10 elements, you could try something like unstack although you're going to generate a lot of columns. For this I would almost suggest to_numpy().reshape(-1, 10).mean(axis=1) or something of the sort.

from polars.

kszlim commented on June 11, 2024

Regardless of what dtype you convert it to, if your every is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impact

May I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)

I'm trying to downsample my data to about 50hz, but my data isn't labeled by timestamp and instead is just some sort of monotonic clock from a given epoch.

from polars.

Pathological slowdown in `LazyFrame.group_by_dynamic` with simple input about polars HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	let (groups, lower, upper) = group_by_windows(
	w,
	ts,
	options.closed_window,
	tu,
	tz,
	include_lower_bound,
	include_upper_bound,
	options.start_by,
	);

	'bounds: for bi in bounds_iter {
	// find starting point of window
	for &t in &time[start..time.len().saturating_sub(1)] {
	// the window is behind the time values.
	if bi.is_future(t, closed_window) {
	continue 'bounds;
	}