Comments (11)
Yeah it's the equivalent of having 1 observation every 2 minutes, and then resampling so they're every 10 microseconds..
So I think a slowdown is expected - not saying it's not addressable, but I don't think it's at all common to do this, and so that it's low-prio compared with other open issues
from polars.
Minimal repro:
df = pl.DataFrame(
{'id': [67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67],
'time_ns': [15016000000,
15126000000,
15236000000,
15346000000,
15456000000,
15566000000,
15676000000,
15786000000,
15896000000,
16006000000,
16116000000,
16226000001,
16336000001,
16446000001,
16556000000,
16666000000,
16776000000,
16886000001,
16996000001,
17106000001]}
).set_sorted("time_ns")
df.group_by_dynamic("time_ns", every="10i", check_sorted=False).agg(pl.col("id").alias("group"))
from polars.
determining the groups takes a long time
polars/crates/polars-time/src/group_by/dynamic.rs
Lines 315 to 324 in 42a4b01
if you're making groups every 10 units, and your measurements span 2 billion units, then that's a lot of groups...there's probably some fastpath which could be introduced to avoid creating a lot of them though
from polars.
Yes, we seem to iterate A LOT! Care to look a that one? Then I will do the pivots. :D
from polars.
I think this isn't so simple to speedup, there's already an early continue
polars/crates/polars-time/src/windows/group_by.rs
Lines 79 to 85 in 25536cf
this may require a larger refactor..
from polars.
Oh, I didn't realize we went in steps of 10 through 2 billion units. Ok.. :/
from polars.
Is there a way to make it work with a time and/or duration datatype? I guess I could convert the column to seconds and then it should work fine with indices?
from polars.
Regardless of what dtype you convert it to, if your every
is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impact
May I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)
from polars.
Just trying to do a lazy downsample within groups.
from polars.
If you're doing an operation on every 10 elements, you could try something like unstack
although you're going to generate a lot of columns. For this I would almost suggest to_numpy().reshape(-1, 10).mean(axis=1)
or something of the sort.
from polars.
Regardless of what dtype you convert it to, if your
every
is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impactMay I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)
I'm trying to downsample my data to about 50hz, but my data isn't labeled by timestamp and instead is just some sort of monotonic clock from a given epoch.
from polars.
Related Issues (20)
- exception thrown if converting arrow Table with struct and dictionary columns to polar dataframe
- converting pandas to Polars drops column if its name, when converted to string, matches another column's name
- pl.format should be clear it will return null when one of the arguments is null
- Off-by-one error when casting to Decimal with set precision
- Importing pyarrow after polars causes `SIGSEGV` HOT 4
- Polars assumes microseconds instead of reading numpy timedelta units
- Cannot create Array column containing large u64 value
- Multipling a Decimal by Int returns Int type HOT 2
- Split out `Expr.top_k` from `Expr.top_k_by`
- `pl.Datetime` `time_zone` parameter has no type or value check HOT 5
- Cast from `pl.Date` to `pl.Datetime` silently returns incorrect value when new dtype cannot hold value HOT 2
- exception thrown if converting chunked arrow Table with struct and dictionary columns to polar Dataframe
- Panic when constructing Series with dtype `Duration('ms')` with large `timedelta` objects
- Can the separator of the read csv function support regular splitting? HOT 5
- Casting float to Decimal fails silently HOT 2
- Use parquet statistics when collecting column statistics from scanned parquet
- Excessive Memory Consumption During Rolling Operations on Large DataFrames
- write_database() - Insert many rows with sql server using fast_executemany HOT 3
- fill_null doesn't support expr HOT 6
- `dt.total_nanoseconds` and `dt.total_microseconds` may overflow silently
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.