Comments (4)
Just trying to understand the example here:
In the Pandas example, you're sorting before log / diff:
# pandas
df.sort_values('date').groupby('code').close.transform(np.log).diff()
But in the Polars example, you're sorting after log / diff:
# polars
pldf.with_columns(
pl.col('close').log().diff().sort_by('date').over('code').alias("log_return1")
)
Should it not come before log/diff in order to be equivalent to the Pandas example?
pldf.with_columns(
pl.col('close').sort_by('date').log().diff().over('code').alias("log_return1")
)
from polars.
@cmdlineluser I thought pl.Expr
is a lazy func like in spark
Something like
w = Window().partitionBy(['store_id', 'product_id', 'date']).orderBy(col('time_create').desc()
F.avg('sale_count').over(w)
from polars.
Yes, but if you sort after .diff()
it will produce different results?
import polars as pl
import numpy as np
df = pl.from_repr("""
┌─────────────────────┬──────┬───────┐
│ date ┆ code ┆ close │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ str ┆ i64 │
╞═════════════════════╪══════╪═══════╡
│ 2021-01-01 00:04:00 ┆ 0001 ┆ 17 │
│ 2021-01-01 00:01:00 ┆ 0001 ┆ 18 │
│ 2021-01-01 00:02:00 ┆ 0001 ┆ 3 │
│ 2021-01-01 00:06:00 ┆ 0001 ┆ 3 │
│ 2021-01-01 00:05:00 ┆ 0001 ┆ 14 │
│ 2021-01-01 00:03:00 ┆ 0001 ┆ 7 │
│ 2021-01-01 00:09:00 ┆ 0001 ┆ 2 │
│ 2021-01-01 00:00:00 ┆ 0001 ┆ 12 │
│ 2021-01-01 00:08:00 ┆ 0001 ┆ 14 │
│ 2021-01-01 00:07:00 ┆ 0001 ┆ 2 │
└─────────────────────┴──────┴───────┘
""")
Your pandas example:
(df.to_pandas()
.sort_values('date')
.groupby('code')
.close.transform(np.log).diff()
)
# 7 NaN
# 1 0.405465
# 2 -1.791759
# 5 0.847298
# 0 0.887303
# 4 -0.194156
# 3 -1.540445
# 9 -0.405465
# 8 1.945910
# 6 -1.945910
# Name: close, dtype: float64
sorting before/after diff:
df.select(
yes = pl.col('close').sort_by('date').log().diff().over('code'),
no = pl.col('close').log().diff().sort_by('date').over('code')
)
# shape: (10, 2)
# ┌───────────┬───────────┐
# │ yes ┆ no │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═══════════╪═══════════╡
# │ null ┆ 1.791759 │
# │ 0.405465 ┆ 0.057158 │
# │ -1.791759 ┆ -1.791759 │
# │ 0.847298 ┆ -0.693147 │
# │ 0.887303 ┆ null │
# │ -0.194156 ┆ 1.540445 │
# │ -1.540445 ┆ 0.0 │
# │ -0.405465 ┆ -1.94591 │
# │ 1.94591 ┆ 0.154151 │
# │ -1.94591 ┆ -1.252763 │
# └───────────┴───────────┘
from polars.
Oh, I understand. diff
is different from other agg functions, depend on order. The new value would not remember the corresponding position of date
.
Thank you for clarification.
from polars.
Related Issues (20)
- BinViewChunkedBuilder has incorrect docs
- Series.hist resulting series name changes with include_breakpoint=False, include_category=False HOT 1
- Series.hist adds two bins when specifying bins HOT 1
- data spilled to disk not cleaned up on failure HOT 1
- Cannot serialize polars.LazyFrame (`Expr::RenameAlias cannot be serialized`)
- Big difference in iteration speed over GroupBy object depending on dataFrame construction HOT 1
- Can't `sink_parquet` on a sorted LazyFrame containing decimal columns HOT 2
- `write_database` closes adbc connection HOT 3
- `list_concat([list<T>, list<T>])` gives `list<T>`, not `list<list<T>>` HOT 10
- mismatching schemas when opening csv file
- SchemaError for Non-Exiting dtype on Concat HOT 1
- LazyFrame select in 0.20.31 includes hive partition column even when not in specified columns HOT 2
- cross-join should not work on any key HOT 1
- `.sink_parquet()` sometimes panics when `statistics` has `"null_count": False` HOT 1
- separate `pl.list()` and `pl.concat_list` HOT 4
- Panick since first release candidate when expressions in `select()` return a different number of rows HOT 1
- Question, how do you generate and test the Python examples in the README? HOT 1
- min/max operations on i16 list with None elements HOT 1
- Panic when drop nulls
- `write_database` fails for UInts and Time dtypes when ADBC used HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.