Comments (10)
@avimallu I had the same thought on the lru_cache but it doesn't work because neither pl.Series
or even np.ndarray
are hashable for the underlying cache.
from polars.
Ping me next week. I will see if I can put something behind an env var.
from polars.
I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past.
from polars.
Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly.
from polars.
At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an over
to the user provided expressions. The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context. It's okay if the UDF is cheap, but some of them are quite expensive, so having CSE work would be great.
from polars.
I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later.
from polars.
Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the pl.Expr.cache
method as an alternative to this instead?
from polars.
The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.
Does lru_cache
not work for your case?
from polars.
The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.
Does
lru_cache
not work for your case?
@avimallu I don't think you understand the feature request.
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())
You'll notice that:
WITH_COLUMNS:
[col("a").python_udf().alias("b"), [(col("a").python_udf()) * (2.cast(Unknown(Any)))].alias("c"), [(col("a").python_udf()) * (3.cast(Unknown(Any)))].alias("d")], []
DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None
Will print out, meaning that the udf gets evaluated 3x.
Contrast it with:
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())
Which will print out:
WITH_COLUMNS:
[col("__POLARS_CSER_0xd39686281a38356a").alias("b"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (2)].alias("c"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (3)].alias("d")], [[(col("a")) * (2)].alias("__POLARS_CSER_0xd39686281a38356a")]
DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None
This requires polars side changes or you have to explicitly write your query/code like:
import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain())
But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction.
from polars.
@ritchie46 if you're not swamped this would be great! Or if you can give me some high level guidance I can try to take a crack at this if you don't think it's too complicated.
from polars.
Related Issues (20)
- Ability to control the assumed century cutoff with `Expr.str.to_date` when parsing 2-digit years HOT 5
- Improve serialization of Categorical/Enum types
- Segfault / PanicException with `pl.lit` + `.slice` in a group by context
- `str.replace_many` to take a dictionary that defines a replacement mapping. HOT 8
- Different join behavior when streaming during join on different types HOT 1
- Allow remapping of hive partitioning columns (or physical parquet columns) before they're unified.
- DataFrame.vstack() deadlocks in asyncio background task HOT 2
- Panic when calling `top_k` on list-of-lists type
- Scan parquet should allow manual schema overrides (and/or should treat any `list[null]` encountered as a type to be resolved at query execution time?)
- Support sorting list-of-lists column
- Change `dt.week()` to be more consistent with `dt.year()` (Gregorian Year) HOT 1
- Deserializing list-of-null data leads to incorrect results
- Cannot instantiate Series of type Struct(List(UInt64)) with large integer value
- Aliasing Columns Names in Map_Element () with Dataclass Fields HOT 3
- Parallel string operations HOT 1
- add option for `to_dummies` to respect nulls
- `.mean_horizontal()` does not work with `pl.datetime` (but `.mean()` does)
- floor_div runtime error for i64, u32 and u64
- Support No-op round() on Int columns HOT 1
- `sort_by("col", nulls_last=True).over("other")` does not put nulls at the end HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.