Deion I'm guessing CSE isn't supported because python UDFs c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support CSE on python UDFs about polars HOT 10 OPEN

kszlim commented on July 20, 2024

Support CSE on python UDFs

from polars.

Comments (10)

deanm0000 commented on July 20, 2024 1

@avimallu I had the same thought on the lru_cache but it doesn't work because neither pl.Series or even np.ndarray are hashable for the underlying cache.

from polars.

ritchie46 commented on July 20, 2024 1

Ping me next week. I will see if I can put something behind an env var.

from polars.

ritchie46 commented on July 20, 2024

I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past.

from polars.

kszlim commented on July 20, 2024

Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly.

from polars.

kszlim commented on July 20, 2024

At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an over to the user provided expressions. The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context. It's okay if the UDF is cheap, but some of them are quite expensive, so having CSE work would be great.

from polars.

ritchie46 commented on July 20, 2024

I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later.

from polars.

kszlim commented on July 20, 2024

Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the pl.Expr.cache method as an alternative to this instead?

from polars.

avimallu commented on July 20, 2024

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

from polars.

kszlim commented on July 20, 2024

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

@avimallu I don't think you understand the feature request.

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

You'll notice that:

 WITH_COLUMNS:
 [col("a").python_udf().alias("b"), [(col("a").python_udf()) * (2.cast(Unknown(Any)))].alias("c"), [(col("a").python_udf()) * (3.cast(Unknown(Any)))].alias("d")], []
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

Will print out, meaning that the udf gets evaluated 3x.

Contrast it with:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

Which will print out:

 WITH_COLUMNS:
 [col("__POLARS_CSER_0xd39686281a38356a").alias("b"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (2)].alias("c"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (3)].alias("d")], [[(col("a")) * (2)].alias("__POLARS_CSER_0xd39686281a38356a")]
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

This requires polars side changes or you have to explicitly write your query/code like:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain())

But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction.

from polars.

kszlim commented on July 20, 2024

@ritchie46 if you're not swamped this would be great! Or if you can give me some high level guidance I can try to take a crack at this if you don't think it's too complicated.

from polars.

Support CSE on python UDFs about polars HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent