Comments (6)
As mentioned by others, this is not a fair comparison as the input/output formats are different - we don't do in-place manipulation but generate a copy. Also, Polars actually has proper nulls (which means it has to look in a different memory location that contains the nulls), whereas Pandas only has to look at the values themselves since it uses NaNs.
Finally the original test of 280,000 rows is way too small - at that point you're almost benchmarking the Polars DSL parsing/optimizer more than the data manipulation itself.
Repeating the above experiment with 100M rows I get the following results on my Apple M2 machine:
378 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
742 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm currently finishing a PR that would reduce the gap to this:
379 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
544 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
More improvement with branchless filling is possible still but low priority at the moment, as it's rather labour-intensive to write.
from polars.
I'm not seeing the same behavior:
242 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
239 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
from polars.
Reopening case as I still get faster results for pandas counterpart.
2.21 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Let me know if anyone gets something else
from polars.
I don't think your test is apples to apples
doing df["random_value"].bfill()
doesn't return a DataFrame. It returns a Series
A more apples to apples test would be compare two function calls that return a dataframe so something like
%%timeit
df2.with_columns(pl.col("random_value").backward_fill())
%%timeit
df.assign(a=df['a'].bfill())
When I do that comparison with 100M rows, 20% null. I get polars takes 795ms and pandas takes 1.44s
from polars.
@deanm0000 I see. The whole idea is to create a new column in a dataframe where i do backwardfilling. By using with_columns
instead of select
i get the following results where polars is line number 2:
2.33 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.25 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The pandas way of adding a new column/manipulating existing column is usually df["New_Col"] = ..., so would kind of be wrong to compare to assign in which "nobody" uses
from polars.
I see your point but it's not a bug that pandas is faster for this operation.
Someone should correct me if I have this wrong but I think the difference is that numpy arrays are mutable whereas arrow arrays are immutable. That means when you just want to change a subset of values, pandas/numpy can do that inplace whereas when you want to perform the same operation with arrow arrays it has to rewrite all the values.
from polars.
Related Issues (20)
- Read_json panics when infer_schema_length = 0
- `explain(streaming=True)` isn't showing correct plan
- Data in csv files with less columns than schema shifts data. HOT 4
- Add the argument `ignore_nulls` in `.arr.all()`, `.arr.any()`, `.list.all()` and `.list.any()`
- read_database_uri panics for dates beyond 2262.04.11 HOT 2
- Move streaming engine original plan to separate field on the `IRPlan`
- Write upgrade guide for 1.0.0
- Polars is unable to parse dates beyond 2262.04.11 HOT 1
- Make a ParquetWriter context handler and/or more control over row group creation
- Casting to float32, int32, int16 and int8 in polars is slower than pandas in larger dfs HOT 4
- Interpolate based on other Float64 column HOT 3
- Comparing 0 with UInt64 values larger than Int64::MAX incorrectly return NULL
- `read_csv` ignores the `columns` parameter when reading an empty CSV file with header HOT 1
- Inconsistent XOR when using literals
- `pl.concat_str(...)`'s `ignore_nulls` arg field
- In `pl.read_csv(...)`, allow `separator=None` in order to read everything into only a single column HOT 1
- pl.DataFrame loads in 2D lists in unexpected way HOT 5
- `join_asof` breaks with certain parquet files (I think due to memory layout or something?) HOT 2
- Expressions result in different serializations across Python Polars PATCH version upgrades HOT 1
- Add inline XOR (`^`) operator for selectors HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.