Comments (5)
It's optimized for the use-case where you call df = pl.concat(...)
just once up top,
and then you do all the rest of your processing on df
.
The idea is that most operations on df
will be faster if df
is contiguous in memory, rather than separate chunks.
from polars.
Can you try concat with rechunk=False
?
from polars.
@s-banach It was about 3000x faster. Are there any reasons this isnt set to False by default?
from polars.
@s-banach I see. Unfortunately, there were no performance gains when i did rechunk=False
with how="align"
from polars.
You are comparing apples with peaches.
Every time you extend df1
, you mutate the original memory. So the first extend
is much cheaper than the later ones (that trigger a realloc) as your size of df1
increases every iteration.
for i in range(10):
df1.extend(df2)
print(df1.height)
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
5500000
So your extend
"benchmark" has a different input every iterations, whilst your concat
"benchmark" needs to concat the 5_500_000 case every time. Please take the time to validate your benchmarks.
On the difference, they will perform different. That's fine, they do different things.
Don't use extend
as it mutates underlying memory which leads to these kind of bugs. As the warning of the extend docstring states.
from polars.
Related Issues (20)
- wrong documentation in DataFrame.update HOT 1
- ComputeError: unable to parse Hive partition value: "TRUE" HOT 2
- `min` fails on `duration` types HOT 3
- Github release for rust-polars 0.40.0 HOT 1
- Getting panic when calling `LazyFrame.group_by().map_groups` and intermitten panic when calling `LazyFrame.columns` HOT 4
- GitHub release seems created with wrong commit? HOT 1
- Ergonomic improvements to `struct.with_fields` HOT 3
- Support converting to NumPy masked arrays
- `write_parquet` on chunked data is pathological
- LazyFrame() not omitting hive partition columns
- Panic when trying to use List(Categorical) set_intersection with concat_list of other column with nulls or empty frame HOT 2
- read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame HOT 1
- `struct.field("*")` duplicate column ComputeError
- `from_repr` generates DecprecationWarning about `apply` when Duration type is present
- In `expr.str.slice()` indicate whether an index of 0 or 1 means "start at the start of the string"
- Add argument to `df.to_dicts()` and `df.to_dict()` - `maintain_column_order: bool` HOT 3
- Support zero copy for Datetime/Duration types in `DataFrame.to_numpy`
- Reading parquet with PyArrow ignores rechunk argument HOT 1
- Add `pl.col(...).is_not_in(<iterable>)` method HOT 4
- `search_sorted` in an order of magnitude slower when single element chunk vstacked to the original dataframe HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.