Comments (9)
@ritchie46 See attached - the dataframe from df = pl.read_database_uri(...)
contains chunks of size 155 which matches the error message, I wrote out 2 columns using write_parquet("tasks.parquet", row_group_size=155)
which I assume preserves the chunks.
To reproduce
pl.read_parquet("tasks.parquet").filter(~pl.col("task_id").is_in([1]))
ShapeError: filter's length: 155 differs from that of the series: 0
(I think the json encoding was a furphy but I included that column)
from polars.
Thanks for the report. Would love to get a repro on this.
from polars.
Sorry, saving the dataframe (to parquet or json) and then reloading it "fixes" the issue otherwise I'd be happy to privately share the data. I can reproduce at will on both my Ubuntu 22.04 workstation and M3 Macbook Pro
If I have time I'll try spinning up a postgres docker image and see if I can create a simpler reproduction.
I also couldn't work our which specific version introduced the issue but it's not present in 0.20.26
from polars.
So I just discovered that
df = pl.read_database_uri(...)
df.with_columns(pl.col("status").str.json_decode().filter(pl.col("id")==1)
fails with the error above and
df = pl.read_database_uri(...)
df.rechunk().with_columns(pl.col("status").str.json_decode().filter(pl.col("id")==1)
works - I'm assuming that's why saving/loading also "fixes" it.
Although I found I had to add rechunk
in multiple points in the query to make it work for every query (in general before any filter operation, so when I filtered twice I had to rechunk twice?)
from polars.
What is the schema? I think if you create a chunked dataframe with the same schema you should be able to create a repro.
from polars.
I think I have an issue that may share similarities with this one : After concatenating dataframes, some operations raise a ShapeError. My issue is also "fixed" by writing the dataframe to parquet and loading back, or by using rechunk.
I am not sure if this is same underlying issue, however it provides an easy repro : #16516
from polars.
I can reproduce the error with the attached file.
I'm not sure if it is the same issue, but I also get a panic trying to rewrite it to disk:
>>> pl.read_parquet("Downloads/tasks.parquet").write_parquet("1.parquet")
thread 'polars-0' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n However, the values at index 7 have a length of 161, which is different from values at index 0, 0."))
from polars.
Also note join
on chunked dataframes is also raising errors, i.e.
thread 'polars-9' panicked at crates/polars-ops/src/chunked_array/gather/chunked.rs:84:5:
assertion `left == right` failed: implementation error
left: 1
right: 12
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Edit: I'm seeing this on v0.30.26 as well
from polars.
Got a complete query @david-waterworth ?
from polars.
Related Issues (20)
- Polars scan_parquet with wildcard fails where schema column index positions dont align HOT 5
- json_path_match is not handling not equal operator correctly HOT 1
- scan_ndjson fails with non-intuitive error message even though python can load files correctly HOT 4
- mypy fails due to "unused-ignore" HOT 2
- Incorrect/mismatched LazyFrame schema resolution between `collect` and `collect_schema` HOT 2
- Allow dynamic EWM settings (window/min_periods)
- DataFrame write_parquet hangs in sub-process HOT 2
- Specify which values are not found in replace_strict in exception message
- `LazyFrame::cross_join` + `concat_list` error HOT 1
- `pl.read_lines`, `pl.scan_lines` HOT 2
- WASM build fails with "lazy" "streaming" features HOT 1
- Ambigious aliases are not found HOT 1
- Allow operation with nested polars DataFrame HOT 6
- Aliases for tables in dataframe API
- `PanicException` occurs when applying a deserialized `rolling_quantile` HOT 1
- Polars can't cross-compile from WSL2 to Windows. EXE created, but will not run, throws "This App will not run on your PC" HOT 7
- The CI tool is seemingly broken on windows python 3.12 HOT 1
- Lazy frame casted from numpy fails on streaming, `cannot reshape empty array into shape ( -1, 1000)`
- RecordBatch requires all its arrays to have an equal number of rows when running pipeline in streaming mode HOT 2
- Singular value optimization HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.