Comments (8)
I'm having trouble understanding the issue here. Is there a bug? Is it a performance issue?
from polars.
@stinodego Sorry I jumped right into the weeds of proving the problem and so I didn't do a good job summarizing the problem.
The problem is that if we do a query where all the responsive info is in the filepaths, it will read every file when it shouldn't read any of them.
Suppose we have this directory structure
mydataset/fruit=apple/0000.parquet
mydataset/fruit=banana/0000.parquet
and then we do
df=pl.scan_parquet("mydataset/**/*.parquet")
The initial scan is going to get the list of files and directories immediately so that if we do
df.select(pl.col('fruit').unique()).collect()
then it should be able to give us ['apple', 'banana'] without reading any files. In fact, reading the files provides no help to completing that query because the column doesn't even exist in the files.
Despite there being no relevant data in any of the files, it reads all the files.
For any small/simple MRE you wouldn't notice that it's doing this which is why I have my first example deletes a file so that you can see that it is really trying to read it. For big data, especially on the cloud where we're subject to internet speeds and read charges, that is the difference between a free instant answer and one where we download GBs (or maybe TBs) of data from the cloud for no reason.
I don't know if this is a bug per se but it is definitely a performance issue.
To extend this a bit.
Even if I did something like
df.select(pl.col('fruit')).collect() # not with unique
then it should still only read the meta data of the files to get the number of rows rather than reading any of the underlying data.
from polars.
@stinodego When the optimizer is looking at stats to determine what it needs to do, can it do a check similar to this pseudo-code?
if stats.min_value==stats.max_value & stats.null_count==0:
return [stats.min_value] * stats.num_rows
else:
return read_physical_source()
from polars.
Ok, I see what you're saying. So we should push down predicates somehow and determine whether all information is already present in the paths themselves rather than the files.
So this is a performance issue. It would definitely be great if we could implement this.
from polars.
I don't think it is a predicate pushdown but a projection issue, isn't it? That is to say, it's not filtering anything.
When it does the initial scan of a hive path, it parses the directories and create statistics max=min=[value] for all the hive directories. If those stats are available to the projection optimizer then generically it seems it should be able to check if max==min and null_count=0 then skip reading anything and just project that value. I would think that check would work even beyond hive partitions and would work for any highly repeated data. Am I off base or could that be a way to implement it?
from polars.
I meant to refer to projection pushdown indeed.
from polars.
@nameexhaustion care to take a look at this one? I ask because I see you crushing other hive related issues and there might be some economies of scale.
from polars.
I don't think we can do this, we need to know how many rows the files have in order to project the correct number of rows, even if the projection is hive only.
from polars.
Related Issues (20)
- Inconsistent rolling results when using temporal windows HOT 5
- Polars ignoring rows that are empty in Excel HOT 1
- S3 credentials aren't loaded from `~/.aws/config` if equals aren't padded with spaces
- No non-strict creation of literals HOT 3
- PanicException: index: 8449 out of bounds for len: 1 when using scan csv with schema and include_file_paths
- Fail to compile polars 0.42.0
- Should `str.to_titlecase()` capitalize the letter after an apostrophe? HOT 5
- polars.read_database can not work with duckdb_engine connection. HOT 1
- Build fail HOT 6
- Improve decimal_comma error message
- Add Lateral Column Aliasing support for the SQL interface HOT 2
- maintain_order is a gotcha. Make it true by default? HOT 4
- Creating struct using 'when' HOT 2
- group_by on all columns cannot suppot tail HOT 1
- Possible memory leak on scanning csv and sinking to parquet HOT 7
- Remove `json` support in `LazyFrame.serialize`
- Regex look-ahead/behind support HOT 7
- Add `.dt.replace()` to alter date/datetime values HOT 3
- LazyFrame.head(n) raises exception when there are errors after the first n rows.
- Non-strict data frame creation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.