Comments (10)
It might be related to this issue #14384 and this comment: #14384 (comment)
from polars.
I appreciate this issue is closed, but for anyone else with this problem who might find this issue in future:
As things currently stand it's about twice as fast to fetch files from S3 using object_store
(Rust) or boto3
(python) than it is using Polars when looping over a load of calls to read_*
.
My current understanding is that this is caused by re-establishing a connection to S3 for each read rather than using a single connection for all reads.
At the moment, if you want to read multiple files using the same connection, it's better to pass a file-like object to read_*
than to use Polars, and to manage the connection yourself.
I appreciate the help from the maintainers in this thread 👍
from polars.
I've managed to run some benchmarks now.
Here's what I did:
I have a load of parquet files on S3 that range from ~2kb to ~2mb depending on the size of the shard. I made a list of 100 random files (the same each time, so there'll be some optimisations on S3 between runs I imaging).
These files are listed in files_to_read
and I hard coded the bucket name / had the AWS creds in the environment.
This was all run from my computer in the same region as the S3 bucket. I imaging all speeds would be faster on AWS (Lambda or EC2), but I'm also assuming that the relative speeds would be the same.
Method 1: Pull whole file from S3, new client each time (pl.read_parquet())
Total time: 22.967 seconds
Time per file: 0.2296688356800587
------------------------------
Method 2: scan_parquet, limit 30, collect (pl.scan_parquet().limit(30).collect())
Total time: 22.057 seconds
Time per file: 0.2205727833401761
------------------------------
Method 3: Boto3 client connection pooling, then read parquet (boto3.client() outside the loop, then use boto to read the data into pl.read_parquet)
Total time: 8.437 seconds
Time per file: 0.0843655370999477
------------------------------
With method 2, I'd assumed Polars might download less data, but it looks like it's doing more or less the same as downloading all the data. Maybe I've misunderstood how this works, or I've made some other mistake.
Either way, I think this shows the performance improvements that could be possible with my use case, with a connection pool.
from polars.
Thanks for the speedy response!
I've build the version with 0c86901 and run my tests again, This time I used a EC2 instance (to speed up build times on my slow laptop).
The good news is that things are faster.
Before the change read_parquet
took about 11.5 seconds on 100 files. After the change It's taking about 6 seconds. About 50% speed up is excellent!
However, Boto3 with a global client is still out performing Polars by a long way. The baseline for reading from boto and then passing this file on to Polars is about 2.5 seconds.
I don't know if this is becaue boto3 is doing something that object_store
doesn't (e.g, it's not related to this project) or if there is still some value in being able to use a single connection that's manually passed in via storage_options
.
I think this issue might need re-opening if the latter is true and this is somethingin the control of Polars. If it's not, then I expect the existing change is going to be about all this project can do. I'll leave that up to someone who knows more about it than I do!
Thanks again for the quick fix!
from polars.
Could you try getting Polars out of the equation and try the benchmark immediately with object-store? Curious what the baseline differences are.
from polars.
I don't know Rust very well, but I took some time hacking something together.
A release version of a simple script that loops over the exact same files as my Python script actually completes slightly faster than Boto3.
- Rust/object_store: ~2s
- Boto3/read_parquet: ~2.5 -2.8 seconds
- Polars/read_parquet(s3://...) (built from master): ~6 seconds
As I say, the new Polars version is much better than before the change, but there's something going on that's not related to object_store, I believe.
Script outline
let s3_store = AmazonS3Builder::from_env().with_bucket_name(bucket_name).build().unwrap();
let files_to_download = vec![...]
let start = Instant::now(); // Start timing
for file_path in files_to_download {
let object_key = format!("prefix/{}", file_path);
let object_path = Path::from(object_key);
let data = s3_store.get(&object_path).await?.bytes().await?;
println!("Downloaded {} bytes for {}", data.len(), file_path);
}
let duration = start.elapsed(); // End timing
println!("Total: {:?} milliseconds", duration.as_millis());
Ok(())
Of course in my Python scripts I'm also reading the parquet into Polars, but I doubt that's adding a lot of overhead.
from polars.
Alright interesting. Have you got some info on the Polars query you run?
Polars downloads all row groups speratedly (concurrently). And in the case of projections in downloads the columns concurently.
How many row groups have you got per file? If you create files with a single row-group does that improve? (i am on vacation so have limited internet. Cannot do cloud tests).
from polars.
For this test I'm literally just calling read_parquet
on the whole file and not performing any query beyond that.
I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.
from polars.
For this test I'm literally just calling
read_parquet
on the whole file and not performing any query beyond that.I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.
You can check this function to read the number of rowgroups. It is a sharding mechanism for parquet format, designed so that parquet readers can batch their data ingestion by this unit.
from polars.
Ok thanks.
I'm at the limits of my current understanding of parquet, but I think a row group might relate to the number of actual files data is split across e.g when using hive partitions?
Either way, I sampled a loads og my files including the ones in the tests and they are all 1 row group.
I've manually partitioned my data according to the domain I'm working in, so I know that when I'm using read_parquet
I'm only ever downloading a single file at at time and then processing it later. All the tests above are just on one file at a time.
I did try using hive partitions (before the patch related to this issue was in) but the processing time was very slow, I assume partly because of the (lack of) connection pool. I've not tried with the patch, but as the single file downloads quickly enough using boto3 for my needs I don't think changing the sharding approach is related to the issue here.
from polars.
Related Issues (20)
- Polars-lts-cpu fails to import on older CPU (no SSSE3/SSE4 support)
- Release GIL on `collect_schema`
- Error when using struct expression with `with_fields` in an `over` context HOT 6
- How to write a UDF for polars that run concurrently?
- `write_excel`: write column formats for column, not individual cells within column HOT 1
- Horizontal concat execution time is quadratic in the number of columns
- Incorrect results from `Series.__rtruediv__` HOT 1
- Add SQL feature of ORDER BY RANDOM()
- Expressions support in insert_column (like with_columns)
- Passing a `Series` to `DataFrame.sort` gives "literal expressions are not allowed for sorting" error HOT 3
- polars' ingestion of decimal.Decimal values fails if all values do not have the same number of decimal places HOT 3
- Series.is_in called with a mixed list of Python integers and floats fails HOT 4
- Where is indexing `__getitem__` (e.g. `df[...]`) documented for polars DataFrame and Series? HOT 5
- Casting literal floats to strings rounds the value down. HOT 2
- Try-Except style Error handing for Column Computation? HOT 1
- double select on lazyframe has performance regression HOT 1
- Join documentation is misleading or wrong. HOT 1
- Missing docs pages for some Series methods HOT 1
- DataFrame.__pow__ fails for (series, column) inputs HOT 1
- Merge list of dataframes with common keys HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.