Deion I've looked at the issues, docs and asked <a href="htt

It might be related to this issue <a class="issue-link js-issue-link" data-error-text=

Python/S3 connection pooling about polars HOT 10 CLOSED

symroe commented on August 19, 2024

Python/S3 connection pooling

from polars.

Comments (10)

thomasfrederikhoeck commented on August 19, 2024 1

It might be related to this issue #14384 and this comment: #14384 (comment)

from polars.

symroe commented on August 19, 2024 1

I appreciate this issue is closed, but for anyone else with this problem who might find this issue in future:

As things currently stand it's about twice as fast to fetch files from S3 using object_store (Rust) or boto3 (python) than it is using Polars when looping over a load of calls to read_*.

My current understanding is that this is caused by re-establishing a connection to S3 for each read rather than using a single connection for all reads.

At the moment, if you want to read multiple files using the same connection, it's better to pass a file-like object to read_* than to use Polars, and to manage the connection yourself.

I appreciate the help from the maintainers in this thread 👍

from polars.

symroe commented on August 19, 2024

I've managed to run some benchmarks now.

Here's what I did:

I have a load of parquet files on S3 that range from ~2kb to ~2mb depending on the size of the shard. I made a list of 100 random files (the same each time, so there'll be some optimisations on S3 between runs I imaging).

These files are listed in files_to_read and I hard coded the bucket name / had the AWS creds in the environment.

This was all run from my computer in the same region as the S3 bucket. I imaging all speeds would be faster on AWS (Lambda or EC2), but I'm also assuming that the relative speeds would be the same.

Method 1: Pull whole file from S3, new client each time (pl.read_parquet())
Total time: 22.967 seconds
Time per file: 0.2296688356800587
------------------------------

Method 2: scan_parquet, limit 30, collect (pl.scan_parquet().limit(30).collect())
Total time: 22.057 seconds
Time per file: 0.2205727833401761
------------------------------

Method 3: Boto3 client connection pooling, then read parquet (boto3.client() outside the loop, then use boto to read the data into pl.read_parquet)
Total time: 8.437 seconds
Time per file: 0.0843655370999477
------------------------------

With method 2, I'd assumed Polars might download less data, but it looks like it's doing more or less the same as downloading all the data. Maybe I've misunderstood how this works, or I've made some other mistake.

Either way, I think this shows the performance improvements that could be possible with my use case, with a connection pool.

from polars.

symroe commented on August 19, 2024

Thanks for the speedy response!

I've build the version with 0c86901 and run my tests again, This time I used a EC2 instance (to speed up build times on my slow laptop).

The good news is that things are faster.

Before the change read_parquet took about 11.5 seconds on 100 files. After the change It's taking about 6 seconds. About 50% speed up is excellent!

However, Boto3 with a global client is still out performing Polars by a long way. The baseline for reading from boto and then passing this file on to Polars is about 2.5 seconds.

I don't know if this is becaue boto3 is doing something that object_store doesn't (e.g, it's not related to this project) or if there is still some value in being able to use a single connection that's manually passed in via storage_options.

I think this issue might need re-opening if the latter is true and this is somethingin the control of Polars. If it's not, then I expect the existing change is going to be about all this project can do. I'll leave that up to someone who knows more about it than I do!

Thanks again for the quick fix!

from polars.

ritchie46 commented on August 19, 2024

Could you try getting Polars out of the equation and try the benchmark immediately with object-store? Curious what the baseline differences are.

from polars.

symroe commented on August 19, 2024

I don't know Rust very well, but I took some time hacking something together.

A release version of a simple script that loops over the exact same files as my Python script actually completes slightly faster than Boto3.

Rust/object_store: ~2s
Boto3/read_parquet: ~2.5 -2.8 seconds
Polars/read_parquet(s3://...) (built from master): ~6 seconds

As I say, the new Polars version is much better than before the change, but there's something going on that's not related to object_store, I believe.

Script outline

let s3_store = AmazonS3Builder::from_env().with_bucket_name(bucket_name).build().unwrap();
let files_to_download = vec![...]
let start = Instant::now(); // Start timing

for file_path in files_to_download {
    let object_key = format!("prefix/{}", file_path);
    let object_path = Path::from(object_key);
    let data = s3_store.get(&object_path).await?.bytes().await?;
    println!("Downloaded {} bytes for {}", data.len(), file_path);
}

let duration = start.elapsed(); // End timing

println!("Total: {:?} milliseconds", duration.as_millis());

Ok(())

Of course in my Python scripts I'm also reading the parquet into Polars, but I doubt that's adding a lot of overhead.

from polars.

ritchie46 commented on August 19, 2024

Alright interesting. Have you got some info on the Polars query you run?

Polars downloads all row groups speratedly (concurrently). And in the case of projections in downloads the columns concurently.

How many row groups have you got per file? If you create files with a single row-group does that improve? (i am on vacation so have limited internet. Cannot do cloud tests).

from polars.

symroe commented on August 19, 2024

For this test I'm literally just calling read_parquet on the whole file and not performing any query beyond that.

I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.

from polars.

cjackal commented on August 19, 2024

For this test I'm literally just calling read_parquet on the whole file and not performing any query beyond that.

I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.

You can check this function to read the number of rowgroups. It is a sharding mechanism for parquet format, designed so that parquet readers can batch their data ingestion by this unit.

from polars.

symroe commented on August 19, 2024

Ok thanks.

I'm at the limits of my current understanding of parquet, but I think a row group might relate to the number of actual files data is split across e.g when using hive partitions?

Either way, I sampled a loads og my files including the ones in the tests and they are all 1 row group.

I've manually partitioned my data according to the domain I'm working in, so I know that when I'm using read_parquet I'm only ever downloading a single file at at time and then processing it later. All the tests above are just on one file at a time.

I did try using hive partitions (before the patch related to this issue was in) but the processing time was very slow, I assume partly because of the (lack of) connection pool. I've not tried with the patch, but as the single file downloads quickly enough using boto3 for my needs I don't think changing the sharding approach is related to the issue here.

from polars.

Python/S3 connection pooling about polars HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent