datafusion-contrib / datafusion-objectstore-s3 Goto Github PK

View Code? Open in Web Editor NEW

58.0 10.0 14.0 75 KB

S3 as an ObjectStore for DataFusion

License: Apache License 2.0

Rust 100.00%

rust datafusion

datafusion-objectstore-s3's Issues

Add parquet testing sub module

For use in CI

Investigate: Register into ObjectStoreRegistry

Investigate using S3 Select

It seems support was added for this based on https://github.com/awslabs/aws-sdk-rust/releases/tag/v0.0.17-alpha

Look into integrating this into S3FileSystem or using it to create a TableProvider.

Async S3 Support for Read & Write Parquet Files

hey, do we have async support to read s3 files I would see support for reading local files using async ParquetRecordBatchStreamBuilder ?

Cleanup README query example with better explanations.

Support aws_config::load_from_env()

As @seddonm1 notes, aws_config::load_from_env() doesnt allow specifying endpoint which is required for integration testing.

that being said, it would still be nice to support this feature.

Maybe we could add a boolean argument to AmazonS3FileSystem::new like load_from_env and dispatch accordingly.

Improve Testing

Add testing for the below cases.

Bad Data

Non existent file
Non existent bucket

DataFusion Integration

Test for ctx.register_object_store

Setup CI

Based on design in base implementation https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/2/files

Support creating client specific configs for different buckets

...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

Originally posted by @houqp in #20 (comment)

With context provided by @houqp:

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

And potential designs also provided by @houqp:

Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

Setup CI

Setup github actions build and testing flows

Reorganize files with similar structure to datafusion

API not quite correct

I think the API is not correct to work with DataFusion:

Currently the AmazonS3FileSystem requires a bucket to be set. This means that one ObjectStore can only retrieve from one bucket. This does not fit with the DataFusion API which resolves ObjectStores by URI scheme (i.e. s3:// vs file://).

I will refactor this so we can dynamically request files correctly.

Rename `AmazonS3FileSystem` to `S3FileSystem`

I was looking at the C++/Python Apache Arrow docs and saw they use the name S3FileSystem (https://arrow.apache.org/docs/search.html?q=s3). While we of course dont need to follow their lead I do think that using that naming makes more sense given that this doesnt need to connect to Amazon S3 (while they are the standard bearer).

we of course are using the official amazon library to connect(im not sure if the C++/Python implementation is), and im not sure if that was taken into account when naming.

one more example is pythons s3fs which is built on top of botocore another aws library.

@seddonm1 what do you think? if ok with you i will make the update.

Tests failing, update the Minio version used for testing

Testing on master branch using the commands provided in the README fails. The issue as suggested here Both these Issues: StackOverflow, GitHub Issue, Minio BlogPost seems to be that Minio doesn't add the files in parquet-testing/data folder to the data bucket because of new versioning scheme and consequently, the tests aren't able to read the files from docker deployment.

This change in Minio's behaviour is recent and suggest that an older version might solve the issue. I tried with the older versions and the newest one that works for the tests is minio/minio:RELEASE.2022-05-26T05-48-41Z.hotfix.204b42b6b.

Please update the GitHub actions with this image. Thanks!

Multi parquet s3 example?

I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?

Replace datafusion Error/Result with new S3Error / Result

I think this would make it more clear when used with datafusion where the error is coming from.

Doc-test for `register_object_store` not working

I cant get doc tests to pass using the object_store returned from ctx.object_store method (i get file not found error). Does anything seem off to you @seddonm1?

Originally posted by @matthewmturner in #38 (comment)

Implement Default on AmazonS3FileSystem

Use default AWS credentials provider in a Default impl for AmazonS3Filesystem so we can call

AmazonS3FileSystem::default().await

instead of

AmazonS3FileSystem::new(
        None,
        None,
        None,
        None,
        None,
        None,
    )
    .await,

Publish to crates.io

Unable to use S3FileSystem with RuntimeEnv::register_object_store

I am trying to add an S3 object store for a project using datafusion version 28. The README.md for this project notes that the S3FileSystem can be registered as an ObjectStore on an ExecutionContext, but that latter trait is not something that exists any longer in datafusion. Instead, the register_object_store function is now part of the RuntimeEnv trait. Given a SessionContext object ctx, I am trying to register the S3 store with

use datafusion_objectstore_s3::object_store::s3::S3FileSystem;
...
let s3_file_system = Arc::new(S3FileSystem::default().await);
let rv = ctx.runtime_env();
let s3url = Url::parse("s3").unwrap();
rv.register_object_store(&s3url, s3_file_system);

However, this produces the following error:

error[E0277]: the trait bound `S3FileSystem: object_store::ObjectStore` is not satisfied
   --> src/main.rs:127:42
    |
127 |         rv.register_object_store(&s3url, s3_file_system);
    |                                          ^^^^^^^^^^^^^^ the trait `object_store::ObjectStore` is not implemented for `S3FileSystem`
    |
    = help: the following other types implement trait `object_store::ObjectStore`:
              Box<(dyn object_store::ObjectStore + 'static)>
              object_store::chunked::ChunkedStore
              object_store::limit::LimitStore<T>
              object_store::local::LocalFileSystem
              object_store::memory::InMemory
              object_store::prefix::PrefixStore<T>
              object_store::throttle::ThrottledStore<T>
    = note: required for the cast from `S3FileSystem` to the object type `dyn object_store::ObjectStore`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `shimsql` due to previous error

Is this crate incompatible with datfusion 28, or is there some using declaration I'm missing that's necessary for this trait implementation to be visible?

datafusion-contrib / datafusion-objectstore-s3 Goto Github PK

datafusion-objectstore-s3's Issues

Recommend Projects

Recommend Topics

Recommend Org