Giter Site home page Giter Site logo

datafusion-objectstore-s3's Issues

Register object store issue

Hi, I`d like to register s3 object store to read files hosted on s3, followed the example I found [here] (#15)

Here is my code part, but it throws me the trait 'ObjectStore' is not implemented for 'AmazonS3FileSystem'

let mut execution_ctx = ExecutionContext::new();
  execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await)
    );

how can I fix this? Thanks

Support aws_config::load_from_env()

As @seddonm1 notes, aws_config::load_from_env() doesnt allow specifying endpoint which is required for integration testing.

that being said, it would still be nice to support this feature.

Maybe we could add a boolean argument to AmazonS3FileSystem::new like load_from_env and dispatch accordingly.

Improve Testing

Add testing for the below cases.

Bad Data

  • Non existent file
  • Non existent bucket

DataFusion Integration

  • Test for ctx.register_object_store

Support creating client specific configs for different buckets

...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

Originally posted by @houqp in #20 (comment)

With context provided by @houqp:

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

And potential designs also provided by @houqp:

  1. Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket

  2. Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

Setup CI

Setup github actions build and testing flows

API not quite correct

I think the API is not correct to work with DataFusion:

Currently the AmazonS3FileSystem requires a bucket to be set. This means that one ObjectStore can only retrieve from one bucket. This does not fit with the DataFusion API which resolves ObjectStores by URI scheme (i.e. s3:// vs file://).

I will refactor this so we can dynamically request files correctly.

Rename `AmazonS3FileSystem` to `S3FileSystem`

I was looking at the C++/Python Apache Arrow docs and saw they use the name S3FileSystem (https://arrow.apache.org/docs/search.html?q=s3). While we of course dont need to follow their lead I do think that using that naming makes more sense given that this doesnt need to connect to Amazon S3 (while they are the standard bearer).

we of course are using the official amazon library to connect(im not sure if the C++/Python implementation is), and im not sure if that was taken into account when naming.

one more example is pythons s3fs which is built on top of botocore another aws library.

@seddonm1 what do you think? if ok with you i will make the update.

Tests failing, update the Minio version used for testing

Testing on master branch using the commands provided in the README fails. The issue as suggested here Both these Issues: StackOverflow, GitHub Issue, Minio BlogPost seems to be that Minio doesn't add the files in parquet-testing/data folder to the data bucket because of new versioning scheme and consequently, the tests aren't able to read the files from docker deployment.

This change in Minio's behaviour is recent and suggest that an older version might solve the issue. I tried with the older versions and the newest one that works for the tests is minio/minio:RELEASE.2022-05-26T05-48-41Z.hotfix.204b42b6b.

Please update the GitHub actions with this image. Thanks!

Multi parquet s3 example?

I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?

Implement Default on AmazonS3FileSystem

Use default AWS credentials provider in a Default impl for AmazonS3Filesystem so we can call

AmazonS3FileSystem::default().await

instead of

AmazonS3FileSystem::new(
        None,
        None,
        None,
        None,
        None,
        None,
    )
    .await,

Unable to use S3FileSystem with RuntimeEnv::register_object_store

I am trying to add an S3 object store for a project using datafusion version 28. The README.md for this project notes that the S3FileSystem can be registered as an ObjectStore on an ExecutionContext, but that latter trait is not something that exists any longer in datafusion. Instead, the register_object_store function is now part of the RuntimeEnv trait. Given a SessionContext object ctx, I am trying to register the S3 store with

use datafusion_objectstore_s3::object_store::s3::S3FileSystem;
...
let s3_file_system = Arc::new(S3FileSystem::default().await);
let rv = ctx.runtime_env();
let s3url = Url::parse("s3").unwrap();
rv.register_object_store(&s3url, s3_file_system);

However, this produces the following error:

error[E0277]: the trait bound `S3FileSystem: object_store::ObjectStore` is not satisfied
   --> src/main.rs:127:42
    |
127 |         rv.register_object_store(&s3url, s3_file_system);
    |                                          ^^^^^^^^^^^^^^ the trait `object_store::ObjectStore` is not implemented for `S3FileSystem`
    |
    = help: the following other types implement trait `object_store::ObjectStore`:
              Box<(dyn object_store::ObjectStore + 'static)>
              object_store::chunked::ChunkedStore
              object_store::limit::LimitStore<T>
              object_store::local::LocalFileSystem
              object_store::memory::InMemory
              object_store::prefix::PrefixStore<T>
              object_store::throttle::ThrottledStore<T>
    = note: required for the cast from `S3FileSystem` to the object type `dyn object_store::ObjectStore`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `shimsql` due to previous error

Is this crate incompatible with datfusion 28, or is there some using declaration I'm missing that's necessary for this trait implementation to be visible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.