datafusion-contrib / datafusion-objectstore-s3 Goto Github PK

View Code? Open in Web Editor NEW

58.0 9.0 14.0 75 KB

S3 as an ObjectStore for DataFusion

License: Apache License 2.0

Rust 100.00%

rust datafusion

datafusion-objectstore-s3's Introduction

DataFusion-ObjectStore-S3

S3 as an ObjectStore for Datafusion.

Querying files on S3 with DataFusion

This crate implements the DataFusion ObjectStore trait on AWS S3 and implementers of the S3 standard. We leverage the official AWS Rust SDK for interacting with S3. While it is our understanding that the AWS APIs we are using a relatively stable, we can make no assurances on API stability either on AWS' part or within this crate. This crates API is tightly connected with DataFusion, a fast moving project, and as such we will make changes inline with those upstream changes.

Examples

Examples for querying AWS and other implementors, such as MinIO, are shown below.

Load credentials from default AWS credential provider (such as environment or ~/.aws/credentials)

let s3_file_system = Arc::new(S3FileSystem::default().await);

S3FileSystem::default() is a convenience wrapper for S3FileSystem::new(None, None, None, None, None, None).

Connect to implementor of S3 API (MinIO, in this case) using access key and secret.

// Example credentials provided by MinIO
const ACCESS_KEY_ID: &str = "AKIAIOSFODNN7EXAMPLE";
const SECRET_ACCESS_KEY: &str = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY";
const PROVIDER_NAME: &str = "Static";
const MINIO_ENDPOINT: &str = "http://localhost:9000";

let s3_file_system = S3FileSystem::new(
    Some(SharedCredentialsProvider::new(Credentials::new(
        MINIO_ACCESS_KEY_ID,
        MINIO_SECRET_ACCESS_KEY,
        None,
        None,
        PROVIDER_NAME,
    ))), // Credentials provider
    None, // Region
    Some(Endpoint::immutable(Uri::from_static(MINIO_ENDPOINT))), // Endpoint
    None, // RetryConfig
    None, // AsyncSleep
    None, // TimeoutConfig
)
.await;

Using DataFusion's ListingTableConfig we register a table into a DataFusion ExecutionContext so that it can be queried.

let filename = "data/alltypes_plain.snappy.parquet";

let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;

let table = ListingTable::try_new(config)?;

let mut ctx = ExecutionContext::new();

ctx.register_table("tbl", Arc::new(table))?;

let df = ctx.sql("SELECT * FROM tbl").await?;
df.show()

We can also register the S3FileSystem directly as an ObjectStore on an ExecutionContext. This provides an idiomatic way of creating TableProviders that can be queried.

execution_ctx.register_object_store(
    "s3",
    Arc::new(S3FileSystem::default().await),
);

let input_uri = "s3://parquet-testing/data/alltypes_plain.snappy.parquet";

let (object_store, _) = ctx.object_store(input_uri)?;

let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;

let mut table_provider: Arc<dyn TableProvider + Send + Sync> = Arc::new(ListingTable::try_new(config)?);

Testing

Tests are run with MinIO which provides a containerized implementation of the Amazon S3 API.

First clone the test data repository:

git submodule update --init --recursive

Then start the MinIO container:

docker run \
--detach \
--rm \
--publish 9000:9000 \
--publish 9001:9001 \
--name minio \
--volume "$(pwd)/parquet-testing:/data" \
--env "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" \
--env "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
quay.io/minio/minio server /data \
--console-address ":9001"

Once started, run tests in normal fashion:

cargo test

datafusion-objectstore-s3's People

Contributors

Stargazers

Watchers

Forkers

matthewmturner seddonm1 nitisht mateuszkj coralogix wjones127 dbpunk-labs de-sh isgasho unsafepointer saikrishna1-bidgely polygon-io iq-scm

datafusion-objectstore-s3's Issues

API not quite correct

I think the API is not correct to work with DataFusion:

Currently the AmazonS3FileSystem requires a bucket to be set. This means that one ObjectStore can only retrieve from one bucket. This does not fit with the DataFusion API which resolves ObjectStores by URI scheme (i.e. s3:// vs file://).

I will refactor this so we can dynamically request files correctly.

Add parquet testing sub module

For use in CI

Implement `list_dir`

Reorganize files with similar structure to datafusion

Investigate creating `S3CatalogProvider` and/or `S3SchemaProvider` or similar purposed `ObjectStore` method

Setup CI

Setup github actions build and testing flows

Cleanup README query example with better explanations.

Support aws_config::load_from_env()

As @seddonm1 notes, aws_config::load_from_env() doesnt allow specifying endpoint which is required for integration testing.

that being said, it would still be nice to support this feature.

Maybe we could add a boolean argument to AmazonS3FileSystem::new like load_from_env and dispatch accordingly.

Doc-test for `register_object_store` not working

I cant get doc tests to pass using the object_store returned from ctx.object_store method (i get file not found error). Does anything seem off to you @seddonm1?

Originally posted by @matthewmturner in #38 (comment)

Update AWS crate versions

To-do list for publishing 0.1

Replace datafusion Error/Result with new S3Error / Result

I think this would make it more clear when used with datafusion where the error is coming from.

Add example to readme for how to use with datafusion

For example run a sql query on a table from s3

Unable to use S3FileSystem with RuntimeEnv::register_object_store

I am trying to add an S3 object store for a project using datafusion version 28. The README.md for this project notes that the S3FileSystem can be registered as an ObjectStore on an ExecutionContext, but that latter trait is not something that exists any longer in datafusion. Instead, the register_object_store function is now part of the RuntimeEnv trait. Given a SessionContext object ctx, I am trying to register the S3 store with

use datafusion_objectstore_s3::object_store::s3::S3FileSystem;
...
let s3_file_system = Arc::new(S3FileSystem::default().await);
let rv = ctx.runtime_env();
let s3url = Url::parse("s3").unwrap();
rv.register_object_store(&s3url, s3_file_system);

However, this produces the following error:

error[E0277]: the trait bound `S3FileSystem: object_store::ObjectStore` is not satisfied
   --> src/main.rs:127:42
    |
127 |         rv.register_object_store(&s3url, s3_file_system);
    |                                          ^^^^^^^^^^^^^^ the trait `object_store::ObjectStore` is not implemented for `S3FileSystem`
    |
    = help: the following other types implement trait `object_store::ObjectStore`:
              Box<(dyn object_store::ObjectStore + 'static)>
              object_store::chunked::ChunkedStore
              object_store::limit::LimitStore<T>
              object_store::local::LocalFileSystem
              object_store::memory::InMemory
              object_store::prefix::PrefixStore<T>
              object_store::throttle::ThrottledStore<T>
    = note: required for the cast from `S3FileSystem` to the object type `dyn object_store::ObjectStore`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `shimsql` due to previous error

Is this crate incompatible with datfusion 28, or is there some using declaration I'm missing that's necessary for this trait implementation to be visible?

Improve Testing

Add testing for the below cases.

Bad Data

Non existent file
Non existent bucket

DataFusion Integration

Test for ctx.register_object_store

Support creating client specific configs for different buckets

...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

Originally posted by @houqp in #20 (comment)

With context provided by @houqp:

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

And potential designs also provided by @houqp:

Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

Publish to crates.io

Register object store issue

Hi, I`d like to register s3 object store to read files hosted on s3, followed the example I found [here] (#15)

Here is my code part, but it throws me the trait 'ObjectStore' is not implemented for 'AmazonS3FileSystem'

let mut execution_ctx = ExecutionContext::new();
  execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await)
    );

how can I fix this? Thanks

Setup CI

Based on design in base implementation https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/2/files

Upgrade to datafusion 9

WIP - https://github.com/timvw/datafusion-objectstore-s3/tree/bump-datafusion-9

Async S3 Support for Read & Write Parquet Files

hey, do we have async support to read s3 files I would see support for reading local files using async ParquetRecordBatchStreamBuilder ?

Create python bindings

Tests failing, update the Minio version used for testing

Testing on master branch using the commands provided in the README fails. The issue as suggested here Both these Issues: StackOverflow, GitHub Issue, Minio BlogPost seems to be that Minio doesn't add the files in parquet-testing/data folder to the data bucket because of new versioning scheme and consequently, the tests aren't able to read the files from docker deployment.

This change in Minio's behaviour is recent and suggest that an older version might solve the issue. I tried with the older versions and the newest one that works for the tests is minio/minio:RELEASE.2022-05-26T05-48-41Z.hotfix.204b42b6b.

Please update the GitHub actions with this image. Thanks!

Investigate: Register into ObjectStoreRegistry

Anonymous Bucket Access

Currently the aws-sdk-rust does not support anonymous bucket access (for accessing things like AWS Open Data). There is a workaround proposed in this request: [request]: Anonymous Credentials Provider.

This issue is here to identify future users of the known issue with no immediate plans to implement the workaround until AWS have had time to (hopefully) implement it.

Rename `AmazonS3FileSystem` to `S3FileSystem`

I was looking at the C++/Python Apache Arrow docs and saw they use the name S3FileSystem (https://arrow.apache.org/docs/search.html?q=s3). While we of course dont need to follow their lead I do think that using that naming makes more sense given that this doesnt need to connect to Amazon S3 (while they are the standard bearer).

we of course are using the official amazon library to connect(im not sure if the C++/Python implementation is), and im not sure if that was taken into account when naming.

one more example is pythons s3fs which is built on top of botocore another aws library.

@seddonm1 what do you think? if ok with you i will make the update.

Multi parquet s3 example?

I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?

instead of

AmazonS3FileSystem::new(
        None,
        None,
        None,
        None,
        None,
        None,
    )
    .await,

Investigate using S3 Select

It seems support was added for this based on https://github.com/awslabs/aws-sdk-rust/releases/tag/v0.0.17-alpha

Look into integrating this into S3FileSystem or using it to create a TableProvider.