Giter Site home page Giter Site logo

datafusion-objectstore-s3's Introduction

DataFusion-ObjectStore-S3

S3 as an ObjectStore for Datafusion.

Querying files on S3 with DataFusion

This crate implements the DataFusion ObjectStore trait on AWS S3 and implementers of the S3 standard. We leverage the official AWS Rust SDK for interacting with S3. While it is our understanding that the AWS APIs we are using a relatively stable, we can make no assurances on API stability either on AWS' part or within this crate. This crates API is tightly connected with DataFusion, a fast moving project, and as such we will make changes inline with those upstream changes.

Examples

Examples for querying AWS and other implementors, such as MinIO, are shown below.

Load credentials from default AWS credential provider (such as environment or ~/.aws/credentials)

let s3_file_system = Arc::new(S3FileSystem::default().await);

S3FileSystem::default() is a convenience wrapper for S3FileSystem::new(None, None, None, None, None, None).

Connect to implementor of S3 API (MinIO, in this case) using access key and secret.

// Example credentials provided by MinIO
const ACCESS_KEY_ID: &str = "AKIAIOSFODNN7EXAMPLE";
const SECRET_ACCESS_KEY: &str = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY";
const PROVIDER_NAME: &str = "Static";
const MINIO_ENDPOINT: &str = "http://localhost:9000";

let s3_file_system = S3FileSystem::new(
    Some(SharedCredentialsProvider::new(Credentials::new(
        MINIO_ACCESS_KEY_ID,
        MINIO_SECRET_ACCESS_KEY,
        None,
        None,
        PROVIDER_NAME,
    ))), // Credentials provider
    None, // Region
    Some(Endpoint::immutable(Uri::from_static(MINIO_ENDPOINT))), // Endpoint
    None, // RetryConfig
    None, // AsyncSleep
    None, // TimeoutConfig
)
.await;

Using DataFusion's ListingTableConfig we register a table into a DataFusion ExecutionContext so that it can be queried.

let filename = "data/alltypes_plain.snappy.parquet";

let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;

let table = ListingTable::try_new(config)?;

let mut ctx = ExecutionContext::new();

ctx.register_table("tbl", Arc::new(table))?;

let df = ctx.sql("SELECT * FROM tbl").await?;
df.show()

We can also register the S3FileSystem directly as an ObjectStore on an ExecutionContext. This provides an idiomatic way of creating TableProviders that can be queried.

execution_ctx.register_object_store(
    "s3",
    Arc::new(S3FileSystem::default().await),
);

let input_uri = "s3://parquet-testing/data/alltypes_plain.snappy.parquet";

let (object_store, _) = ctx.object_store(input_uri)?;

let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;

let mut table_provider: Arc<dyn TableProvider + Send + Sync> = Arc::new(ListingTable::try_new(config)?);

Testing

Tests are run with MinIO which provides a containerized implementation of the Amazon S3 API.

First clone the test data repository:

git submodule update --init --recursive

Then start the MinIO container:

docker run \
--detach \
--rm \
--publish 9000:9000 \
--publish 9001:9001 \
--name minio \
--volume "$(pwd)/parquet-testing:/data" \
--env "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" \
--env "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
quay.io/minio/minio server /data \
--console-address ":9001"

Once started, run tests in normal fashion:

cargo test

datafusion-objectstore-s3's People

Contributors

matthewmturner avatar seddonm1 avatar timvw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datafusion-objectstore-s3's Issues

API not quite correct

I think the API is not correct to work with DataFusion:

Currently the AmazonS3FileSystem requires a bucket to be set. This means that one ObjectStore can only retrieve from one bucket. This does not fit with the DataFusion API which resolves ObjectStores by URI scheme (i.e. s3:// vs file://).

I will refactor this so we can dynamically request files correctly.

Setup CI

Setup github actions build and testing flows

Support aws_config::load_from_env()

As @seddonm1 notes, aws_config::load_from_env() doesnt allow specifying endpoint which is required for integration testing.

that being said, it would still be nice to support this feature.

Maybe we could add a boolean argument to AmazonS3FileSystem::new like load_from_env and dispatch accordingly.

Unable to use S3FileSystem with RuntimeEnv::register_object_store

I am trying to add an S3 object store for a project using datafusion version 28. The README.md for this project notes that the S3FileSystem can be registered as an ObjectStore on an ExecutionContext, but that latter trait is not something that exists any longer in datafusion. Instead, the register_object_store function is now part of the RuntimeEnv trait. Given a SessionContext object ctx, I am trying to register the S3 store with

use datafusion_objectstore_s3::object_store::s3::S3FileSystem;
...
let s3_file_system = Arc::new(S3FileSystem::default().await);
let rv = ctx.runtime_env();
let s3url = Url::parse("s3").unwrap();
rv.register_object_store(&s3url, s3_file_system);

However, this produces the following error:

error[E0277]: the trait bound `S3FileSystem: object_store::ObjectStore` is not satisfied
   --> src/main.rs:127:42
    |
127 |         rv.register_object_store(&s3url, s3_file_system);
    |                                          ^^^^^^^^^^^^^^ the trait `object_store::ObjectStore` is not implemented for `S3FileSystem`
    |
    = help: the following other types implement trait `object_store::ObjectStore`:
              Box<(dyn object_store::ObjectStore + 'static)>
              object_store::chunked::ChunkedStore
              object_store::limit::LimitStore<T>
              object_store::local::LocalFileSystem
              object_store::memory::InMemory
              object_store::prefix::PrefixStore<T>
              object_store::throttle::ThrottledStore<T>
    = note: required for the cast from `S3FileSystem` to the object type `dyn object_store::ObjectStore`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `shimsql` due to previous error

Is this crate incompatible with datfusion 28, or is there some using declaration I'm missing that's necessary for this trait implementation to be visible?

Improve Testing

Add testing for the below cases.

Bad Data

  • Non existent file
  • Non existent bucket

DataFusion Integration

  • Test for ctx.register_object_store

Support creating client specific configs for different buckets

...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

Originally posted by @houqp in #20 (comment)

With context provided by @houqp:

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

And potential designs also provided by @houqp:

  1. Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket

  2. Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

Register object store issue

Hi, I`d like to register s3 object store to read files hosted on s3, followed the example I found [here] (#15)

Here is my code part, but it throws me the trait 'ObjectStore' is not implemented for 'AmazonS3FileSystem'

let mut execution_ctx = ExecutionContext::new();
  execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await)
    );

how can I fix this? Thanks

Tests failing, update the Minio version used for testing

Testing on master branch using the commands provided in the README fails. The issue as suggested here Both these Issues: StackOverflow, GitHub Issue, Minio BlogPost seems to be that Minio doesn't add the files in parquet-testing/data folder to the data bucket because of new versioning scheme and consequently, the tests aren't able to read the files from docker deployment.

This change in Minio's behaviour is recent and suggest that an older version might solve the issue. I tried with the older versions and the newest one that works for the tests is minio/minio:RELEASE.2022-05-26T05-48-41Z.hotfix.204b42b6b.

Please update the GitHub actions with this image. Thanks!

Rename `AmazonS3FileSystem` to `S3FileSystem`

I was looking at the C++/Python Apache Arrow docs and saw they use the name S3FileSystem (https://arrow.apache.org/docs/search.html?q=s3). While we of course dont need to follow their lead I do think that using that naming makes more sense given that this doesnt need to connect to Amazon S3 (while they are the standard bearer).

we of course are using the official amazon library to connect(im not sure if the C++/Python implementation is), and im not sure if that was taken into account when naming.

one more example is pythons s3fs which is built on top of botocore another aws library.

@seddonm1 what do you think? if ok with you i will make the update.

Multi parquet s3 example?

I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?

Implement Default on AmazonS3FileSystem

Use default AWS credentials provider in a Default impl for AmazonS3Filesystem so we can call

AmazonS3FileSystem::default().await

instead of

AmazonS3FileSystem::new(
        None,
        None,
        None,
        None,
        None,
        None,
    )
    .await,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.