datafusion-contrib / datafusion-objectstore-s3 Goto Github PK
View Code? Open in Web Editor NEWS3 as an ObjectStore for DataFusion
License: Apache License 2.0
S3 as an ObjectStore for DataFusion
License: Apache License 2.0
For use in CI
Hi, I`d like to register s3 object store to read files hosted on s3, followed the example I found [here] (#15)
Here is my code part, but it throws me the trait 'ObjectStore' is not implemented for 'AmazonS3FileSystem'
let mut execution_ctx = ExecutionContext::new();
execution_ctx.register_object_store(
"s3",
Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await)
);
how can I fix this? Thanks
It seems support was added for this based on https://github.com/awslabs/aws-sdk-rust/releases/tag/v0.0.17-alpha
Look into integrating this into S3FileSystem
or using it to create a TableProvider
.
hey, do we have async support to read s3 files I would see support for reading local files using async ParquetRecordBatchStreamBuilder ?
As @seddonm1 notes, aws_config::load_from_env()
doesnt allow specifying endpoint which is required for integration testing.
that being said, it would still be nice to support this feature.
Maybe we could add a boolean argument to AmazonS3FileSystem::new
like load_from_env
and dispatch accordingly.
Add testing for the below cases.
Bad Data
DataFusion Integration
ctx.register_object_store
Based on design in base implementation https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/2/files
...it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.
Originally posted by @houqp in #20 (comment)
With context provided by @houqp:
IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.
And potential designs also provided by @houqp:
Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.
I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.
Setup github actions build and testing flows
I think the API is not correct to work with DataFusion:
Currently the AmazonS3FileSystem
requires a bucket to be set. This means that one ObjectStore
can only retrieve from one bucket. This does not fit with the DataFusion API which resolves ObjectStores
by URI scheme (i.e. s3://
vs file://
).
I will refactor this so we can dynamically request files correctly.
I was looking at the C++/Python Apache Arrow docs and saw they use the name S3FileSystem
(https://arrow.apache.org/docs/search.html?q=s3). While we of course dont need to follow their lead I do think that using that naming makes more sense given that this doesnt need to connect to Amazon S3 (while they are the standard bearer).
we of course are using the official amazon library to connect(im not sure if the C++/Python implementation is), and im not sure if that was taken into account when naming.
one more example is pythons s3fs
which is built on top of botocore
another aws library.
@seddonm1 what do you think? if ok with you i will make the update.
Testing on master branch using the commands provided in the README fails. The issue as suggested here Both these Issues: StackOverflow, GitHub Issue, Minio BlogPost seems to be that Minio doesn't add the files in parquet-testing/data
folder to the data
bucket because of new versioning scheme and consequently, the tests aren't able to read the files from docker deployment.
This change in Minio's behaviour is recent and suggest that an older version might solve the issue. I tried with the older versions and the newest one that works for the tests is minio/minio:RELEASE.2022-05-26T05-48-41Z.hotfix.204b42b6b
.
Please update the GitHub actions with this image. Thanks!
I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?
I think this would make it more clear when used with datafusion where the error is coming from.
I cant get doc tests to pass using the object_store
returned from ctx.object_store
method (i get file not found error). Does anything seem off to you @seddonm1?
Originally posted by @matthewmturner in #38 (comment)
Use default AWS credentials provider in a Default impl for AmazonS3Filesystem so we can call
AmazonS3FileSystem::default().await
instead of
AmazonS3FileSystem::new(
None,
None,
None,
None,
None,
None,
)
.await,
I am trying to add an S3 object store for a project using datafusion version 28. The README.md for this project notes that the S3FileSystem
can be registered as an ObjectStore
on an ExecutionContext
, but that latter trait is not something that exists any longer in datafusion. Instead, the register_object_store
function is now part of the RuntimeEnv
trait. Given a SessionContext
object ctx
, I am trying to register the S3 store with
use datafusion_objectstore_s3::object_store::s3::S3FileSystem;
...
let s3_file_system = Arc::new(S3FileSystem::default().await);
let rv = ctx.runtime_env();
let s3url = Url::parse("s3").unwrap();
rv.register_object_store(&s3url, s3_file_system);
However, this produces the following error:
error[E0277]: the trait bound `S3FileSystem: object_store::ObjectStore` is not satisfied
--> src/main.rs:127:42
|
127 | rv.register_object_store(&s3url, s3_file_system);
| ^^^^^^^^^^^^^^ the trait `object_store::ObjectStore` is not implemented for `S3FileSystem`
|
= help: the following other types implement trait `object_store::ObjectStore`:
Box<(dyn object_store::ObjectStore + 'static)>
object_store::chunked::ChunkedStore
object_store::limit::LimitStore<T>
object_store::local::LocalFileSystem
object_store::memory::InMemory
object_store::prefix::PrefixStore<T>
object_store::throttle::ThrottledStore<T>
= note: required for the cast from `S3FileSystem` to the object type `dyn object_store::ObjectStore`
For more information about this error, try `rustc --explain E0277`.
error: could not compile `shimsql` due to previous error
Is this crate incompatible with datfusion 28, or is there some using declaration I'm missing that's necessary for this trait implementation to be visible?
Currently the aws-sdk-rust does not support anonymous bucket access (for accessing things like AWS Open Data). There is a workaround proposed in this request: [request]: Anonymous Credentials Provider.
This issue is here to identify future users of the known issue with no immediate plans to implement the workaround until AWS have had time to (hopefully) implement it.
For example run a sql query on a table from s3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.