Giter Site home page Giter Site logo

Comments (8)

mdneuzerling avatar mdneuzerling commented on June 5, 2024 1

I've had some time to think about this and put some code down. Interacting with S3 requires (at least) three pieces of information:

  • Object key, analogous to a file path
  • Bucket name
  • Configuration, including credentials

The first two are easy. The configuration (possibly inspired by boto's Session.client) is more involved. Refer to the documentation for details. It optionally specifies the following configuration fields, some of which are considered secrets:

  • access_key_id
  • secret_access_key
  • session_token
  • profile (something like a config file)
  • endpoint
  • region

On my machine and with my straightforward AWS configuration, I can get away with using the default paws::s3() configuration. This is probably because I've run aws configure in a terminal. So we do have a sensible default value.

So I'm expecting two target archetypes:

tar_s3_get_object(key, bucket, config = paws::s3(), <other_targets_args>, ...)
tar_s3_push_object(key, bucket, config = paws::s3(), <other_targets_args>, ...)

And then using either the etag/hash or the last modified date to avoid re-downloading. These checks are actually performed server-side, so there's no need for targets to compare hashes.

The intent here is that ... is passed to PAWS so that users can deal with things like server-side encryption without it weighing down the targets syntax. I'll get some draft code together but I fully expect this to take a few rounds of iteration.

from tarchetypes.

wlandau avatar wlandau commented on June 5, 2024 1

Thanks for starting on this!

I agree that most of the config should happen up front and not encumber the target's interface.

After looking at your proposal, I see a couple alternative potential routes. Not sure which one I like more yet.

  1. Just get and push existing files. This is how I read your proposal. Maybe I am missing an implicit argument for the target's command.
  2. For the "push" archetype, act more like tar_target(format = "file"), which contains some customizable R code to run that returns a file path.
tar_s3_push_object(name, command, key, bucket, config = paws::s3(), ...)

from tarchetypes.

mdneuzerling avatar mdneuzerling commented on June 5, 2024

Concerns with my approach:

  1. How do CRAN feel about using the system command like this?
  2. Does the AWS CLI approach work on Windows?
  3. How do we handle the authentication side of things when uploading data to S3? (Things like server-side encryption). Downloading is generally simpler than uploading with S3.

These issues may be somewhat resolved if we wrap a fully-featured AWS API like the PAWS package. This has the added bonus of opening up the possibility of other integrations with AWS besides S3.

I'm a bit busy at the moment, despite being in strict lockdown, but I can try to have a look if you'd like?

from tarchetypes.

wlandau avatar wlandau commented on June 5, 2024

Yeah, I was hoping to use paws. If you have time, I would appreciate help and input. I plan to learn paws for ropensci/targets#152, but you have a huge head start.

from tarchetypes.

wlandau avatar wlandau commented on June 5, 2024

Probably best to build this on top of tar_change_raw() like in #9. Another thought is to use #9 somehow like Miles mentioned here.

from tarchetypes.

mdneuzerling avatar mdneuzerling commented on June 5, 2024

I think I may still be in a drake state of mind, trying to replicate file_in and file_out. I'll try to get a better understanding of the targets equivalent concepts.

from tarchetypes.

wlandau avatar wlandau commented on June 5, 2024

Yeah, with targets, all files are dynamic (e.g. tar_target(format = "file")).

On reflection, I would actually prefer ropensci/targets#154 if it works out. I think S3 will be more seamless and efficient that way.

from tarchetypes.

wlandau avatar wlandau commented on June 5, 2024

Let's go with ropensci/targets#176 instead. I think it's as seamless as Metaflow.

from tarchetypes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.