Giter Site home page Giter Site logo

Comments (8)

mdneuzerling avatar mdneuzerling commented on June 11, 2024 7

Quick note on this: Metaflow doesn't support anonymous functions as written here. I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024 1

I read up more on AWS ParallelCluster, AWS Batch, and Metaflow's HPC, and I no longer think ParallelCluster is something that makes sense to integrate with directly. I think a Metaflow target archetype makes more sense to start with, and the versioning could still help even after targets adopts the cloud.

Some future development ideas:

  1. For S3 storage on its own, let's try #8 (using https://mdneuzerling.com/post/sourcing-data-from-s3-with-drake/).
  2. AWS Batch scheduling as an externalized algorithm subclass (related: ropensci/targets#148). Should look like the existing clustermq and future algorithm subclasses but built on top of paws::batch().

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024 1

Just learned some neat stuff from experimenting with metaflow.org/sandbox. It gave me another idea for AWS S3 integration in targets: ropensci/targets#154.

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024 1

Update: thanks to http://metaflow.org/sandbox, I think I figured out what AWS S3 integration in targets should look like, and we're off to a great start: ropensci/targets#176.

AWS Batch integration is going to be a lot harder. What we really need is a batchtools or clustermq for AWS Batch, plus a future extension on top of that. With that in place, there should be nothing more to implement in targets itself.

If/when we get that far, the value added from tar_metaflow() will just be the versioning system. But that in itself a big deal, and it's something targets is never going to have on its own. (targets instead tries to make the data store light and readable so third party data versioning tools have an easier time.)

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024

I need to figure out how to write download_artifact_from_aws(), but that should be straightforward in principle.

A bigger issue is probably the way tar_metaflow() creates an entire new flow for each new target. This could lead to thousands of flows in practice, and I do not know if that will incur extra overhead. We could alternatively try to stick to a single flow for the entire targets pipeline, but that flow would have a completely different definition for each target, which might not bode well either.

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024

Quick note on this: Metaflow doesn't support anonymous functions as written here

Seems straightforward to work around if we define a function from inside the command for the target.

I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

Thank you so much, David! Really looking forward to this! If it works out, it could be a huge win-win.

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024

My opinion is changing on this one. I think tar_metaflow() would still be nice for a small number of targets that need both AWS computing and S3 data versioning. However, since targets (and drake) already do distributed computing on clusters, I think AWS ParallelCluster might be a more natural fit for heavily scaled-out pipelines. Related: ropensci-books/targets#21.

from tarchetypes.

wlandau avatar wlandau commented on June 11, 2024

On reflection, I am closing this issue. The maintainers of clustermq and future have expressed interest in supporting some form of AWS compute, which would automatically let targets deploy work to the cloud. I believe this is the best route for targets.

from tarchetypes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.