Comments (8)
Quick note on this: Metaflow doesn't support anonymous functions as written here. I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.
from tarchetypes.
I read up more on AWS ParallelCluster, AWS Batch, and Metaflow's HPC, and I no longer think ParallelCluster is something that makes sense to integrate with directly. I think a Metaflow target archetype makes more sense to start with, and the versioning could still help even after targets
adopts the cloud.
Some future development ideas:
- For S3 storage on its own, let's try #8 (using https://mdneuzerling.com/post/sourcing-data-from-s3-with-drake/).
- AWS Batch scheduling as an externalized algorithm subclass (related: ropensci/targets#148). Should look like the existing
clustermq
andfuture
algorithm subclasses but built on top ofpaws::batch()
.
from tarchetypes.
Just learned some neat stuff from experimenting with metaflow.org/sandbox. It gave me another idea for AWS S3 integration in targets
: ropensci/targets#154.
from tarchetypes.
Update: thanks to http://metaflow.org/sandbox, I think I figured out what AWS S3 integration in targets
should look like, and we're off to a great start: ropensci/targets#176.
AWS Batch integration is going to be a lot harder. What we really need is a batchtools
or clustermq
for AWS Batch, plus a future
extension on top of that. With that in place, there should be nothing more to implement in targets
itself.
If/when we get that far, the value added from tar_metaflow()
will just be the versioning system. But that in itself a big deal, and it's something targets
is never going to have on its own. (targets
instead tries to make the data store light and readable so third party data versioning tools have an easier time.)
from tarchetypes.
I need to figure out how to write download_artifact_from_aws()
, but that should be straightforward in principle.
A bigger issue is probably the way tar_metaflow()
creates an entire new flow for each new target. This could lead to thousands of flows in practice, and I do not know if that will incur extra overhead. We could alternatively try to stick to a single flow for the entire targets
pipeline, but that flow would have a completely different definition for each target, which might not bode well either.
from tarchetypes.
Quick note on this: Metaflow doesn't support anonymous functions as written here
Seems straightforward to work around if we define a function from inside the command for the target.
I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.
Thank you so much, David! Really looking forward to this! If it works out, it could be a huge win-win.
from tarchetypes.
My opinion is changing on this one. I think tar_metaflow()
would still be nice for a small number of targets that need both AWS computing and S3 data versioning. However, since targets
(and drake
) already do distributed computing on clusters, I think AWS ParallelCluster might be a more natural fit for heavily scaled-out pipelines. Related: ropensci-books/targets#21.
from tarchetypes.
On reflection, I am closing this issue. The maintainers of clustermq
and future
have expressed interest in supporting some form of AWS compute, which would automatically let targets
deploy work to the cloud. I believe this is the best route for targets
.
from tarchetypes.
Related Issues (20)
- Local persistence of cloud-backed file targets HOT 5
- `tar_quarto()` always ends normally for quarto project even if there is error HOT 3
- `tar_quarto()` ignores `output_dir` in _quarto.yml when passed an individual file
- combine tar_cue_age with a conditional statement HOT 4
- Rep-specific seeds in tar_rep(), tar_map_rep(), etc. HOT 5
- optional garbage collection between reps of the `tar_rep*()` functions HOT 1
- tar_change repository not considered for change part
- Branches not in metadata: branches out of range
- GitHub interactions are temporarily limited because the maintainer is out of office.
- tar_cross() HOT 2
- Bug: `tar_quarto_rep()` throws an error if used together with `future::plan()` from _targets.R template HOT 1
- Support Quarto profiles? HOT 10
- Expose `tar_render()`, `tar_quarto()` and similar functions to the `deps` argument of `tar_target_raw()` HOT 8
- Errors and warnings with Quarto
- tar_quarto_rep doesn't work on reports in subdirectories HOT 2
- `retrieval = "none"` in quarto target factories HOT 2
- [general] Use `tar_rep()` and `tar_rep2()` inside of `tar_map()` HOT 2
- `tar_map()` does not interpret <integer64> values correctly HOT 3
- tar_render_rep() with any number of output reports HOT 1
- Quarto targets - indicate code chunk that causes error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tarchetypes.