Comments (5)
@dgrahn : It would be able to work but there would definitely need to be some legwork to make it work :). If there is interest in working on a Slurm plugin (the first step to getting this to be compatible with Slurm), we can definitely help but probably would need some support from the community to test and what not.
from metaflow.
@dgrahn We currently don't integrate with Slurm, but it's an interesting idea. @romain-intel Do you have any suggestions for Slurm alternatives for HPC workloads?
from metaflow.
It's been a while and I had worked with Torque at the time but Slurm is definitely something we can consider. I suppose if you are mentioning Slurm, you intend on setting up multiple DSS-8440 in a cluster and would like to evaluate if Metaflow can potentially help with running large workloads on such a cluster. It could work but I am not sure it is quite ready for prime-time. You would still need to setup quite a few things to make sure that it all works properly (slurm, isolation of the accelerators if you wanted to share them among multiple flows, etc). Metaflow would also not necessarily benefit from some of the benefits of Slurm (request for multiple nodes at the same time for example since Metaflow "communicates" through a central location (S3 in the case of the AWS integration but you could imagine a shared file-system in a cluster-like setup).
On a single DSS-8440, you could use Metaflow without Slurm as well and make use of the multiple accelerators that way (since Metaflow would launch multiple processes on the same machine).
I am not sure of your exact use case but happy to discuss a little bit more. It's been a while since I worked in HPC but can hopefully still have a somewhat coherent conversation :).
from metaflow.
@romain-intel It'll be one DSS-8440 and few lower-powered GPU machines that were previously available. It sounds like metaflow might not be the right technology for that use case at this point in time.
from metaflow.
Any progress? :)
from metaflow.
Related Issues (20)
- Allow namespacing of `ArgoEvent` when published from a step
- allow uploading metaflow packages that include "dots" in the path
- Suggest replacing `pull_request_target` in branches as well as `main`
- Bug: Passing an `S3PutObject` to `s3.put_files` treated as `tuple` of key path values.
- Allow passing of `trusted-host` parameter to `@pypi` decorator.
- Unable to utilise pytest unit test for metaflow HOT 1
- Update extras_require for tracing dependencies?
- R tests on Github fail with macos-latest runner
- Argo-workflows with name > 52 characters fail if run as CronWorkflows HOT 1
- Introduce support for private registry for AWS Batch
- Write metadata about log scrubbing events
- Language-agnostic API and/or other language SDKs for the client API HOT 1
- Performance Degradation When Running PyCaret Model with n_jobs > 1 Inside Metaflow
- Question on Executing Metaflow Workflow from Python Script Without 'run' Argument HOT 3
- Resume Doesn't Resume from the Last Failed Step
- Argo Events: trigger sensors are not deleted
- Allow option to set labels on pods using `kubernetes` decorator
- Bug: Runner API doesn't pass parameters to flows as expected HOT 4
- Log failure reason in slack alerts for deployed flow failures
- Display full trigger event payload in metaflow ui HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaflow.