Giter Site home page Giter Site logo

Support HPC & GPU local clusters about metaflow HOT 8 OPEN

netflix avatar netflix commented on August 27, 2024 1
Support HPC & GPU local clusters

from metaflow.

Comments (8)

dgasmith avatar dgasmith commented on August 27, 2024 3

You may want to consider opening this up to arbitrary workload managers such as Parsl, Dask (job queue), RADICAL, etc. While these are typically full workflow/workload managers their core workload capability can be used to execute arbitrary tasks on these machines. Some serious thought will need to go into this as you manage your own environments as well, but there has been some pretty hefty work getting these kinds of task management systems onto SLURM/PBS and general academic clusters/leadership platforms.

from metaflow.

romain-intel avatar romain-intel commented on August 27, 2024 2

To add a bit of flavor on #29 which was very specific to slurm and accelerators, the batch plugin that is currently included is basically a thin way of launching a process on a remote machine (in this case a batch instance). You could write a similar plugin for your specific environment. You would basically need to provide:

  • a commonly accessible object store (batch uses S3 but you could use NFS mounted partitions, or Lustre partitions or whatever shared file system you have)
  • a way to remote launch, monitor and collect stdout/err from a process. This could technically even be as simple as a slightly more fancy ssh -c

Without knowing your exact setup, it's hard for me to help further but hopefully this helps a little bit.

from metaflow.

romain-intel avatar romain-intel commented on August 27, 2024 1

We currently have little documentation regarding the internals of Metaflow (our initial release is primarily targeted at users of metaflow rather than the developers of metaflow). Feel free, however, to ask any question here. To get you started, the batch plugin is in plugins/aws/batch. In there, a few things to keep in mind:

  • When launching something on batch, MF actually launches two things: a local process that is responsible for launching a batch job and monitoring it and then the actual batch job.
  • The way the local process is launched is through the batch_cli.py (ie: a command line that contains batch step ...).
  • The way this command line is generated is in batch_decorator.py in the runtime_step_cli.py function. The way this works is that effectively, when a step needs to execute, the runtime (see runtime.py) will launch a subprocess to execute the step, to do so, it will generate a command line to call metaflow with (see use of runtime_step_cli.py in runtime.py).
  • batch.py provides the functionality called by the batch_cli.py file
  • batch_client.py is the file that contains the code that communicates with the AWS Batch backend. This is probably the part that you will need to change the most to talk to Slurm (or whatever else you want).

Let me know if you have more questions. As I mentioned, happy to help give you more information/help if needed. Please also see caveats I posted in #29.

from metaflow.

savingoyal avatar savingoyal commented on August 27, 2024

Please check #29 .

from metaflow.

oguitart avatar oguitart commented on August 27, 2024

Thank you for the explanation. We have an HPC cluster with SGE and a GPU cluster with Slurm. So I think your suggestion of creating a plugin, it should be the right way. I need to start checking documentation and code to understand how to create this kind of plugin.

from metaflow.

oguitart avatar oguitart commented on August 27, 2024

Thank you very much for all the information. I'll let you know if I have any questions,

from metaflow.

IanQS avatar IanQS commented on August 27, 2024

Is it possible to run metaflow across a local cluster of machines? I've got a cluster of machines locally and I'd rather use that than prematurely deploy to AWS when I may not need it? I've tried googling "metaflow local cluster" but this was the first result and the rest didn't look particularly relevant (all of them advocate training and verifying locally before scaling to AWS)

from metaflow.

savingoyal avatar savingoyal commented on August 27, 2024

@IanQS If you can deploy Kubernetes on top of this cluster, then our latest release which adds support for Kubernetes will get you going.

from metaflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.