Giter Site home page Giter Site logo

Comments (4)

gpbatra avatar gpbatra commented on July 3, 2024

Quick correction here (just to make sure we all have consistent terminology -- Kaan, please correct me if there's better terminology for each concept).

Task = a worker. Easiest to think of this as a single vCPU with dedicated resources in isolation of all other tasks. However, could also be a completely isolated cpu thread on a single machine with 10cpus.

Each task gets a number of data shards. A single shard could be something like a single row, or could be all rows with sample_id X. Shard definition is determined by the programmer -- shard routing is handled by dataflow.

Dataflow is in charge of sending shards to tasks. We have no control over this.


for this case, the most scalable approach is for each task to instantiate a client once and only once, to be used across as many shards as it receives, as opposed to

  • every task sharing one global client
  • every task opening up a new client for every shard that it sees.

from ml4h.

kyuksel avatar kyuksel commented on July 3, 2024

After looking at some of the Beam/Dataflow documentation and online discussions, the following terms seem to be the most relevant:

  • PCollection: every step in a pipeline takes in and outputs one or more of them; e.g. in our pipeline, the list of (sample_id, list of rows) tuples constitute a PCollection
  • Bundle: a piece of a PCollection that gets sent to a worker (e.g. 10 of the (sample_id, list of rows) tuples)
  • Worker: hardware that executes an operation on a bundle

From Beam execution model:

Instead of processing all elements simultaneously, the elements in a PCollection are processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.

Things get fuzzier with Dataflow runner's dynamic work rebalancing scheme where it may send bundles to a number of workers and then if it detects stragglers, it may take away pieces of bundles away from the stragglers and give to workers that may have finished and/or to newly spawn workers.

TL;DR
It seems like the best we can do/control is to have one bundle use a single GCS client. Most of the time, that'll be probably the only client a worker would be using. If it happens to get additional bundles to work on, it would get a new client with it, which people say is fine and Beam-idiomatic.

from ml4h.

gpbatra avatar gpbatra commented on July 3, 2024

Hi -- a single worker can get multiple bundles. Most efficient is to have one client per worker, as discussed.

Dan's code implements per bundle start/finish tasks.

According to apache beam, we could be using per task setup, teardown functions:
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Teardown.html
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Setup.html

Note that these are java, so hopefully have made it into Python. This is a fairly critical and known part of the MR paradigm, so I'd be very surprised if not implemented.

Example things that are a good idea to do in this method:
Close a network connection that was opened in DoFn.Setup
Shut down a helper process that was started in DoFn.Setup

from ml4h.

erikr avatar erikr commented on July 3, 2024

@gpbatra is this issue sufficiently stale to be closed?

from ml4h.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.