Dataflow pipeline runs write tensors to GCS buckets using GCS' Python client. From Sta

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GCS Client usage in Dataflow pipelines about ml4h HOT 4 OPEN

kyuksel commented on July 3, 2024

GCS Client usage in Dataflow pipelines

from ml4h.

Comments (4)

gpbatra commented on July 3, 2024

Quick correction here (just to make sure we all have consistent terminology -- Kaan, please correct me if there's better terminology for each concept).

Task = a worker. Easiest to think of this as a single vCPU with dedicated resources in isolation of all other tasks. However, could also be a completely isolated cpu thread on a single machine with 10cpus.

Each task gets a number of data shards. A single shard could be something like a single row, or could be all rows with sample_id X. Shard definition is determined by the programmer -- shard routing is handled by dataflow.

Dataflow is in charge of sending shards to tasks. We have no control over this.

for this case, the most scalable approach is for each task to instantiate a client once and only once, to be used across as many shards as it receives, as opposed to

every task sharing one global client
every task opening up a new client for every shard that it sees.

from ml4h.

kyuksel commented on July 3, 2024

After looking at some of the Beam/Dataflow documentation and online discussions, the following terms seem to be the most relevant:

PCollection: every step in a pipeline takes in and outputs one or more of them; e.g. in our pipeline, the list of (sample_id, list of rows) tuples constitute a PCollection
Bundle: a piece of a PCollection that gets sent to a worker (e.g. 10 of the (sample_id, list of rows) tuples)
Worker: hardware that executes an operation on a bundle

From Beam execution model:

Instead of processing all elements simultaneously, the elements in a PCollection are processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.

Things get fuzzier with Dataflow runner's dynamic work rebalancing scheme where it may send bundles to a number of workers and then if it detects stragglers, it may take away pieces of bundles away from the stragglers and give to workers that may have finished and/or to newly spawn workers.

TL;DR
It seems like the best we can do/control is to have one bundle use a single GCS client. Most of the time, that'll be probably the only client a worker would be using. If it happens to get additional bundles to work on, it would get a new client with it, which people say is fine and Beam-idiomatic.

from ml4h.

gpbatra commented on July 3, 2024

Hi -- a single worker can get multiple bundles. Most efficient is to have one client per worker, as discussed.

Dan's code implements per bundle start/finish tasks.

According to apache beam, we could be using per task setup, teardown functions:
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Teardown.html
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Setup.html

Note that these are java, so hopefully have made it into Python. This is a fairly critical and known part of the MR paradigm, so I'd be very surprised if not implemented.

Example things that are a good idea to do in this method:
Close a network connection that was opened in DoFn.Setup
Shut down a helper process that was started in DoFn.Setup

from ml4h.

erikr commented on July 3, 2024

@gpbatra is this issue sufficiently stale to be closed?

from ml4h.

GCS Client usage in Dataflow pipelines about ml4h HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent