Comments (4)
Quick correction here (just to make sure we all have consistent terminology -- Kaan, please correct me if there's better terminology for each concept).
Task = a worker. Easiest to think of this as a single vCPU with dedicated resources in isolation of all other tasks. However, could also be a completely isolated cpu thread on a single machine with 10cpus.
Each task gets a number of data shards. A single shard could be something like a single row, or could be all rows with sample_id X. Shard definition is determined by the programmer -- shard routing is handled by dataflow.
Dataflow is in charge of sending shards to tasks. We have no control over this.
for this case, the most scalable approach is for each task to instantiate a client once and only once, to be used across as many shards as it receives, as opposed to
- every task sharing one global client
- every task opening up a new client for every shard that it sees.
from ml4h.
After looking at some of the Beam/Dataflow documentation and online discussions, the following terms seem to be the most relevant:
- PCollection: every step in a pipeline takes in and outputs one or more of them; e.g. in our pipeline, the list of (sample_id, list of rows) tuples constitute a PCollection
- Bundle: a piece of a PCollection that gets sent to a worker (e.g. 10 of the (sample_id, list of rows) tuples)
- Worker: hardware that executes an operation on a bundle
From Beam execution model:
Instead of processing all elements simultaneously, the elements in a PCollection are processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.
Things get fuzzier with Dataflow runner's dynamic work rebalancing scheme where it may send bundles to a number of workers and then if it detects stragglers, it may take away pieces of bundles away from the stragglers and give to workers that may have finished and/or to newly spawn workers.
TL;DR
It seems like the best we can do/control is to have one bundle use a single GCS client. Most of the time, that'll be probably the only client a worker would be using. If it happens to get additional bundles to work on, it would get a new client with it, which people say is fine and Beam-idiomatic.
from ml4h.
Hi -- a single worker can get multiple bundles. Most efficient is to have one client per worker, as discussed.
Dan's code implements per bundle start/finish tasks.
According to apache beam, we could be using per task setup, teardown functions:
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Teardown.html
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/DoFn.Setup.html
Note that these are java, so hopefully have made it into Python. This is a fairly critical and known part of the MR paradigm, so I'd be very surprised if not implemented.
Example things that are a good idea to do in this method:
Close a network connection that was opened in DoFn.Setup
Shut down a helper process that was started in DoFn.Setup
from ml4h.
@gpbatra is this issue sufficiently stale to be closed?
from ml4h.
Related Issues (20)
- Type Checks in Conv Layer
- Tensorflow 2.2.0 errors HOT 1
- No OSX support for multithreading HOT 2
- p-values for bootstrapped performance comparison HOT 1
- Migrate to tf.data.Dataset and/or torch.utils.data.DataLoader HOT 1
- Support multi-dimensional TMAPs in output HOT 1
- Transformer Model Module
- Store MRI metadata in a DataFrame during hd5 production HOT 2
- ml4h install: vtk module is pinned to a version not supported in py3.8 HOT 4
- some tensormap key names need to be updated. HOT 1
- fix display of MRI images in plot_mri_tensor_as_animation HOT 2
- issues running the `mnist_survival_analysis_demo.ipynb` notebook HOT 3
- PCLR SavedModel different to PCLR.h5, unable to load PCLR.h5 HOT 2
- Jupyter, Welder failed to start after 10 minutes.
- missing function: slice
- Channels in mri_silhouettes
- PCLR_lead_I.h5
- droid docker image HOT 1
- Regarding the issue of DROID-LV prediction output values being too small HOT 1
- echo_supervised_inference_recipe.py wide_file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml4h.