Comments (6)
Another use case is to load a dataset in a balanced way.
It is also similar to Environment variables, maybe we can unify these two functionalities in some way or provide a better one that replaces both.
We also need to think a little about the api, i.e. which arguments the function should have (maybe total # of nodes or workers and current worker index). And how does it interplay with the scheduler? How do we make sure people don't abuse it to circumvent the scheduler?
I'd say add it with a disclaimer that it is experimental, encourage people to use it rarely and see what it gets used for.
from ray.
As we discussed earlier, one elegant approach is to provide a single method for running a function on all workers, and pass a counter into that function indicating how many other workers on that machine have already started executing that function (using Redis to do this atomically).
So if there are 4 worker on one machine, and we call something like ray.run_function_on_all_workers(f)
, then one worker will call f(1)
, one worker will call f(2)
, one will call f(3)
, and one will call f(4)
.
from ray.
This is a very interesting question/discussion. To address this specific use case, I think what you are looking for is the anti-affinity placement constraint, ability to run one task per locality domain (e.g., per node, per worker, per rack). We could provide this functionality in the global scheduler, by extending the scheduler API to accept a bag of tasks (collectively defined as a job) and a placement constraint associated with that job. Then, when the placement decision is made for this bag of tasks, it will be done in a way that honors the constraint, or the job is atomically rejected, if the constraint cannot be satisfied. Of course, it will be ideal to do it in a general way as well as make the placement constraint specification optional (perhaps even as a separate loadable module).
Another take on this would be to approach the problem from the OS systems perspective. We could think about a basic primitive Ray could expose (for instance a scoped atomic counter primitive backed by Redis) that enables ensembles of distributed Ray tasks to do leader election, for example. Counters could be node- or worker-scoped and persist for the lifetime of the task ensemble. It is easy to see how node-scoped atomic counters would enable "at most once per node" functionality, while worker-scoped atomic counters would enable "at most once per worker" functionality. So the Ray function that relies on some "once per worker" or "once per node" pre-processing will add a simple if statement checking the worker-scoped or node-scoped atomic counter and calls the init() function if the atomic counter is zero. The init() could either run in the same task context or as a separate task. The latter requires a mechanism that guarantees init() to run in the same locale as the caller, thus some minimal placement/locality awareness is still needed here. BUT, we could make it relative (as opposed to absolute) to preserve the resource abstraction. Locality constraint could be supported in the form "same locale as me" (affinity) or "not same locale as me" (anti-affinity).
Attempting to achieve everything we need by using what the system already provides is the way to go. As tempting as it is, I would discourage side-channel (i.e. internal/invisible) data/task propagation/distribution/broadcast. Thinking about and exposing expressive/composable basic system primitives will make Ray feel more and more like a microkernel!
from ray.
Closing for now.
from ray.
Just wanted to second this request. In my context, I want to set some defaults for libraries I use (Python logging, NumPy print options, etc) on all workers. I'm working around it by just stuffing this code in init.py which will get executed on all the workers, but this is pretty nasty.
from ray.
@robertnishihara
hi, at the released version: 1.13.0 , worker does't have the : ray.worker.get_global_worker() function . and how can i running a function on all workers now ?
from ray.
Related Issues (20)
- CI test linux://rllib:examples/evaluation/evaluation_parallel_to_training_multi_agent_duration_auto_torch_envrunner is flaky HOT 4
- Ray v2.11.0 missing windows distribution HOT 3
- [Ray Tune/ Train] Auth with aws_web_identity_token or use the provided file system provider in runtime config HOT 3
- [<Ray component: Core>] ray raise error on nvidia cuda machine for amdgpu missing
- [Tune] Trials on pre-started game instances
- Release test chaos_torch_batch_inference_16_gpu_300gb_raw.aws failed HOT 3
- [Data] Add `override_num_blocks` parameter to `from_pandas`
- Release test dataset_shuffle_push_based_sort_1tb.aws failed HOT 1
- [Core] Actor/Task cannot be scheduled on worker node. HOT 1
- [Ray core] Stopped job leaks worker HOT 4
- [serve] Support resource-based autoscaling HOT 1
- [Observability / Doc] Add support of ray debugger on windows HOT 5
- [Doc] Python 3.12 `docs` env has conflicts with `..scripts/format.sh` HOT 2
- [Dashboard] `py-spy` profiling initiated from the Ray Dashboard fails if `sudo` is not installed HOT 3
- [Ray Data] map_batches with actors is 25% slower than manually consuming with iter_batches
- [Core][Actors] Duplicate named actor exception should not be lazy if possible HOT 1
- [Core] `ray.wait` not actually wait until ready when the task is longer than 12 days
- [tune] `tune.with_resources` with `PlacementGroupFactory` cannot find GPUs in `train_fn` HOT 4
- CI test linux://python/ray/dashboard:test_dashboard is consistently_failing HOT 5
- CI test linux://rllib:examples/checkpoints/checkpoint_by_custom_criteria is flaky HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.