Comments (8)
To be more explicit, we would like to be able to have other deployment solutions have the same CUDA_VISIBLE_DEVICES
tricks that are used within the LocalCUDACluster
class within this repository. One simple-ish way to do this is to create optional hooks to replace the dask-worker
command with a new command, which we will provide as dask-cuda-worker
. This CLI utility would presumably start a few Nanny
s with different environment variables set (there is an env=
keyword in Nanny
), similar to how LocalCUDACluster
does.
A concrete proposal for this on the dask-jobqueue side is referenced above in dask/dask-jobqueue#229 . But this approach could also be used in other tooling like dask-kubernetes, dask-ssh, and so on.
from dask-cuda.
One approach to this would be to create a thin wrapper around dask-worker
that made some opinions about --nprocs
and --env
and then passed those opinions down to the main dask-worker command, maybe something like the following:
from distributed.cli.dask_worker import main as original_main
@click.argument(...) # pass through everything
def main(*args, **kwargs):
nprocs = figure_out_how_many_gpus_we_have()
environment_variables = [cuda_visible_devices(i) for i in range(nprocs)]
original_main(*args, nprocs=nprocs, env=environment_variables, **kwargs)
But I suspect that click
has some special way to pass through keywords that isn't well represented here.
from dask-cuda.
This would require and --env
keyword to dask-worker
, I'll raise an issue there
from dask-cuda.
Alternatively we could screw Python and probably just write a small bash script to do this.
In hindsight this might be simpler if the --env
trick is deemed a hack from the dask side.
from dask-cuda.
Proposing the following interaction between --nprocs and a new flag --gpus to dask-worker. No CUDA code is required in dask-worker, so no new worker program. No additional dependencies. Modifications to dask-ssh should be straightforward.
@click.option('--gpus', type=str, default=None,
help="GPUs to map to worker processes. "
"If used with --nprocs then nprocs must evenly divide the "
"number of GPUs. Specified as an ordered list of ranges e.g. '0-7', or '0,2,6-7'")
Checks
- --gpus parses as a comma separated list of integer ranges expanded into gpu-list
- GPUs exists on the system as /dev/nvidia# for each GPU.
- len(gpu-list) / nprocs is an integer.
Functionality
For each group of len(gpu-list)/nprocs a nanny+worker will be started with CUDA_VISIBLE_DEVICES
set with the GPUs in that group.
from dask-cuda.
I'm not sure if this is what you're proposing, but we probably can't add this to the dask-worker
command in the core project. The logic is pretty CUDA specific. My guess is that we'll have to find a way to wrap around in this repository rather than introduce this logic into the core project.
from dask-cuda.
To expand on my previous remark, I would have a few concerns:
- The logic we're proposing is around modifying
CUDA_VISIBLE_DEVICES
, so it's very CUDA specific. If some other GPU manufacturer were to show up they could easily cry foul. If someone else were to ask for similar functionality that was tied to AMD hardware the community would also probably reject this - The logic we're proposing here is honestly likely to change as we learn more about how to use GPUs and Dask together. The proposal isn't stable enough to make its way into the user-level API of Dask itself. There might be other good ways to handle Dask and GPUs. We generally need more experimentation before we commit ourselves to a particular blessed approach.
- It's not yet clear that Dask wants to think at all about GPUs. They're not special cased anywhere currently in the system. Baking in logic about GPUs specifically is a broader decision beyond this change. There are enough disparate special interests pushing on this project that we end up having to say "no" unfortunately often, even when it would make a particular group's situation easier. Scope creep gets pretty bad otherwise and maintenance becomes difficult.
from dask-cuda.
To get things moving, we might also consider just copy-pasting the entire dask-worker implementation, modifying the nprocs logic a bit, and calling it a day. I suspect that learning enough about click to properly reuse code might take a while (I certainly don't know how to do this at least).
Also cc @quasiben in case he has recommendations.
from dask-cuda.
Related Issues (20)
- `dask cuda` CLI doesn't work with click 8.0.x
- Add option to use `cudf` spilling with `dask cuda` HOT 2
- Add cli option to enable pytorch to use same memory pool as rapids. HOT 2
- New version bug HOT 3
- Dask-CUDA 23.12.0 requires `rapids-dask-dependency` on PyPI HOT 2
- Support Pandas version 2 HOT 10
- Support "dataframe.query-planning" config in ``dask.dataframe``
- `test_dataframe_shuffle`: (CuPy's?) CUB emits `warning: cuda_runtime_api.h [jitify] File not found` HOT 3
- `dask.dataframe` `DeprecationWarning` turns into error (no `dask.dataframe`) HOT 3
- Broken link in explicit_comms docs HOT 1
- Pin newer pandas version HOT 2
- cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable HOT 2
- [CI] add wheel tests in CI HOT 4
- Error when installing latest version (24.6.0) of dask-cuda HOT 2
- CPU Memory Usage for Tasks with CPU-GPU Transfer HOT 28
- Explicit-comms shuffle produces different partitioning than `"tasks"`
- Slurm Cluster CPU Affinity HOT 10
- Allow deregistering cuDF from Dask's spilling mechanism
- Allow enabling CUDF_SPILL via an environment variable
- Serialization problem with GridSearchCV HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-cuda.