Comments (12)
Have you tried the LocalCUDACluster
Python object or dask-cuda-worker
CLI tool from this repository? I wonder if you would have more luck with them. At the very least, it would remove MPI from the picture and simplify things
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
client = Client(cluster)
from dask-cuda.
No, I haven't tried LocalCUDACluster, but we need MPI, I can try this for debugging purpose, but in the final solution, we need to launch MPI processes from DASK.
from dask-cuda.
appears only 13 times, and it took quite some additional time for this to get printed 16 times (I am assuming that this gets printed once per worker). Any clues?
So most workers start up quickly, but a few take a long time?
Do you have a sense for what the workers that are taking a long time are doing? Honestly if I were to debug this I would probably just start putting print statements in the dask-mpi code to see how far things had gotten. Did MPI even start the Python process? Is it hanging on some import (maybe the file systems is slow?) Is it hanging on a network connection somewhere in Dask?
Unfortunately I don't have any easy answers for you. My hope is that the dask-mpi script is simple enough to inspect that you'll be able to insert a few print statements to help narrow down the problem.
from dask-cuda.
So most workers start up quickly, but a few take a long time?
=>
Say there are 16 workers, 7-8 starts within few seconds, takes 10-20 seconds to launch 10, a minute or two to get 13-14, and 5 minutes to get all 16.
I am not running anything, just waiting till workers to start.
And I am experimenting with dask-worker (instead of using mpirun & dask-mpi), and what happens is the pretty much same. To be more specific, I am trying
dask-worker --nprocs 16 --no-bokeh --nthreads 4 --no-scheduler --scheduler-file /datasets/pagerank_demo/tmp/cluster.json
So this doesn't look like an MPI issue, it's either a DASK issue or a system issue. And what I am experiencing sounds pretty similar to
I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.
from dask-cuda.
Thanks for investigating and for sharing your results here.
I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.
I would be curious what in the process of starting workers takes a long time. Is it just taking a long time for the process to start up? Are imports taking a long time? Is it taking a long time to establish a network connection?
One way to answer these questions would be to insert print statements like
print("Process started up", os.getpid())
import foo
import bar
import other_libraries
print("Imports finished", os.getpid())
...
worker = Worker(...)
worker.start(...)
print("Worker started", os.getpid())
...
and so on. If you have time, you could do this in the distributed/cli/dask_worker.py
file to try to narrow down what the slow part of this process is.
from dask-cuda.
Thanks, yes, I will try that with either dask_worker.py or dask_mpi.py. I am currently reading dask_mpi.py to figure out what it is actually doing...
from dask-cuda.
72 worker = W(scheduler_file=scheduler_file,
73 loop=loop,
74 name=rank if scheduler else None,
75 ncores=nthreads,
76 local_dir=local_directory,
77 services={('bokeh', bokeh_worker_port): BokehWorker},
78 memory_limit=memory_limit)
in
dask_mpi.py is taking long (some workers finish this pretty fast, but for the other remaining workers, this part takes long). I will dig into the W() call, but I am eager to take any advice.
I copy & pasted the dask_mpi.py file I am playing with as well.
1 from functools import partial
2
3 import click
4 from mpi4py import MPI
5 from tornado.ioloop import IOLoop
6 from tornado import gen
7 from warnings import warn
8
9 from distributed import Scheduler, Nanny, Worker
10 from distributed.bokeh.worker import BokehWorker
11 from distributed.cli.utils import check_python_3, uri_from_host_port
12 from distributed.utils import get_ip_interface
13
14
15 comm = MPI.COMM_WORLD
16 rank = comm.Get_rank()
17 loop = IOLoop()
18
19
20 @click.command()
21 @click.option('--scheduler-file', type=str, default='scheduler.json',
22 help='Filename to JSON encoded scheduler information. ')
23 @click.option('--interface', type=str, default=None,
24 help="Network interface like 'eth0' or 'ib0'")
25 @click.option('--nthreads', type=int, default=0,
26 help="Number of threads per worker.")
27 @click.option('--memory-limit', default='auto',
28 help="Number of bytes before spilling data to disk. "
29 "This can be an integer (nbytes) "
30 "float (fraction of total memory) "
31 "or 'auto'")
32 @click.option('--local-directory', default='', type=str,
33 help="Directory to place worker files")
34 @click.option('--scheduler/--no-scheduler', default=True,
35 help=("Whether or not to include a scheduler. "
36 "Use --no-scheduler to increase an existing dask cluster"))
37 @click.option('--nanny/--no-nanny', default=True,
38 help="Start workers in nanny process for management")
39 @click.option('--bokeh-port', type=int, default=8787,
40 help="Bokeh port for visual diagnostics")
41 @click.option('--bokeh-worker-port', type=int, default=8789,
42 help="Worker's Bokeh port for visual diagnostics")
43 @click.option('--bokeh-prefix', type=str, default=None,
44 help="Prefix for the bokeh app")
45 def main(scheduler_file, interface, nthreads, local_directory, memory_limit,
46 scheduler, bokeh_port, bokeh_prefix, nanny, bokeh_worker_port):
47 if interface:
48 host = get_ip_interface(interface)
49 else:
50 host = None
51
52 if rank == 0 and scheduler:
53 try:
54 from distributed.bokeh.scheduler import BokehScheduler
55 except ImportError:
56 services = {}
57 else:
58 services = {('bokeh', bokeh_port): partial(BokehScheduler,
59 prefix=bokeh_prefix)}
60 scheduler = Scheduler(scheduler_file=scheduler_file,
61 loop=loop,
62 services=services)
63 addr = uri_from_host_port(host, None, 8786)
64 scheduler.start(addr)
65 try:
66 loop.start()
67 loop.close()
68 finally:
69 scheduler.stop()
70 else:
71 W = Nanny if nanny else Worker
72 worker = W(scheduler_file=scheduler_file,
73 loop=loop,
74 name=rank if scheduler else None,
75 ncores=nthreads,
76 local_dir=local_directory,
77 services={('bokeh', bokeh_worker_port): BokehWorker},
78 memory_limit=memory_limit)
79 addr = uri_from_host_port(host, None, 0)
80
81 @gen.coroutine
82 def run():
83 yield worker._start(addr)
84 while worker.status != 'closed':
85 yield gen.sleep(0.2)
86
87 try:
88 loop.run_sync(run)
89 loop.close()
90 finally:
91 pass
92
93 @gen.coroutine
94 def close():
95 yield worker._close(timeout=2)
96
97 loop.run_sync(close)
98
99
100 def go():
101 check_python_3()
102 warn("The dask-mpi command line utility in the `distributed` "
103 "package is deprecated. "
104 "Please install the `dask-mpi` package instead. "
105 "More information is available at https://mpi.dask.org")
106 main()
107
108
109 if __name__ == '__main__':
110 go()
from dask-cuda.
Hrm, that's a bit surprising. That's just the worker constructor
I wouldn't have expected anything in there to be particularly expensive to call. If you're able to dive further and find out which line is causing the problems that would be welcome.
from dask-cuda.
381 self._workspace = WorkSpace(os.path.abspath(local_dir))
382 self._workdir = self._workspace.new_work_dir(prefix='worker-')
383 self.local_dir = self._workdir.dir_path
This part is slow... So sounds like a file system problem... It's taking too long to create directories...
from dask-cuda.
I think this was due to the network file system problem. After adding --local-directory to another file system, now all the workers are launching pretty fast. Thanks for the help!!!
from dask-cuda.
Thank you for the help in diagnosing this issue @seunghwak ! This is likely to be useful information for others as well.
from dask-cuda.
Yeah, this is a common issue if --local-directory
points to a directory on NFS for instance. The content in these directories doesn't need to stick around particularly long. Normally I just point these to node local scratch space for instance.
from dask-cuda.
Related Issues (20)
- Add a cluster/worker option to log cuDF spilling statistics HOT 1
- Dask LocalCudaCluster compute error when `threads_per_worker` not equal to 1 HOT 5
- Failing tests on `distributed>2023.9.2` HOT 3
- `dask cuda` CLI doesn't work with click 8.0.x
- Add option to use `cudf` spilling with `dask cuda` HOT 2
- Add cli option to enable pytorch to use same memory pool as rapids. HOT 2
- New version bug HOT 3
- Dask-CUDA 23.12.0 requires `rapids-dask-dependency` on PyPI HOT 2
- Support Pandas version 2 HOT 10
- Support "dataframe.query-planning" config in ``dask.dataframe``
- `test_dataframe_shuffle`: (CuPy's?) CUB emits `warning: cuda_runtime_api.h [jitify] File not found` HOT 3
- `dask.dataframe` `DeprecationWarning` turns into error (no `dask.dataframe`) HOT 3
- Broken link in explicit_comms docs HOT 1
- Pin newer pandas version HOT 2
- cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable HOT 2
- [CI] add wheel tests in CI HOT 4
- Error when installing latest version (24.6.0) of dask-cuda HOT 2
- CPU Memory Usage for Tasks with CPU-GPU Transfer HOT 28
- Explicit-comms shuffle produces different partitioning than `"tasks"`
- Slurm Cluster CPU Affinity HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-cuda.