I am trying to launch one scheduler and 16 workers (one worker per GPU for DGX2, DGX2

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

Takes very long to launch all the workers. about dask-cuda HOT 12 CLOSED

rapidsai commented on September 17, 2024

Takes very long to launch all the workers.

from dask-cuda.

Comments (12)

mrocklin commented on September 17, 2024

Have you tried the LocalCUDACluster Python object or dask-cuda-worker CLI tool from this repository? I wonder if you would have more luck with them. At the very least, it would remove MPI from the picture and simplify things

from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

client = Client(cluster)

from dask-cuda.

seunghwak commented on September 17, 2024

No, I haven't tried LocalCUDACluster, but we need MPI, I can try this for debugging purpose, but in the final solution, we need to launch MPI processes from DASK.

from dask-cuda.

mrocklin commented on September 17, 2024

appears only 13 times, and it took quite some additional time for this to get printed 16 times (I am assuming that this gets printed once per worker). Any clues?

So most workers start up quickly, but a few take a long time?

Do you have a sense for what the workers that are taking a long time are doing? Honestly if I were to debug this I would probably just start putting print statements in the dask-mpi code to see how far things had gotten. Did MPI even start the Python process? Is it hanging on some import (maybe the file systems is slow?) Is it hanging on a network connection somewhere in Dask?

Unfortunately I don't have any easy answers for you. My hope is that the dask-mpi script is simple enough to inspect that you'll be able to insert a few print statements to help narrow down the problem.

from dask-cuda.

seunghwak commented on September 17, 2024

So most workers start up quickly, but a few take a long time?
=>
Say there are 16 workers, 7-8 starts within few seconds, takes 10-20 seconds to launch 10, a minute or two to get 13-14, and 5 minutes to get all 16.

I am not running anything, just waiting till workers to start.

And I am experimenting with dask-worker (instead of using mpirun & dask-mpi), and what happens is the pretty much same. To be more specific, I am trying

dask-worker --nprocs 16 --no-bokeh --nthreads 4 --no-scheduler --scheduler-file /datasets/pagerank_demo/tmp/cluster.json

So this doesn't look like an MPI issue, it's either a DASK issue or a system issue. And what I am experiencing sounds pretty similar to

dask/dask-jobqueue#193

I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.

from dask-cuda.

mrocklin commented on September 17, 2024

Thanks for investigating and for sharing your results here.

I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.

I would be curious what in the process of starting workers takes a long time. Is it just taking a long time for the process to start up? Are imports taking a long time? Is it taking a long time to establish a network connection?

One way to answer these questions would be to insert print statements like

print("Process started up", os.getpid())
import foo
import bar
import other_libraries
print("Imports finished", os.getpid())
...
worker = Worker(...)
worker.start(...)
print("Worker started", os.getpid())
...

and so on. If you have time, you could do this in the distributed/cli/dask_worker.py file to try to narrow down what the slow part of this process is.

from dask-cuda.

seunghwak commented on September 17, 2024

Thanks, yes, I will try that with either dask_worker.py or dask_mpi.py. I am currently reading dask_mpi.py to figure out what it is actually doing...

from dask-cuda.

seunghwak commented on September 17, 2024

 72         worker = W(scheduler_file=scheduler_file,
 73                    loop=loop,
 74                    name=rank if scheduler else None,
 75                    ncores=nthreads,
 76                    local_dir=local_directory,
 77                    services={('bokeh', bokeh_worker_port): BokehWorker},
 78                    memory_limit=memory_limit)

dask_mpi.py is taking long (some workers finish this pretty fast, but for the other remaining workers, this part takes long). I will dig into the W() call, but I am eager to take any advice.

I copy & pasted the dask_mpi.py file I am playing with as well.

  1 from functools import partial
  2 
  3 import click
  4 from mpi4py import MPI
  5 from tornado.ioloop import IOLoop
  6 from tornado import gen
  7 from warnings import warn
  8 
  9 from distributed import Scheduler, Nanny, Worker
 10 from distributed.bokeh.worker import BokehWorker
 11 from distributed.cli.utils import check_python_3, uri_from_host_port
 12 from distributed.utils import get_ip_interface
 13 
 14 
 15 comm = MPI.COMM_WORLD
 16 rank = comm.Get_rank()
 17 loop = IOLoop()
 18 
 19 
 20 @click.command()
 21 @click.option('--scheduler-file', type=str, default='scheduler.json',
 22               help='Filename to JSON encoded scheduler information. ')
 23 @click.option('--interface', type=str, default=None,
 24               help="Network interface like 'eth0' or 'ib0'")
 25 @click.option('--nthreads', type=int, default=0,
 26               help="Number of threads per worker.")
 27 @click.option('--memory-limit', default='auto',
 28               help="Number of bytes before spilling data to disk. "
 29                    "This can be an integer (nbytes) "
 30                    "float (fraction of total memory) "
 31                    "or 'auto'")
 32 @click.option('--local-directory', default='', type=str,
 33               help="Directory to place worker files")
 34 @click.option('--scheduler/--no-scheduler', default=True,
 35               help=("Whether or not to include a scheduler. "
 36                     "Use --no-scheduler to increase an existing dask cluster"))
 37 @click.option('--nanny/--no-nanny', default=True,
 38               help="Start workers in nanny process for management")
 39 @click.option('--bokeh-port', type=int, default=8787,
 40               help="Bokeh port for visual diagnostics")
 41 @click.option('--bokeh-worker-port', type=int, default=8789,
 42               help="Worker's Bokeh port for visual diagnostics")
 43 @click.option('--bokeh-prefix', type=str, default=None,
 44               help="Prefix for the bokeh app")
 45 def main(scheduler_file, interface, nthreads, local_directory, memory_limit,
 46          scheduler, bokeh_port, bokeh_prefix, nanny, bokeh_worker_port):
 47     if interface:
 48         host = get_ip_interface(interface)
 49     else:
 50         host = None
 51 
 52     if rank == 0 and scheduler:
 53         try:
 54             from distributed.bokeh.scheduler import BokehScheduler
 55         except ImportError:
 56             services = {}
 57         else:
 58             services = {('bokeh',  bokeh_port): partial(BokehScheduler,
 59                                                         prefix=bokeh_prefix)}
 60         scheduler = Scheduler(scheduler_file=scheduler_file,
 61                               loop=loop,
 62                               services=services)
 63         addr = uri_from_host_port(host, None, 8786)
 64         scheduler.start(addr)
 65         try:
 66             loop.start()
 67             loop.close()
 68         finally:
 69             scheduler.stop()
 70     else:
 71         W = Nanny if nanny else Worker
 72         worker = W(scheduler_file=scheduler_file,
 73                    loop=loop,
 74                    name=rank if scheduler else None,
 75                    ncores=nthreads,
 76                    local_dir=local_directory,
 77                    services={('bokeh', bokeh_worker_port): BokehWorker},
 78                    memory_limit=memory_limit)
 79         addr = uri_from_host_port(host, None, 0)
 80 
 81         @gen.coroutine
 82         def run():
 83             yield worker._start(addr)
 84             while worker.status != 'closed':
 85                 yield gen.sleep(0.2)
 86 
 87         try:
 88             loop.run_sync(run)
 89             loop.close()
 90         finally:
 91             pass
 92
 93         @gen.coroutine
 94         def close():
 95             yield worker._close(timeout=2)
 96 
 97         loop.run_sync(close)
 98 
 99 
100 def go():
101     check_python_3()
102     warn("The dask-mpi command line utility in the `distributed` "
103          "package is deprecated.  "
104          "Please install the `dask-mpi` package instead. "
105          "More information is available at https://mpi.dask.org")
106     main()
107 
108 
109 if __name__ == '__main__':
110     go()

from dask-cuda.

mrocklin commented on September 17, 2024

Hrm, that's a bit surprising. That's just the worker constructor

https://github.com/dask/distributed/blob/ae18f6597619c2c365c46bc0ff5671444b713c7e/distributed/worker.py#L255-L262

I wouldn't have expected anything in there to be particularly expensive to call. If you're able to dive further and find out which line is causing the problems that would be welcome.

from dask-cuda.

seunghwak commented on September 17, 2024

 381         self._workspace = WorkSpace(os.path.abspath(local_dir))
 382         self._workdir = self._workspace.new_work_dir(prefix='worker-')
 383         self.local_dir = self._workdir.dir_path

This part is slow... So sounds like a file system problem... It's taking too long to create directories...

from dask-cuda.

seunghwak commented on September 17, 2024

I think this was due to the network file system problem. After adding --local-directory to another file system, now all the workers are launching pretty fast. Thanks for the help!!!

from dask-cuda.

mrocklin commented on September 17, 2024

Thank you for the help in diagnosing this issue @seunghwak ! This is likely to be useful information for others as well.

from dask-cuda.

jakirkham commented on September 17, 2024

Yeah, this is a common issue if --local-directory points to a directory on NFS for instance. The content in these directories doesn't need to stick around particularly long. Normally I just point these to node local scratch space for instance.

from dask-cuda.

Takes very long to launch all the workers. about dask-cuda HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent