Giter Site home page Giter Site logo

Comments (12)

mrocklin avatar mrocklin commented on September 17, 2024

Have you tried the LocalCUDACluster Python object or dask-cuda-worker CLI tool from this repository? I wonder if you would have more luck with them. At the very least, it would remove MPI from the picture and simplify things

from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

client = Client(cluster)

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024

No, I haven't tried LocalCUDACluster, but we need MPI, I can try this for debugging purpose, but in the final solution, we need to launch MPI processes from DASK.

from dask-cuda.

mrocklin avatar mrocklin commented on September 17, 2024

appears only 13 times, and it took quite some additional time for this to get printed 16 times (I am assuming that this gets printed once per worker). Any clues?

So most workers start up quickly, but a few take a long time?

Do you have a sense for what the workers that are taking a long time are doing? Honestly if I were to debug this I would probably just start putting print statements in the dask-mpi code to see how far things had gotten. Did MPI even start the Python process? Is it hanging on some import (maybe the file systems is slow?) Is it hanging on a network connection somewhere in Dask?

Unfortunately I don't have any easy answers for you. My hope is that the dask-mpi script is simple enough to inspect that you'll be able to insert a few print statements to help narrow down the problem.

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024

So most workers start up quickly, but a few take a long time?
=>
Say there are 16 workers, 7-8 starts within few seconds, takes 10-20 seconds to launch 10, a minute or two to get 13-14, and 5 minutes to get all 16.

I am not running anything, just waiting till workers to start.

And I am experimenting with dask-worker (instead of using mpirun & dask-mpi), and what happens is the pretty much same. To be more specific, I am trying

dask-worker --nprocs 16 --no-bokeh --nthreads 4 --no-scheduler --scheduler-file /datasets/pagerank_demo/tmp/cluster.json

So this doesn't look like an MPI issue, it's either a DASK issue or a system issue. And what I am experiencing sounds pretty similar to

dask/dask-jobqueue#193

I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.

from dask-cuda.

mrocklin avatar mrocklin commented on September 17, 2024

Thanks for investigating and for sharing your results here.

I will try few more and will post here if find something... but if you have any additional guide, it will be greatly appreciated.

I would be curious what in the process of starting workers takes a long time. Is it just taking a long time for the process to start up? Are imports taking a long time? Is it taking a long time to establish a network connection?

One way to answer these questions would be to insert print statements like

print("Process started up", os.getpid())
import foo
import bar
import other_libraries
print("Imports finished", os.getpid())
...
worker = Worker(...)
worker.start(...)
print("Worker started", os.getpid())
...

and so on. If you have time, you could do this in the distributed/cli/dask_worker.py file to try to narrow down what the slow part of this process is.

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024

Thanks, yes, I will try that with either dask_worker.py or dask_mpi.py. I am currently reading dask_mpi.py to figure out what it is actually doing...

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024
 72         worker = W(scheduler_file=scheduler_file,
 73                    loop=loop,
 74                    name=rank if scheduler else None,
 75                    ncores=nthreads,
 76                    local_dir=local_directory,
 77                    services={('bokeh', bokeh_worker_port): BokehWorker},
 78                    memory_limit=memory_limit)

in

dask_mpi.py is taking long (some workers finish this pretty fast, but for the other remaining workers, this part takes long). I will dig into the W() call, but I am eager to take any advice.

I copy & pasted the dask_mpi.py file I am playing with as well.

  1 from functools import partial
  2 
  3 import click
  4 from mpi4py import MPI
  5 from tornado.ioloop import IOLoop
  6 from tornado import gen
  7 from warnings import warn
  8 
  9 from distributed import Scheduler, Nanny, Worker
 10 from distributed.bokeh.worker import BokehWorker
 11 from distributed.cli.utils import check_python_3, uri_from_host_port
 12 from distributed.utils import get_ip_interface
 13 
 14 
 15 comm = MPI.COMM_WORLD
 16 rank = comm.Get_rank()
 17 loop = IOLoop()
 18 
 19 
 20 @click.command()
 21 @click.option('--scheduler-file', type=str, default='scheduler.json',
 22               help='Filename to JSON encoded scheduler information. ')
 23 @click.option('--interface', type=str, default=None,
 24               help="Network interface like 'eth0' or 'ib0'")
 25 @click.option('--nthreads', type=int, default=0,
 26               help="Number of threads per worker.")
 27 @click.option('--memory-limit', default='auto',
 28               help="Number of bytes before spilling data to disk. "
 29                    "This can be an integer (nbytes) "
 30                    "float (fraction of total memory) "
 31                    "or 'auto'")
 32 @click.option('--local-directory', default='', type=str,
 33               help="Directory to place worker files")
 34 @click.option('--scheduler/--no-scheduler', default=True,
 35               help=("Whether or not to include a scheduler. "
 36                     "Use --no-scheduler to increase an existing dask cluster"))
 37 @click.option('--nanny/--no-nanny', default=True,
 38               help="Start workers in nanny process for management")
 39 @click.option('--bokeh-port', type=int, default=8787,
 40               help="Bokeh port for visual diagnostics")
 41 @click.option('--bokeh-worker-port', type=int, default=8789,
 42               help="Worker's Bokeh port for visual diagnostics")
 43 @click.option('--bokeh-prefix', type=str, default=None,
 44               help="Prefix for the bokeh app")
 45 def main(scheduler_file, interface, nthreads, local_directory, memory_limit,
 46          scheduler, bokeh_port, bokeh_prefix, nanny, bokeh_worker_port):
 47     if interface:
 48         host = get_ip_interface(interface)
 49     else:
 50         host = None
 51 
 52     if rank == 0 and scheduler:
 53         try:
 54             from distributed.bokeh.scheduler import BokehScheduler
 55         except ImportError:
 56             services = {}
 57         else:
 58             services = {('bokeh',  bokeh_port): partial(BokehScheduler,
 59                                                         prefix=bokeh_prefix)}
 60         scheduler = Scheduler(scheduler_file=scheduler_file,
 61                               loop=loop,
 62                               services=services)
 63         addr = uri_from_host_port(host, None, 8786)
 64         scheduler.start(addr)
 65         try:
 66             loop.start()
 67             loop.close()
 68         finally:
 69             scheduler.stop()
 70     else:
 71         W = Nanny if nanny else Worker
 72         worker = W(scheduler_file=scheduler_file,
 73                    loop=loop,
 74                    name=rank if scheduler else None,
 75                    ncores=nthreads,
 76                    local_dir=local_directory,
 77                    services={('bokeh', bokeh_worker_port): BokehWorker},
 78                    memory_limit=memory_limit)
 79         addr = uri_from_host_port(host, None, 0)
 80 
 81         @gen.coroutine
 82         def run():
 83             yield worker._start(addr)
 84             while worker.status != 'closed':
 85                 yield gen.sleep(0.2)
 86 
 87         try:
 88             loop.run_sync(run)
 89             loop.close()
 90         finally:
 91             pass
 92
 93         @gen.coroutine
 94         def close():
 95             yield worker._close(timeout=2)
 96 
 97         loop.run_sync(close)
 98 
 99 
100 def go():
101     check_python_3()
102     warn("The dask-mpi command line utility in the `distributed` "
103          "package is deprecated.  "
104          "Please install the `dask-mpi` package instead. "
105          "More information is available at https://mpi.dask.org")
106     main()
107 
108 
109 if __name__ == '__main__':
110     go()

from dask-cuda.

mrocklin avatar mrocklin commented on September 17, 2024

Hrm, that's a bit surprising. That's just the worker constructor

https://github.com/dask/distributed/blob/ae18f6597619c2c365c46bc0ff5671444b713c7e/distributed/worker.py#L255-L262

I wouldn't have expected anything in there to be particularly expensive to call. If you're able to dive further and find out which line is causing the problems that would be welcome.

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024
 381         self._workspace = WorkSpace(os.path.abspath(local_dir))
 382         self._workdir = self._workspace.new_work_dir(prefix='worker-')
 383         self.local_dir = self._workdir.dir_path

This part is slow... So sounds like a file system problem... It's taking too long to create directories...

from dask-cuda.

seunghwak avatar seunghwak commented on September 17, 2024

I think this was due to the network file system problem. After adding --local-directory to another file system, now all the workers are launching pretty fast. Thanks for the help!!!

from dask-cuda.

mrocklin avatar mrocklin commented on September 17, 2024

Thank you for the help in diagnosing this issue @seunghwak ! This is likely to be useful information for others as well.

from dask-cuda.

jakirkham avatar jakirkham commented on September 17, 2024

Yeah, this is a common issue if --local-directory points to a directory on NFS for instance. The content in these directories doesn't need to stick around particularly long. Normally I just point these to node local scratch space for instance.

from dask-cuda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.