dask-contrib / dask-databricks Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 4.0 43 KB

Cluster tools for running Dask on Databricks

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

dask-databricks's People

Contributors

Stargazers

Watchers

Forkers

skirui-source benrutter jacobtomlinson bancopop

dask-databricks's Issues

Access Dask dashboard through Databricks proxy

It's currently unclear how to expose the Dask Dashboard from the Databricks driver node.

There seems to be support for accessing TensorBoard so maybe this mechanism can be reused for Dask?

Add a DatabricksRunner

There have been some discussions lately like https://dask.discourse.group/t/dask-on-databricks-clusters/2086/8 about being able to side load a Dask cluster onto a Databricks multi-node cluster. A Runner could be a good model for this.

A Databricks cluster can be given an init script that gets executed on every node of the cluster. We could create a runner that can be launched as an init script as a background process.

It looks like Databricks sets environment variables like DB_IS_DRIVER and DB_DRIVER_IP so the scheduler could start on the driver node and all of the workers can find it because they know the driver IP address.

It probably doesn't make sense to support running client code as that would likely be driven from a notebook.

Add support for alternative worker commands and config options

Right now the dask worker command is hard coded.

https://github.com/jacobtomlinson/dask-databricks/blob/a64698074bcbe4117a0a16e32737b4fdeba3d491/dask_databricks/cli.py#L51

We should support configuring this so we can set it to other things like dask cuda worker. We may also want to be able to pass extra flags to dask scheduler and dask worker and we should expose that via the dask databricks run CLI.

#!/bin/bash

# Install Dask Databricks
/databricks/python/bin/pip install dask-databricks

# Start Dask cluster
dask databricks run --worker-command "dask cuda worker" --worker-args "--nthreads 10"

Publish to PyPI

Set up builds to push to PyPI on tag. kr8s has a nice workflow for this we can copy.

https://github.com/kr8s-org/kr8s/blob/e914160dcb6654a21c73fee6a63d73a1550b4c34/.github/workflows/release.yaml#L1-L21

Add precommit and precommit CI

Add some precommit config to run black and do some linting (I like ruff these days).
Add precommit CI to run automatically.

I most recently set this up in kr8s so might be a good place to copy from.

https://github.com/kr8s-org/kr8s/blob/e914160dcb6654a21c73fee6a63d73a1550b4c34/.pre-commit-config.yaml
https://github.com/kr8s-org/kr8s/blob/e914160dcb6654a21c73fee6a63d73a1550b4c34/pyproject.toml#L76-L115

Add auto VCS versioning

Right now we have a hard coded version.

https://github.com/jacobtomlinson/dask-databricks/blob/a64698074bcbe4117a0a16e32737b4fdeba3d491/dask_databricks/__about__.py#L4

It would be good to switch over to hatch-vcs to pick this up automatically at build time from the git tags.

High-level plan and scope

This issue outlines the high-level plan and scope of dask-databricks.

Overview

Running Dask clusters on Databricks is possible today but is not well supported. The goal of dask-databricks is to provide folks with useful tools to make launching and using Dask clusters on Databricks easier.

Specifically this tool is intended to launch a Dask cluster alongside a Databricks multi-node Spark cluster. Spark's architecture is similar to Dask in that it has a driver node that coordinates work and many worker nodes that run the work. When using Databricks notebooks the cells of the notebook are executed on the driver node. When using pyspark or other libraries the tasks are submitted to the driver which distributes them onto the workers.

Via init scripts it is possible to run arbitrary commands on each node as it starts up, and node are aware of their role and the network information of the driver node via environment variables. Therefore we can use an init script to start up Dask scheduler and worker processes alongside the Spark driver and worker processes.

From the user's perspective in a notebook environment they can then choose to either use pyspark or dask and utilize the same set of hardware.

Outline

There are two key technical aspects to using Dask on Databricks, 1) starting the cluster and 2) using the cluster. These functions will be provided by two separate parts of dask-databricks.

Cluster startup tool

To simplify launching the Dask cluster components a CLI tool will be provided that simplifies starting the Dask cluster. Users should be able to create a minimal init script with something like this.

#!/bin/bash

pip install dask-databricks
dask databricks run

It would be great if that's all that is needed to launch the Dask cluster.

Cluster manager

Then from the notebook side it should be easy for the user to create a Dask Client and connect to the cluster.

It could be nice to implement a DatabricksCluster object that provides convenience methods to the user. This would be similar to the HelmCluster in Dask Kubernetes where unlike most cluster managers it doesn't create the cluster, but instead finds and connects to the existing cluster and then provides the Cluster API.

from dask_databricks import DatabricksCluster
from dask.distributed import Client

# If `dask databricks run` was successful this will connect to the scheduler, 
# otherwise will raise a useful error to help folks troubleshoot their cluster
cluster = DatabricksCluster() 

# Connect the client
client = Client(cluster)

# Easily view logs from scheduler and workers in the notebook
cluster.get_logs()

Add docs

We need to write and build some documentation with Sphinx using the dask-sphinx-theme.

Add CI

Add GitHub Actions CI to run pytest.

Add DatabricksCluster class

Add a cluster manager class to simplify connecting a Client and provide useful utilities.

Currently to connect a client we can do the following.

from dask.distributed import Client
import os

client = Client(f'{os.environ["SPARK_LOCAL_IP"]}:8786')

It would be nice if we could do this the "Dask way" with a cluster manager class like this.

from dask.distributed import Client
from dask_databricks import DatabricksCluster

cluster = DatabricksCluster()
client = Client(cluster)

Then we could have a cluster object that we can use to call useful utilities like cluster.get_logs().

Check process health

When starting the scheduler and workers we just call them and move on.

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L38

https://github.com/jacobtomlinson/dask-databricks/blob/a6ff0d4ff5ca91c29b174386474a69b9454c48d4/dask_databricks/cli.py#L51

It would be a better user experience if we watched their health for a little while. Eventually, we need to exit the init script and leave the Dask processes running, and we don't want to watch them forever otherwise the init script will never exit. But in the overall timeline of the Databricks cluster starting we could afford to spend a few seconds watching the Dask components to make sure they are healthy and don't exit prematurely.

Here are a few ideas for health checks we could implement:

Scheduler startup
- For a few seconds after starting the scheduler check the process hasn't exited
- Poll the network socket like the workers do and wait for the scheduler to start
- Watch the stdout for the Scheduler at: ... log line
Worker startup
- Poll the worker socket (requires starting all workers on the same port)
- Watch the stdout for Start worker at: log line

We should add at least one check for the scheduler and workers.

We could add some configurable timeout and if the components don't start up in that time we would exit with a non-zero return code and some useful logs about what is going on.

#!/bin/bash

# Install Dask Databricks
/databricks/python/bin/pip install dask-databricks

# Start Dask cluster
dask databricks run --timeout 30s

When init scripts exit like that the whole cluster provisioning fails so we want to be cautious about when we do this, but if the scheduler or worker processes fail to start up cleanly this seems like a good time to do this.

Publish to conda-forge

Once published to PyPI in #15 we should also publish to conda-forge.

To do this we should generate a recipe with greyskull and submit it to the conda-forge/stages-recipes repo.