dask / dask-gke Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 23.0 689 KB

kubernetes setup to bootstrap distributed on google container engine

Python 99.23% Shell 0.77%

dask-gke's Introduction

Dask

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-gke's People

Contributors

Stargazers

Watchers

dask-gke's Issues

CLI for cluster management

This will replace scripts/make_cluster.sh and is loosely based on the command gcloud dataproc clusters but specific to Dask.

Example usage:

$ dask-kube
Usage: dask-kube [--version] [--help] <command> [args]

Create and manage Dask clusters using Kubernetes on Google Cloud Platform.

Common commands:
    create             Create a cluster.
    delete             Delete a cluster.
    update             Update cluster configuration.
    describe           Show the details of a cluster.
    list               Show all clusters.

Notebooks need to refer to scheduler address

Currently the notebooks connect to Client('127.0.0.1:8786'). This made sense on EC2 when both the scheduler and the jupyter notebook server were on the same machine. Now these probably need to be changed to Client('dask-scheduler:8786'), which seems to work well.

Include Bokeh server for one worker

So my understanding is that it is difficult to provide an external view of the bokeh server page for each worker. However, is it possible to provide an external view of the bokeh server page for one worker? It can be very helpful to get information from a representative worker.

Use port 80 or 443 for notebook and bokeh application?

By convention we serve on ports 8787 (dask/bokeh) and 8888 (jupyter). However these ports may be blocked within enterprise environments. Perhaps we should serve on typical ports like 80 (for http) and 443 (for https) which are usually not blocked?

bad description for lab command

Got copy-paste of notebook command

Remove password from default jupyter configuration

Last week, I found dask-kubernetes and was excited to play around with it. It ended up taking me a couple of days of work to get the docker container re-compiled so that I could test it. This is because the password for the jupyter notebook is not specified. Why not disable the password by default?

option to use preemptible instances

I thought it'd be nice to safe some money by adding the option to use preemptible instances, so I added a quickfix to do this. Will issue PR.

Increase scheduler memory

For large graphs we can easily extend beyond the 1GB limit here. We might consider going to 2GB or 4GB.

Kubernetes deployment of many workers on many-cpu machines is very slow

I was trying to do some comparisons between joblib and joblib-with-distributed-backend and so tried to provision 2 or 3 96 or 64 cpu machines with 40-60 workers and 40-60 cpus for the notebook worker on google cloud. I found that the provisioning process would take quite long. I think this may just reflect underlying slowness in the container process, but if you guys have any insight that would be great.

Deleting a cluster should delete associated resources

I believe that the command dask-kubernetes delete NAME only deletes the GKE cluster, but does not delete the associated load balancer(s) from GCP, which continue to run in terms of compute and billing.

dask-kubernetes delete NAME should delete the GKE cluster and associated resources on GCP.

Rerender fails

It looks like there is a typo in the name/cluster variable. I also get follow on errors about update_config after I fix this.

Missing packages in notebook 06-nyc-parquet.ipynb

Tried to execute cells in the notebook above.

To fix it I installed it locally on my machine and connected dask.distributed to the gke-dask but then:
distributed.utils - ERROR - Data is compressed as snappy but we don't have this installed

Btw. I installed dask via helm install stable/dask

No handlers could be found for logger "dask_kubernetes.cli.utils"

$ dask-kubernetes create test
No handlers could be found for logger "dask_kubernetes.cli.utils"

Got the same error on ubuntu 14.04 and on a mac

Should this repo be archived?

The readme has a note about the lack of maintenance and activity. Should this repo be archived?

Move to dask org

This seems to be stable enough that it should probably move to the main org

Command to print out useful information

After creating the cluster I need to know a few things like where the jupyter notebook and bokeh page are. The README is quite good about pointing me to kubectl get services -l app=dask. Relying on kubernetes in this way may be the right approach to avoid endless scope. However, if we do want to start doing more things like resize it would be convenient to have the important addresses printed out on the console so that I could easily copy-paste and put into my web browser (or click on them, if my terminal highlights appropriately turns them into links).

Additionally, it would be useful to get instructions on how to launch the kubernetes web interface. I think that this is typically done by setting up some sort of proxy. Launching the web interface might be sufficient to cover a lot of the requests like resize. We could just say "oh, just launch the web interface and it should be easy to do from there."

TypeError: makedirs() got an unexpected keyword argument 'exist_ok'

https://github.com/dask/dask-kubernetes/blob/master/dask_kubernetes/cli/main.py#L96
As suggested by the comment on that line, it indeed breaks for py2

But removing exist_ok argument fixed it for me.

Disable public IP access by default

For security reasons, I think that the default configuration should not map the jupyter / scheduler services on a public IP address (even if jupyter notebook asks for a password, passing a password over HTTP without TLS is unsafe).

It would be better to advertise the use of:

kubectl port-forward name-of-service localport:serviceport

We could even have some dask-kubernetes helper commands to do that automatically and open the notebook and other HTTP status pages on http://localhost:localport instead.

Develop system to update docker image

This repository tends to fall behind whenever we update Dask. We might want to move the basic image to something in a standard dask/ namespace. We may also want to add instructions on how to rebuild and republish this image whenever we issue a release. This might go into the Dask release-procedure notes

'webbrowser.open' on OSX

Gives error message like doesn’t understand the “open location” message. (-1708) when calling any browser page, because of a change in the latest OSX. This can only be fixed in python itself (webbrowser is a builtin package) or by setting an environment variable like export BROWSER=open.

click.launch seems to work fine, however. Since click is already a dependency of this project, should that become the way that browser pages are opened? Does it work for all platforms?

Update to new version of dask/dask and dask/distributed

Now that we have released 0.14.3 it would be nice to update the docker image. How does one do this for the future?

Add credentials command

Currently we set the kubectl credentials when creating a cluster, but that only remains valid so long as credentials are not set elsewhere. Should provide a command to set for a given cluster, for users that wish to use kubectl directly.

A possible alternative would be to report the kubetnetes "context" for a cluster, since passing --context to kubectl will work without explicitly fetching credentials. Probably should provide both.

Relies on being run from the repo directory

It appears that this tool requires being run from a directory that has a kubernetes subdirectory with appropriate yaml files. While this is probably true in development it may not be true if someone pip installs this library. Instead, would it be possible to look for a local kubernetes directory and then, if one does not exist, perhaps look at ~/.dask/kubernetes, where perhaps we populate some default config files?

mrocklin@carbon:~$ dask-kubernetes create test-1
executing: gcloud config set compute/zone us-east1-b
Updated property [compute/zone].
executing: gcloud container clusters create test-1 --num-nodes 6 --machine-type n1-standard-4 --no-async --disk-size 50 --tags=dask,anacondascale
Creating cluster test-1...done.                                                                                   
Created [https://container.googleapis.com/v1/projects/continuum-compute/zones/us-east1-b/clusters/test-1].
kubeconfig entry generated for test-1.
NAME    ZONE        MASTER_VERSION  MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
test-1  us-east1-b  1.5.7           35.185.100.244  n1-standard-4  1.5.7         6          RUNNING
executing: gcloud container clusters get-credentials test-1
Fetching cluster endpoint and auth data.
kubeconfig entry generated for test-1.
executing: kubectl create -f kubernetes
error: the path "kubernetes" does not exist

Add JupyterLab service

It would be interesting (and community-minded) to add a JupyterLab instance running alongside the class Jupyter notebook.

Change name to dask-kubernetes?

I've been trying to remove the name distributed from any package or documentation. It's a bad name for a software project. I've been calling the whole thing just dask. This would leave us with

kubernetes-dask

Many other dask projects switch the name here to something like dask-foo. For example dask-drmaa, dask-marathon, dask-ssh, dask-ec2, etc.. It might be more consistent to follow this pattern.

Programmatic interface

For various reasons it would be convenient to have a KubernetesCluster object in Python that we could play with during execution rather than use the command line. In my immediate use case I'm building benchmarks and would like to try the same computation with many clusters of different sizes.

Insight into configuration options

I've recently started exploring using dask-kubernetes to automate cluster deployment. Is it possible to add a few more comments to the defaults.yaml configuration file?

For example:
One of the first things I wanted to do is to see how things scale as I add more workers. There's a basic formula that I think I understand

count * cpus_per_worker + jupyter cpu + scheduler cpu <= num_nodes * cpus per node

But why is the cpus_per_worker non-integer? When I dig a bit further, the default value of 1.8 appears to get normalized to 2.

Dask Scheduler PodUnschedulable

Hi I am trying to create a cluster, however my dask-scheduler is alwasy unschedulable. I have enough resources. I will add my cluster.yaml here. It also has autoscaling enabled but I can't create my cluster.

I see this on my kubernetes engine

Building a Helm Chart for dask

Hi Martin,

I am doing some work in the Kubernetes space around ML / DL, mostly on the infrastructure side to make it easy to onboard clustering. One of the things I have worked on is a Helm chart for Tensorflow (write up here)

Someone told me that Anaconda would make a lot of sense, so I started looking into ways to run it on K8s, and your repo is the most advanced setup I can find. However, I try to stick to Helm packages in general, and you don't have that.

Are you accepting PRs, and is this something you would value?
If not, do you mind if I borrow your structure to start a chart? Any license you'd like to apply?

Many thanks! I am also available on Twitter to discuss @SaMnCo_23.

Support demonstration workload

One common use of dask-ec2 today is to support conferences and demonstrations. To satisfy this application we would need the following:

Include a copy of the notebooks within https://github.com/dask/dask-ec2/tree/master/notebooks though replacing data in the dask-data s3 bucket with equivalent data in a dask-data gcs container
Include instructions to scale the cluster up and down so that it could be cheaply running forever and only scale up as needed
Include instructions on how users can alter the conda environment in the dockerfile and distribute that image rather than the standard image provided here.

Add CLI command to display cluster info

Similar to dask and dask-ec2, it would be helpful to give users easy access to common UI URLs for a given cluster.

Example usage:

$ dask-kubernetes info NAME

Addresses
---------
   Web Interface:  http://xxx.xxx.xxx.xxx:8787/status
 Scheduler (TCP):         xxx.xxx.xxx.xxx:8786
Scheduler (HTTP):         xxx.xxx.xxx.xxx:9786
           Bokeh:         xxx.xxx.xxx.xxx:8788
         Jupyter:         xxx.xxx.xxx.xxx:8888

To connect to scheduler inside of cluster
-----------------------------------------
from dask.distributed import Client
c = Client('dask-scheduler:8786')

To connect to scheduler outside of cluster
------------------------------------------
from dask.distributed import Client
c = Client('xxx.xxx.xxx.xxx:8786')

Resize operation handles both pods and nodes?

It looks like the current dask-kubernetes resize operation passes through to kubectl. Users are recommended to increase both the number of pods and the number of nodes:

Excerpt

To add machines to the cluster, you may do the following

dask-kubernetes resize nodes CLUSTER COUNT

To add worker containers, you may do the following

dask-kubernetes resize pods CLUSTER COUNT

Suggestion

Is it possible to take on this book keeping ourselves so that users don't need to understand the difference here?

Error connecting to the scheduler

I got the cluster successfully spawned and it showed a message like this

Addresses
---------
   Web Interface:  http://104.196.202.253:8787/status
Jupyter Notebook:  http://35.185.38.137:8888
     Jupyter Lab:  http://35.185.38.137:8889
Config directory:  /Users/ajkale/.dask/kubernetes/dask-cluster-1
 Config settings:  /Users/ajkale/.dask/kubernetes/dask-cluster-1.yaml

To connect to scheduler inside of cluster
-----------------------------------------
from dask.distributed import Client
c = Client('dask-scheduler:8786')

or from outside the cluster

c = Client('104.196.202.253:8786')

Live pods:
{u'dask-scheduler': [u'dask-scheduler-9hkw6'], u'jupyter-notebook': [u'jupyter-notebook-4jzr2'], u'dask-worker': [u'dask-worker-mdr4z']}

Couple of issues -

The jupyter notebook and lab need a password. Didnt find an easy way to get those.
From outside the cluster if I try c = Client('104.196.202.253:8786') as mentioned in the info, I get this error

distributed.utils - ERROR - in <closed TCP>: Stream is closed: while trying to call remote method 'identity'
Traceback (most recent call last):
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/utils.py", line 223, in f
    result[0] = yield make_coro()
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/client.py", line 680, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/client.py", line 716, in _ensure_connected
    yield self.scheduler.identity()
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/core.py", line 433, in send_recv_from_rpc
    % (e, key,))
CommClosedError: in <closed TCP>: Stream is closed: while trying to call remote method 'identity'```

Naive run-through of README failed

So I ran through the README got this point and then failed with the following error:

mrocklin@workstation:~/workspace/dask-kubernetes$ source scripts/make_cluster.sh 
Updated property [compute/zone].
ERROR: (gcloud.container.clusters.create) ResponseError: code=404, message=The resource "projects/dask-cluster" was not found.
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=The resource "projects/dask-cluster" was not found.
Unable to connect to the server: dial tcp 104.196.158.131:443: i/o timeout

My first guess here is that I have some config file hidden somewhere on my computer pointing gcloud to use an old resource. I suspect that I'm failing to properly execute this step in the README:

Register on the Google Cloud Platform, setup a billing account and create a project with the Google Compute Engine API enabled.

I am (perhaps naively) assuming that my previous projects on my work account already satisfy this constraint. Looking on there now it looks like there are a couple of these active. How does gcloud choose between them? How does it choose between these and my personal account? Do we know why it is finding this old dask-cluster project?

This is probably all on me rather than on this project, but any thoughts on how gcloud determines what project is uses would be welcome.

Next steps

update README
fetching of IP/ports from kubernetes is probably fragile, should use full JSON syntax to get exactly the right values.
wait until cluster is ready (with potential roll-back on failure)
print information when cluster creates
~~automatically start notebook/status~~
enable change of some parameter in supplied file, in generated settings file, or in command line options, and apply to cluster; currently we can already apply changes to kubernetes yaml files. Does this require rolling updates? If the scheduler is changed, must all workers restart?
Store the 'context' for every cluster, to avoid the time needed to look it up on every call.

Status page not connecting

When I create a fresh cluster I find that the status page does not show up. I'm not sure how to get more information to diagnose this.

periodic failures when creating a new cluster

I have run into this a few times:

$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
  File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start

Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo returned information but I was unable to connect to any of the services.

attaching persistent disks to GCP instances

I am working on developing a dask-kubernetes deployment that will need to access datasets stored on a GCP persistent disk. I've tried a number of approaches without success and am wondering if anyone in the dask-kubernetes development circle has any experience in this area.

xref: pangeo-data/pangeo#16

Deprecate or rename this project

This project is more specifically about provisioning nodes on GKE and then launching a kubernetes cluster on those nodes. This has been incredibly helpful in the past, but with advances in more general solutions like the helm chart and daskernetes it may be time to start deprecating this project.

Thoughts? cc @martindurant

Auto-generate an SSL certificate for jupyter notebooks?

Are you interested in a patch that auto-generates an SSL certificate for the jupyter-notebook process?

GCSFS fails to initialize

>>> fs = gcsfs.GCSFileSystem()
Enter the following code when prompted in the browser:
XXXX-XXXX
RuntimeError: Waited too long for browserauthentication.

If we are running from gcloud there must be some way of using the existing credentials. We might be able to ask our Google contacts about this.

Unable to connect to scheduler

I gave the recent changes a shot. First time through I was able to connect to the scheduler but the workers kept going up and down. Second time through I'm unable to connect to the scheduler.

Import additional libraries

blosc
dask-glm
dask-searchcv