Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
kubernetes setup to bootstrap distributed on google container engine
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
This will replace scripts/make_cluster.sh
and is loosely based on the command gcloud dataproc clusters
but specific to Dask.
Example usage:
$ dask-kube
Usage: dask-kube [--version] [--help] <command> [args]
Create and manage Dask clusters using Kubernetes on Google Cloud Platform.
Common commands:
create Create a cluster.
delete Delete a cluster.
update Update cluster configuration.
describe Show the details of a cluster.
list Show all clusters.
Currently the notebooks connect to Client('127.0.0.1:8786')
. This made sense on EC2 when both the scheduler and the jupyter notebook server were on the same machine. Now these probably need to be changed to Client('dask-scheduler:8786')
, which seems to work well.
So my understanding is that it is difficult to provide an external view of the bokeh server page for each worker. However, is it possible to provide an external view of the bokeh server page for one worker? It can be very helpful to get information from a representative worker.
By convention we serve on ports 8787 (dask/bokeh) and 8888 (jupyter). However these ports may be blocked within enterprise environments. Perhaps we should serve on typical ports like 80 (for http) and 443 (for https) which are usually not blocked?
Got copy-paste of notebook command
Last week, I found dask-kubernetes and was excited to play around with it. It ended up taking me a couple of days of work to get the docker container re-compiled so that I could test it. This is because the password for the jupyter notebook is not specified. Why not disable the password by default?
I thought it'd be nice to safe some money by adding the option to use preemptible instances, so I added a quickfix to do this. Will issue PR.
For large graphs we can easily extend beyond the 1GB limit here. We might consider going to 2GB or 4GB.
I was trying to do some comparisons between joblib and joblib-with-distributed-backend and so tried to provision 2 or 3 96 or 64 cpu machines with 40-60 workers and 40-60 cpus for the notebook worker on google cloud. I found that the provisioning process would take quite long. I think this may just reflect underlying slowness in the container process, but if you guys have any insight that would be great.
I believe that the command dask-kubernetes delete NAME
only deletes the GKE cluster, but does not delete the associated load balancer(s) from GCP, which continue to run in terms of compute and billing.
dask-kubernetes delete NAME
should delete the GKE cluster and associated resources on GCP.
It looks like there is a typo in the name/cluster variable. I also get follow on errors about update_config
after I fix this.
$ dask-kubernetes create test
No handlers could be found for logger "dask_kubernetes.cli.utils"
Got the same error on ubuntu 14.04 and on a mac
The readme has a note about the lack of maintenance and activity. Should this repo be archived?
This seems to be stable enough that it should probably move to the main org
After creating the cluster I need to know a few things like where the jupyter notebook and bokeh page are. The README is quite good about pointing me to kubectl get services -l app=dask
. Relying on kubernetes in this way may be the right approach to avoid endless scope. However, if we do want to start doing more things like resize it would be convenient to have the important addresses printed out on the console so that I could easily copy-paste and put into my web browser (or click on them, if my terminal highlights appropriately turns them into links).
Additionally, it would be useful to get instructions on how to launch the kubernetes web interface. I think that this is typically done by setting up some sort of proxy. Launching the web interface might be sufficient to cover a lot of the requests like resize. We could just say "oh, just launch the web interface and it should be easy to do from there."
https://github.com/dask/dask-kubernetes/blob/master/dask_kubernetes/cli/main.py#L96
As suggested by the comment on that line, it indeed breaks for py2
But removing exist_ok
argument fixed it for me.
For security reasons, I think that the default configuration should not map the jupyter / scheduler services on a public IP address (even if jupyter notebook asks for a password, passing a password over HTTP without TLS is unsafe).
It would be better to advertise the use of:
kubectl port-forward name-of-service localport:serviceport
We could even have some dask-kubernetes
helper commands to do that automatically and open the notebook and other HTTP status pages on http://localhost:localport instead.
This repository tends to fall behind whenever we update Dask. We might want to move the basic image to something in a standard dask/
namespace. We may also want to add instructions on how to rebuild and republish this image whenever we issue a release. This might go into the Dask release-procedure notes
Gives error message like doesn’t understand the “open location” message. (-1708)
when calling any browser page, because of a change in the latest OSX. This can only be fixed in python itself (webbrowser is a builtin package) or by setting an environment variable like export BROWSER=open
.
click.launch
seems to work fine, however. Since click
is already a dependency of this project, should that become the way that browser pages are opened? Does it work for all platforms?
Now that we have released 0.14.3 it would be nice to update the docker image. How does one do this for the future?
Currently we set the kubectl credentials when creating a cluster, but that only remains valid so long as credentials are not set elsewhere. Should provide a command to set for a given cluster, for users that wish to use kubectl directly.
A possible alternative would be to report the kubetnetes "context" for a cluster, since passing --context
to kubectl will work without explicitly fetching credentials. Probably should provide both.
It appears that this tool requires being run from a directory that has a kubernetes subdirectory with appropriate yaml files. While this is probably true in development it may not be true if someone pip installs this library. Instead, would it be possible to look for a local kubernetes
directory and then, if one does not exist, perhaps look at ~/.dask/kubernetes
, where perhaps we populate some default config files?
mrocklin@carbon:~$ dask-kubernetes create test-1
executing: gcloud config set compute/zone us-east1-b
Updated property [compute/zone].
executing: gcloud container clusters create test-1 --num-nodes 6 --machine-type n1-standard-4 --no-async --disk-size 50 --tags=dask,anacondascale
Creating cluster test-1...done.
Created [https://container.googleapis.com/v1/projects/continuum-compute/zones/us-east1-b/clusters/test-1].
kubeconfig entry generated for test-1.
NAME ZONE MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
test-1 us-east1-b 1.5.7 35.185.100.244 n1-standard-4 1.5.7 6 RUNNING
executing: gcloud container clusters get-credentials test-1
Fetching cluster endpoint and auth data.
kubeconfig entry generated for test-1.
executing: kubectl create -f kubernetes
error: the path "kubernetes" does not exist
It would be interesting (and community-minded) to add a JupyterLab instance running alongside the class Jupyter notebook.
I've been trying to remove the name distributed
from any package or documentation. It's a bad name for a software project. I've been calling the whole thing just dask. This would leave us with
kubernetes-dask
Many other dask projects switch the name here to something like dask-foo
. For example dask-drmaa
, dask-marathon
, dask-ssh
, dask-ec2
, etc.. It might be more consistent to follow this pattern.
For various reasons it would be convenient to have a KubernetesCluster
object in Python that we could play with during execution rather than use the command line. In my immediate use case I'm building benchmarks and would like to try the same computation with many clusters of different sizes.
I've recently started exploring using dask-kubernetes to automate cluster deployment. Is it possible to add a few more comments to the defaults.yaml configuration file?
For example:
One of the first things I wanted to do is to see how things scale as I add more workers. There's a basic formula that I think I understand
count * cpus_per_worker + jupyter cpu + scheduler cpu <= num_nodes * cpus per node
But why is the cpus_per_worker non-integer? When I dig a bit further, the default value of 1.8 appears to get normalized to 2.
Hi Martin,
I am doing some work in the Kubernetes space around ML / DL, mostly on the infrastructure side to make it easy to onboard clustering. One of the things I have worked on is a Helm chart for Tensorflow (write up here)
Someone told me that Anaconda would make a lot of sense, so I started looking into ways to run it on K8s, and your repo is the most advanced setup I can find. However, I try to stick to Helm packages in general, and you don't have that.
Are you accepting PRs, and is this something you would value?
If not, do you mind if I borrow your structure to start a chart? Any license you'd like to apply?
Many thanks! I am also available on Twitter to discuss @SaMnCo_23.
One common use of dask-ec2 today is to support conferences and demonstrations. To satisfy this application we would need the following:
dask-data
gcs containerSimilar to dask
and dask-ec2
, it would be helpful to give users easy access to common UI URLs for a given cluster.
Example usage:
$ dask-kubernetes info NAME
Addresses
---------
Web Interface: http://xxx.xxx.xxx.xxx:8787/status
Scheduler (TCP): xxx.xxx.xxx.xxx:8786
Scheduler (HTTP): xxx.xxx.xxx.xxx:9786
Bokeh: xxx.xxx.xxx.xxx:8788
Jupyter: xxx.xxx.xxx.xxx:8888
To connect to scheduler inside of cluster
-----------------------------------------
from dask.distributed import Client
c = Client('dask-scheduler:8786')
To connect to scheduler outside of cluster
------------------------------------------
from dask.distributed import Client
c = Client('xxx.xxx.xxx.xxx:8786')
It looks like the current dask-kubernetes resize
operation passes through to kubectl. Users are recommended to increase both the number of pods and the number of nodes:
To add machines to the cluster, you may do the following
dask-kubernetes resize nodes CLUSTER COUNT
To add worker containers, you may do the following
dask-kubernetes resize pods CLUSTER COUNT
Is it possible to take on this book keeping ourselves so that users don't need to understand the difference here?
I got the cluster successfully spawned and it showed a message like this
Addresses
---------
Web Interface: http://104.196.202.253:8787/status
Jupyter Notebook: http://35.185.38.137:8888
Jupyter Lab: http://35.185.38.137:8889
Config directory: /Users/ajkale/.dask/kubernetes/dask-cluster-1
Config settings: /Users/ajkale/.dask/kubernetes/dask-cluster-1.yaml
To connect to scheduler inside of cluster
-----------------------------------------
from dask.distributed import Client
c = Client('dask-scheduler:8786')
or from outside the cluster
c = Client('104.196.202.253:8786')
Live pods:
{u'dask-scheduler': [u'dask-scheduler-9hkw6'], u'jupyter-notebook': [u'jupyter-notebook-4jzr2'], u'dask-worker': [u'dask-worker-mdr4z']}
Couple of issues -
c = Client('104.196.202.253:8786')
as mentioned in the info, I get this errordistributed.utils - ERROR - in <closed TCP>: Stream is closed: while trying to call remote method 'identity'
Traceback (most recent call last):
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/utils.py", line 223, in f
result[0] = yield make_coro()
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/client.py", line 680, in _start
yield self._ensure_connected(timeout=timeout)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/client.py", line 716, in _ensure_connected
yield self.scheduler.identity()
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/Users/ajkale/anaconda2/lib/python2.7/site-packages/distributed/core.py", line 433, in send_recv_from_rpc
% (e, key,))
CommClosedError: in <closed TCP>: Stream is closed: while trying to call remote method 'identity'```
So I ran through the README got this point and then failed with the following error:
mrocklin@workstation:~/workspace/dask-kubernetes$ source scripts/make_cluster.sh
Updated property [compute/zone].
ERROR: (gcloud.container.clusters.create) ResponseError: code=404, message=The resource "projects/dask-cluster" was not found.
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=The resource "projects/dask-cluster" was not found.
Unable to connect to the server: dial tcp 104.196.158.131:443: i/o timeout
My first guess here is that I have some config file hidden somewhere on my computer pointing gcloud to use an old resource. I suspect that I'm failing to properly execute this step in the README:
Register on the Google Cloud Platform, setup a billing account and create a project with the Google Compute Engine API enabled.
I am (perhaps naively) assuming that my previous projects on my work account already satisfy this constraint. Looking on there now it looks like there are a couple of these active. How does gcloud choose between them? How does it choose between these and my personal account? Do we know why it is finding this old dask-cluster
project?
This is probably all on me rather than on this project, but any thoughts on how gcloud determines what project is uses would be welcome.
When I create a fresh cluster I find that the status page does not show up. I'm not sure how to get more information to diagnose this.
I have run into this a few times:
$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start
Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo
returned information but I was unable to connect to any of the services.
I am working on developing a dask-kubernetes deployment that will need to access datasets stored on a GCP persistent disk. I've tried a number of approaches without success and am wondering if anyone in the dask-kubernetes development circle has any experience in this area.
xref: pangeo-data/pangeo#16
This project is more specifically about provisioning nodes on GKE and then launching a kubernetes cluster on those nodes. This has been incredibly helpful in the past, but with advances in more general solutions like the helm chart and daskernetes it may be time to start deprecating this project.
Thoughts? cc @martindurant
Are you interested in a patch that auto-generates an SSL certificate for the jupyter-notebook process?
>>> fs = gcsfs.GCSFileSystem()
Enter the following code when prompted in the browser:
XXXX-XXXX
RuntimeError: Waited too long for browserauthentication.
If we are running from gcloud there must be some way of using the existing credentials. We might be able to ask our Google contacts about this.
I gave the recent changes a shot. First time through I was able to connect to the scheduler but the workers kept going up and down. Second time through I'm unable to connect to the scheduler.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.