Giter Site home page Giter Site logo

dask / dask-examples Goto Github PK

View Code? Open in Web Editor NEW
361.0 26.0 223.0 622.78 MB

Easy-to-run example notebooks for Dask

Home Page: https://examples.dask.org/

License: Creative Commons Attribution Share Alike 4.0 International

Jupyter Notebook 97.07% Python 2.66% Shell 0.13% Makefile 0.14%
hacktoberfest

dask-examples's Introduction

Dask Example Notebooks

This repository includes easy-to-run example notebooks for Dask. They are intended to be educational and give users a start on common workflows.

They should be easy to run locally if you download this repository. They are also available on the cloud by clicking on the link below:

Binder Build Status

Contributing

This repository is a great opportunity to start contributing to Dask. Please note that examples submitted to this repository should follow these guidelines:

  1. Run top-to-bottom without intervention from the user

  2. Not require external data sources that may disappear over time (external data sources that are highly unlikely to disappear are fine)

  3. Not be resource intensive, and should run within 2GB of memory

  4. Be clear and contain enough prose to explain the topic at hand

  5. Be concise and limited to one or two topics, such that a reader can get through the example within a few minutes of reading

  6. Be of general relevance to Dask users, and so not too specific on a particular problem or use case

    As an example "how to do dataframe joins" is a great topic while "how to do dataframe joins in the particular case when one column is a categorical and the other is object dtype" is probably too specific

  7. If the example requires a library not included in binder/environment.yml then it would be pip installed` in the first cell of the notebook, with a brief explanation about what functionality the library adds. A brief example follows:

    ### Install Extra Dependencies
    
    We first install the library X for interacting with Y
    !pip install X

Updating the Binder environment

  1. Modify binder/environment-base.yml with new or updated dependencies

  2. Run a linux/amd64 Docker container with mamba available. For example:

    docker run --platform=linux/amd64 -it --rm --mount type=bind,source=$(pwd)/binder,target=/binder condaforge/mambaforge /bin/bash

    This mounts the ./binder folder in /binder in the Docker container

  3. Create the environment

    mamba env create -f environment-base.yml

    This may take quite a while.

  4. Export the environment specification:

    mamba env export -n dask-examples --no-builds -f environment.yml

dask-examples's People

Contributors

albertdefusco avatar alex-rakowski avatar charlesbluca avatar dvatterott avatar edesz avatar genevievebuckley avatar guillaumeeb avatar habi avatar ian-r-rose avatar jacobtomlinson avatar jrbourbeau avatar jsignell avatar martindurant avatar mrocklin avatar quasiben avatar raybellwaves avatar richardscottoz avatar rima-ag avatar rmsare avatar rochaporto avatar sasha-kap avatar saulshanabrook avatar scharlottej13 avatar sephib avatar smartlixx avatar stsievert avatar tadejong avatar tomaugspurger avatar volkerh avatar willirath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dask-examples's Issues

Static Site Feedback

http://dask.org/dask-examples/

Stencil example overlap?

In the Stencil example, the parallel code doesn't do anything to handle boundaries or overlapping. I assume this means that at the block edges their could be some discrepancies? Should this be mentioned or is it small enough to not be significant?

Lab extension no longer active

Our binder environment no longer seems to have the dask labextension active. Perhaps this was caused by an upstream change in the jupyter environment?

@ian-r-rose any thoughts on what might have happened here?

Looking at some build logs I'm getting this

Step 48/53 : RUN ./binder/postBuild
 ---> Running in 31a382825faa
> /srv/conda/envs/notebook/bin/npm pack dask-labextension
dask-labextension-0.3.0.tgz
Incompatible extension:

"[email protected]" is not compatible with the current JupyterLab
Conflicting Dependencies:
JupyterLab              Extension        Package
>=0.18.6 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/application
>=0.18.4 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/apputils
>=0.18.4 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/console
>=0.18.5 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/notebook

Found compatible version: 0.1.2
> /srv/conda/envs/notebook/bin/npm pack [email protected]
dask-labextension-0.1.2.tgz
> node /srv/conda/envs/notebook/lib/python3.6/site-packages/jupyterlab/staging/yarn.js install
yarn install v1.9.4

Organization

Thinking optimistically, if we get many examples in this repository, how should we organize them? I recommend that we have a small number of notebooks at the top level, showing an introduction for each topic like arrays, dataframes, delayed, ML, ..., and also a directory for topics that have many additional notebooks.

Add example for dealing with satellite imagery

This notebook demonstrates using XArray and Dask with a large satellite image. It downloads the image from S3 with rasterio and then loads it in chunks.

It doesn't do a whole lot with it afterwards though, which presumably we should change. This, along with ensuring that the computation we choose also fits nicely in RAM (might have to play around with chunk sizes a bit) is presumably the hard part of this exercise.

Additionally, there is an open question of whether or not we should include rasterio in the docker image (I suspect that it brings in GDAL, which would greatly increase the image size). Instead we might include a !pip install xarray rasterio line or something similar at the top of the notebook.

Regularly check for broken links

Inspired by #151, which I encountered manually, I just ran http://examples.dask.org through a dead link checker and found a handful of broken links. I don't have time to fix them all right now but I just thought I'd drop the results here.

It might be a good idea to include a dead link check as part of the website deploy, but that may also be overkill!

Status URL Source link text
-1 Invalid URL http://127.0.0.1:8787/status http://127.0.0.1:8787/status
404 Not Found https://docs.dask.org/en/latest/bag-overview.html Dask Bag Documentation
404 Not Found https://www.continuum.io/sites/default/files/dask_stacked.png
-1 Invalid URL http://10.20.0.141:8787/status http://10.20.0.141:8787/status
404 Not Found https://ml.dask.org/examples/xgboost.html http://ml.dask.org/examples/xgboost.html
404 Not Found https://xgboost.readthedocs.io/en/latest/python/python_intro https://xgboost.readthedocs.io/en/latest/python/python_intro
404 Not Found https://distributed.readthedocs.io/en/latest/local-cluster.html local cluster
404 Not Found https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api
404 Not Found https://scikit-learn.org/stable/modules/scaling_strategies.html user guide [301 from http://scikit-learn.org/stable/modules/scaling_strategies.html]
404 Not Found https://numpy.org/doc/stable/reference/c-api.generalized-ufuncs.html Generalized Universal Functions [302 from https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html]
404 Not Found https://examples.dask.org/proxy/8787/status dashboard's status page
404 Not Found https://examples.dask.org/proxy/8787/graph dashboard's graph page
404 Not Found https://examples.dask.org/applications/' Cleaning up temporary directories and files
404 Not Found https://examples.dask.org/applications/clip.gif img/src
404 Not Found https://examples.dask.org/surveys/examples.dask.org dask examples
-1 Timeout http://www.celeryproject.org/ Celery
404 Not Found https://distributed.readthedocs.io/en/latest/setup.html scale out to a cluster

Add example for embarrassingly parallel parametrized computation

Would this be of interest for people here to have an example with Dask showing how to do embarrassingly parallel computation on some simulation code?

The idea is that you have a code that does some heavy computation, it takes some parameters and return one or more values as a result. You want to evaluate the program on a grid, with a predefined list of inputs, or maybe with some randomly picked input (Montecarlo simulation). The code run in a few seconds or minutes with each input, and you've got to run it with several thousand or more different inputs. It is something that is often seen in HPC with some physical simulations, and this is not really efficient doing it only with job arrays or alike in PBS or some other schedulers.

Basic steps would be:

  1. Generate or read a list of parameters
  2. Use client.map or delayed to iterate over these parameters and run the computation
  3. Gather all results, and save them on a tabular file.
  4. Eventually do some analysis on the result, mean or other aggregations.

I'm currently working on something like that for showing it to our scientists.

Any thought on that?

Move default branch from "master" -> "main"

@jrbourbeau and I are in the process of moving the default branch for this repo from master to main.

  • Changed in GitHub
  • Merged PR to change branch name in code (xref #183)

What you'll see

Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull you'll get

Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.

What you need to do

First: head to your fork and rename the default branch there
Then:

git branch -m master main
git fetch origin
git branch -u origin/main main

HyperbandSearchCv gives ImportError

I was following through the example notebook: https://github.com/dask/dask-examples/blob/master/machine-learning/hyperparam-opt.ipynb

In the cell:

from dask_ml.model_selection import HyperbandSearchCV

It gives following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-10-cc8522d594f9> in <module>()
----> 1 from dask_ml.model_selection import HyperbandSearchCV

~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/__init__.py in <module>()
      5 """
      6 from ._search import GridSearchCV, RandomizedSearchCV, compute_n_splits, check_cv
----> 7 from ._split import ShuffleSplit, KFold, train_test_split
      8 
      9 

~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/_split.py in <module>()
     10 import numpy as np
     11 import sklearn.model_selection as ms
---> 12 from sklearn.model_selection._split import (
     13     BaseCrossValidator,
     14     _validate_shuffle_split,

ImportError: cannot import name '_validate_shuffle_split_init' from 'sklearn.model_selection._split' (/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/sklearn/model_selection/_split.py)

My imports:

[('numpy', '1.16.4'), ('pandas', '0.25.0'), ('dask', '2.5.0'), ('sklearn', '0.21.2')]

How to fix the error?

Dask-ML Scikit-Learn ImportError

It seems that the recent sklearn update has broken dask-ml, which breaks our CI.

This is odd though, since we seem to pin Scikit-Learn in our environment.yml

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-001abd10b0d1> in <module>
      1 import dask_ml.datasets
----> 2 import dask_ml.cluster
      3 import matplotlib.pyplot as plt
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/__init__.py in <module>
      1 """Unsupervised Clustering Algorithms"""
      2 
----> 3 from .k_means import KMeans  # noqa
      4 from .minibatch import PartialMiniBatchKMeans  # noqa
      5 from .spectral import SpectralClustering  # noqa
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/k_means.py in <module>
     21 )
     22 from ..utils import _timed, _timer, check_array, row_norms
---> 23 from ._compat import _k_init
     24 
     25 logger = logging.getLogger(__name__)
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/_compat.py in <module>
      2 
      3 if SK_022:
----> 4     from sklearn.cluster._k_means import _k_init
      5 else:
      6     from sklearn.cluster.k_means_ import _k_init
ModuleNotFoundError: No module named 'sklearn.cluster._k_means'

Use fsspec release

We switched to installing fsspec from master in #95. We should revert back to installing a release version once there's a new release with the changes in fsspec/filesystem_spec#128. Note: at the time of opening this issue the latest release is fsspec==0.4.4

Jupyterlab broken in binder?

Currently, I get a broken JupyterLab on binder while the standard notebook interface seems to be working.

(Possibly related to #77.)

Maintenence: update mybinder environment.yml dependency versions

I noticed that the mybinder environment.yml file pins dask to version 0.20, but the latest dask release is now up to 1.1.2. It's probably time to update or unpin some of these dependencies. Should we do that?

Currently:
https://github.com/dask/dask-examples/blob/master/binder/environment.yml

channels:
  - conda-forge
dependencies:
  - python=3
  - bokeh=0.13
  - dask=0.20
  - dask-ml=0.10.0
  - distributed=1.24
  - jupyterlab=0.35.1
  - nodejs=8.9
  - numpy
  - pandas
  - pyarrow==0.10.0
  - scikit-learn=0.20
  - matplotlib
  - nbserverproxy
  - nomkl
  - h5py
  - xarray
  - bottleneck
  - py-xgboost
  - pip:
    - graphviz
    - dask_xgboost
    - seaborn
    - mimesis

Switch to JupyterLab and use dask-labextension

I believe that we should switch the default environment to use JupyterLab and that we should encode a specific layout on container startup. In this way the user is presented with the dashboard without having to do anything.

@ian-r-rose is the best way to provide a default layout to create a JSON file and insert it into the .jupyter/lab/workspaces directory, or is there another common route for these things?

Use an already trained Torch model to predict on lots of data

Extending on #35 it would be nice to have an example using Dask and Torch together to parallelize prediction. This should be a simple embarrassingly parallel use case, but I suspect that it would be pragmatic for lots of folks.

The challenge, I think, is constructing a simple example that hopefully doesn't get too much into Torch or a dataset. In my ideal world this would be something like

import torchvision
model = torchvision.get_model("model_name")

dataset = get_canned_dataset()
>>> imshow(dataset[0])  # show an example image

>>> model.predict(dataset[0])
"this is a cat"

... # then dask things here

Does anyone have good pointers to such a simple case?

cc @stsievert @TomAugspurger @AlbertDeFusco

Occasional failure in HTTP bytes

When running CI in this project I sometimes run across the following error:

~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in reify()
   1603 def reify(seq):
   1604     if isinstance(seq, Iterator):
-> 1605         seq = list(seq)
   1606     if seq and isinstance(seq[0], Iterator):
   1607         seq = list(map(list, seq))
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in map_chunk()
   1769                 yield f(**k)
   1770     else:
-> 1771         for a in zip(*args):
   1772             yield f(*a)
   1773 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/text.py in file_to_blocks()
    103 def file_to_blocks(lazy_file):
    104     with lazy_file as f:
--> 105         for line in f:
    106             yield line
    107 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in read()
    247             # EOF (python files don't error, just return no data)
    248             return b''
--> 249         self. _fetch(self.loc, end)
    250         data = self.cache[self.loc - self.start:end - self.start]
    251         self.loc = end
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch()
    258             self.start = start
    259             self.end = end + self.blocksize
--> 260             self.cache = self._fetch_range(start, self.end)
    261         elif start < self.start:
    262             if self.end - end > self.blocksize:
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch_range()
    320             if cl <= end - start:
    321                 # data size OK
--> 322                 return r.content
    323             else:
    324                 raise ValueError('Got more bytes (%i) than requested (%i)' % (
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in content()
    826                 self._content = None
    827             else:
--> 828                 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
    829 
    830         self._content_consumed = True
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in generate()
    751                         yield chunk
    752                 except ProtocolError as e:
--> 753                     raise ChunkedEncodingError(e)
    754                 except DecodeError as e:
    755                     raise ContentDecodingError(e)
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
    nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:
------------------
df.spec.value_counts().nlargest(20).to_frame().compute()
------------------

@martindurant , this seems to be in your general domain. Do you have any suggestions on what might be happening here?

Go through dataframe notebook with a pedagogical eye

At AnacondaCon @AlbertDeFusco went through the dataframe notebook with me and provided a dozen small corrections to improve the flow of a novice user through the notebook. I diligently copied down the changed notebook but seem to have lost all memory of where I placed it.

@AlbertDeFusco if you have time can I ask you for feedback on this notebook again?

Your attention on any of the notebooks here, or on others, would also be very welcome.

Some feedback from a quick readthrough

array.ipynb

  • The description at the top "Dask Dataframes coordinate many Pandas dataframes," applies to dataframe, not array.

  • If this notebook is intended as one of the main intros, it's sort of annoying that the dashboard link situation is so fussy--I first clicked without running the cell, then ran the cell and got a link that won't work, then could click the right link. Since the link isn't going to be right, it might be less noise not to present the (cool) HTML repr of the client.

  • "This creates a 10000, 10000 arrays of random number" -> "This creates a 10000, 10000 array of random number"

  • "This creates a 10000, 10000 arrays of random numbers, split into a 10x10 grid of 1000x1000 NumPy arrays." I know you probably want to avoid verbosity, but as this is one of the fundamental dask-unique concepts it might be nice to answer some basic questions about how chunks work, e.g. "This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000."

dataframe.ipynb

  • compute is used but not explained like it was in the array notebook

  • The mix of eager operations and lazy operations is significant, and I don't know if someone would walk away having a real sense for what operations would do what.

  • The warning /srv/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result """Entry point for launching an IPython kernel. in the last cell is a but hard to work your way through. (There was also a warning in cell 9, but it's a lot more digestible).

delayed.ipynb

  • from random import random within each function isn't a very normal pattern in Python in general, and I got hung up on why you would do it for a moment. I still don't know whether it was meaningful (like, hoping the random state would get initialized differently in each process) or just because the need was internal.

  • zs = dask.persist(*zs) is unexplained and can look a bit scary.

  • "Note the red bars for inter-worker communication. Also note how there is lots of parallelism at the beginning but less towards the end as we reach the top of the tree where there is less work to do." This seems to lack the part where it originally told me to look somewhere before I computed.

Find good XArray example

The XArray docs likely have some good informative examples that we could include in this repository. We might also consider including XArray in the environment.

cc @shoyer @jhamman for insight

XArray might also want to fork and make their own repository, but they'd be welcome here as well

Can not use "conda install" in binder environment

It seems that to whatever reasons, it is not possible to install a package into the binder environment via conda.
I tried via the terminal or directly from within the jupyter notebook. I started binder by clicking one of the links from the dask-example page.

It does not really matter which package (I tried dask-sql, but also e.g. pandas, which is already installed). Both fail with:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are missing from the target environment:
  - nodejs==14
  - numpy==1.18

The installation seems to work with mamba, which makes me think it is some problem in the way conda resolves the packages - but I have no idea what is going wild her.

Now comes the "funny" part: If I explicitly install

conda install numpy=1.18 nodejs=14

it actually works, even though conda list already shows me them as installed. It will downgrade some packages:

The following packages will be UPDATED:

  certifi                          2020.6.20-py38h32f6830_0 --> 2020.6.20-py38h924ce5b_2

The following packages will be DOWNGRADED:

  jupyterlab                                     2.1.5-py_0 --> 2.1.0-py_1
  pandas                               1.0.5-py38hcb8c335_0 --> 1.0.0-py38hb3f55d8_0
  scikit-learn                        0.23.2-py38h5d63f67_2 --> 0.23.0-py38h3a94b23_0

After that, I can also install other packages...

Use an already trained Keras model to predict on lots of data

A common approach is to train on a bit of data and then use that trained model to predict on lots of data. We could do this using ParallelPostFit in dask-ml, or we can use X.map_blocks or df.map_partitions. In either case we might want to be a bit careful about avoiding repeated serializations costs. For example, in the following case I suspect that we include the serialized model in every task

# maybe bad?
model = load_model()
predictions = X.map_blocks(model.predict)  

It's probably better to encourage the user to keep the model delayed

# maybe bad?
model = dask.delayed(load_model)()
predictions = X.map_blocks(model.predict)  

We should also ensure that dask-ml does this correctly, and includes the model as a single task in the graph so that it gets sent around appropriately (cc @TomAugspurger )

I'm also generally curious if a Keras model that lives on the GPU will eventually make its way back onto the GPU when deserializing.

Error from dask xgboost

I was trying a contrived example with the following code...got this long unexpected error. Any idea in how to proceed?

import dask_xgboost
params = {'objective': 'binary:logistic',
          'max_depth': 4, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)
> TypeError                                 Traceback (most recent call last)
> <ipython-input-14-eb3f56620d79> in <module>
>       4           'min_child_weight': 0.5}
>       5 
> ----> 6 bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
>     240     """
>     241     return client.sync(
> --> 242         _train, client, params, data, labels, dmatrix_kwargs, **kwargs
>     243     )
>     244 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
>     765         else:
>     766             return sync(
> --> 767                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
>     768             )
>     769 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
>     332     if error[0]:
>     333         typ, exc, tb = error[0]
> --> 334         raise exc.with_traceback(tb)
>     335     else:
>     336         return result[0]
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in f()
>     316             if callback_timeout is not None:
>     317                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
> --> 318             result[0] = yield future
>     319         except Exception as exc:
>     320             error[0] = sys.exc_info()
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1131 
>    1132                     try:
> -> 1133                         value = future.result()
>    1134                     except Exception:
>    1135                         self.had_exception = True
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1139                     if exc_info is not None:
>    1140                         try:
> -> 1141                             yielded = self.gen.throw(*exc_info)
>    1142                         finally:
>    1143                             # Break up a reference to itself
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
>     169     for part in parts:
>     170         if part.status == "error":
> --> 171             yield part  # trigger error locally
>     172 
>     173     # Because XGBoost-python doesn't yet allow iterative training, we need to
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1131 
>    1132                     try:
> -> 1133                         value = future.result()
>    1134                     except Exception:
>    1135                         self.had_exception = True
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/asyncio/tasks.py in _wrap_awaitable(awaitable)
>     601     that will later be wrapped in a Task by ensure_future().
>     602     """
> --> 603     return (yield from awaitable.__await__())
>     604 
>     605 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in __await__(self)
>     410 
>     411     def __await__(self):
> --> 412         return self.result().__await__()
>     413 
>     414 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in result(self, timeout)
>     219         result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
>     220         if self.status == "error":
> --> 221             typ, exc, tb = result
>     222             raise exc.with_traceback(tb)
>     223         elif self.status == "cancelled":
> 
> TypeError: cannot unpack non-iterable coroutine object
> 

Dask Bag examples

We currently lack dask bag examples in this repository. Two come to mind:

  1. Read JSON data, and do some groupby aggregation with both Bag.groupby and Bag.foldby
  2. Read text data and do some basic wordcount

For the JSON data it might make sense to add a dataset generation tool for nested records data, similar to dask.datasets.timeseries, and then use that to generate JSON data to disk, similar to how we generate CSV data in http://examples.dask.org/dataframes/01-data-access.html#Create-artificial-dataset.

We would then read the JSON data, and do some minimal processing.

For the text data I wonder if there is an online dataset we can download. I suspect that the complete works of shakespeare is around somewhere. We might do a simple thing like read, split, frequencies. Or we might do more complex work afterwards by bringing in NLTK, stemming words, removing stopwords, etc..

Placement of interactive dashboard in JupyterLab

Not sure if this strictly counts as a bug, but here goes.

When running the example notebook for delayed online (after clicking "launch binder"), I was expecting the interactive dashboard to show up in the two panes on the right side of the JupyterLab window. This turned out not to be the case, as the two panes remained blank throughout.

To get the dashboard to appear, I had to click the link that shows up after a Client object is instantiated.

Is this intended behavior? Is there a way to peg the dashboard to the two "pre-allocated" panes within the same JupyterLab window?

Screenshots

image

image

Migrate CI to GitHub Actions

Due to changes in the Travis CI billing, the Dask org is migrating CI to GitHub Actions.

This repo contains a .travis.yml file which needs to be replaced with an equivalent .github/workflows/ci.yml file.

See dask/community#107 for more details.

Include outputs in notebooks

In #14 (comment) @stsievert says

I see most of the other notebooks in this repo are without their outputs in the notebook. Letting the outbooks be coded in the notebook would lower the barrier to viewing to viewing these notebooks by allowing users to try binder if they want. Why don't we include the outputs in the notebooks?

This seemed to be worth discussion, so I've raised a separate issue for it here.

JupyterLab workspace no longer active

It seems that our pre-defined workspace is no longer showing up on dask-examples. When I run jupyter lab workspaces export I do find that it's using the same workspace that we give it, but it's not showing up correctly.

@ian-r-rose do you have any thoughts on why this might be? Has JupyterLab changed how it interprets workspaces recently?

Continuous Integration

This repository could use continuous integration to ensure that the examples continue to work over time

I'm not familiar with tools to test notebooks, but other projects do this, so it must be doable.

Workspace not showing up

When we first load the page, the dashboard address isn't populated and the task stream and progress plots aren't arranged on the screen.

My first guess was that this was due to a change in the workspace file spec, so I decided to generate a new one. However I ran into an error that I don't fully underestand.

jovyan@jupyter-dask-2ddask-2dexamples-2d4c9mozvq:~$ jupyter lab workspaces export 
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/jupyter-lab", line 10, in <module>
    sys.exit(main())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/notebookapp.py", line 1758, in start
    super(NotebookApp, self).start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 276, in start
    super().start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 136, in start
    page_url = config.page_url
AttributeError: 'LabConfig' object has no attribute 'page_url'

cc @ian-r-rose

CI Failures

It looks like CI is now failing

https://travis-ci.org/github/dask/dask-examples/builds/716774531

Some highlights

Prophet install command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-emefs74w cwd: /tmp/pip-install-n5i_pkvw/fbprophet/python Complete output (44 lines): running bdist_wheel running build running build_py creating build creating build/lib creating build/lib/fbprophet creating build/lib/fbprophet/stan_model Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 122, in <module> setup( File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/setuptools/__init__.py", line 163, in setup return distutils.core.setup(**attrs) File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 223, in run self.run_command('build') File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 48, in run build_models(target_dir) File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 36, in build_models from fbprophet.models import StanBackendEnum File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/__init__.py", line 8, in <module> from fbprophet.forecaster import Prophet File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/forecaster.py", line 17, in <module> from fbprophet.make_holidays import get_holiday_names, make_holidays_df File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/make_holidays.py", line 14, in <module> import fbprophet.hdays as hdays_part2 File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/hdays.py", line 13, in <module> from convertdate.islamic import from_gregorian, to_gregorian ModuleNotFoundError: No module named 'convertdate' ---------------------------------------- ERROR: Failed building wheel for fbprophet Running setup.py clean for fbprophet ERROR: Command errored out with exit status 1: command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all cwd: /tmp/pip-install-n5i_pkvw/fbprophet Complete output (5 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 119, in <module> with open('requirements.txt', 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt' ---------------------------------------- ERROR: Failed cleaning build dir for fbprophet Building wheel for pystan (setup.py): started Building wheel for pystan (setup.py): finished with status 'error' ERROR: Command errored out with exit status 1: command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-149xclu8 cwd: /tmp/pip-install-n5i_pkvw/pystan/ Complete output (1 lines): Cython>=0.22 and NumPy are required.
Version check issue
Function:  execute_task
args:      ((<function fit at 0x7f2cd61d4280>, DecisionTreeClassifier(max_depth=4, min_samples_leaf=4, min_samples_split=9), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1., 10., ...,  2.,  0.,  0.],
       ...,
       [ 0.,  6., 16., ..., 11.,  1.,  0.],
       [ 0.,  0., 10., ...,  8.,  6.,  0.],
       [ 0.,  0.,  7., ...,  0.,  0.,  0.]]), array([4, 4, 5, 2, 1, 5, 6, 7, 7, 7, 3, 6, 3, 2, 9, 5, 2, 8, 2, 7, 5, 7,
       5, 5, 4, 8, 5, 6, 4, 2, 0, 7, 3, 5, 5, 4, 7, 4, 8, 9, 3, 1, 0, 5,
       1, 9, 6, 9, 1, 0, 5, 5, 8, 3, 8, 8, 9, 1, 2, 5, 8, 9, 6, 1, 7, 9,
       7, 8, 9, 8, 0, 4, 5, 3, 0, 1, 3, 7, 7, 1, 1, 8, 3, 2, 8, 9, 3, 2,
       7]), True, True, 0), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., .
kwargs:    {}
Exception: TypeError("'<' not supported between instances of 'Version' and 'tuple'")
TPot
nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
tp.fit(X_train, y_train)
------------------
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    713                 warnings.simplefilter('ignore')
--> 714                 self._pop, _ = eaMuPlusLambda(
    715                     population=self._pop,
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/gp_deap.py in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function)
    225 
--> 226     population[:] = toolbox.evaluate(population)
    227 
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _evaluate_individuals(self, population, features, target, sample_weight, groups)
   1333                             warnings.simplefilter('ignore')
-> 1334                             tmp_result_scores = list(dask.compute(*tmp_result_scores))
   1335 
~/miniconda/envs/test/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2687             try:
-> 2688                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2689             finally:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1981                 local_worker = None
-> 1982             return self.sync(
   1983                 self._gather,
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
--> 832             return sync(
    833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338         typ, exc, tb = error[0]
--> 339         raise exc.with_traceback(tb)
    340     else:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in f()
    322                 future = asyncio.wait_for(future, callback_timeout)
--> 323             result[0] = yield future
    324         except Exception as exc:
~/miniconda/envs/test/lib/python3.8/site-packages/tornado/gen.py in run(self)
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1846                         else:
-> 1847                             raise exception.with_traceback(traceback)
   1848                         raise exc
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in cv_extract()
    165 def cv_extract(cvs, X, y, is_X, is_train, n):
--> 166     return cvs.extract(X, y, n, is_X, is_train)
    167 
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in extract()
    110                 return self._extract_pairwise(X, y, n, is_train=is_train)
--> 111             return self._extract(X, y, n, is_x=True, is_train=is_train)
    112         if y is None:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in _extract()
    130         inds = self.splits[n][0] if is_train else self.splits[n][1]
--> 131         result = _safe_indexing(X if is_x else y, inds)
    132 
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _safe_indexing()
    219     elif hasattr(X, "shape"):
--> 220         return _array_indexing(X, indices, indices_dtype, axis=axis)
    221     else:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _array_indexing()
    298     """Index an array or scipy.sparse consistently across NumPy version."""
--> 299     if np_version < (1, 12) or sp.issparse(array):
    300         # FIXME: Remove the check for NumPy when using >= 1.12
TypeError: '<' not supported between instances of 'Version' and 'tuple'
During handling of the above exception, another exception occurred:
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-c5bcc440217f> in <module>
----> 1 tp.fit(X_train, y_train)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    754                     # raise the exception if it's our last attempt
    755                     if attempt == (attempts - 1):
--> 756                         raise e
    757             return self
    758 
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    745                         self._pbar.close()
    746 
--> 747                     self._update_top_pipeline()
    748                     self._summary_of_best_pipeline(features, target)
    749                     # Delete the temporary cache before exiting
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _update_top_pipeline(self)
    827             # If user passes CTRL+C in initial generation, self._pareto_front (halloffame) shoule be not updated yet.
    828             # need raise RuntimeError because no pipeline has been optimized
--> 829             raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
    830 
    831     def _summary_of_best_pipeline(self, features, target):
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

Docs aren't updating

I think that our docs don't seem to be updating after we push. We've added a couple notebooks in recent months that aren't showing up.

cc @TomAugspurger and @jcrist who might know a bit more about the doc build system here (I think doctr)

Dataframe notebook + video series

I'm inclined to create a series of notebooks around dask dataframe that also include short 1-5 minute screencasts of the notebook in the top cell. I'll propose the following structure based on experience with common stack overflow questions. Feedback is welcome.

  1. Load and store Dask Dataframes
    • Dump artificial CSV data to disk using dd.demo.make_timeseries().to_csv()
    • Use read_csv, customize by adding datetime dtype
    • Dump to parquet, customize by adding dtypes
    • Read from parquet, customize by using the columns keyword for faster speed
    • Do some trivial computation on that column to show the speed difference
  2. Groupby
    • Create artificial dataset with dd.demo.make_timeseries()
    • Use groupby aggregations, stress that it's the same as pandas, show the use of compute, and dask.compute
    • Use groupby apply (steal the scikit-learn example from the current main dataframe example) and show how it's much slower
  3. Set index
    • Point out the index and divisions
    • Show that operations like loc are fast while filtering on other columns is slow
    • Use set_index to change the index, show that it is expensive
    • Use persist to keep that computation in memory
    • Use loc on the newly created index column (probably name), and show that it is now fast
    • Use groupby-apply (link back to old notebook) and show that it too is now fast

At some point after we do a couple series like this we might also want to move to JupyterLab and provide a welcome notebook that shows people how to drag the video off to the side.

Make the live notebook element more prominent

When I've shown people examples.dask.org they often don't realize that they can click on the "Launch Binder" button and get a live session. This is despite our header at the top which says:

You can run this notebook in a live session or view it on Github

I think that we might make this more prominent by

  1. Using a button, similar to the "Launch Binder" button, but more obvious to people who are unfamiliar with Binder
  2. Making that button very large?
  3. Making that button stay on the screen, even after the user scrolls down?
  4. ...?

If only we knew someone with some basic web design skills ...

cc @jrbourbeau , in case you or anyone around you has ideas ;)

Update prophet notebook

The current prophet notebook talks about installing from master. My guess is that this is no longer necessary.

It would be useful if someone can check on this, and update the notebook if possible.

notebook example for transfering code from pure pandas to dask

Hi,
I'm writing a notebook example to highlight some key differences between pandas and dask. Are you interested in such a PR?
If so i have currently the following topics - (are there any additional topics that I should include?) :

  1. Dask does not update - thus no "inplace=True": (e.g. rename, reset_index, dropna,)
  2. reading/saving dataframes (with *)
  3. some gotcha's with index
  4. dd.Aggregatoin vs groupby.apply

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.