dask / dask-examples Goto Github PK

View Code? Open in Web Editor NEW

367.0 367.0 226.0 622.78 MB

Easy-to-run example notebooks for Dask

Home Page: https://examples.dask.org/

License: Creative Commons Attribution Share Alike 4.0 International

Jupyter Notebook 97.07% Python 2.66% Shell 0.13% Makefile 0.14%

hacktoberfest

dask-examples's Introduction

Dask

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-examples's People

Contributors

Stargazers

Watchers

Forkers

tomaugspurger martindurant mrocklin edenbaus albertdefusco pandadumpling hbcbh1999 gridl convexset stsievert simoneperazzoli thompson42 tsktsktsk123 dvatterott guillaumeeb sanyaade-machine-learning ian-r-rose ceball lazarusa deepak-k-zefr jrbourbeau demirelo hengma1001 ariel-lieberman tjslezak rabernat rahulmarlabs akshayshende129 redpoint13 cdeil quasiben genevievebuckley rmsare sandgate-dev danielballan mohasinbasha sephib mbrukman michaele919 carreau statiksof sawantsaurabh anirao26 willirath phongnv93 debasishmaji zhch158 cicdw awesome-archive tarofranke vanesssin caesarnine chaoyue729 conradbm hhyap lemonnn-8 xytnba shiv4ngi shiv4nsh alvarocalle wangjianze sdd031215 rml-admin rocketmlhq bigdatamatta madwirebrad jsignell vijayantsoni jacobtomlinson dataveterans anhmike shirangi gvvynplaine ricardo-garciar 1512468 maazamjad shalevy1 w601sxs jbn battyone dickinsjohn o10222 devmathur1993 jsga fagan2888 chkoar caitlinacasey rochaporto harrisonfeng evi1angel hugovk musabaloyi toddmorrill ogfunkycold hammer gayu28 anaszain89 raybellwaves nesreensada rohitpandey13

dask-examples's Issues

Switch to JupyterLab and use dask-labextension

I believe that we should switch the default environment to use JupyterLab and that we should encode a specific layout on container startup. In this way the user is presented with the dashboard without having to do anything.

@ian-r-rose is the best way to provide a default layout to create a JSON file and insert it into the .jupyter/lab/workspaces directory, or is there another common route for these things?

Move default branch from "master" -> "main"

@jrbourbeau and I are in the process of moving the default branch for this repo from master to main.

Changed in GitHub
Merged PR to change branch name in code (xref #183)

What you'll see

Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull you'll get

Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.

What you need to do

First: head to your fork and rename the default branch there
Then:

git branch -m master main
git fetch origin
git branch -u origin/main main

Continuous Integration

This repository could use continuous integration to ensure that the examples continue to work over time

I'm not familiar with tools to test notebooks, but other projects do this, so it must be doable.

Use fsspec release

We switched to installing fsspec from master in #95. We should revert back to installing a release version once there's a new release with the changes in fsspec/filesystem_spec#128. Note: at the time of opening this issue the latest release is fsspec==0.4.4

Add Prefect example?

@cicdw any interest in adding an example notebook building a Prefect workflow to examples.dask.org ? https://github.com/dask/dask-examples#contributing

This site gets decent traffic, and we provide testing, HTML static output generation, and serving as a binder.

Links in Image Processing page aren't working

The links at the top of the Image Processing page aren't working. Clicking any of the links at the top doesn't bring you down the page to the content as it should.
https://examples.dask.org/applications/image-processing.html

I'm thinking maybe it has to do with the extra apostrophe in the last (cleanup) link? Didn't have time to test that theory though.

notebook example for transfering code from pure pandas to dask

Hi,
I'm writing a notebook example to highlight some key differences between pandas and dask. Are you interested in such a PR?
If so i have currently the following topics - (are there any additional topics that I should include?) :

Dask does not update - thus no "inplace=True": (e.g. rename, reset_index, dropna,)
reading/saving dataframes (with *)
some gotcha's with index
dd.Aggregatoin vs groupby.apply

Update prophet notebook

The current prophet notebook talks about installing from master. My guess is that this is no longer necessary.

It would be useful if someone can check on this, and update the notebook if possible.

doc: link dataframe-groupby aggregate to 02-groupby

Would be good to have a link to https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate in https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb

This is probably more so because googling dask dataframe group by takes me to https://examples.dask.org/dataframes/02-groupby.html and I know there is more info in the docs https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate

Add a license

CC-BY-SA may make the most sense

Add satpy example?

@djhoese any interest?

There is also https://examples.dask.org/applications/satellite-imagery-geotiff.html , but maybe there is more that we could add there or other examples that we could add.

As a reminder, examples in dask-examples are tested, rendered statically to serve from examples.dask.org, and available for users as runnable binders. It's nice way to reach users, and a nice way to have a simple test with many other parts of the ecosystem.

Regularly check for broken links

Inspired by #151, which I encountered manually, I just ran http://examples.dask.org through a dead link checker and found a handful of broken links. I don't have time to fix them all right now but I just thought I'd drop the results here.

It might be a good idea to include a dead link check as part of the website deploy, but that may also be overkill!

Status	URL	Source link text
-1 Invalid URL	http://127.0.0.1:8787/status	http://127.0.0.1:8787/status
404 Not Found	https://docs.dask.org/en/latest/bag-overview.html	Dask Bag Documentation
404 Not Found	https://www.continuum.io/sites/default/files/dask_stacked.png
-1 Invalid URL	http://10.20.0.141:8787/status	http://10.20.0.141:8787/status
404 Not Found	https://ml.dask.org/examples/xgboost.html	http://ml.dask.org/examples/xgboost.html
404 Not Found	https://xgboost.readthedocs.io/en/latest/python/python_intro	https://xgboost.readthedocs.io/en/latest/python/python_intro
404 Not Found	https://distributed.readthedocs.io/en/latest/local-cluster.html	local cluster
404 Not Found	https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api	https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api
404 Not Found	https://scikit-learn.org/stable/modules/scaling_strategies.html	user guide [301 from http://scikit-learn.org/stable/modules/scaling_strategies.html]
404 Not Found	https://numpy.org/doc/stable/reference/c-api.generalized-ufuncs.html	Generalized Universal Functions [302 from https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html]
404 Not Found	https://examples.dask.org/proxy/8787/status	dashboard's status page
404 Not Found	https://examples.dask.org/proxy/8787/graph	dashboard's graph page
404 Not Found	https://examples.dask.org/applications/'	Cleaning up temporary directories and files
404 Not Found	https://examples.dask.org/applications/clip.gif	img/src
404 Not Found	https://examples.dask.org/surveys/examples.dask.org	dask examples
-1 Timeout	http://www.celeryproject.org/	Celery
404 Not Found	https://distributed.readthedocs.io/en/latest/setup.html	scale out to a cluster

Dask-ML Scikit-Learn ImportError

It seems that the recent sklearn update has broken dask-ml, which breaks our CI.

This is odd though, since we seem to pin Scikit-Learn in our environment.yml

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-001abd10b0d1> in <module>
      1 import dask_ml.datasets
----> 2 import dask_ml.cluster
      3 import matplotlib.pyplot as plt
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/__init__.py in <module>
      1 """Unsupervised Clustering Algorithms"""
      2 
----> 3 from .k_means import KMeans  # noqa
      4 from .minibatch import PartialMiniBatchKMeans  # noqa
      5 from .spectral import SpectralClustering  # noqa
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/k_means.py in <module>
     21 )
     22 from ..utils import _timed, _timer, check_array, row_norms
---> 23 from ._compat import _k_init
     24 
     25 logger = logging.getLogger(__name__)
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/_compat.py in <module>
      2 
      3 if SK_022:
----> 4     from sklearn.cluster._k_means import _k_init
      5 else:
      6     from sklearn.cluster.k_means_ import _k_init
ModuleNotFoundError: No module named 'sklearn.cluster._k_means'

Some feedback from a quick readthrough

array.ipynb

The description at the top "Dask Dataframes coordinate many Pandas dataframes," applies to dataframe, not array.
If this notebook is intended as one of the main intros, it's sort of annoying that the dashboard link situation is so fussy--I first clicked without running the cell, then ran the cell and got a link that won't work, then could click the right link. Since the link isn't going to be right, it might be less noise not to present the (cool) HTML repr of the client.
"This creates a 10000, 10000 arrays of random number" -> "This creates a 10000, 10000 array of random number"
"This creates a 10000, 10000 arrays of random numbers, split into a 10x10 grid of 1000x1000 NumPy arrays." I know you probably want to avoid verbosity, but as this is one of the fundamental dask-unique concepts it might be nice to answer some basic questions about how chunks work, e.g. "This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000."

dataframe.ipynb

compute is used but not explained like it was in the array notebook
The mix of eager operations and lazy operations is significant, and I don't know if someone would walk away having a real sense for what operations would do what.
The warning /srv/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result """Entry point for launching an IPython kernel. in the last cell is a but hard to work your way through. (There was also a warning in cell 9, but it's a lot more digestible).

delayed.ipynb

from random import random within each function isn't a very normal pattern in Python in general, and I got hung up on why you would do it for a moment. I still don't know whether it was meaningful (like, hoping the random state would get initialized differently in each process) or just because the need was internal.
zs = dask.persist(*zs) is unexplained and can look a bit scary.
"Note the red bars for inter-worker communication. Also note how there is lots of parallelism at the beginning but less towards the end as we reach the top of the tree where there is less work to do." This seems to lack the part where it originally told me to look somewhere before I computed.

Workspace not showing up

When we first load the page, the dashboard address isn't populated and the task stream and progress plots aren't arranged on the screen.

My first guess was that this was due to a change in the workspace file spec, so I decided to generate a new one. However I ran into an error that I don't fully underestand.

jovyan@jupyter-dask-2ddask-2dexamples-2d4c9mozvq:~$ jupyter lab workspaces export

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/jupyter-lab", line 10, in <module>
    sys.exit(main())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/notebookapp.py", line 1758, in start
    super(NotebookApp, self).start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 276, in start
    super().start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
    self.subapp.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 136, in start
    page_url = config.page_url
AttributeError: 'LabConfig' object has no attribute 'page_url'

cc @ian-r-rose

Can not use "conda install" in binder environment

It seems that to whatever reasons, it is not possible to install a package into the binder environment via conda.
I tried via the terminal or directly from within the jupyter notebook. I started binder by clicking one of the links from the dask-example page.

It does not really matter which package (I tried dask-sql, but also e.g. pandas, which is already installed). Both fail with:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are missing from the target environment:
  - nodejs==14
  - numpy==1.18

The installation seems to work with mamba, which makes me think it is some problem in the way conda resolves the packages - but I have no idea what is going wild her.

Now comes the "funny" part: If I explicitly install

conda install numpy=1.18 nodejs=14

it actually works, even though conda list already shows me them as installed. It will downgrade some packages:

The following packages will be UPDATED:

  certifi                          2020.6.20-py38h32f6830_0 --> 2020.6.20-py38h924ce5b_2

The following packages will be DOWNGRADED:

  jupyterlab                                     2.1.5-py_0 --> 2.1.0-py_1
  pandas                               1.0.5-py38hcb8c335_0 --> 1.0.0-py38hb3f55d8_0
  scikit-learn                        0.23.2-py38h5d63f67_2 --> 0.23.0-py38h3a94b23_0

After that, I can also install other packages...

Add example for embarrassingly parallel parametrized computation

Would this be of interest for people here to have an example with Dask showing how to do embarrassingly parallel computation on some simulation code?

The idea is that you have a code that does some heavy computation, it takes some parameters and return one or more values as a result. You want to evaluate the program on a grid, with a predefined list of inputs, or maybe with some randomly picked input (Montecarlo simulation). The code run in a few seconds or minutes with each input, and you've got to run it with several thousand or more different inputs. It is something that is often seen in HPC with some physical simulations, and this is not really efficient doing it only with job arrays or alike in PBS or some other schedulers.

Basic steps would be:

Generate or read a list of parameters
Use client.map or delayed to iterate over these parameters and run the computation
Gather all results, and save them on a tabular file.
Eventually do some analysis on the result, mean or other aggregations.

I'm currently working on something like that for showing it to our scientists.

Any thought on that?

Use an already trained Torch model to predict on lots of data

Extending on #35 it would be nice to have an example using Dask and Torch together to parallelize prediction. This should be a simple embarrassingly parallel use case, but I suspect that it would be pragmatic for lots of folks.

The challenge, I think, is constructing a simple example that hopefully doesn't get too much into Torch or a dataset. In my ideal world this would be something like

import torchvision
model = torchvision.get_model("model_name")

dataset = get_canned_dataset()
>>> imshow(dataset[0])  # show an example image

>>> model.predict(dataset[0])
"this is a cat"

... # then dask things here

Does anyone have good pointers to such a simple case?

cc @stsievert @TomAugspurger @AlbertDeFusco

Lab extension plots not showing up

Plots are no longer showing up. I believe that this may be due to recent security changes in JLab. Any suggestions on how to handle this @ian-r-rose ?

I've also pushed a binder with more up-to-date versions here: https://mybinder.org/v2/gh/mrocklin/dask-examples/update-labextension?urlpath=lab

CI Failures

It looks like CI is now failing

https://travis-ci.org/github/dask/dask-examples/builds/716774531

Some highlights

Prophet install

   command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-emefs74w
       cwd: /tmp/pip-install-n5i_pkvw/fbprophet/python
  Complete output (44 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib
  creating build/lib/fbprophet
  creating build/lib/fbprophet/stan_model
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 122, in <module>
      setup(
    File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/setuptools/__init__.py", line 163, in setup
      return distutils.core.setup(**attrs)
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 223, in run
      self.run_command('build')
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 48, in run
      build_models(target_dir)
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 36, in build_models
      from fbprophet.models import StanBackendEnum
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/__init__.py", line 8, in <module>
      from fbprophet.forecaster import Prophet
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/forecaster.py", line 17, in <module>
      from fbprophet.make_holidays import get_holiday_names, make_holidays_df
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/make_holidays.py", line 14, in <module>
      import fbprophet.hdays as hdays_part2
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/hdays.py", line 13, in <module>
      from convertdate.islamic import from_gregorian, to_gregorian
  ModuleNotFoundError: No module named 'convertdate'
  ----------------------------------------
  ERROR: Failed building wheel for fbprophet
  Running setup.py clean for fbprophet
  ERROR: Command errored out with exit status 1:
   command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /tmp/pip-install-n5i_pkvw/fbprophet
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 119, in <module>
      with open('requirements.txt', 'r') as f:
  FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
  ----------------------------------------
  ERROR: Failed cleaning build dir for fbprophet
  Building wheel for pystan (setup.py): started
  Building wheel for pystan (setup.py): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-149xclu8
       cwd: /tmp/pip-install-n5i_pkvw/pystan/
  Complete output (1 lines):
  Cython>=0.22 and NumPy are required.

Version check issue

Function:  execute_task
args:      ((<function fit at 0x7f2cd61d4280>, DecisionTreeClassifier(max_depth=4, min_samples_leaf=4, min_samples_split=9), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1., 10., ...,  2.,  0.,  0.],
       ...,
       [ 0.,  6., 16., ..., 11.,  1.,  0.],
       [ 0.,  0., 10., ...,  8.,  6.,  0.],
       [ 0.,  0.,  7., ...,  0.,  0.,  0.]]), array([4, 4, 5, 2, 1, 5, 6, 7, 7, 7, 3, 6, 3, 2, 9, 5, 2, 8, 2, 7, 5, 7,
       5, 5, 4, 8, 5, 6, 4, 2, 0, 7, 3, 5, 5, 4, 7, 4, 8, 9, 3, 1, 0, 5,
       1, 9, 6, 9, 1, 0, 5, 5, 8, 3, 8, 8, 9, 1, 2, 5, 8, 9, 6, 1, 7, 9,
       7, 8, 9, 8, 0, 4, 5, 3, 0, 1, 3, 7, 7, 1, 1, 8, 3, 2, 8, 9, 3, 2,
       7]), True, True, 0), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., .
kwargs:    {}
Exception: TypeError("'<' not supported between instances of 'Version' and 'tuple'")

TPot

nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
tp.fit(X_train, y_train)
------------------
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    713                 warnings.simplefilter('ignore')
--> 714                 self._pop, _ = eaMuPlusLambda(
    715                     population=self._pop,
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/gp_deap.py in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function)
    225 
--> 226     population[:] = toolbox.evaluate(population)
    227 
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _evaluate_individuals(self, population, features, target, sample_weight, groups)
   1333                             warnings.simplefilter('ignore')
-> 1334                             tmp_result_scores = list(dask.compute(*tmp_result_scores))
   1335 
~/miniconda/envs/test/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2687             try:
-> 2688                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2689             finally:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1981                 local_worker = None
-> 1982             return self.sync(
   1983                 self._gather,
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
--> 832             return sync(
    833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338         typ, exc, tb = error[0]
--> 339         raise exc.with_traceback(tb)
    340     else:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in f()
    322                 future = asyncio.wait_for(future, callback_timeout)
--> 323             result[0] = yield future
    324         except Exception as exc:
~/miniconda/envs/test/lib/python3.8/site-packages/tornado/gen.py in run(self)
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1846                         else:
-> 1847                             raise exception.with_traceback(traceback)
   1848                         raise exc
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in cv_extract()
    165 def cv_extract(cvs, X, y, is_X, is_train, n):
--> 166     return cvs.extract(X, y, n, is_X, is_train)
    167 
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in extract()
    110                 return self._extract_pairwise(X, y, n, is_train=is_train)
--> 111             return self._extract(X, y, n, is_x=True, is_train=is_train)
    112         if y is None:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in _extract()
    130         inds = self.splits[n][0] if is_train else self.splits[n][1]
--> 131         result = _safe_indexing(X if is_x else y, inds)
    132 
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _safe_indexing()
    219     elif hasattr(X, "shape"):
--> 220         return _array_indexing(X, indices, indices_dtype, axis=axis)
    221     else:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _array_indexing()
    298     """Index an array or scipy.sparse consistently across NumPy version."""
--> 299     if np_version < (1, 12) or sp.issparse(array):
    300         # FIXME: Remove the check for NumPy when using >= 1.12
TypeError: '<' not supported between instances of 'Version' and 'tuple'
During handling of the above exception, another exception occurred:
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-c5bcc440217f> in <module>
----> 1 tp.fit(X_train, y_train)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    754                     # raise the exception if it's our last attempt
    755                     if attempt == (attempts - 1):
--> 756                         raise e
    757             return self
    758 
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    745                         self._pbar.close()
    746 
--> 747                     self._update_top_pipeline()
    748                     self._summary_of_best_pipeline(features, target)
    749                     # Delete the temporary cache before exiting
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _update_top_pipeline(self)
    827             # If user passes CTRL+C in initial generation, self._pareto_front (halloffame) shoule be not updated yet.
    828             # need raise RuntimeError because no pipeline has been optimized
--> 829             raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
    830 
    831     def _summary_of_best_pipeline(self, features, target):
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

Unpin sphinx

In #134, CI was failing after 20 minutes with

The job exceeded the maximum log length, and has been terminated.

See https://travis-ci.org/dask/dask-examples/builds/651935612?utm_source=github_status&utm_medium=notification

We avoid that by pinning to 2.3.1. Ideally we would unpin sphinx and figure out what's wrong. Perhaps we start by changing the sphinx-build -M html . _build -vv to just be -v

HyperbandSearchCv gives ImportError

I was following through the example notebook: https://github.com/dask/dask-examples/blob/master/machine-learning/hyperparam-opt.ipynb

In the cell:

from dask_ml.model_selection import HyperbandSearchCV

It gives following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-10-cc8522d594f9> in <module>()
----> 1 from dask_ml.model_selection import HyperbandSearchCV

~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/__init__.py in <module>()
      5 """
      6 from ._search import GridSearchCV, RandomizedSearchCV, compute_n_splits, check_cv
----> 7 from ._split import ShuffleSplit, KFold, train_test_split
      8 
      9 

~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/_split.py in <module>()
     10 import numpy as np
     11 import sklearn.model_selection as ms
---> 12 from sklearn.model_selection._split import (
     13     BaseCrossValidator,
     14     _validate_shuffle_split,

ImportError: cannot import name '_validate_shuffle_split_init' from 'sklearn.model_selection._split' (/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/sklearn/model_selection/_split.py)

My imports:

[('numpy', '1.16.4'), ('pandas', '0.25.0'), ('dask', '2.5.0'), ('sklearn', '0.21.2')]

How to fix the error?

Use an already trained Keras model to predict on lots of data

A common approach is to train on a bit of data and then use that trained model to predict on lots of data. We could do this using ParallelPostFit in dask-ml, or we can use X.map_blocks or df.map_partitions. In either case we might want to be a bit careful about avoiding repeated serializations costs. For example, in the following case I suspect that we include the serialized model in every task

# maybe bad?
model = load_model()
predictions = X.map_blocks(model.predict)

It's probably better to encourage the user to keep the model delayed

# maybe bad?
model = dask.delayed(load_model)()
predictions = X.map_blocks(model.predict)

We should also ensure that dask-ml does this correctly, and includes the model as a single task in the graph so that it gets sent around appropriately (cc @TomAugspurger )

I'm also generally curious if a Keras model that lives on the GPU will eventually make its way back onto the GPU when deserializing.

Jupyterlab broken in binder?

Currently, I get a broken JupyterLab on binder while the standard notebook interface seems to be working.

(Possibly related to #77.)

JupyterLab workspace no longer active

It seems that our pre-defined workspace is no longer showing up on dask-examples. When I run jupyter lab workspaces export I do find that it's using the same workspace that we give it, but it's not showing up correctly.

@ian-r-rose do you have any thoughts on why this might be? Has JupyterLab changed how it interprets workspaces recently?

Dashboard link broken on some binder deployments.

See https://discourse.jupyter.org/t/gesis-hub-blocks-dask-dashboard/3019.

With

dask-examples/binder/start

Line 4 in c14b7dc

    
           sed -i -e "s|DASK_DASHBOARD_URL|/user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json

assuming that the full URL always is along the lines of https://binder.something.example.com/user/${JUPYTERHUB_USER}/proxy/8787 which is not necessarily true.

Placement of interactive dashboard in JupyterLab

Not sure if this strictly counts as a bug, but here goes.

When running the example notebook for delayed online (after clicking "launch binder"), I was expecting the interactive dashboard to show up in the two panes on the right side of the JupyterLab window. This turned out not to be the case, as the two panes remained blank throughout.

To get the dashboard to appear, I had to click the link that shows up after a Client object is instantiated.

Is this intended behavior? Is there a way to peg the dashboard to the two "pre-allocated" panes within the same JupyterLab window?

Screenshots

Maintenence: update mybinder environment.yml dependency versions

I noticed that the mybinder environment.yml file pins dask to version 0.20, but the latest dask release is now up to 1.1.2. It's probably time to update or unpin some of these dependencies. Should we do that?

Currently:
https://github.com/dask/dask-examples/blob/master/binder/environment.yml

channels:
  - conda-forge
dependencies:
  - python=3
  - bokeh=0.13
  - dask=0.20
  - dask-ml=0.10.0
  - distributed=1.24
  - jupyterlab=0.35.1
  - nodejs=8.9
  - numpy
  - pandas
  - pyarrow==0.10.0
  - scikit-learn=0.20
  - matplotlib
  - nbserverproxy
  - nomkl
  - h5py
  - xarray
  - bottleneck
  - py-xgboost
  - pip:
    - graphviz
    - dask_xgboost
    - seaborn
    - mimesis

Migrate CI to GitHub Actions

Due to changes in the Travis CI billing, the Dask org is migrating CI to GitHub Actions.

This repo contains a .travis.yml file which needs to be replaced with an equivalent .github/workflows/ci.yml file.

See dask/community#107 for more details.

Include outputs in notebooks

In #14 (comment) @stsievert says

I see most of the other notebooks in this repo are without their outputs in the notebook. Letting the outbooks be coded in the notebook would lower the barrier to viewing to viewing these notebooks by allowing users to try binder if they want. Why don't we include the outputs in the notebooks?

This seemed to be worth discussion, so I've raised a separate issue for it here.

Find good XArray example

The XArray docs likely have some good informative examples that we could include in this repository. We might also consider including XArray in the environment.

cc @shoyer @jhamman for insight

XArray might also want to fork and make their own repository, but they'd be welcome here as well

Add example for dealing with satellite imagery

This notebook demonstrates using XArray and Dask with a large satellite image. It downloads the image from S3 with rasterio and then loads it in chunks.

It doesn't do a whole lot with it afterwards though, which presumably we should change. This, along with ensuring that the computation we choose also fits nicely in RAM (might have to play around with chunk sizes a bit) is presumably the hard part of this exercise.

Additionally, there is an open question of whether or not we should include rasterio in the docker image (I suspect that it brings in GDAL, which would greatly increase the image size). Instead we might include a !pip install xarray rasterio line or something similar at the top of the notebook.

Error from dask xgboost

I was trying a contrived example with the following code...got this long unexpected error. Any idea in how to proceed?

import dask_xgboost
params = {'objective': 'binary:logistic',
          'max_depth': 4, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)

> TypeError                                 Traceback (most recent call last)
> <ipython-input-14-eb3f56620d79> in <module>
>       4           'min_child_weight': 0.5}
>       5 
> ----> 6 bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
>     240     """
>     241     return client.sync(
> --> 242         _train, client, params, data, labels, dmatrix_kwargs, **kwargs
>     243     )
>     244 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
>     765         else:
>     766             return sync(
> --> 767                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
>     768             )
>     769 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
>     332     if error[0]:
>     333         typ, exc, tb = error[0]
> --> 334         raise exc.with_traceback(tb)
>     335     else:
>     336         return result[0]
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in f()
>     316             if callback_timeout is not None:
>     317                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
> --> 318             result[0] = yield future
>     319         except Exception as exc:
>     320             error[0] = sys.exc_info()
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1131 
>    1132                     try:
> -> 1133                         value = future.result()
>    1134                     except Exception:
>    1135                         self.had_exception = True
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1139                     if exc_info is not None:
>    1140                         try:
> -> 1141                             yielded = self.gen.throw(*exc_info)
>    1142                         finally:
>    1143                             # Break up a reference to itself
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
>     169     for part in parts:
>     170         if part.status == "error":
> --> 171             yield part  # trigger error locally
>     172 
>     173     # Because XGBoost-python doesn't yet allow iterative training, we need to
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
>    1131 
>    1132                     try:
> -> 1133                         value = future.result()
>    1134                     except Exception:
>    1135                         self.had_exception = True
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/asyncio/tasks.py in _wrap_awaitable(awaitable)
>     601     that will later be wrapped in a Task by ensure_future().
>     602     """
> --> 603     return (yield from awaitable.__await__())
>     604 
>     605 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in __await__(self)
>     410 
>     411     def __await__(self):
> --> 412         return self.result().__await__()
>     413 
>     414 
> 
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in result(self, timeout)
>     219         result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
>     220         if self.status == "error":
> --> 221             typ, exc, tb = result
>     222             raise exc.with_traceback(tb)
>     223         elif self.status == "cancelled":
> 
> TypeError: cannot unpack non-iterable coroutine object
>

Static Site Feedback

http://dask.org/dask-examples/

Some of the code examples are a bit wide, causing a horizontal scrollbar
The Client repr could be improved with some external CSS
dask.array output repr is a bit too wide
pandas HTML table is a bit ugly, some output too long (dataframe.html)
Section headers feel a bit squished into the previous content, too far from the section content
Our code bocks have the yellow background we've removed in a few places IIRC
http://dask.org/dask-examples/dataframes/01-data-access.html isn't executing, I think because the video which needs to be check-in as executed. Need to set nbsphinx_execute = 'always' in our conf.
http://dask.org/dask-examples/machine-learning/parallel-prediction.html update links to be absolute links to the API pages, not :class: refs
Empty cell at the bottom of http://dask.org/dask-examples/machine-learning/xgboost.html
Long output in http://dask.org/dask-examples/embarrassingly-parallel.html#Using-the-Futures-API
Empty cell at the bottom of http://dask.org/dask-examples/embarrassingly-parallel.html#Doing-some-analysis-on-the-results
Section headers off in http://dask.org/dask-examples/embarrassingly-parallel.html. The top one should be an # and the rest ##.

Stencil example overlap?

In the Stencil example, the parallel code doesn't do anything to handle boundaries or overlapping. I assume this means that at the block edges their could be some discrepancies? Should this be mentioned or is it small enough to not be significant?

Binder notebook "training-on-large-datasets.ipynb" fails when importing dask_ml.cluster

The notebook imports on old version of dask-ml. Fix this by updating binder/environment.yml to use newer version of dask-ml.

dask-ml=1.1.1 -> dask-ml=1.2.0

Dask Bag examples

We currently lack dask bag examples in this repository. Two come to mind:

Read JSON data, and do some groupby aggregation with both Bag.groupby and Bag.foldby
Read text data and do some basic wordcount

For the JSON data it might make sense to add a dataset generation tool for nested records data, similar to dask.datasets.timeseries, and then use that to generate JSON data to disk, similar to how we generate CSV data in http://examples.dask.org/dataframes/01-data-access.html#Create-artificial-dataset.

We would then read the JSON data, and do some minimal processing.

For the text data I wonder if there is an online dataset we can download. I suspect that the complete works of shakespeare is around somewhere. We might do a simple thing like read, split, frequencies. Or we might do more complex work afterwards by bringing in NLTK, stemming words, removing stopwords, etc..

`predictions = dask.compute(*predictions)` gives errror for torch-prediction.ipynb file

I was running the notebook for batch prediction
The following code gave an error : predictions = dask.compute(*predictions)
Error:
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not builtin_function_or_method

I can't figure out why. Any solutions?

Add ReviewNB GitHub app

Any objections to adding this GitHub app to this repository? https://www.reviewnb.com

It requires

Read access to code
Read access to metadata
Read and write access to pull requests

You can see an example at stijnvanhoey/pandas-getting-started-tutorials#2 / https://app.reviewnb.com/stijnvanhoey/pandas-getting-started-tutorials/pull/2/files/.

Add example showing resilience against worker deaths?

In preparation for a dask-jobqueue workshop, I'm playing with partially preemptible clusters. Would it make sense to add an example using a mocked up partially resilient LocalCluster to show that dask is able to handle dying ressources?

A draft that works on the current examples binder is here:
https://gist.github.com/willirath/c6e5e26204b33bb4415923329a6b8723

Lab extension no longer active

Our binder environment no longer seems to have the dask labextension active. Perhaps this was caused by an upstream change in the jupyter environment?

@ian-r-rose any thoughts on what might have happened here?

Looking at some build logs I'm getting this

Step 48/53 : RUN ./binder/postBuild
 ---> Running in 31a382825faa
> /srv/conda/envs/notebook/bin/npm pack dask-labextension
dask-labextension-0.3.0.tgz
Incompatible extension:

"[email protected]" is not compatible with the current JupyterLab
Conflicting Dependencies:
JupyterLab              Extension        Package
>=0.18.6 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/application
>=0.18.4 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/apputils
>=0.18.4 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/console
>=0.18.5 <0.19.0        >=0.19.1 <0.20.0 @jupyterlab/notebook

Found compatible version: 0.1.2
> /srv/conda/envs/notebook/bin/npm pack [email protected]
dask-labextension-0.1.2.tgz
> node /srv/conda/envs/notebook/lib/python3.6/site-packages/jupyterlab/staging/yarn.js install
yarn install v1.9.4

Add dask-sql example(s)

As was commented on the dask-sql repo

dask-contrib/dask-sql#60 (comment)

Docs aren't updating

I think that our docs don't seem to be updating after we push. We've added a couple notebooks in recent months that aren't showing up.

cc @TomAugspurger and @jcrist who might know a bit more about the doc build system here (I think doctr)

Create GridSearch + RandomForest example with Joblib

Once Scikit-Learn 0.20 is released we should create an example for scaling random forests with joblib.

We might consider embedding this video from SciPy 2018 starting at 7:42

Go through dataframe notebook with a pedagogical eye

At AnacondaCon @AlbertDeFusco went through the dataframe notebook with me and provided a dozen small corrections to improve the flow of a novice user through the notebook. I diligently copied down the changed notebook but seem to have lost all memory of where I placed it.

@AlbertDeFusco if you have time can I ask you for feedback on this notebook again?

Your attention on any of the notebooks here, or on others, would also be very welcome.

Machine Learning broken link

At the bottom of

https://examples.dask.org/machine-learning.html#Training-on-Large-Datasets

For all the estimators implemented in Dask-ML, see the API documentation.

Points to https://dask-ml.readthedocs.io/en/latest/modules/.html

Should be https://ml.dask.org/modules/api.html# ?

Make the live notebook element more prominent

When I've shown people examples.dask.org they often don't realize that they can click on the "Launch Binder" button and get a live session. This is despite our header at the top which says:

You can run this notebook in a live session or view it on Github

I think that we might make this more prominent by

Using a button, similar to the "Launch Binder" button, but more obvious to people who are unfamiliar with Binder
Making that button very large?
Making that button stay on the screen, even after the user scrolls down?
...?

If only we knew someone with some basic web design skills ...

cc @jrbourbeau , in case you or anyone around you has ideas ;)

Occasional failure in HTTP bytes

When running CI in this project I sometimes run across the following error:

~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in reify()
   1603 def reify(seq):
   1604     if isinstance(seq, Iterator):
-> 1605         seq = list(seq)
   1606     if seq and isinstance(seq[0], Iterator):
   1607         seq = list(map(list, seq))
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in map_chunk()
   1769                 yield f(**k)
   1770     else:
-> 1771         for a in zip(*args):
   1772             yield f(*a)
   1773 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/text.py in file_to_blocks()
    103 def file_to_blocks(lazy_file):
    104     with lazy_file as f:
--> 105         for line in f:
    106             yield line
    107 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in read()
    247             # EOF (python files don't error, just return no data)
    248             return b''
--> 249         self. _fetch(self.loc, end)
    250         data = self.cache[self.loc - self.start:end - self.start]
    251         self.loc = end
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch()
    258             self.start = start
    259             self.end = end + self.blocksize
--> 260             self.cache = self._fetch_range(start, self.end)
    261         elif start < self.start:
    262             if self.end - end > self.blocksize:
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch_range()
    320             if cl <= end - start:
    321                 # data size OK
--> 322                 return r.content
    323             else:
    324                 raise ValueError('Got more bytes (%i) than requested (%i)' % (
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in content()
    826                 self._content = None
    827             else:
--> 828                 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
    829 
    830         self._content_consumed = True
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in generate()
    751                         yield chunk
    752                 except ProtocolError as e:
--> 753                     raise ChunkedEncodingError(e)
    754                 except DecodeError as e:
    755                     raise ContentDecodingError(e)
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
    nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:
------------------
df.spec.value_counts().nlargest(20).to_frame().compute()
------------------

@martindurant , this seems to be in your general domain. Do you have any suggestions on what might be happening here?

Organization

Thinking optimistically, if we get many examples in this repository, how should we organize them? I recommend that we have a small number of notebooks at the top level, showing an introduction for each topic like arrays, dataframes, delayed, ML, ..., and also a directory for topics that have many additional notebooks.

Dataframe notebook + video series

I'm inclined to create a series of notebooks around dask dataframe that also include short 1-5 minute screencasts of the notebook in the top cell. I'll propose the following structure based on experience with common stack overflow questions. Feedback is welcome.

Load and store Dask Dataframes
- Dump artificial CSV data to disk using dd.demo.make_timeseries().to_csv()
- Use read_csv, customize by adding datetime dtype
- Dump to parquet, customize by adding dtypes
- Read from parquet, customize by using the columns keyword for faster speed
- Do some trivial computation on that column to show the speed difference
Groupby
- Create artificial dataset with dd.demo.make_timeseries()
- Use groupby aggregations, stress that it's the same as pandas, show the use of compute, and dask.compute
- Use groupby apply (steal the scikit-learn example from the current main dataframe example) and show how it's much slower
Set index
- Point out the index and divisions
- Show that operations like loc are fast while filtering on other columns is slow
- Use set_index to change the index, show that it is expensive
- Use persist to keep that computation in memory
- Use loc on the newly created index column (probably name), and show that it is now fast
- Use groupby-apply (link back to old notebook) and show that it too is now fast

At some point after we do a couple series like this we might also want to move to JupyterLab and provide a welcome notebook that shows people how to drag the video off to the side.