Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
Easy-to-run example notebooks for Dask
Home Page: https://examples.dask.org/
License: Creative Commons Attribution Share Alike 4.0 International
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
Would this be of interest for people here to have an example with Dask showing how to do embarrassingly parallel computation on some simulation code?
The idea is that you have a code that does some heavy computation, it takes some parameters and return one or more values as a result. You want to evaluate the program on a grid, with a predefined list of inputs, or maybe with some randomly picked input (Montecarlo simulation). The code run in a few seconds or minutes with each input, and you've got to run it with several thousand or more different inputs. It is something that is often seen in HPC with some physical simulations, and this is not really efficient doing it only with job arrays or alike in PBS or some other schedulers.
Basic steps would be:
I'm currently working on something like that for showing it to our scientists.
Any thought on that?
Any objections to adding this GitHub app to this repository? https://www.reviewnb.com
It requires
Read access to code
Read access to metadata
Read and write access to pull requests
You can see an example at stijnvanhoey/pandas-getting-started-tutorials#2 / https://app.reviewnb.com/stijnvanhoey/pandas-getting-started-tutorials/pull/2/files/.
We currently lack dask bag examples in this repository. Two come to mind:
Bag.groupby
and Bag.foldby
For the JSON data it might make sense to add a dataset generation tool for nested records data, similar to dask.datasets.timeseries
, and then use that to generate JSON data to disk, similar to how we generate CSV data in http://examples.dask.org/dataframes/01-data-access.html#Create-artificial-dataset.
We would then read the JSON data, and do some minimal processing.
For the text data I wonder if there is an online dataset we can download. I suspect that the complete works of shakespeare is around somewhere. We might do a simple thing like read, split, frequencies. Or we might do more complex work afterwards by bringing in NLTK, stemming words, removing stopwords, etc..
@jrbourbeau and I are in the process of moving the default branch for this repo from master to main.
Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull
you'll get
Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.
First: head to your fork and rename the default branch there
Then:
git branch -m master main
git fetch origin
git branch -u origin/main main
Due to changes in the Travis CI billing, the Dask org is migrating CI to GitHub Actions.
This repo contains a .travis.yml
file which needs to be replaced with an equivalent .github/workflows/ci.yml
file.
See dask/community#107 for more details.
This repository could use continuous integration to ensure that the examples continue to work over time
I'm not familiar with tools to test notebooks, but other projects do this, so it must be doable.
Once Scikit-Learn 0.20 is released we should create an example for scaling random forests with joblib.
We might consider embedding this video from SciPy 2018 starting at 7:42
@djhoese any interest?
There is also https://examples.dask.org/applications/satellite-imagery-geotiff.html , but maybe there is more that we could add there or other examples that we could add.
As a reminder, examples in dask-examples are tested, rendered statically to serve from examples.dask.org, and available for users as runnable binders. It's nice way to reach users, and a nice way to have a simple test with many other parts of the ecosystem.
Thinking optimistically, if we get many examples in this repository, how should we organize them? I recommend that we have a small number of notebooks at the top level, showing an introduction for each topic like arrays, dataframes, delayed, ML, ..., and also a directory for topics that have many additional notebooks.
Our binder environment no longer seems to have the dask labextension active. Perhaps this was caused by an upstream change in the jupyter environment?
@ian-r-rose any thoughts on what might have happened here?
Looking at some build logs I'm getting this
Step 48/53 : RUN ./binder/postBuild
---> Running in 31a382825faa
> /srv/conda/envs/notebook/bin/npm pack dask-labextension
dask-labextension-0.3.0.tgz
Incompatible extension:
"[email protected]" is not compatible with the current JupyterLab
Conflicting Dependencies:
JupyterLab Extension Package
>=0.18.6 <0.19.0 >=0.19.1 <0.20.0 @jupyterlab/application
>=0.18.4 <0.19.0 >=0.19.1 <0.20.0 @jupyterlab/apputils
>=0.18.4 <0.19.0 >=0.19.1 <0.20.0 @jupyterlab/console
>=0.18.5 <0.19.0 >=0.19.1 <0.20.0 @jupyterlab/notebook
Found compatible version: 0.1.2
> /srv/conda/envs/notebook/bin/npm pack [email protected]
dask-labextension-0.1.2.tgz
> node /srv/conda/envs/notebook/lib/python3.6/site-packages/jupyterlab/staging/yarn.js install
yarn install v1.9.4
Would be good to have a link to https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate in https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb
This is probably more so because googling dask dataframe group by takes me to https://examples.dask.org/dataframes/02-groupby.html and I know there is more info in the docs https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
When we first load the page, the dashboard address isn't populated and the task stream and progress plots aren't arranged on the screen.
My first guess was that this was due to a change in the workspace file spec, so I decided to generate a new one. However I ran into an error that I don't fully underestand.
jovyan@jupyter-dask-2ddask-2dexamples-2d4c9mozvq:~$ jupyter lab workspaces export
Traceback (most recent call last):
File "/srv/conda/envs/notebook/bin/jupyter-lab", line 10, in <module>
sys.exit(main())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/notebookapp.py", line 1758, in start
super(NotebookApp, self).start()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 276, in start
super().start()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_core/application.py", line 255, in start
self.subapp.start()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab/labapp.py", line 136, in start
page_url = config.page_url
AttributeError: 'LabConfig' object has no attribute 'page_url'
cc @ian-r-rose
I'm inclined to create a series of notebooks around dask dataframe that also include short 1-5 minute screencasts of the notebook in the top cell. I'll propose the following structure based on experience with common stack overflow questions. Feedback is welcome.
dd.demo.make_timeseries().to_csv()
dd.demo.make_timeseries()
At some point after we do a couple series like this we might also want to move to JupyterLab and provide a welcome notebook that shows people how to drag the video off to the side.
Plots are no longer showing up. I believe that this may be due to recent security changes in JLab. Any suggestions on how to handle this @ian-r-rose ?
I've also pushed a binder with more up-to-date versions here: https://mybinder.org/v2/gh/mrocklin/dask-examples/update-labextension?urlpath=lab
In preparation for a dask-jobqueue workshop, I'm playing with partially preemptible clusters. Would it make sense to add an example using a mocked up partially resilient LocalCluster
to show that dask is able to handle dying ressources?
A draft that works on the current examples binder is here:
https://gist.github.com/willirath/c6e5e26204b33bb4415923329a6b8723
Currently, I get a broken JupyterLab on binder while the standard notebook interface seems to be working.
(Possibly related to #77.)
At the bottom of
https://examples.dask.org/machine-learning.html#Training-on-Large-Datasets
For all the estimators implemented in Dask-ML, see the API documentation.
Points to https://dask-ml.readthedocs.io/en/latest/modules/.html
Should be https://ml.dask.org/modules/api.html# ?
The XArray docs likely have some good informative examples that we could include in this repository. We might also consider including XArray in the environment.
cc @shoyer @jhamman for insight
XArray might also want to fork and make their own repository, but they'd be welcome here as well
See https://discourse.jupyter.org/t/gesis-hub-blocks-dask-dashboard/3019.
With
Line 4 in c14b7dc
https://binder.something.example.com/user/${JUPYTERHUB_USER}/proxy/8787
which is not necessarily true.When running CI in this project I sometimes run across the following error:
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in reify()
1603 def reify(seq):
1604 if isinstance(seq, Iterator):
-> 1605 seq = list(seq)
1606 if seq and isinstance(seq[0], Iterator):
1607 seq = list(map(list, seq))
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in map_chunk()
1769 yield f(**k)
1770 else:
-> 1771 for a in zip(*args):
1772 yield f(*a)
1773
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/text.py in file_to_blocks()
103 def file_to_blocks(lazy_file):
104 with lazy_file as f:
--> 105 for line in f:
106 yield line
107
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in read()
247 # EOF (python files don't error, just return no data)
248 return b''
--> 249 self. _fetch(self.loc, end)
250 data = self.cache[self.loc - self.start:end - self.start]
251 self.loc = end
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch()
258 self.start = start
259 self.end = end + self.blocksize
--> 260 self.cache = self._fetch_range(start, self.end)
261 elif start < self.start:
262 if self.end - end > self.blocksize:
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch_range()
320 if cl <= end - start:
321 # data size OK
--> 322 return r.content
323 else:
324 raise ValueError('Got more bytes (%i) than requested (%i)' % (
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in content()
826 self._content = None
827 else:
--> 828 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
829
830 self._content_consumed = True
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in generate()
751 yield chunk
752 except ProtocolError as e:
--> 753 raise ChunkedEncodingError(e)
754 except DecodeError as e:
755 raise ContentDecodingError(e)
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:
------------------
df.spec.value_counts().nlargest(20).to_frame().compute()
------------------
@martindurant , this seems to be in your general domain. Do you have any suggestions on what might be happening here?
In #134, CI was failing after 20 minutes with
The job exceeded the maximum log length, and has been terminated.
We avoid that by pinning to 2.3.1. Ideally we would unpin sphinx and figure out what's wrong. Perhaps we start by changing the sphinx-build -M html . _build -vv
to just be -v
CC-BY-SA may make the most sense
This notebook demonstrates using XArray and Dask with a large satellite image. It downloads the image from S3 with rasterio and then loads it in chunks.
It doesn't do a whole lot with it afterwards though, which presumably we should change. This, along with ensuring that the computation we choose also fits nicely in RAM (might have to play around with chunk sizes a bit) is presumably the hard part of this exercise.
Additionally, there is an open question of whether or not we should include rasterio in the docker image (I suspect that it brings in GDAL, which would greatly increase the image size). Instead we might include a !pip install xarray rasterio
line or something similar at the top of the notebook.
http://dask.org/dask-examples/
code
bocks have the yellow background we've removed in a few places IIRCnbsphinx_execute = 'always'
in our conf.:class:
refs#
and the rest ##
.It seems that to whatever reasons, it is not possible to install a package into the binder environment via conda.
I tried via the terminal or directly from within the jupyter notebook. I started binder by clicking one of the links from the dask-example page.
It does not really matter which package (I tried dask-sql, but also e.g. pandas, which is already installed). Both fail with:
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are missing from the target environment:
- nodejs==14
- numpy==1.18
The installation seems to work with mamba, which makes me think it is some problem in the way conda resolves the packages - but I have no idea what is going wild her.
Now comes the "funny" part: If I explicitly install
conda install numpy=1.18 nodejs=14
it actually works, even though conda list already shows me them as installed. It will downgrade some packages:
The following packages will be UPDATED:
certifi 2020.6.20-py38h32f6830_0 --> 2020.6.20-py38h924ce5b_2
The following packages will be DOWNGRADED:
jupyterlab 2.1.5-py_0 --> 2.1.0-py_1
pandas 1.0.5-py38hcb8c335_0 --> 1.0.0-py38hb3f55d8_0
scikit-learn 0.23.2-py38h5d63f67_2 --> 0.23.0-py38h3a94b23_0
After that, I can also install other packages...
Not sure if this strictly counts as a bug, but here goes.
When running the example notebook for delayed
online (after clicking "launch binder"), I was expecting the interactive dashboard to show up in the two panes on the right side of the JupyterLab window. This turned out not to be the case, as the two panes remained blank throughout.
To get the dashboard to appear, I had to click the link that shows up after a Client
object is instantiated.
Is this intended behavior? Is there a way to peg the dashboard to the two "pre-allocated" panes within the same JupyterLab window?
The notebook imports on old version of dask-ml. Fix this by updating binder/environment.yml to use newer version of dask-ml.
dask-ml=1.1.1 -> dask-ml=1.2.0
Inspired by #151, which I encountered manually, I just ran http://examples.dask.org through a dead link checker and found a handful of broken links. I don't have time to fix them all right now but I just thought I'd drop the results here.
It might be a good idea to include a dead link check as part of the website deploy, but that may also be overkill!
It looks like CI is now failing
https://travis-ci.org/github/dask/dask-examples/builds/716774531
Some highlights
command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-emefs74w
cwd: /tmp/pip-install-n5i_pkvw/fbprophet/python
Complete output (44 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/fbprophet
creating build/lib/fbprophet/stan_model
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 122, in <module>
setup(
File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/setuptools/__init__.py", line 163, in setup
return distutils.core.setup(**attrs)
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/travis/miniconda/envs/test/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 223, in run
self.run_command('build')
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/travis/miniconda/envs/test/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 48, in run
build_models(target_dir)
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 36, in build_models
from fbprophet.models import StanBackendEnum
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/__init__.py", line 8, in <module>
from fbprophet.forecaster import Prophet
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/forecaster.py", line 17, in <module>
from fbprophet.make_holidays import get_holiday_names, make_holidays_df
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/make_holidays.py", line 14, in <module>
import fbprophet.hdays as hdays_part2
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/fbprophet/hdays.py", line 13, in <module>
from convertdate.islamic import from_gregorian, to_gregorian
ModuleNotFoundError: No module named 'convertdate'
----------------------------------------
ERROR: Failed building wheel for fbprophet
Running setup.py clean for fbprophet
ERROR: Command errored out with exit status 1:
command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
cwd: /tmp/pip-install-n5i_pkvw/fbprophet
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-n5i_pkvw/fbprophet/python/setup.py", line 119, in <module>
with open('requirements.txt', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
ERROR: Failed cleaning build dir for fbprophet
Building wheel for pystan (setup.py): started
Building wheel for pystan (setup.py): finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /home/travis/miniconda/envs/test/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n5i_pkvw/pystan/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-149xclu8
cwd: /tmp/pip-install-n5i_pkvw/pystan/
Complete output (1 lines):
Cython>=0.22 and NumPy are required.
Function: execute_task
args: ((<function fit at 0x7f2cd61d4280>, DecisionTreeClassifier(max_depth=4, min_samples_leaf=4, min_samples_split=9), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 1., 10., ..., 2., 0., 0.],
...,
[ 0., 6., 16., ..., 11., 1., 0.],
[ 0., 0., 10., ..., 8., 6., 0.],
[ 0., 0., 7., ..., 0., 0., 0.]]), array([4, 4, 5, 2, 1, 5, 6, 7, 7, 7, 3, 6, 3, 2, 9, 5, 2, 8, 2, 7, 5, 7,
5, 5, 4, 8, 5, 6, 4, 2, 0, 7, 3, 5, 5, 4, 7, 4, 8, 9, 3, 1, 0, 5,
1, 9, 6, 9, 1, 0, 5, 5, 8, 3, 8, 8, 9, 1, 2, 5, 8, 9, 6, 1, 7, 9,
7, 8, 9, 8, 0, 4, 5, 3, 0, 1, 3, 7, 7, 1, 1, 8, 3, 2, 8, 9, 3, 2,
7]), True, True, 0), (<function cv_extract at 0x7f2cd61d5e50>, <dask_ml.model_selection.methods.CVCache object at 0x7f2cd61cfa60>, array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., .
kwargs: {}
Exception: TypeError("'<' not supported between instances of 'Version' and 'tuple'")
nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
tp.fit(X_train, y_train)
------------------
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
713 warnings.simplefilter('ignore')
--> 714 self._pop, _ = eaMuPlusLambda(
715 population=self._pop,
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/gp_deap.py in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function)
225
--> 226 population[:] = toolbox.evaluate(population)
227
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _evaluate_individuals(self, population, features, target, sample_weight, groups)
1333 warnings.simplefilter('ignore')
-> 1334 tmp_result_scores = list(dask.compute(*tmp_result_scores))
1335
~/miniconda/envs/test/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
443
--> 444 results = schedule(dsk, keys, **kwargs)
445 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2687 try:
-> 2688 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2689 finally:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1981 local_worker = None
-> 1982 return self.sync(
1983 self._gather,
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
831 else:
--> 832 return sync(
833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 typ, exc, tb = error[0]
--> 339 raise exc.with_traceback(tb)
340 else:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/utils.py in f()
322 future = asyncio.wait_for(future, callback_timeout)
--> 323 result[0] = yield future
324 except Exception as exc:
~/miniconda/envs/test/lib/python3.8/site-packages/tornado/gen.py in run(self)
734 try:
--> 735 value = future.result()
736 except Exception:
~/miniconda/envs/test/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1846 else:
-> 1847 raise exception.with_traceback(traceback)
1848 raise exc
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in cv_extract()
165 def cv_extract(cvs, X, y, is_X, is_train, n):
--> 166 return cvs.extract(X, y, n, is_X, is_train)
167
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in extract()
110 return self._extract_pairwise(X, y, n, is_train=is_train)
--> 111 return self._extract(X, y, n, is_x=True, is_train=is_train)
112 if y is None:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/methods.py in _extract()
130 inds = self.splits[n][0] if is_train else self.splits[n][1]
--> 131 result = _safe_indexing(X if is_x else y, inds)
132
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _safe_indexing()
219 elif hasattr(X, "shape"):
--> 220 return _array_indexing(X, indices, indices_dtype, axis=axis)
221 else:
~/miniconda/envs/test/lib/python3.8/site-packages/dask_ml/model_selection/utils.py in _array_indexing()
298 """Index an array or scipy.sparse consistently across NumPy version."""
--> 299 if np_version < (1, 12) or sp.issparse(array):
300 # FIXME: Remove the check for NumPy when using >= 1.12
TypeError: '<' not supported between instances of 'Version' and 'tuple'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-7-c5bcc440217f> in <module>
----> 1 tp.fit(X_train, y_train)
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
754 # raise the exception if it's our last attempt
755 if attempt == (attempts - 1):
--> 756 raise e
757 return self
758
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
745 self._pbar.close()
746
--> 747 self._update_top_pipeline()
748 self._summary_of_best_pipeline(features, target)
749 # Delete the temporary cache before exiting
~/miniconda/envs/test/lib/python3.8/site-packages/tpot/base.py in _update_top_pipeline(self)
827 # If user passes CTRL+C in initial generation, self._pareto_front (halloffame) shoule be not updated yet.
828 # need raise RuntimeError because no pipeline has been optimized
--> 829 raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
830
831 def _summary_of_best_pipeline(self, features, target):
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
I was following through the example notebook: https://github.com/dask/dask-examples/blob/master/machine-learning/hyperparam-opt.ipynb
In the cell:
from dask_ml.model_selection import HyperbandSearchCV
It gives following error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-10-cc8522d594f9> in <module>()
----> 1 from dask_ml.model_selection import HyperbandSearchCV
~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/__init__.py in <module>()
5 """
6 from ._search import GridSearchCV, RandomizedSearchCV, compute_n_splits, check_cv
----> 7 from ._split import ShuffleSplit, KFold, train_test_split
8
9
~/miniconda3/envs/dataSc/lib/python3.7/site-packages/dask_ml/model_selection/_split.py in <module>()
10 import numpy as np
11 import sklearn.model_selection as ms
---> 12 from sklearn.model_selection._split import (
13 BaseCrossValidator,
14 _validate_shuffle_split,
ImportError: cannot import name '_validate_shuffle_split_init' from 'sklearn.model_selection._split' (/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/sklearn/model_selection/_split.py)
My imports:
[('numpy', '1.16.4'), ('pandas', '0.25.0'), ('dask', '2.5.0'), ('sklearn', '0.21.2')]
How to fix the error?
As was commented on the dask-sql repo
It seems that the recent sklearn update has broken dask-ml, which breaks our CI.
This is odd though, since we seem to pin Scikit-Learn in our environment.yml
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-3-001abd10b0d1> in <module>
1 import dask_ml.datasets
----> 2 import dask_ml.cluster
3 import matplotlib.pyplot as plt
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/__init__.py in <module>
1 """Unsupervised Clustering Algorithms"""
2
----> 3 from .k_means import KMeans # noqa
4 from .minibatch import PartialMiniBatchKMeans # noqa
5 from .spectral import SpectralClustering # noqa
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/k_means.py in <module>
21 )
22 from ..utils import _timed, _timer, check_array, row_norms
---> 23 from ._compat import _k_init
24
25 logger = logging.getLogger(__name__)
~/miniconda/envs/test/lib/python3.7/site-packages/dask_ml/cluster/_compat.py in <module>
2
3 if SK_022:
----> 4 from sklearn.cluster._k_means import _k_init
5 else:
6 from sklearn.cluster.k_means_ import _k_init
ModuleNotFoundError: No module named 'sklearn.cluster._k_means'
I was running the notebook for batch prediction
The following code gave an error : predictions = dask.compute(*predictions)
Error:
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not builtin_function_or_method
I can't figure out why. Any solutions?
When I've shown people examples.dask.org they often don't realize that they can click on the "Launch Binder" button and get a live session. This is despite our header at the top which says:
You can run this notebook in a live session or view it on Github
I think that we might make this more prominent by
If only we knew someone with some basic web design skills ...
cc @jrbourbeau , in case you or anyone around you has ideas ;)
We switched to installing fsspec
from master
in #95. We should revert back to installing a release version once there's a new release with the changes in fsspec/filesystem_spec#128. Note: at the time of opening this issue the latest release is fsspec==0.4.4
In the Stencil example, the parallel code doesn't do anything to handle boundaries or overlapping. I assume this means that at the block edges their could be some discrepancies? Should this be mentioned or is it small enough to not be significant?
The current prophet notebook talks about installing from master. My guess is that this is no longer necessary.
It would be useful if someone can check on this, and update the notebook if possible.
I think that our docs don't seem to be updating after we push. We've added a couple notebooks in recent months that aren't showing up.
cc @TomAugspurger and @jcrist who might know a bit more about the doc build system here (I think doctr)
A common approach is to train on a bit of data and then use that trained model to predict on lots of data. We could do this using ParallelPostFit in dask-ml, or we can use X.map_blocks
or df.map_partitions
. In either case we might want to be a bit careful about avoiding repeated serializations costs. For example, in the following case I suspect that we include the serialized model in every task
# maybe bad?
model = load_model()
predictions = X.map_blocks(model.predict)
It's probably better to encourage the user to keep the model delayed
# maybe bad?
model = dask.delayed(load_model)()
predictions = X.map_blocks(model.predict)
We should also ensure that dask-ml does this correctly, and includes the model as a single task in the graph so that it gets sent around appropriately (cc @TomAugspurger )
I'm also generally curious if a Keras model that lives on the GPU will eventually make its way back onto the GPU when deserializing.
The links at the top of the Image Processing page aren't working. Clicking any of the links at the top doesn't bring you down the page to the content as it should.
https://examples.dask.org/applications/image-processing.html
I'm thinking maybe it has to do with the extra apostrophe in the last (cleanup) link? Didn't have time to test that theory though.
I noticed that the mybinder environment.yml file pins dask to version 0.20, but the latest dask release is now up to 1.1.2. It's probably time to update or unpin some of these dependencies. Should we do that?
Currently:
https://github.com/dask/dask-examples/blob/master/binder/environment.yml
channels:
- conda-forge
dependencies:
- python=3
- bokeh=0.13
- dask=0.20
- dask-ml=0.10.0
- distributed=1.24
- jupyterlab=0.35.1
- nodejs=8.9
- numpy
- pandas
- pyarrow==0.10.0
- scikit-learn=0.20
- matplotlib
- nbserverproxy
- nomkl
- h5py
- xarray
- bottleneck
- py-xgboost
- pip:
- graphviz
- dask_xgboost
- seaborn
- mimesis
Extending on #35 it would be nice to have an example using Dask and Torch together to parallelize prediction. This should be a simple embarrassingly parallel use case, but I suspect that it would be pragmatic for lots of folks.
The challenge, I think, is constructing a simple example that hopefully doesn't get too much into Torch or a dataset. In my ideal world this would be something like
import torchvision
model = torchvision.get_model("model_name")
dataset = get_canned_dataset()
>>> imshow(dataset[0]) # show an example image
>>> model.predict(dataset[0])
"this is a cat"
... # then dask things here
Does anyone have good pointers to such a simple case?
@cicdw any interest in adding an example notebook building a Prefect workflow to examples.dask.org ? https://github.com/dask/dask-examples#contributing
This site gets decent traffic, and we provide testing, HTML static output generation, and serving as a binder.
It seems that our pre-defined workspace is no longer showing up on dask-examples. When I run jupyter lab workspaces export
I do find that it's using the same workspace that we give it, but it's not showing up correctly.
@ian-r-rose do you have any thoughts on why this might be? Has JupyterLab changed how it interprets workspaces recently?
Hi,
I'm writing a notebook example to highlight some key differences between pandas and dask. Are you interested in such a PR?
If so i have currently the following topics - (are there any additional topics that I should include?) :
In #14 (comment) @stsievert says
I see most of the other notebooks in this repo are without their outputs in the notebook. Letting the outbooks be coded in the notebook would lower the barrier to viewing to viewing these notebooks by allowing users to try binder if they want. Why don't we include the outputs in the notebooks?
This seemed to be worth discussion, so I've raised a separate issue for it here.
I was trying a contrived example with the following code...got this long unexpected error. Any idea in how to proceed?
import dask_xgboost
params = {'objective': 'binary:logistic',
'max_depth': 4, 'eta': 0.01, 'subsample': 0.5,
'min_child_weight': 0.5}
bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)
> TypeError Traceback (most recent call last)
> <ipython-input-14-eb3f56620d79> in <module>
> 4 'min_child_weight': 0.5}
> 5
> ----> 6 bst = dask_xgboost.train(client, params, train_df.to_dask_array(), label_df.to_dask_array(), num_boost_round=10)
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
> 240 """
> 241 return client.sync(
> --> 242 _train, client, params, data, labels, dmatrix_kwargs, **kwargs
> 243 )
> 244
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
> 765 else:
> 766 return sync(
> --> 767 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
> 768 )
> 769
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
> 332 if error[0]:
> 333 typ, exc, tb = error[0]
> --> 334 raise exc.with_traceback(tb)
> 335 else:
> 336 return result[0]
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/utils.py in f()
> 316 if callback_timeout is not None:
> 317 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
> --> 318 result[0] = yield future
> 319 except Exception as exc:
> 320 error[0] = sys.exc_info()
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
> 1131
> 1132 try:
> -> 1133 value = future.result()
> 1134 except Exception:
> 1135 self.had_exception = True
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
> 1139 if exc_info is not None:
> 1140 try:
> -> 1141 yielded = self.gen.throw(*exc_info)
> 1142 finally:
> 1143 # Break up a reference to itself
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
> 169 for part in parts:
> 170 if part.status == "error":
> --> 171 yield part # trigger error locally
> 172
> 173 # Because XGBoost-python doesn't yet allow iterative training, we need to
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/tornado/gen.py in run(self)
> 1131
> 1132 try:
> -> 1133 value = future.result()
> 1134 except Exception:
> 1135 self.had_exception = True
>
> ~/app/anaconda3/envs/dask/lib/python3.7/asyncio/tasks.py in _wrap_awaitable(awaitable)
> 601 that will later be wrapped in a Task by ensure_future().
> 602 """
> --> 603 return (yield from awaitable.__await__())
> 604
> 605
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in __await__(self)
> 410
> 411 def __await__(self):
> --> 412 return self.result().__await__()
> 413
> 414
>
> ~/app/anaconda3/envs/dask/lib/python3.7/site-packages/distributed/client.py in result(self, timeout)
> 219 result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
> 220 if self.status == "error":
> --> 221 typ, exc, tb = result
> 222 raise exc.with_traceback(tb)
> 223 elif self.status == "cancelled":
>
> TypeError: cannot unpack non-iterable coroutine object
>
I believe that we should switch the default environment to use JupyterLab and that we should encode a specific layout on container startup. In this way the user is presented with the dashboard without having to do anything.
@ian-r-rose is the best way to provide a default layout to create a JSON file and insert it into the .jupyter/lab/workspaces
directory, or is there another common route for these things?
At AnacondaCon @AlbertDeFusco went through the dataframe notebook with me and provided a dozen small corrections to improve the flow of a novice user through the notebook. I diligently copied down the changed notebook but seem to have lost all memory of where I placed it.
@AlbertDeFusco if you have time can I ask you for feedback on this notebook again?
Your attention on any of the notebooks here, or on others, would also be very welcome.
array.ipynb
The description at the top "Dask Dataframes coordinate many Pandas dataframes," applies to dataframe, not array.
If this notebook is intended as one of the main intros, it's sort of annoying that the dashboard link situation is so fussy--I first clicked without running the cell, then ran the cell and got a link that won't work, then could click the right link. Since the link isn't going to be right, it might be less noise not to present the (cool) HTML repr of the client.
"This creates a 10000, 10000 arrays of random number" -> "This creates a 10000, 10000 array of random number"
"This creates a 10000, 10000 arrays of random numbers, split into a 10x10 grid of 1000x1000 NumPy arrays." I know you probably want to avoid verbosity, but as this is one of the fundamental dask-unique concepts it might be nice to answer some basic questions about how chunks work, e.g. "This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000."
dataframe.ipynb
compute is used but not explained like it was in the array notebook
The mix of eager operations and lazy operations is significant, and I don't know if someone would walk away having a real sense for what operations would do what.
The warning /srv/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result """Entry point for launching an IPython kernel.
in the last cell is a but hard to work your way through. (There was also a warning in cell 9, but it's a lot more digestible).
delayed.ipynb
from random import random
within each function isn't a very normal pattern in Python in general, and I got hung up on why you would do it for a moment. I still don't know whether it was meaningful (like, hoping the random state would get initialized differently in each process) or just because the need was internal.
zs = dask.persist(*zs)
is unexplained and can look a bit scary.
"Note the red bars for inter-worker communication. Also note how there is lots of parallelism at the beginning but less towards the end as we reach the top of the tree where there is less work to do." This seems to lack the part where it originally told me to look somewhere before I computed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.