Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
Dask development blog
Home Page: https://blog.dask.org/
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
This blogpost should probably be edited to include direct links to the example data and example PSF.
I think I found the data & PSF, more details here #138 (comment)
Link to the data: https://drive.google.com/drive/folders/13mpIfqspKTIINkfoWbFsVtFF8D7jbTqJ (linked to from this earlier blogpost about image loading)
Link to the PSF: https://drive.google.com/drive/folders/13udO-h9epItG5MNWBp0VxBkKCllYBLQF (discussed here)
We will use image data generously provided by Gokul Upadhyayula at the Advanced Bioimaging Center at UC Berkeley and discussed in this paper (preprint), though the workloads presented here should work for any kind of imaging data, or array data generally.
I just noticed that the single-gpu-cupy-benchmarks blog post is not listed on the blog index. Can someone add it? Thanks!
We now have a couple hundred responses from the Dask User Survey. We should probably analyze this data and write about it. This includes things like
We should also add context about how this is being used to drive current work
Now that [release v5.3rc03 of ITK is available (which should include this PR), it would be good to do a follow up blogpost to this one about using Dask + ITK together.
The purpose of this would be:
The first step is re-running the code from the earlier blogpost with ITK v5.3rc03 or above and seeing whether that works or not. Then we write up whatever we find.
Here are the comments specifically discussing what should be included in a followup blogpost:
Links:
The idea of having Dask-adjacent post on the Dask blog came up in the last monthly community meeting. When thinking of interesting projects in the space, msgspec
came up. @jcrist do you think having a short blog post on msgspec
in the Dask blog makes sense? Do you have any interest in authoring such a post?
This idea came up in the September 2022 monthly community meeting. Opening an issue so we don't loose track of the idea
It would be handy to have a blogpost about loading image data into Dask Array, and then possibly storing it into some other format, like HDF5 or Zarr.
It might be nice to have both a trivial example here (a simple stack of images), as well as a less-trivial example (if those are common).
cc @jakirkham , perhaps the world's leading expert
So I just pushed out http://matthewrocklin.com/blog/work/2019/01/03/dask-array-gpus-first-steps which I think would be a great post to include here as well, except that at the end I say "come work for NVIDIA" which is a bit corporate. Should we include this post on blog.dask.org as well? Some options:
Thoughts or objections?
This article https://blog.dask.org/2023/04/14/scheduler-environment-requirements includes statements like the following:
If you use value-add hardware on the client and workers such as GPUs you’ll need to ensure your scheduler has one
This statement can be misleading. It's very true for RAPIDS work, but generally less true for PyTorch or other GPU work. (here is a pretty typical example).
I've fielded a bunch of questions on this topic. Here is an example. I think that we should alter this blog post to talk more about how serialization works. This article is causing non-trivial confusion among general, non-RAPIDS, GPU users.
I am not sure this belongs as a blogpost, but I have been including a bit about meta in some talks that I have been doing and there are some recurring questions that come up on the issue tracker that make it clear that the concept of meta is not well understood.
Basically I'd just be expanding on this comment: dask/dask#8515 (comment)
There is a factual error in my old skeleton analysis blogpost.
The blogpost shows a violin plot of the euclidean-distance measurement from skan, and says this is the skeleton branch thickness.
We can see that there are more thick blood vessel branches in the healthy lung.
That is incorrect. I misunderstood, and the error wasn't caught by Juan's otherwise excellent review (Juan is the author of the skan library).
Instead, it is correct to call this "the straight line distance from one end of the branch to the other".
Juan says:
The first thing I’ll say is that euclidean distance is not the thickness — it is the straight line distance from one end of the branch to the other
See this comment and this comment in an image.sc forum discussion for the full context. That discussion is from last year, but I only became aware of it today.
Hi Folks, currently it looks like we currently track tags in each of the markdown files for each blogpost, but we don't expose these in the HTML or XML/RSS/Atom feeds.
We're currently trying to migrate over our blog to a new platform and it would be useful to have these in the RSS/Atom feeds, even if they're not visible anywhere.
Does anyone have time to add this? At minimum it would probably mean looking up what approach people use by default, and then editing https://github.com/dask/dask-blog/blob/gh-pages/atom.xml or the layouts directory to include tags somewhere. (Grepping for tags yields some interesting results)
@jacobtomlinson is this easy for you or someone around you?
I have a use case where I collect multiple TB datasets (electron microscopy) time series videos. The image processing requires a combination of resizing (e.g. 4k ->1k xy or time-averaging frames), various filtering steps, background subtraction, and image alignment. Some of these steps can be performed by only considering an individual frame, others require using neighboring frames etc. After the processing, the images are typically rendered to video, and/or processed programmatically. Currently, a lot of this is performed using proprietary software, which is limited in its capabilities and slow.
It would be great to create an open-source repo for analysing these sorts of datasets, underpinned by Dask. I've written a basic workflow, which with a bit of polish and some assistance, I thought could be the basis of a blog post, and would be a great spot for me to start a repo from.
The blog post would demonstrate:
might be a bit too similar to:
We've had a fair number of questions related to things changed after adding the annotation machinery. Most of the answers are pretty similar. Also this often ties into other things people may want to do (heterogenous computing, special resource allocation, etc.). It might be helpful to create some higher visibility content on this explaining what changed, how users should update existing code, and what new things they might now be able to do with annotations
@VolkerH I saw you added a small note to the DaskFusion README that links to Tobias' project.
Should I do the same in the blogpost under the "Also see" section heading at the bottom? They each focus on slightly different things, so it might be helpful for people.
https://twitter.com/TobiasAdeJong/status/1466719789280382977
Some [feedback on the blog post](https://twitter.com/TobiasAdeJong/status/1466719789280382977) by [Tobias deJong](https://github.com/TAdeJong) points out a very similar approach that allows incorporates optimization of tile positions, [see this notebook](https://github.com/TAdeJong/LEEM-analysis/blob/master/6%20-%20Stitching.ipynb).
Ian mentioned this tweet to me today. I originally wrote it because I'd just given a tutorial, and lots of people were confused about how to choose good chunk sizes in Dask. Apparently that was helpful to a lot of people (this is supported by the twitter analytics stats, which are much higher than typical).
For better discoverability and permanence, it might be good to have these tips in blogpost format (twitter is a bit ephemeral, and searches don't often unearth content there).
Describe the issue:
I'm trying to contribute a blog but cannot build the ruby/jekyll software environment. Most likely because I'm on an M1 Mac.
Minimal Complete Verifiable Example:
conda install rb-commonmarker
OR
bundle install
within the project directory
bundle install
partial output:
...
Installing nokogiri 1.13.9 (arm64-darwin)
Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
/Users/rpelgrim/mambaforge/envs/dask-blog/bin/ruby -I /Users/rpelgrim/mambaforge/envs/dask-blog/lib/ruby/3.1.0 -r
./siteconf20221121-16691-jyr1nw.rb extconf.rb
creating Makefile
current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
make DESTDIR\= clean
current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
make DESTDIR\=
compiling arena.c
make: arm64-apple-darwin20.0.0-clang: No such file or directory
make: *** [arena.o] Error 1
make failed, exit code 2
...
Anything else we need to know?:
Environment:
After a bit of profiling, this is what I found out for Dask-GLM with Dask array:
14339 0.139 0.000 0.814 0.000 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:430(fire_task)
44898 19.945 0.000 19.945 0.000 {method 'acquire' of '_thread.lock' objects}
4055 0.042 0.000 19.992 0.005 /usr/lib/python3.5/threading.py:261(wait)
14339 0.107 0.000 20.234 0.001 /usr/lib/python3.5/queue.py:147(get)
14339 0.018 0.000 20.253 0.001 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:140(queue_get)
122 0.117 0.001 22.327 0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:345(get_async)
122 0.013 0.000 22.346 0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py:33(get)
122 0.004 0.000 22.733 0.186 /home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute)
1 0.020 0.020 23.224 23.224 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:200(admm)
1 0.000 0.000 23.267 23.267 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/utils.py:13(normalize_inputs)
1 0.000 0.000 23.268 23.268 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/estimators.py:65(fit)
A big portion of the time seems to be spent on waiting for thread lock. Also, looking at the callers, we see 100 compute()
calls departing from admm()
, which means it's not converging and stopping only at max_iter
as @cicdw suggested:
/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute) <- 100 0.004 19.637 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)
Running with NumPy, the algorithm converges, showing only 7 compute()
calls:
/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute) <- 7 0.000 0.120 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)
I'm running Dask 1.1.4 and Dask-GLM master branch, to ensure that my local changes aren't introduce any bugs. However, if I run my Dask-GLM branch and use CuPy as a backend, it also converges in 7 iterations.
To me this seems to suggest that we have one of those very well-hidden and difficult to track bugs in Dask. Before I spent hours with this, any suggestions what could we look for?
Originally posted by @pentschev in #15
It'd be good to have a blogpost about how to choose good settings for Dask on HPC. Users are often confused about this.
I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact. For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.
...you know what would ALSO be a good blogpost? How to choose good cluster settings. Eg: how your SLURM/PBS/whatever batch submission settings relate to the settings you need to put in your dask-jobqueue cluster object.
To be honest I'm still a bit confused by this, and it is something other people ask me too.
If either @jacobtomlinson or @ian-r-rose would like to help make this, that would be very useful to refer people to (hint, hint) 😄
@guillaumeeb has kindly agreed to help put this together #116 (comment)
Hi all, I saw this issue, and I agree that both ideas would make great articles. Those are questions we see a lot as HPC admin/experts.
I can try to help with the second one one batch submission settings! Everyone is confused about it.
Recently we added a new engine="ipycytoscape"
option to Dask's visualize(...)
functionality. It might be good to write a short blog post about it. @ian-r-rose, you added ipycytoscape
support -- do you have any interest in writing such a post?
This idea came up in the September 2022 monthly community meeting. Opening an issue so we don't loose track of the idea
It'd be good to show the work performed on Dask-ML and HyperbandSearchCV
. It's recently been merged in dask/dask-ml#221.
What happened:
The "Refresh" github action has been failing for the last 8 months. This GitHub Actions workflow is supposed to cause the GitHub Pages site for the Dask blog to be rebuilt every day at 3am UTC.
https://github.com/dask/dask-blog/blob/gh-pages/.github/workflows/refresh.yml
What you expected to happen:
I expected the github action to rebuild/refresh the dask blog website instead of failing.
Minimal Complete Verifiable Example:
Here are the logs for the latest failed action: https://github.com/dask/dask-blog/runs/4403644252?check_suite_focus=true
It fails on the step "Trigger GitHub pages rebuild"
Run curl --fail --request POST \
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 401
Error: Process completed with exit code 22.
Anything else we need to know?:
The last successful refresh action ran 8 months ago. I'm guessing something about the github actions environment changed around that time.
cc @jacobtomlinson - maybe you have some suggestions? As the author of #69 you have some good background context on this.
Environment:
Using Github actions associated with this repository.
Current runner version: '2.285.0'
Operating System
Ubuntu
20.04.3
LTS
Virtual Environment
Environment: ubuntu-20.04
Version: 20211129.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20211129.1/images/linux/Ubuntu2004-README.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20211129.1
Virtual Environment Provisioner
1.0.0.0-master-20211123-1
GITHUB_TOKEN Permissions
Actions: write
Checks: write
Contents: write
Deployments: write
Discussions: write
Issues: write
Metadata: read
Packages: write
Pages: write
PullRequests: write
RepositoryProjects: write
SecurityEvents: write
Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Hi All,
I'd like for a group of us to write a blogpost about using Dask on supercomputers, including why we like it today, and highlighting improvements that could be done in the near future to improve usability. My goal for this post is to show it around to various HPC groups, and to show it to my employer to motivate work in this area. I think that now is a good time for this community to have some impact by sharing its recent experience.
cc'ing some notable users today @guillaumeeb @jhamman @kmpaul @lesteve @dharhas @josephhardinee @jakirkham
To start conversation, if we were to structure the post as five reasons we use Dask on HPC and five things that could be better, what would be those five things? I think it'd be good to get a five-item list from a few people cc'ed above, then maybe we talk about those lists and I (or anyone else if interested) composes an initial draft that we can then all iterate on?
The examples are on datasets of 2.8KB and 28GB or 2.8MB and 28MB?
Section Fast Fitting timing
It would be useful to have a blogpost that shows using cuML and Dask-ML together for hyperparamter optimization. I imagine the gist of this would be something like the following:
cc @quasiben
dask-memusage is tool I wrote that gives you max memory usage per task in the executed graph, so you can:
https://github.com/itamarst/dask-memusage/
Happy to write a blog post if you're interested.
Clicking the links in the table of contents in this blogpost doesn't take you to the corresponding section of the post, but it does work if you do it here.
Did I get the syntax for this wrong? Or is jekyll somehow losing these markers when it renders things?
This is what I used as my markdown syntax:
## Contents
* [Background](#Background)
* [What we learned](#What-we-learned)
* [From Dask users](#From-Dask-users)
* [From other software libraries](#From-other-software-libraries)
* [Opportunities we see](#Opportunities-we-see)
* [Strategic plan](#strategic-plan)
* [Limitations](#Limitations)
* [Methods](#Methods)
* ```
This PR is currently in progress, but could be merged soon (for some loose value of "soon", I don't have a good idea of when) ome/ome-zarr-py#192
When it is done, I think it might be nice to have a blogpost about how to generate a multiscale image array and save it to disk, etc.
This is something that surprisingly doesn't seem to have a single, obvious, best way to do it (see discussion ome/ome-zarr-py#215). So when there is a convenience function available, it would be good to highlight that with a blogpost.
Jacob, feel free to nudge me in a few months about this, if you like. (That may or may not work, I can't say for sure I'll be available to do more about it then, but it's worth a try)
Maybe we should have an update blogpost on how we've been introducing GPU support to dask-image, as well as an update on the broader plan for dispatching (some discussion in scipy/scipy#10204)
What do you think @jakirkham & @quasiben
The images in this blogpost appear way too large. The size should be reduced so they fit comfortably. This might be happening on other posts (especially other posts of mine) too.
There are two possible approaches to fix this:
<img src="image/file.jpg" alt="alt text" width="700">
with a maximum width of 700 pixels (this seems to be roughly the same `We've run into some people who use ITK a bit on imaging data. A blogpost that showed a simple-yet-realistic workflow with Dask Array and ITK together would probably help onboard these groups more effectively.
In conversation with @jakirkham and Gokul we were thinking about something like the following:
This workflow is far from set in stone. I suspect that things will quickly change when someone tries this on real data, but it might provide an initial guideline. Also, as @jakirkham brought up, it might be worth splitting this up into a few smaller posts.
cc @thewtex for visibility. I think that he has also been thinking about these same problems of using Dask to scale out ITK workloads on large datasets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.