hpc-carpentry / old-hpc-workflows Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 10.08 MB

Scaling studies on high-performance clusters using Snakemake workflows

Home Page: https://www.hpc-carpentry.org/old-hpc-workflows/

License: Other

Ruby 0.61% Makefile 6.96% R 5.16% Shell 1.04% Python 86.23%

hpc-carpentry parallel-computing snakemake-workflows carpentries-incubator english alpha

old-hpc-workflows's Introduction

Tame Your Workflow with Snakemake

This lesson teaches the basics of modern workflow engines through Snakemake.

The example workflow performs a frequency analysis of several public domain books sourced from Project Gutenberg, testing how closely each book conforms to Zipf's Law. All code and data are provided. This example has been chosen over a more complex scientific workflow as the goal is to appeal to a wide audience and to focus on building the workflow without distraction from the underlying processing.

At the end of this lesson, you will:

Understand the benefits of workflow engines.
Be able to create reproducible analysis pipelines with Snakemake.

Topic breakdown

The lesson outline and rough breakdown of topics is in lesson-outline.md.

Lesson writing instructions

This is a fast overview of the Software Carpentry lesson template.

For a full guide to the lesson template, see the Software Carpentry example lesson.

Lesson structure

Software Carpentry lessons are generally episodic, with one clear concept for each episode (example).

An episode is just a markdown file that lives under the _episodes folder. Here is a link to a markdown cheatsheet with most markdown syntax. Additionally, the Software Carpentry lesson template uses several extra bits of formatting - see here for a full guide. The most significant change is the addition of a YAML header that adds metadata (key questions, lesson teaching times, etc.) and special syntax for code blocks, exercises, and the like.

Episode names should be prefixed with a number of their section plus the number of their episode within that section. This is important because the Software Carpentry lesson template will auto-post our lessons in the order that they would sort in. As long as your lesson sorts into the correct order, it will appear in the correct order on the website.

Publishing changes to Github + the Github pages website

The lesson website is viewable at https://carpentries-incubator.github.io/hpc-workflows/.

The lesson website itself is auto-generated from the gh-pages branch of this repository. Github pages will rebuild the website as soon as you push to the Github gh-pages branch. Because of this gh-pages is considered the "main" branch.

Previewing changes locally

Obviously having to push to Github every time you want to view your changes to the website isn't very convenient. To preview the lesson locally, run make serve. You can then view the website at localhost:4000 in your browser. Pages will be automatically regenerated every time you write to them.

Note that the autogenerated website lives under the _site directory (and doesn't get pushed to Github).

This process requires Ruby, Make, and Jekyll. You can find setup instructions here.

Example lessons

A couple links to example SWC workshop lessons for reference:

Example Bash lesson
Example Python lesson
Example R lesson (uses R markdown files instead of markdown)

old-hpc-workflows's People

Contributors

Stargazers

Watchers

Forkers

vinisalazar lemythe

old-hpc-workflows's Issues

Flags on Amdahl

We have the --terse option on Amdahl to help make machine-digestible output. We should discuss the fact that some programs have different output modes, to help with human- or machine-parsing of their outputs. Hat tip @ocaisa

Develop Snakefile on a node

We can start developing the Snakefile on a compute node (srun --pty bash) using 1 core, then when a user's Snakefile uses more than 1 core, the workflow manager will kill their session for grabbing more than was allocated. This becomes an HPC & shared resource lesson, and marks the turning point between live development of the script and launching it through the head node. Hat tip @tobyhodges

Add description of of `-j` in the compute cluster configuration.

@reid-a suggested (+1) that we can drop all mentions of -j and any discussion pertaining to it until the compute cluster configuration is covered.

Originally posted by @HaoZeke in carpentries-incubator/hpc-workflows#23 (comment)
Converted quote from code block -- @tkphd

Review lesson objectives

To guide the development of this lesson, it would be a good idea to have a clear outline of which parts of the current lesson are important to be kept for the HPC Carpentry setting, which are non-essential but not harmful, and which need to be removed.

One way to do this would be to review the current list of objectives for the lesson and discuss them in the context above. Perhaps dividing them into "must be kept", "could be kept", and "should be removed"? And then you will also need a list of new objectives that you want/need to add, which are not in the lesson in its current form.

Provenance of this repository

For the record, this repository is a fork of Getting Started with Snakemake, and since that lesson is already in the Carpentries Incubator we needed to import it since you can't create a new fork in the same organisation. Getting Started with Snakemake is itself a fork of https://github.com/hpc-carpentry/hpc-python and so, the circle is now complete.

Defining the "common workflow" for our lesson

The current example is a set of books that are downloaded. How do we define our raw data? We effectively don't have any, what we are doing is taking measurements with amdahl which will become our raw data.

In 01-introduction.md we start off by creating a bash script describing the manual workflow. We will somehow need to replicate this. This will require:

Generating a set of data (which will require parsing of amdahl output, or perhaps adding a --terse option to amdahl, see #6). Redirecting the amdahl output to a file could work...or indeed using the output files from SLURM itself.
Plotting the result (both graphically...and perhaps in terminal)

Add `--terse` option to `amdahl` to make it easier to parse the output

PR should go to https://github.com/ocaisa/amdahl

Location of final code products

Currently, Snakefiles and other Snakemake code are in the compressed files in the files directory. Furthermore, they are inside a hidden .solutions directory inside those files. Would it be good to have them "more accessible", e.g. to move them to the code directory?

Plotting Amdahl's Law

Part of the workflow will be to plot Amdahl's Law. It would be nice if we could do this in the terminal, and (with prettier output) to an image file.

There's a tool termplotlib that probably accepts the same options as matplotlib which could be leveraged here.

We could also use gnuplot directly probably but more contributors are likely to be familiar with matplotlib syntax.

protect default branch before sprint

Fine to leave gh-pages unprotected while we prepare, but it should be protected against direct commits, and PRs should not be merged without an approving review.

Generating our data files

In https://carpentries-incubator.github.io/hpc-workflows/09-cluster/index.html#running-our-workflow-on-the-cluster , it is introduced how to send your workflow to the cluster.

In our case, we probably want to introduce this earlier, and also highlight https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#local-rules . For example, we only really want to submit the amdahl jobs to the cluster, everything else is very cheap and we can keep it local.

Discuss limitations & alternatives to Snakemake

Snakemake may not be the best workflow manager for HPC; we're teaching it because it is broadly accessible, and the lessons can be broadly transferrable.

Include an episode at the end to discuss the limitations of Snakemake, and introduce alternative tools (Parsl, Fireworks) for HPC.

Add versions to requirements.txt

Adding version of libraries supported and tested on to requirements.txt will make setup easier. It would also allow running CI to check that the scripts used work. This is motivated by dropping of dropping of collections.MutableMapping from Python 3.10 and above, see https://stackoverflow.com/questions/70943244/attributeerror-module-collections-has-no-attribute-mutablemapping

Create a profile

Seeing -c 1 repeated in so many places made me wonder whether this parameter can go into a profile early on in the lesson or something, to save some keystrokes?

This would be a good learning objective. Configuring software so that learners become more productive is a common task.

Originally posted by @tobyhodges in carpentries-incubator/hpc-workflows#23 (review)

SnakeMake best practices

Snakemake has options to use "profiles", as well as the use of YAML files, to control interaction with clusters, which inform the best practices for running on an HPC cluster. This lesson should examine these best practices with a view to doing this the right way, aligned with the SnakeMake community.

"Perfect" `snakemake` file for Amdahls Law

Ultimately the lesson ends up with the "perfect" Snakefile for the example it uses (which can be found in the .solutions/completing_the_pipeline folder of workflow-engines-lesson.zip). We need to get to a similar perfect Snakefile for own use case (hopefully one that uses all the material in the lesson). Once we have that we can work backwards to replace the existing example with our own.

Rework `index.md`

This can only really be tackled once we have a better idea of the whole lesson but will include

Introducing the data we will create/use
Identifying the prereqs and where they can be found (and I notice now that HPC Carpentry will not fulfil the Python prereq...so how will we tackle this?)

Moore's law

Consider adding an exercise with Moore's law using historical processor data.

Inline Python -> Lesson 10

Snakemake allows for Python inside the Snakefile, which is a neat feature. It's not core to workflows, however, and does not map to other workflow tools. We should (as @ocaisa suggested) use gray-box Python code to plot etc., and move the Python-in-Snake material to the currently-sparse Lesson 10.

primer docs for sprinters

Update CONTRIBUTING to include the branching workflow and need for PRs during the CarpentryCon Sprint, along with an overview of where the files to be edited reside (_episodes, mostly)

Update `setup.md` for the HPC use case

Update the required data files (or do we only generate them instead?)
Decide on approach(es) to make the required software for the tutorial available (snakemake, amdahl and also probably a plotting tool). Possibilities include
- environment modules
- pip
- conda/mamba

Carpentries Workbench

Try using Carpentries Workbench formatting for this lesson.

spelling & linting actions

create an action to check spelling and lint markdown on PRs

Locations where we may need to use templates

There are cases where we may need to provide different templates for information in the lessons so that information can be easily tweaked for different schedulers (and systems). The use of templating in https://github.com/carpentries-incubator/hpc-intro/blob/gh-pages/_config.yml is probably a good guide here.

At a glance there does not seem to be too many of these cases, but will probably include:

Getting the computing environment needed to run the tutorial (module, pip, conda or whatever)
The scheduler
The cluster config cluster.yaml
The string for --cluster

All of these seem to be really relevant to https://github.com/carpentries-incubator/hpc-workflows/edit/gh-pages/_episodes/09-cluster.md

Consider Parsl

NERSC's Snakemake docs lists Snakemake's "cluster mode" as a
disadvantage, since it submits each "rule" as a separate job, thereby
spamming the scheduler with dependent tasks. The main Snakemake process also
resides on the login node until all jobs have finished, occupying some
resources.

NERSC specifically documents Parsl as the recommended alternative for multinode jobs. I was aware of Parsl as a Python extension for parallel programming, but had not recognized its ability to dispatch work directly on Slurm (and possibly other schedulers).

This synergy suggests Parsl as a viable alternative to Snakemake, since it (a) would integrate readily with the Python-based Amdahl code and (b) could form the basis of a Programming for HPC lesson with thematic callbacks to this prior lesson in the workshop.