Giter Site home page Giter Site logo

hpc-carpentry / old-hpc-workflows Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 2.0 10.08 MB

Scaling studies on high-performance clusters using Snakemake workflows

Home Page: https://www.hpc-carpentry.org/old-hpc-workflows/

License: Other

Ruby 0.61% Makefile 6.96% R 5.16% Shell 1.04% Python 86.23%
hpc-carpentry parallel-computing snakemake-workflows carpentries-incubator english alpha

old-hpc-workflows's Introduction

Tame Your Workflow with Snakemake

This lesson teaches the basics of modern workflow engines through Snakemake.

The example workflow performs a frequency analysis of several public domain books sourced from Project Gutenberg, testing how closely each book conforms to Zipf's Law. All code and data are provided. This example has been chosen over a more complex scientific workflow as the goal is to appeal to a wide audience and to focus on building the workflow without distraction from the underlying processing.

At the end of this lesson, you will:

  • Understand the benefits of workflow engines.
  • Be able to create reproducible analysis pipelines with Snakemake.

Topic breakdown

The lesson outline and rough breakdown of topics is in lesson-outline.md.

Lesson writing instructions

This is a fast overview of the Software Carpentry lesson template.

For a full guide to the lesson template, see the Software Carpentry example lesson.

Lesson structure

Software Carpentry lessons are generally episodic, with one clear concept for each episode (example).

An episode is just a markdown file that lives under the _episodes folder. Here is a link to a markdown cheatsheet with most markdown syntax. Additionally, the Software Carpentry lesson template uses several extra bits of formatting - see here for a full guide. The most significant change is the addition of a YAML header that adds metadata (key questions, lesson teaching times, etc.) and special syntax for code blocks, exercises, and the like.

Episode names should be prefixed with a number of their section plus the number of their episode within that section. This is important because the Software Carpentry lesson template will auto-post our lessons in the order that they would sort in. As long as your lesson sorts into the correct order, it will appear in the correct order on the website.

Publishing changes to Github + the Github pages website

The lesson website is viewable at https://carpentries-incubator.github.io/hpc-workflows/.

The lesson website itself is auto-generated from the gh-pages branch of this repository. Github pages will rebuild the website as soon as you push to the Github gh-pages branch. Because of this gh-pages is considered the "main" branch.

Previewing changes locally

Obviously having to push to Github every time you want to view your changes to the website isn't very convenient. To preview the lesson locally, run make serve. You can then view the website at localhost:4000 in your browser. Pages will be automatically regenerated every time you write to them.

Note that the autogenerated website lives under the _site directory (and doesn't get pushed to Github).

This process requires Ruby, Make, and Jekyll. You can find setup instructions here.

Example lessons

A couple links to example SWC workshop lessons for reference:

old-hpc-workflows's People

Contributors

abbycabs avatar andrewspiers avatar bkmgit avatar brandoncurtis avatar ccoulombe avatar dc23 avatar erinbecker avatar evanwill avatar fmichonneau avatar gvwilson avatar ianlee1521 avatar jduckles avatar jpallen avatar jsta avatar jstaf avatar katrinleinweber avatar mawds avatar maxim-belkin avatar mr-c avatar neon-ninja avatar pbanaszkiewicz avatar pipitone avatar reid-a avatar rgaiacs avatar synesthesiam avatar tkphd avatar tobyhodges avatar tracykteal avatar twitwi avatar wclose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

old-hpc-workflows's Issues

Flags on Amdahl

We have the --terse option on Amdahl to help make machine-digestible output. We should discuss the fact that some programs have different output modes, to help with human- or machine-parsing of their outputs. Hat tip @ocaisa

Develop Snakefile on a node

We can start developing the Snakefile on a compute node (srun --pty bash) using 1 core, then when a user's Snakefile uses more than 1 core, the workflow manager will kill their session for grabbing more than was allocated. This becomes an HPC & shared resource lesson, and marks the turning point between live development of the script and launching it through the head node. Hat tip @tobyhodges

Review lesson objectives

To guide the development of this lesson, it would be a good idea to have a clear outline of which parts of the current lesson are important to be kept for the HPC Carpentry setting, which are non-essential but not harmful, and which need to be removed.

One way to do this would be to review the current list of objectives for the lesson and discuss them in the context above. Perhaps dividing them into "must be kept", "could be kept", and "should be removed"? And then you will also need a list of new objectives that you want/need to add, which are not in the lesson in its current form.

Defining the "common workflow" for our lesson

The current example is a set of books that are downloaded. How do we define our raw data? We effectively don't have any, what we are doing is taking measurements with amdahl which will become our raw data.

In 01-introduction.md we start off by creating a bash script describing the manual workflow. We will somehow need to replicate this. This will require:

  • Generating a set of data (which will require parsing of amdahl output, or perhaps adding a --terse option to amdahl, see #6). Redirecting the amdahl output to a file could work...or indeed using the output files from SLURM itself.
  • Plotting the result (both graphically...and perhaps in terminal)

Location of final code products

Currently, Snakefiles and other Snakemake code are in the compressed files in the files directory. Furthermore, they are inside a hidden .solutions directory inside those files. Would it be good to have them "more accessible", e.g. to move them to the code directory?

Plotting Amdahl's Law

Part of the workflow will be to plot Amdahl's Law. It would be nice if we could do this in the terminal, and (with prettier output) to an image file.

There's a tool termplotlib that probably accepts the same options as matplotlib which could be leveraged here.

We could also use gnuplot directly probably but more contributors are likely to be familiar with matplotlib syntax.

protect default branch before sprint

Fine to leave gh-pages unprotected while we prepare, but it should be protected against direct commits, and PRs should not be merged without an approving review.

Discuss limitations & alternatives to Snakemake

Snakemake may not be the best workflow manager for HPC; we're teaching it because it is broadly accessible, and the lessons can be broadly transferrable.

Include an episode at the end to discuss the limitations of Snakemake, and introduce alternative tools (Parsl, Fireworks) for HPC.

Create a profile

Seeing -c 1 repeated in so many places made me wonder whether this parameter can go into a profile early on in the lesson or something, to save some keystrokes?

This would be a good learning objective. Configuring software so that learners become more productive is a common task.

Originally posted by @tobyhodges in carpentries-incubator/hpc-workflows#23 (review)

SnakeMake best practices

Snakemake has options to use "profiles", as well as the use of YAML files, to control interaction with clusters, which inform the best practices for running on an HPC cluster. This lesson should examine these best practices with a view to doing this the right way, aligned with the SnakeMake community.

"Perfect" `snakemake` file for Amdahls Law

Ultimately the lesson ends up with the "perfect" Snakefile for the example it uses (which can be found in the .solutions/completing_the_pipeline folder of workflow-engines-lesson.zip). We need to get to a similar perfect Snakefile for own use case (hopefully one that uses all the material in the lesson). Once we have that we can work backwards to replace the existing example with our own.

Rework `index.md`

This can only really be tackled once we have a better idea of the whole lesson but will include

  • Introducing the data we will create/use
  • Identifying the prereqs and where they can be found (and I notice now that HPC Carpentry will not fulfil the Python prereq...so how will we tackle this?)

Moore's law

Consider adding an exercise with Moore's law using historical processor data.

Inline Python -> Lesson 10

Snakemake allows for Python inside the Snakefile, which is a neat feature. It's not core to workflows, however, and does not map to other workflow tools. We should (as @ocaisa suggested) use gray-box Python code to plot etc., and move the Python-in-Snake material to the currently-sparse Lesson 10.

primer docs for sprinters

Update CONTRIBUTING to include the branching workflow and need for PRs during the CarpentryCon Sprint, along with an overview of where the files to be edited reside (_episodes, mostly)

Update `setup.md` for the HPC use case

  • Update the required data files (or do we only generate them instead?)
  • Decide on approach(es) to make the required software for the tutorial available (snakemake, amdahl and also probably a plotting tool). Possibilities include
    • environment modules
    • pip
    • conda/mamba

Locations where we may need to use templates

There are cases where we may need to provide different templates for information in the lessons so that information can be easily tweaked for different schedulers (and systems). The use of templating in https://github.com/carpentries-incubator/hpc-intro/blob/gh-pages/_config.yml is probably a good guide here.

At a glance there does not seem to be too many of these cases, but will probably include:

  • Getting the computing environment needed to run the tutorial (module, pip, conda or whatever)
  • The scheduler
  • The cluster config cluster.yaml
  • The string for --cluster

All of these seem to be really relevant to https://github.com/carpentries-incubator/hpc-workflows/edit/gh-pages/_episodes/09-cluster.md

Consider Parsl

NERSC's Snakemake docs lists Snakemake's "cluster mode" as a
disadvantage, since it submits each "rule" as a separate job, thereby
spamming the scheduler with dependent tasks. The main Snakemake process also
resides on the login node until all jobs have finished, occupying some
resources.

NERSC specifically documents Parsl as the recommended alternative for multinode jobs. I was aware of Parsl as a Python extension for parallel programming, but had not recognized its ability to dispatch work directly on Slurm (and possibly other schedulers).

This synergy suggests Parsl as a viable alternative to Snakemake, since it (a) would integrate readily with the Python-based Amdahl code and (b) could form the basis of a Programming for HPC lesson with thematic callbacks to this prior lesson in the workshop.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.