Giter Site home page Giter Site logo

Comments (9)

tkphd avatar tkphd commented on July 16, 2024 1

Can we (should we) convert this list to a Project?

from old-hpc-workflows.

tobyhodges avatar tobyhodges commented on July 16, 2024

List of objectives in the current version of the lesson:

  • 1. Managing Data Processing Workflow
    • Understand our example problem.
  • 2. Snakefiles
    • Understand the components of a Snakefile: rules, inputs, outputs, and actions.
    • Write a simple Snakefile.
    • Run Snakemake from the shell.
    • Perform a dry-run, to understand your workflow without executing anything.
  • 3. Wildcards
    • Use Snakemake wildcards to simplify our rules.
    • Understand that outputs depend not only on the input data files but also on the scripts or code.
  • 4. Pattern Rules
    • Write Snakemake pattern rules.
  • 5. Snakefiles are Python Code
    • Use Python variables, functions, and imports in a Snakefile.
    • Learn to use the run action to execute Python code as an action.
  • 6. Completing the Pipeline
    • Update existing rules so that dat files are created in a subdirectory.
    • Add a rule to your Snakefile that generates PNG plots of word frequencies.
    • Add an all rule to your Snakefile.
    • Make all the default rule.
  • 7. Resources and Paralellism
    • Modify your pipeline to run in parallel.
  • 8. Make your workflow portable and reduce duplication
    • Learn to use configuration files to make workflows portable.
    • Learn a safe way to mix global variables and snakemake wildcards.
    • Learn to use configuration files, global variables, and wildcards in a systematic way to reduce duplication and make your workflows less error-prone.
  • 9. Scaling a pipeline across a cluster
    • Understand the Snakemake cluster job submission workflow.
  • 10. Final notes
    • Understand how to perform a dry-run of your workflow.
    • Understand how to configure logging so that each rule generates a separate log.
    • Understand how to visualise your workflow.

from old-hpc-workflows.

tobyhodges avatar tobyhodges commented on July 16, 2024

Based on discussion in the first sprint session at CarpentryCon, it sounds like some discussion of the --profile option is something that will need to be added to the lesson for an HPC setting.

from old-hpc-workflows.

tobyhodges avatar tobyhodges commented on July 16, 2024

Related to discussion in #4, familiarity with Python is not currently a prerequisite for the lesson, so episode 5 should probably be removed altogether (after checking that there is not other useful content in there, not specific to Python)

from old-hpc-workflows.

vinisalazar avatar vinisalazar commented on July 16, 2024

Based on discussion in the first sprint session at CarpentryCon, it sounds like some discussion of the --profile option is something that will need to be added to the lesson for an HPC setting.

I will try working on that on the upcoming sprints by submitting a PR to episode 9.

from old-hpc-workflows.

bkmgit avatar bkmgit commented on July 16, 2024

Some of the flexibility comes from Python. Programming experience from the shell lesson should be enough to understand the Python code if it is introduced correctly.

from old-hpc-workflows.

bkmgit avatar bkmgit commented on July 16, 2024

The lesson is quite long, 6 hours, it would be good if it was about 4 hours. Removing some of the repeated parts may work.

from old-hpc-workflows.

reid-a avatar reid-a commented on July 16, 2024

Draft outline of objectives for the revised lesson, for when we get there. Some of this material is cut-and-pasted from the Sprint notes.

  1. Episode 1 is mostly motivation/stage-setting. Brief refresh of Amdahl example, running the script on laptops, or the head node (if allowed). Motivation question: what is the relationship between width and time taken?
    • Objective: by the end, learners should be able to... compare performance indicators from tasks with different parameters run on the same system
    • Note: note in episode about Bash scripts is less important for HPC Carpentry setting
  2. Episode 2 introduces a lot of concepts - rules, inputs, outputs, actions, running a snakefile, graphs (maybe not explicitly introduced but used in figures), dry run. Needs more exercises to assess progress to all of this.
    • Note: we might aim to write a very short Snakefile: one rule with no inputs (Amdahl again). Run that twice and see that it does nothing the second time. Change the output file name and run again. Exercise could be to adjust the width for the Amdahl command.
    • Objective: by the end, learners should be able to... write a rule that produces an output file.
    • Objective: by the end, learners should be able to... predict (correctly!) whether a rule will run based on its output and the files in the project folder.
    • Note: Our episode three would then be extending this to multiple rules, introducing rule to plot results and show how these connect together. add at least one more Amdahl rule with different widths, plot the results. demonstrate the importance of rule order. introduce a rule to clean things up.
    • Objective: by the end, learners should be able to write a basic Snakefile and run it.
  3. Episode 3 is about wildcards. Andrew found it helpful to write a Snakefile that iterates over a list of values, which become parameters/arguments to a task. This will introduce the wildcards object, preparing us to introduce the resources object later, by analogy with the wildcards object.
    • Note: we could bridge from our episode 3 to this by discussing how to replace those repetitive rules with a single rule and wildcard width values.
    • Objective: by the end, learners should be able to write a Snakefile that iterates over a set of values and generates multiple outputs using wildcards.
  4. Episode 4 is about pattern rules, which we think we do not really need, but may be a concept that is important to thinking about workflows. Right now, HPC Carpentry's next episode should probably be more concerned with introducing the cluster config.
    • Note: important ideas for cluster config: resources object and our defined function get_tasks(). Function executed at runtime for the rule it is written into.
    • We will need to be careful about how much Python we end up talking about from here on in. Cognitive load is probably very large for any learner who is seeing a Python function definition for the first time.
    • also looking at config.yaml for the first time. hope we can bridge that gap by calling back to SBATCH parameters.
  5. Episode 5 is about Python code in Snakefiles. Makes the point that e.g. the "input" iten can have sub-parts, and you can refer to them by attribute, e.g. "input.cmd". They still refer to files. Also introduces the "expand" primitive.
    • We do want the "expand" primitive. The access-by-attribute is late for us, we will have done this with "resources" in the previous lesson. Using Python code as actions depends on whether we want to require Python as a pre-req -- arguably this is already induced by the get_tasks() Do we care about the --quiet option? Python is probably a pre-req at this point.
    • Objective: Learners should be able to write a Snakemake rule with a Python-code body.
  6. "Completing the pipeline", default rules, Snakemake best practices. Introduces a "plotcount.py" for making plots. General book-keeping.
    • Unlike our case, their workflow has plotting in the middle of the DAG, where as we are aggregating many outputs into the plot. Some of the data management and data movement is perhaps optional for us. HPC Intro also discusses storage, though, so this is an opportunity to reinforce.
  7. Resources and parallelism
    • The ability to do parallelism may be limited by HPC resource policy. Learners will be able to run parallel rules on their laptops, but may not have the Amdahl code there.
    • Objective: Learners should be able to write and run a Snakefile that runs the rules concurrently, and control the "width" of this parallelism.
  8. Make your workflow portable, reduce duplication.
    • Unclear to me if this is a separate task -- we have been introducing wildcards and global stateful configuration as appropriate as we've gone along, so this episode might be a no-op for us?
  9. Scaling a pipeline across a cluster.
    • At this point we have probably deviated pretty far from the original lesson. Running on the cluster may have already happened in the "parallelism" stage? For us, running on a cluster has less novelty, because we are following the HPC Intro lesson, so this might have already happened when we introduced the cluster config and global state info. Workflow is pretty murky at this point.
    • Objective: Learners should definitely be able to use Snakemake to dispatch workflow to the cluster, and at this point, should be able to aggregate results data from the cluster on the head node and analyze it.
  10. "Final notes"
  • Also pretty murky.

from old-hpc-workflows.

reid-a avatar reid-a commented on July 16, 2024

In light of the decision made at the April co-working meeting, viz. that we should not ever run the Amdahl code on the head node or on learner's laptops, some revisions are appropriate for the lesson content.

Episode one's goal needs some revision --- the current thinking is that learners can run the Amdahl code on the head node and collect some preliminary performance data, by way of re-familiarizing themselves with the code, and illustrating the difference between the "bare metal" run and the Snakemake-enclosed one. We can still make this point, but will need to do the initial runs on the cluster, using batch files. Graduates of HPC Intro will have seen this, but realistically we should add some time for refreshing this knowledge for actual human learners with imperfect retention.

So a high-level version of the set of tasks now maybe looks something like this:

  1. Run the amdahl code on the cluster. Learners should be able to identify what output files the code generates, and know what data is in them.
  2. Introduce the Snakemake tool, and construct a "Hello, world" snakefile. Learners should be able to correctly predict whether the rule in the snakefile will fire or not, based on the presence and currency of the output file.
  3. Generate a multi-rule snakefile, with a dependency, to introduce the concept of the task graph, and illustrate that the order of operations. We can continue to use "Hello, world" level executables here. Learners should be able to correctly predict which snakemake rules will fire on an invocation, and in what order, based on the presence and currency of the output targets.
  4. Generate a single-rule snakefile that runs on the cluster. At first, manually specify all the cluster stuff, like the partition name and so forth, to foreground it. Learners should be able to predict how their snakefile will dispatch to the cluster, and predict the location and character of the resulting outputs.
  5. Introduce the cluster config file, and populate it for the local cluster. Repeat the task of the previous lesson, but with the cluster info implicit in the configuration. Same learner capability, I guess?
  6. Dispatch multiple jobs to the cluster via snakemake. Observe that the snakemake process itself remains active on the head node until the jobs are finished. (Deal with the thing where a cluster rule exits at dispatch-time, but the target doesn't appear until later?) Once this content is more developed, the goal can probably be clarified, beyond the obvious "learners should be able to correctly predict the sequence of operations that will result from running their snakefile", which is the emerging theme here.

From here, the tasks get a bit more murky in my mind, but the two beats to hit include:

  1. Plan and execute the workflow that generates the data needed for the Amdahl plot.
  2. Actually generate the Amdahl plot, and observe and appreciate the diminishing returns to increased parallelism.

The mapping of these goals to the existing lesson material is the next step, hopefully much of it is reusable, but a clear and coherent lesson structure is more important than re-using prior content, IMO.

from old-hpc-workflows.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.