carpentries-incubator / hpc-intro Goto Github PK

View Code? Open in Web Editor NEW

132.0 22.0 127.0 30.72 MB

Lesson materials for an Introduction to High Performance Computing in the tradition of Software Carpentry

Home Page: https://carpentries-incubator.github.io/hpc-intro/

License: Other

Makefile 2.10% HTML 3.09% R 1.94% Python 21.28% Shell 1.91% Ruby 0.30% Vim Snippet 62.73% TeX 6.64%

lesson carpentry-lesson alpha carpentries-incubator english hpc-carpentry

hpc-intro's Introduction

Intro to HPC

This lesson teaches the basics of interacting with high-performance computing (HPC) clusters through the command line

Using this material

NOTE: This is not Carpentries boilerplate! Please read carefully.

Follow the instructions found in The Carpentries' example lesson to create a repository for your lesson. Install Ruby, Make, and Jekyll following the instructions here.
For easier portability, we use snippets of text and code to capture inputs and outputs that are host- or site-specific and cannot be scripted. These are stored in a library _includes/snippets_library, with subdirectories matching the pattern InstitutionName_ClusterName_scheduler. If your cluster is not already present, please copy (cp -r) the closest match as a new folder under snippets_library.
- We have placed snippets in files with the .snip extension, to make tracking easier. These files contain Markdown-formatted text, and will render to HTML when the lesson is built.
- Code snippets are placed in subdirectories that are named according to the episode they appear in. For example, if the snippet is for episode 12, then it will be in a subdirectory called 12.
- In the episodes source, snippets are included using Liquid scripting include statements. For example, the first snippet in episode 12 is included using {% include /snippets/12/info.snip %}.
Edit _config_options.yml in your snippets folder. These options set such things as the address of the host to log in to, definitions of the command prompt, and scheduler names. You can also change the order of the episodes, or omit episodes, by editing the configuration block under episode_names in this file.
Set the environment variable HPC_JEKYLL_CONFIG to the relative path of the configuration file in your snippets folder:
```
export HPC_JEKYLL_CONFIG=_includes/snippets_library/.../_config_options.yml
```
Preview the lesson locally, by running make serve. You can then view the website in your browser, following the links in the output (usually, https://localhost:4000). Pages will be automatically regenerated every time you write to them.
If there are discrepancies in the output, edit the snippet file containing it, or create a new one and customize.
Add your snippet directory name to the GitHub Actions configuration file, .github/workflows/test_and_build.yml.
Check out a new branch(git checkout -b new_branch_name), commit your changes, and push to your fork of the repository. If you're comfortable sharing, please file a Pull Request against our upstream repo. We would love to have your site config for the Library.
To maintain compatibility, please do not merge your new branch into your fork's gh-pages branch. Instead, wait until your pull request has been merged upstream, then pull down the upstream version. Otherwise, your repository will diverge from ours, and pull requests you make in the future will probably not be accepted.

Deploying a Customized Lesson

The steps above will help you port the default HPC Intro lesson to your specific cluster, but the changes will only be visible on your local machine. To build a website for a specific workshop or instance of the lesson, you'll want to make a stand-alone copy.

Template Your Customized Repository

This will let you create an exact duplicate of your fork. Without this, GitHub won't let you create a second fork of a repository on the same account.

On GitHub, go to your repository's Settings.
Under the repository name, check the "Template Repository" box.
Go to the Code tab.
Click the new button to Use This Template.
Fill in a name, like yyyy-mm-dd-hpc-intro.
Check the Include all branches box.
Go!

Merge Your Customized Branch

If your snippets are already included in the snippet library, skip this step.

On GitHub, find the drop-down menu of branches. It should be all the way to the left of the "Use This Template" button.
From the list, select the branch containing your site customization.
There should be a bar above the list of repository contents with the branch name, stating "This branch is x commits ahead, y commits behind gh-pages" or similar. To the right of that, click the button to Create Pull Request.
Make sure that the source and destination repositories at the top of the new PR are both your current duplicate of hpc-intro, not the upstream.
Create the pull request, then click the Merge button. You can delete the customization branch when it's done.

Modify `_config.yml`

GitHub builds sites using the top-level _config.yml, only, but you want the values set in the snippet library.

Open a copy of your _includes/snippet_library/Institution_Cluster_scheduler/_config_options.yml
On GitHub, open the top-level _config.yml for editing.
Copy your _config_options.yml, overwriting the values under the SITE specific configuration section of the top-level _config.yml. Leave the rest as-is.
Commit the change.
Back on the Code tab, there should be a timer icon, a green check, or a red X next to the latest commit hash. If it's a timer, the site is building; give it time.
If the symbol is a red x, something went wrong. Click it to open the build log, and attempt to correct the error. Follow GitHub's troubleshooting guide, and double-check the values in _config.yml ar ecorrect and complete.
Once you see a green check, your website will be available for viewing at https://your-github-account.github.io/name-of-the-repository.

Lesson Outlines

The following list of items is meant as a guide on what content should go where in this repo. This should work as a guide where you can contribute. If a bullet point is prefixed by a file name, this is the lesson where the listed content should go into. This document is meant as a concept map converted into a flow of learning goals and questions. Note, again, that it is possible, when building your actual lesson, to re-order these files, or omit one or more of them.

User profiles of people approaching high-performance computing from an academic and/or commercial background are provided to help guide planning and decision-making.

Why use a cluster? (20 minutes)
- Brief, concentrate on the concepts not details like interconnect type, etc.
- Be able to describe what a compute cluster (HPC/HTC system) is
- Explain how a cluster differs from a laptop, desktop, cloud, or "server"
- Identify how an compute cluster could benefit you.
- Jargon busting
Working on a remote HPC system (35 minutes)
- Understand the purpose of using a terminal program and SSH
- Learn the basics of working on a remote system
- Know the differences of between login and compute nodes
- Objectives: Connect to a cluster using ssh; Transfer files to and from the cluster; Run the hostname command on a compute node of the cluster.
- Potential tools: ssh, ls, hostname, logout, nproc, free, scp, man, wget
Working with the scheduler (1 hour 15 minutes)
- Know how to submit a program and batch script to the cluster (interactive & batch)
- Use the batch system command line tools to monitor the execution of your job.
- Inspect the output and error files of your jobs.
- Potential tools: shell script, sbatch, squeue -u, watch, -N, -n, -c, --mem, --time, scancel, srun, --x11 --pty,
- Extras: --mail-user, --mail-type,
- Remove? watch
- Later lessons? -N -n -c
Accessing software via Modules (45 minutes)
- Understand the runtime environment at login
- Learn how software modules can modify your environment
- Learn how modules prevent problems and promote reproducibility
- Objectives: how to load and use a software package.
- Tools: module avail, module load, which, echo $PATH, module list, module unload, module purge, .bashrc, .bash_profile, git clone, make
- Remove: make, git clone,
- Extras: .bashrc, .bash_profile
Transferring files with remote computers (30 minutes)
- Understand the (cognitive) limitations that remote systems don't necessarily have local Finder/Explorer windows
- Be mindful of network and speed restrictions (e.g. cannot push from cluster; many files vs one archive)
- Know what tools can be used for file transfers, and transfer modes (binary vs text)
- Objective: Be able to transfer files to and from a computing cluster.
- Tools: wget, scp, rsync (callout), mkdir, FileZilla,
- Remove: dos2unix, unix2dos,
- Bonus: gzip, tar, dos2unix, cat, unix2dos, sftp, pwd, lpwd, put, get
Running a parallel job (1 hour)
- Introduce message passing and MPI as the fundamental engine of parallel software
- Walk through a simple Python program for estimation of π
- Use mpi4py to parallelize the program
- Write job submission scripts & run the job on a cluster node
- Tools: nano, sbatch, squeue
Using resources effectively (40 minutes)
- Understand how to look up job statistics
- Learn how to use job statistics to understand the health of your jobs
- Learn some very basic techniques to monitor / profile code execution.
- Understand job size and resource request implications.
- Tools: fastqc, sacct, ssh, top, free, ps, kill, killall (note that some of these may not be appropriate on shared systems)
Using shared resources responsibly (20 minutes)
- Discuss the ways some activities can affect everyone else on the system

Nascent lesson ideas

Playing friendly in the cluster (psteinb: the following is very tricky as it is site dependent, I personally would like to see it in _extras)
- Understanding resource utilisation
- Profiling code - time, size, etc.
- Getting system stats
- Consequences of going over
Filesystems and Storage: objectives likely include items from @psteinb's Shared Filesystem lesson:
- Understand the difference between a local and shared / network filesystem
- Learn about high performance / scratch filesystems
- Raise attention that misuse (intentional or not) of a common file system negatively affects all users very quickly.
- Possible tools: echo $TEMP, ls -al /tmp, df, quota
Advanced Job Scripting and Submission:
- Checking status of jobs (squeue, bjobs etc.), explain different job states and relate to scheduler basics
- Cancelling/deleting a job (scancel, bkill etc.)
- Passing options to the scheduler (log files)
- Callout: Changing a job's name
- Optional Callout: Send an email once the job completes (not all sites support sending emails)
- for a starting point, see this for reference
Filesystem Zoo:
- execute a job that collects node information and stores the output to /tmp
- ask participants where the output went and why they can't see it
- execute a job that collects node information and stores the output to /shared or however your shared file system is called
- for a starting point, see this

hpc-intro's People

Contributors

Stargazers

Watchers

Forkers

callaghanmt rmdickson christinalk r4space vdda ubec fishinwind statkclee uupadhyaya khill42 sstevens2 cesul chrisbuwahpc kerriwait exeter-data-analytics srikanthgr1 mpiercy colinsauze gintan aproeme colindaven wirawan0 sharonwsolis psteinb tkphd utpalbora cwant arctraining reid-a jdhayes researchit carreau aturner-epcc sabryr mboisson ocaisa yngtodd dagimy researchcomputingservices grwoodw afrah-khairallah miguelcarcamov mikewallis 648trindade hbs-rcs traktofon prithwish-ichec t0t4r4 cybort mblue9 vrrp cilwerner arcca getmanskiy-oleksandr ucl-arc fmichonneau markcmiller86 netphone willfurnass iain-s mberzig cbohn4 jevbelikov ualberta-rcg mikerenfro learnhpc ewallace harivyasi bkmgit demuth callumwalley qiwang7 ubccr longr jamesetsmith erikborra nesi smcclatchy tirganteanga sebranchett loucksg c5ys mbareford stoltzstrop cherrysx unode laitanawe kucukgz juleskerley baskerville-hpc bsurc lcoghill sihart25 lemythe shdchen eckhrd tommelt gotero msu-icer laderast

hpc-intro's Issues

Make lessons less system-specific

With the goal of these materials being more widely applicable, we should remove system-specific details and replace them with more general explanations.

Working on a cluster "Going remote" for windows users

Open the correct terminal before showing the "ssh" may be kinder to Windows users.

Scheduler Concepts

Would it be helpful/possible to create a list of key concepts (scheduler agnostic) that should be discussed in the scheduler lesson? Syntax specific to whatever scheduler is being used could then be filled in around those concepts. I pulled a list of ideas from the current SLURM lesson as a starting point:

Definition of a job and batch processing
Submitting a job to the scheduler
Passing options to the scheduler
Changing a job's name
Send an email once the job completes
Requesting resources on a compute node
Log files/job status
Wall times
Cancelling/deleting a job

"Accessing software", "Installing software of our own", more caution needed?

In the "Accessing software" episode, in "Installing software of our own", it's not mentioned that some source distributions make bad assumptions about the level of privilege, and cluster users without root may need to provide "--prefix" info to the configuration step. They may also need to be prepared for odd or surprising output from the build.
It's possible that building software in a cluster environment should just be removed from "hpc-intro" (except perhaps to remark that it can be done), and put into an advanced follow-on lesson, if one is set up.

Potential images or diagrams

screenshot of putty for the ssh episode
diagram of two computers, for ssh episde
diagram of cluster (head node + worker nodes) for cluster episode

adding parallel examples

@jstaf this isn't for you, but for @callaghanmt who is interested in customizing / adding to these materials for an upcoming workshop!

Resources - How Much to Request?

Lesson: Using resources effectively
Section: Estimating required resources using the scheduler

I'd like to suggest a change and an expansion of the following.

"A good rule of thumb is to ask the scheduler for more time and memory than your job can use. This value is typically two to three times what you think your job will need."

I agree that the user should ask for more than they expect to need but I think two-to-three times is far too high. When giving workshops to HPC novices I usually recommend no more than 20% (assuming the job's resource use is not unusually volatile) in order to ensure that a job does not get stuck in the queue waiting for the requested resources to become available.

How about something like this? I think this is an important enough concept that it is worth spelling out the issues.

"A good rule of thumb is to ask the scheduler for more time and memory than you expect your job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being canceled by the scheduler. Recommendations for how much extra to ask for vary but 10% is probably the minimum, with 20-30% being more typical. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for."

Could also add the following example to make the point even clearer.

"For example, suppose your job requires 20 GB of memory but you requested 60 GB just to be on the safe side. While your job is waiting in the queue, a node with 40 GB becomes available but your job still doesn't start because the scheduler is looking for 60 GB to satisfy the requirements you specified. In this case a smaller memory request would have resulted in your job starting sooner."

I think this point is worth emphasizing, especially as (in my experience, anyway) it isn't uncommon for novice users to end up with submit files containing resource requests that are far above what they actually need, which results in longer wait times as well as wasted resources, especially memory, when the jobs don't use what has been set aside for them.

Thoughts?

splitting this lesson into hpc-shell and hpc-intro

As discussed, with @aturner-epcc and many others from this team, I wanted to start discussing splitting this lesson. The core reason is, that I am not sure whether to start merging from hpc-novice before or after the split.

"Customising a job" section, pay attention to copy past.

One regular issue we have in HPC at UC Merced is with copy past.
Some OS/text editor/etc try to be "helpful" and convert --/- to em dashes, and u0096 it can be relatively hard to diagnose, in the cli you can use grep $'u0096' (on bash 4.2+ I believe) to find those.

It is relatively sneaky as there is no errors, and flags get just ignore on sbatch.
This is rare but recurrent for user who create batch script on their machine in word/text-edit and then upload to the cluster.

Might not be a good thing in the lesson itself, but for the instructors notes.

Mistake in sbatch options ?

Resource requests
But what [.....], which is probably not what we want.

The following are several key resource requests:

-n <nnodes> - how many nodes does your job need?

-c <ncpus> - How many CPUs does your job need?

I believe -n is the number of tasks, the number of nodes is uppercase -N,
-c is the number of CPU but per task, written as above, it feels like it's the total number of CPU.

Loading a module by default

14-modules talks about putting 'module load' commands in .bachrc and .bash_profile.

I generally discourage users from doing this. It often slows down their login session, they forget they have them and then later open tickets about things behaving oddly, module names can change, etc.
If it's to be kept, we should at least explain the difference between the two conf files, and issue a word of caution about why it may cause problems

Use variables to define workshop specific values

This is to use the _config.yml (as in hpc-in-a-day lesson) to define, for example queing system used
e.g from hpc-in-a day

this is the scheduler to be used for the workshop
possible values: lsf, slurm, pbs
workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"

@aturner-epcc is working on a full solution for this. But due to time constraint I will fix the first few lessons (as it is easier than manually editing from my part) on my folk.

Change index.md to reflect current lesson plan

As the lessons are now split (some modules moved to hpc-carpentry/hpc-shell and the programming part is removed may be the index.md should be modified to reflect this and some lesson outcomes could be removed as well.

Add distinction between multiple tasks and multiple nodes

This was holding up the merge of #4, but a distinction should be added between multiple tasks (-n) and multiple nodes (-N). Maybe add this as a callout box?

Need to identify maintainer set for each of the lessons

We should identify groups of maintainers for each of the lessons who take responsibility for both the lessons themselves and ensuring that dependencies between lessons are kept up to date and described properly.

Improvements to Episode 2 (Working on a cluster)

Notes for possible PRs (I haven't gone through all the material yet, so some of the following may be addressed later).

Why shouldn't we run long jobs on login nodes.
A diagram to show the login process from laptop to login node via internet and how the login nodes are used to interact with the worker nodes.
Elaborate more on, availability of common storage, place in one location and access from any worker node (much faster than through internet). A diagram to show all compute nodes are connected to each other and to the common storage.
Photos or CPU and memory and disk (to show these are not mystical stuff but physical objects)
When showing nproc, sinfo and free -n, show "df -h" to see available storage locations.

add examples to intro episode

Have 2-3 user profiles / typical use cases to further explain why you might need a cluster.

Possibly also an exercise where the students write out why they think they need a cluster?

data locality

On 15-transferring-files: Add an explanation of data locality, and how to not do things like use scp on your macbook in your office (or worse yet at home on vpn) to scp files from the file storage at a facility to the HPC storage. Something like: https://researchit.las.iastate.edu/move-your-files-faster but more generic.

address internet connectivity issues common to clusters

15-transferring-files talks about using wget to grab files from the internets. We should also include a warning about some clusters not being able to get to the internet at all, or maybe only a head node or DTN node being setup for internet access.

login node limits

17-responsibility says: 'A “quick test” is generally anything that uses less than 10GB of memory, 4 CPUs, and 15 minutes of time. Remember, the login node is to be shared with other users.'

This is going to vary widely by site. Many will have enforced ulimit or cgroup limits on the headnode that are significantly lower than this and may cause this type of 'quick test' to fail when it runs out of memory.

How about just, don't run on the login node, and explain about looking for a debug or interactive queue ?

Interactive jobs with "--x11", user should have used ssh -Y /-X when login in

srun with x-forwarding
i.e.
$ srun --x11 --pty bash

Is mentioned, but for this to work the user must have established a ssh connection with x-orwarding

i.e.
ssh -Y server.org

MobaXterm or GitBash?

The current version of the login suggests that MobaXterm is used on Windows (or PuTTY). This certainly gets the job done. However, Software Carpentry uses GitBash. I know that the reason for MobaXterm is at least partly historical (it had nano while GitBash didn't and so was adopted by Compute Canada) and now not relevant (Oliver Stueker added nano to GitBash).

Would GitBash be a better option here so that:

We align with Software Carpentry.
Windows users get Git as a bonus.

"Working on a cluster" -- assumption of user log-in state

The "Working on a cluster" episode begins with the assumption that the user is logged in to the head node of a cluster ("what computer have we logged into?"), but the actual step of logging in to the head node is not called out, either at the start of this lesson, or the end of the previous one. Since the point of the description is to clarify that there are multiple contexts or scopes within a cluster, it's important to make this step explicit.

Consolidating "file transfer" into one section

I want to add an input toward the simplification of "file transfer" topic. There are currently two places where this was touched:

"Transferring Data" under "Working on a cluster" episode: https://hpc-carpentry.github.io/hpc-intro/12-cluster/index.html#transferring-data
"Transferring Files", a dedicated episode:
https://hpc-carpentry.github.io/hpc-intro/15-transferring-files/index.html

I wonder if we can consider consolidating all into one episode. Perhaps on top subheadings there will be the "preferred" method. Then on the latter subheadings, we present several alternatives. Instructors can pick and choose what method they want to expose the learners to.

A second alternative is to push the alternative methods to an extra episode, per @psteinb suggestion in #27.

PS: I'm aware of issue 27, Simplify file transfer section, but that one was on suggestion to focus only on one method of file transfer.

What is everyone's thought on this? I won't mind helping doing the consolidation/reorg, but I want to get your input first.

Fix broken links

https://github.com/hpc-carpentry/hpc-intro:

Under "Lesson structure"
"full guide" points to http://swcarpentry.github.io/lesson-example/04-formatting/ (404)

Elaborate on "Efficiency" in "Why use a cluster?"

The "Why Use These Computers?" section of this lesson has a discussion of "efficiency", which mentions large clusters are often a "pool of resources drawn on by many users" and mentions that they can be in use constantly. There's another side to this coin, which is that projects or groups may have a requirement for very powerful computers, but only part of the time, so sharing the resource makes economic sense for them also.
This dovetails a little bit with the existing "what is a cluster" and "definition of cloud" conversations -- pay-per-use is a much-remarked-on feature of the cloud, and relevant to institutional HPC also.

filezilla for something with less adware

Filezilla has become riddled with adware and we (and others I assume) are steering users away from it.

I think we should suggest another tool on 15-transferring-files, WiinSCP, CyberDuck, etc.

Auto-include configuration variables

We need to decide how to take forward the configuration variables used to allow customisation to different schedulers and local system setups. There has already been discussion of this in #73 and #80. This issue is to pull together the discussion and decide on the way forward for the configuration variables.

The original set used by @psteinb in HPC-in-a-day was:

workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"

I took this approach and applied it to the hpc-intro lesson with the plan to push this update back into the current hpc-intro source. The variables used to customise in this version actually ended up being a much longer list than I would have liked (and I think there is scope to rationalise):

Local host and scheduler options

workshop_host: "Cirrus"
workshop_host_id: "EPCC_Cirrus"
workshop_host_login: "login.cirrus.ac.uk"
workshop_host_location: "EPCC, The University of Edinburgh"
workshop_host_ip: "129.215.175.28"
workshop_host_homedir: "/lustre/home/tc001"
workshop_host_prompt: "[yourUsername@cirrus-login0 ~]$"
workshop_sched_id: "EPCC_Cirrus_pbs"
workshop_sched_name: "PBS Pro"
workshop_sched_submit: "qsub"
workshop_sched_submitopt: "qsub -A tc001 -q R387726"
workshop_sched_stat: "qstat"
workshop_sched_statu: "qstat -u yourUsername"
workshop_sched_del: "qdel"
workshop_sched_submiti: "qsub"
workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_info: "pbsnodes -a"
workshop_sched_comment: "#PBS"
workshop_sched_nameopt: "-N"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"

However, one thing I found compared to the original set was the requirement to distinguish between different local configurations even if the same scheduler is used, hence the addition of both "workshop_host_id" and "workshop_sched_id". I suppose these could be dropped and the combination of the variables used in naming of the snippets if we want to minimise configuration variables.

Some of the variables are only used in one place in the current lesson and, in these cases, the variable could be dropped and the syntax included directly in a snippet instead (I was too keen on avioding one line snippets). Variables only used in one place:

workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"

Variables not used at all:

workshop_host_ip: "129.215.175.28"

Ideally, I would like to rationalise and issue a PR for the updated list of configuration variables (I currently have time to do this) but, as there has been discussion on this, I wanted to get the thoughts of the community and agree a way forwards before doing this.

Re-number episodes

I'd be a fan of re-numbering by 5s so that it's easier for future users of this material to slot in their own pages.

Add about/discussion/instructor notes pages

UNIX terminology

I think we should explain that Linux derived from the design of UNIX (but is not UNIX) and introduce the Linux & Posix terms instead of UNIX throughout this unit.

"Using resources effectively", "Measuring stats" -- bad description of SSH/SLURM relationship

In the "Using resources effectively" episode, under "Measuring the staticstics of currently-running tasks", there's text that says "One very useful feature of SLURM is the ability to SSH to a node where a job is running...".
The SSH-ability of nodes is a result of cluster configuration independently of the queuing system. SLURM provides interactive access, but direct SSH may be disallowed even in the presence of SLURM, or allowed for other queuing systems.

Propose to reword this to "Typically, clusters allow users to SSH directly into worker nodes from the head node. This is useful to check on a running job and see how it's doing."

The why of cluster is missing

The why of cluster is missing
Cloud and cluster are defined as terms. But why should a researcher care?

Anwser:
Cloud typically runs services.
Cluster runs batch jobs.

Cloud is better when you need to run a service such as a website, or database.

Cluster is better when a researcher needs to run one or more computations (ex simulation or data processing) where it is not really important exactly when the computation runs and where it may take a while for enough resources to become available to run the computation.

Would it not be great if there was something that orchestrated the resources, started your computation when enough resources were available to run even if its 4 am, and emailed you when your computation was done. This is what a HPC cluster does, and why you would use one.

The exact nomenclature computation/simulation/job of this discussion will be difficult to have without invoking other not defined words.

status of `jeff` branch

@jstaf, is this work in progress?

Worker node vs Compute node: regional dialect ?

12-cluster refers to 'worker (or execute)' nodes. We (and most other US institutions I've worked with) seem to refer to them as the 'compute' nodes. Minor difference that may be regional.

Do we need to think about regional dialect being swapped like cluster names & schedulers ? Or should we just keep adding terms (worker aka execute aka compute) ?

Simplify file transfer section

Somewhat related to #20, I think we should present one clear option in the file transfer section (https://hpc-carpentry.github.io/hpc-intro/14-transferring-files/) and then refer people to other tools in the "extras" section.

"Working on a cluster" -- nodes may be heterogeneous

In the "Working on a cluster" episode, where contexts and scopes within the cluster are introduced, it's worth pointing out that the cluster's node collection may be heterogeneous. It's probably overkill to get into resource requests, but a remark that nodes are not always all equal would be valuable here.

Fill in missing information

https://hpc-carpentry.github.io/hpc-intro/12-cluster/index.html
Currently says :
Bytes (one Byte are bits).

This should be:
Bytes (one Byte are 8 bits).

As this is a introductory, we should remove as much confusion as possible. Will send a PR on this

Typos

Introduction to High-Performance Computing
Link : https://hpc-carpentry.github.io/hpc-intro/

Episode 1: Why Use a Cluster?
Typos : line 54 (tell the the computer)
line 66: When the task to solve become more heavy on computations
Episode 2: Working on a cluster
line 179: Issueing a ssh command always entails the same ....

Introduction to High-Performance Computing
Link : https://epcced.github.io/hpc-intro/

Episode: Why use High Performance Computing?
line 26 : Summarise your discussion in 2-3 sentances.
line 53: They are often interchangably
line 62: For example, varying an imput parameter (or input data) to a computation and running many copies simultaneously.
line 97: Summarise your discussion in 2-3 sentances.

Episode: What is an HPC system?
line 43 : Each core contains a floating point unit (FPU) which is responsible for actually performning the computations
line 51 : (also referred to as RAM or DRAM) in addtion to the processor memory
line 79 : hey may need differnt options or settings
line 101 : performance and workfows are often categorised
line 103 : his is typically seen when peforming

Episode: Connecting to the HPC system
line 9: "Succesfully connect to a remote HPC system."
line 90: Running PuTTY will not initially produce a terminal but intsead a window full of connection options.
line 175 : then use the follwing command: $ hostname.

Episode: Transferring files
line 8: "Be able to tranfer files to and from a remote HPC system."
line 19: choose will be decided by what is most covenient for your workflow.
line 83: Or perhaps we're simply not sure which files we want to tranfer yet.
line 288: All file tranfers using the above methods use encrypted communication over

Episode: Scheduling jobs

line 248: A key job enviroment variable in PBS is..
line 325: Absence of any job info indicates that the job has been successfully canceled
line 140 : Intially, Python 3 is not loaded.

Episode : Accessing software
line 45: (Fastest Fourer Transform in the West) software library availble for it to
line 52: it contains the settings required to run a software packace
line 65: may start out with an empty environemnt,
line 172: $PATH is a special ennvironment variable

Episode : Using resources effectively
line 59: You'll need to figure out a good amount of resources to ask for for this first "test run".

Episode : Using shared resources responsibly
line 13: "Understand how to conver many files to a single archive file using tar."
line 127: submit a short trunctated test to ensure that
line 156: In all these cases, the helpdesk of the system you are using shoud be

Episode: How does parallel computing work
line 91: If you look at the souce code.

Episode: Understanding what resources to use
line 20: Specifically what resources do you need initially and for parallel applicatiosns.
line 22: Remember the basic resources that are mananged by the scheduler on a HPC system are
line 83: such as this exampler for running the 3D animation software Maya.
line 99: **NOTE: This is nice as it automatically creates a seperate timing log file
Just after the references, Key Points
• Basic benchmarking allows you to use HPC resources more effecively.

Episode: Bootstrapping your use of HPC
line 21: This session is designed to give you the opprotunity to explore these questions and

line length and episode numbering

make lesson-check-full flags the following issues, after updating the backend (#34):

"Using resources effectively"

populate glossary

The glossary ~~(reference.md)~~ (guide.md) needs content. For a start:

Narrative Needed

To align further with the standard Carpentries approach the technical content should be wrapped up in a narrative that can serve as a template for participants to imagining themselves fitting into. I suspect that hpc-novice and hpc-shell should each have their own narrative such that they can be done independently but they could build as long as hpc-novice provided whatever files hpc-shell might produce.

Thoughts?

GNU Parallel in hpc-shell?

One command I have been wondering about including in hpc-shell is parallel. If worked into a meaningful example it would do the following:

Provide a powerful little tool that could be used elsewhere.
Lay some foundations for thinking in a parallel in a way that is very responsive and program language agnostic before the scheduler and the programming language get involved.
Differentiate "hpc-shell" from "swc-shell" further.

Of course there are downsides, including:

If people are new to the shell then this might be an overload.
It might be tough to run on certain systems (I'm speculating here).

Thoughts?

setup jekyll baseurl/relative_url

most of the image are broken as they try to load from /fig/ instead of /hpc-intro/fig/... this seem to be due to the missing baseurl config.

It seem like you can use filters to prepends only on github, and not when deploying/testing locally.

cpus: sockets, cores, and threads

12-cluster talks about components and equates the terms CPUs, processors, and cores. It doesn't make a distinction between sockets, cores, or threads for SMT (hyperthreading). It seems like we should get that straight up front before the wrong understanding is cemented.

filesystems vary widely

12-cluster says:

This is an important point to remember: files saved on one node (computer) are available everywhere on the cluster!

That statement seems overly broad - we have things like /tmp and /scratch that may or may not be shared in some way across nodes. I'd probably just remove this sentence from this section & address it in a section about filesystems where it can be explained with more detail

longer introduction to ssh

Hi,

I'm going through the lesson material, and the step from 11-hpc-intro.md to 12-cluster.md feels a bit rough.

Go ahead and log in to the cluster.
```
[user@laptop]$ ssh remote
```
{: .bash}


Very often, many users are tempted to think of a high-performance computing installation as one
giant, magical machine. Sometimes, people will assume that the computer they've logged onto is the
entire computing cluster. So what's really happening? What computer have we logged on to? The name
of the current computer we are logged onto can be checked with the `hostname` command. (Clever users
will notice that the current hostname is also part of our prompt!)

```
[remote]$ hostname
```

Now we can assume that user now about shell, and guess that ssh is a command and remote is an argument. But from experience assuming that they understand the [remote]$ prompt is a prompt of the remote machine if a big leap of faith.

I see a number of step missing here that IMHO need to be explained.:

ssh allow you to connect to a remote machine – as if you plugged a screen/keyboard/(mouse?) to a remote computer and open a terminal.
the remote argument is not actually the word remote, but the actual remote address of the cluster (given to you by your admin). You also (likely) want to prefix it by <username>@
You will (likely) need to type your password and it won't show up on the screen while you type.
if the password is correct your terminal should now see a welcome message from the cluster, and everything you run in this terminal is now executed on a remote machine.
(details may varies between installations)

I believe that would be the basic of what need to be covered, but I feel like understanding local machine vs remote machine and when we are where is critical. Alternatively this could be in a separate lessons, but I don't see it in the shell-novices.

"Scheduling jobs" -- remark about environment variables is incorrect

In the "Scheduling jobs" episode, under the "other types of jobs" subsection, there is a remark that environment variables are not available for interactive jobs launched via "srun". On the two SLURM clusters available to me, this is not true for either of them -- doing "srun" followed by "env" shows that in the shell session, the "SLURM_*" environment variables are set, and the explicitly-mentioned SLURM_CPUS_PER_TASK variable is set if the "-c" flag was provided to "srun".

This might be configurable, and vary from one installation to another?

Need to define "job" in the intro

Request from Kamil: early in the intro we should define what a "job" is. Otherwise it's jargon.

carpentries-incubator / hpc-intro Goto Github PK

hpc-intro's Introduction

Intro to HPC

Using this material

Deploying a Customized Lesson

Template Your Customized Repository

Merge Your Customized Branch

Modify _config.yml

Lesson Outlines

Nascent lesson ideas

hpc-intro's People

Contributors

Stargazers

Watchers

Forkers

hpc-intro's Issues

Local host and scheduler options

Recommend Projects

Recommend Topics

Recommend Org

Modify `_config.yml`