Giter Site home page Giter Site logo

carpentries-incubator / hpc-intro Goto Github PK

View Code? Open in Web Editor NEW
132.0 22.0 129.0 30.72 MB

An Introduction to High Performance Computing

Home Page: https://carpentries-incubator.github.io/hpc-intro/

License: Other

Makefile 2.10% HTML 3.09% R 1.94% Python 21.28% Shell 1.91% Ruby 0.30% Vim Snippet 62.73% TeX 6.64%
lesson carpentry-lesson alpha carpentries-incubator english hpc-carpentry

hpc-intro's Issues

Mistake in sbatch options ?

Resource requests
But what [.....], which is probably not what we want.

The following are several key resource requests:

-n <nnodes> - how many nodes does your job need?

-c <ncpus> - How many CPUs does your job need?
  • I believe -n is the number of tasks, the number of nodes is uppercase -N,
  • -c is the number of CPU but per task, written as above, it feels like it's the total number of CPU.

setup jekyll baseurl/relative_url

most of the image are broken as they try to load from /fig/ instead of /hpc-intro/fig/... this seem to be due to the missing baseurl config.

It seem like you can use filters to prepends only on github, and not when deploying/testing locally.

"Using resources effectively", "Measuring stats" -- bad description of SSH/SLURM relationship

In the "Using resources effectively" episode, under "Measuring the staticstics of currently-running tasks", there's text that says "One very useful feature of SLURM is the ability to SSH to a node where a job is running...".
The SSH-ability of nodes is a result of cluster configuration independently of the queuing system. SLURM provides interactive access, but direct SSH may be disallowed even in the presence of SLURM, or allowed for other queuing systems.

Propose to reword this to "Typically, clusters allow users to SSH directly into worker nodes from the head node. This is useful to check on a running job and see how it's doing."

GNU Parallel in hpc-shell?

One command I have been wondering about including in hpc-shell is parallel. If worked into a meaningful example it would do the following:

  1. Provide a powerful little tool that could be used elsewhere.
  2. Lay some foundations for thinking in a parallel in a way that is very responsive and program language agnostic before the scheduler and the programming language get involved.
  3. Differentiate "hpc-shell" from "swc-shell" further.

Of course there are downsides, including:

  1. If people are new to the shell then this might be an overload.
  2. It might be tough to run on certain systems (I'm speculating here).

Thoughts?

Typos

Introduction to High-Performance Computing
Link : https://hpc-carpentry.github.io/hpc-intro/

Episode 1: Why Use a Cluster?
Typos : line 54 (tell the the computer)
line 66: When the task to solve become more heavy on computations
Episode 2: Working on a cluster
line 179: Issueing a ssh command always entails the same ....

Introduction to High-Performance Computing
Link : https://epcced.github.io/hpc-intro/

Episode: Why use High Performance Computing?
line 26 : Summarise your discussion in 2-3 sentances.
line 53: They are often interchangably
line 62: For example, varying an imput parameter (or input data) to a computation and running many copies simultaneously.
line 97: Summarise your discussion in 2-3 sentances.

Episode: What is an HPC system?
line 43 : Each core contains a floating point unit (FPU) which is responsible for actually performning the computations
line 51 : (also referred to as RAM or DRAM) in addtion to the processor memory
line 79 : hey may need differnt options or settings
line 101 : performance and workfows are often categorised
line 103 : his is typically seen when peforming

Episode: Connecting to the HPC system
line 9: "Succesfully connect to a remote HPC system."
line 90: Running PuTTY will not initially produce a terminal but intsead a window full of connection options.
line 175 : then use the follwing command: $ hostname.

Episode: Transferring files
line 8: "Be able to tranfer files to and from a remote HPC system."
line 19: choose will be decided by what is most covenient for your workflow.
line 83: Or perhaps we're simply not sure which files we want to tranfer yet.
line 288: All file tranfers using the above methods use encrypted communication over

Episode: Scheduling jobs

line 248: A key job enviroment variable in PBS is..
line 325: Absence of any job info indicates that the job has been successfully canceled
line 140 : Intially, Python 3 is not loaded.

Episode : Accessing software
line 45: (Fastest Fourer Transform in the West) software library availble for it to
line 52: it contains the settings required to run a software packace
line 65: may start out with an empty environemnt,
line 172: $PATH is a special ennvironment variable

Episode : Using resources effectively
line 59: You'll need to figure out a good amount of resources to ask for for this first "test run".

Episode : Using shared resources responsibly
line 13: "Understand how to conver many files to a single archive file using tar."
line 127: submit a short trunctated test to ensure that
line 156: In all these cases, the helpdesk of the system you are using shoud be

Episode: How does parallel computing work
line 91: If you look at the souce code.

Episode: Understanding what resources to use
line 20: Specifically what resources do you need initially and for parallel applicatiosns.
line 22: Remember the basic resources that are mananged by the scheduler on a HPC system are
line 83: such as this exampler for running the 3D animation software Maya.
line 99: **NOTE: This is nice as it automatically creates a seperate timing log file
Just after the references, Key Points
• Basic benchmarking allows you to use HPC resources more effecively.

Episode: Bootstrapping your use of HPC
line 21: This session is designed to give you the opprotunity to explore these questions and

add examples to intro episode

Have 2-3 user profiles / typical use cases to further explain why you might need a cluster.

Possibly also an exercise where the students write out why they think they need a cluster?

Improvements to Episode 2 (Working on a cluster)

Notes for possible PRs (I haven't gone through all the material yet, so some of the following may be addressed later).

  1. Why shouldn't we run long jobs on login nodes.
  2. A diagram to show the login process from laptop to login node via internet and how the login nodes are used to interact with the worker nodes.
  3. Elaborate more on, availability of common storage, place in one location and access from any worker node (much faster than through internet). A diagram to show all compute nodes are connected to each other and to the common storage.
  4. Photos or CPU and memory and disk (to show these are not mystical stuff but physical objects)
  5. When showing nproc, sinfo and free -n, show "df -h" to see available storage locations.

Potential images or diagrams

  • screenshot of putty for the ssh episode
  • diagram of two computers, for ssh episde
  • diagram of cluster (head node + worker nodes) for cluster episode

The why of cluster is missing

The why of cluster is missing
Cloud and cluster are defined as terms. But why should a researcher care?

Anwser:
Cloud typically runs services.
Cluster runs batch jobs.

Cloud is better when you need to run a service such as a website, or database.

Cluster is better when a researcher needs to run one or more computations (ex simulation or data processing) where it is not really important exactly when the computation runs and where it may take a while for enough resources to become available to run the computation.

Would it not be great if there was something that orchestrated the resources, started your computation when enough resources were available to run even if its 4 am, and emailed you when your computation was done. This is what a HPC cluster does, and why you would use one.

The exact nomenclature computation/simulation/job of this discussion will be difficult to have without invoking other not defined words.

address internet connectivity issues common to clusters

15-transferring-files talks about using wget to grab files from the internets. We should also include a warning about some clusters not being able to get to the internet at all, or maybe only a head node or DTN node being setup for internet access.

login node limits

17-responsibility says: 'A “quick test” is generally anything that uses less than 10GB of memory, 4 CPUs, and 15 minutes of time. Remember, the login node is to be shared with other users.'

This is going to vary widely by site. Many will have enforced ulimit or cgroup limits on the headnode that are significantly lower than this and may cause this type of 'quick test' to fail when it runs out of memory.

How about just, don't run on the login node, and explain about looking for a debug or interactive queue ?

Use variables to define workshop specific values

This is to use the _config.yml (as in hpc-in-a-day lesson) to define, for example queing system used
e.g from hpc-in-a day

this is the scheduler to be used for the workshop
possible values: lsf, slurm, pbs
workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"

@aturner-epcc is working on a full solution for this. But due to time constraint I will fix the first few lessons (as it is easier than manually editing from my part) on my folk.

Worker node vs Compute node: regional dialect ?

12-cluster refers to 'worker (or execute)' nodes. We (and most other US institutions I've worked with) seem to refer to them as the 'compute' nodes. Minor difference that may be regional.

Do we need to think about regional dialect being swapped like cluster names & schedulers ? Or should we just keep adding terms (worker aka execute aka compute) ?

cpus: sockets, cores, and threads

12-cluster talks about components and equates the terms CPUs, processors, and cores. It doesn't make a distinction between sockets, cores, or threads for SMT (hyperthreading). It seems like we should get that straight up front before the wrong understanding is cemented.

Re-number episodes

I'd be a fan of re-numbering by 5s so that it's easier for future users of this material to slot in their own pages.

Change index.md to reflect current lesson plan

As the lessons are now split (some modules moved to hpc-carpentry/hpc-shell and the programming part is removed may be the index.md should be modified to reflect this and some lesson outcomes could be removed as well.

line length and episode numbering

make lesson-check-full flags the following issues, after updating the backend (#34):

  • Missing or non-consecutive episode numbers [0, 11, 12, 13, 14, 15, 16]
    • related to #9
  • ./CODE_OF_CONDUCT.md: Internally-defined links may be missing definitions: "Carpentry Code of Conduct"=>"coc", "reporting guidelines"=>"coc-reporting"
  • ./CONTRIBUTING.md: Line(s) are too long: 114, 126
  • ./README.md: Line(s) are too long: 41, 42, 43, 46, 47, 53, 61, 69, 77
  • ./_episodes/00-hpc-intro.md: Line(s) are too long: 124, 127
  • ./_episodes/11-cluster.md: Line(s) are too long: 71, 87, 89, 90, 91, 99
  • ./_episodes/12-scheduler.md: Line(s) are too long: 89, 91, 153, 154, 188, 190, 266, 267, 280, 302, 323, 348
  • ./_episodes/13-modules.md: Line(s) are too long: 91, 121, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 366
  • ./_episodes/13-modules.md:434: Unknown or missing code block type None
  • ./_episodes/14-transferring-files.md: Line(s) are too long: 47, 190, 195, 224
  • ./_episodes/15-resources.md: Line(s) are too long: 89, 151, 153, 207
  • ./_episodes/16-responsiblity.md: Line(s) are too long: 48
  • ./index.md: Line(s) are too long: 7, 8, 12, 13
  • ./index.md:18: Unknown or missing blockquote type None

"Working on a cluster" -- nodes may be heterogeneous

In the "Working on a cluster" episode, where contexts and scopes within the cluster are introduced, it's worth pointing out that the cluster's node collection may be heterogeneous. It's probably overkill to get into resource requests, but a remark that nodes are not always all equal would be valuable here.

populate glossary

MobaXterm or GitBash?

The current version of the login suggests that MobaXterm is used on Windows (or PuTTY). This certainly gets the job done. However, Software Carpentry uses GitBash. I know that the reason for MobaXterm is at least partly historical (it had nano while GitBash didn't and so was adopted by Compute Canada) and now not relevant (Oliver Stueker added nano to GitBash).

Would GitBash be a better option here so that:

  1. We align with Software Carpentry.
  2. Windows users get Git as a bonus.

?

filesystems vary widely

12-cluster says:

This is an important point to remember: files saved on one node (computer) are available everywhere on the cluster!

That statement seems overly broad - we have things like /tmp and /scratch that may or may not be shared in some way across nodes. I'd probably just remove this sentence from this section & address it in a section about filesystems where it can be explained with more detail

Consolidating "file transfer" into one section

I want to add an input toward the simplification of "file transfer" topic. There are currently two places where this was touched:

I wonder if we can consider consolidating all into one episode. Perhaps on top subheadings there will be the "preferred" method. Then on the latter subheadings, we present several alternatives. Instructors can pick and choose what method they want to expose the learners to.

A second alternative is to push the alternative methods to an extra episode, per @psteinb suggestion in #27.

PS: I'm aware of issue 27, Simplify file transfer section, but that one was on suggestion to focus only on one method of file transfer.

What is everyone's thought on this? I won't mind helping doing the consolidation/reorg, but I want to get your input first.

Narrative Needed

To align further with the standard Carpentries approach the technical content should be wrapped up in a narrative that can serve as a template for participants to imagining themselves fitting into. I suspect that hpc-novice and hpc-shell should each have their own narrative such that they can be done independently but they could build as long as hpc-novice provided whatever files hpc-shell might produce.

Thoughts?

UNIX terminology

I think we should explain that Linux derived from the design of UNIX (but is not UNIX) and introduce the Linux & Posix terms instead of UNIX throughout this unit.

Loading a module by default

14-modules talks about putting 'module load' commands in .bachrc and .bash_profile.

  1. I generally discourage users from doing this. It often slows down their login session, they forget they have them and then later open tickets about things behaving oddly, module names can change, etc.

  2. If it's to be kept, we should at least explain the difference between the two conf files, and issue a word of caution about why it may cause problems

filezilla for something with less adware

Filezilla has become riddled with adware and we (and others I assume) are steering users away from it.

I think we should suggest another tool on 15-transferring-files, WiinSCP, CyberDuck, etc.

"Accessing software", "Installing software of our own", more caution needed?

In the "Accessing software" episode, in "Installing software of our own", it's not mentioned that some source distributions make bad assumptions about the level of privilege, and cluster users without root may need to provide "--prefix" info to the configuration step. They may also need to be prepared for odd or surprising output from the build.
It's possible that building software in a cluster environment should just be removed from "hpc-intro" (except perhaps to remark that it can be done), and put into an advanced follow-on lesson, if one is set up.

longer introduction to ssh

Hi,

I'm going through the lesson material, and the step from 11-hpc-intro.md to 12-cluster.md feels a bit rough.

Go ahead and log in to the cluster.
```
[user@laptop]$ ssh remote
```
{: .bash}


Very often, many users are tempted to think of a high-performance computing installation as one
giant, magical machine. Sometimes, people will assume that the computer they've logged onto is the
entire computing cluster. So what's really happening? What computer have we logged on to? The name
of the current computer we are logged onto can be checked with the `hostname` command. (Clever users
will notice that the current hostname is also part of our prompt!)

```
[remote]$ hostname
```

Now we can assume that user now about shell, and guess that ssh is a command and remote is an argument. But from experience assuming that they understand the [remote]$ prompt is a prompt of the remote machine if a big leap of faith.

I see a number of step missing here that IMHO need to be explained.:

  • ssh allow you to connect to a remote machine – as if you plugged a screen/keyboard/(mouse?) to a remote computer and open a terminal.
  • the remote argument is not actually the word remote, but the actual remote address of the cluster (given to you by your admin). You also (likely) want to prefix it by <username>@
  • You will (likely) need to type your password and it won't show up on the screen while you type.
  • if the password is correct your terminal should now see a welcome message from the cluster, and everything you run in this terminal is now executed on a remote machine.
    (details may varies between installations)

I believe that would be the basic of what need to be covered, but I feel like understanding local machine vs remote machine and when we are where is critical. Alternatively this could be in a separate lessons, but I don't see it in the shell-novices.

Make lessons less system-specific

With the goal of these materials being more widely applicable, we should remove system-specific details and replace them with more general explanations.

Elaborate on "Efficiency" in "Why use a cluster?"

The "Why Use These Computers?" section of this lesson has a discussion of "efficiency", which mentions large clusters are often a "pool of resources drawn on by many users" and mentions that they can be in use constantly. There's another side to this coin, which is that projects or groups may have a requirement for very powerful computers, but only part of the time, so sharing the resource makes economic sense for them also.
This dovetails a little bit with the existing "what is a cluster" and "definition of cloud" conversations -- pay-per-use is a much-remarked-on feature of the cloud, and relevant to institutional HPC also.

Resources - How Much to Request?

Lesson: Using resources effectively
Section: Estimating required resources using the scheduler

I'd like to suggest a change and an expansion of the following.

"A good rule of thumb is to ask the scheduler for more time and memory than your job can use. This value is typically two to three times what you think your job will need."

I agree that the user should ask for more than they expect to need but I think two-to-three times is far too high. When giving workshops to HPC novices I usually recommend no more than 20% (assuming the job's resource use is not unusually volatile) in order to ensure that a job does not get stuck in the queue waiting for the requested resources to become available.

How about something like this? I think this is an important enough concept that it is worth spelling out the issues.

"A good rule of thumb is to ask the scheduler for more time and memory than you expect your job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being canceled by the scheduler. Recommendations for how much extra to ask for vary but 10% is probably the minimum, with 20-30% being more typical. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for."

Could also add the following example to make the point even clearer.

"For example, suppose your job requires 20 GB of memory but you requested 60 GB just to be on the safe side. While your job is waiting in the queue, a node with 40 GB becomes available but your job still doesn't start because the scheduler is looking for 60 GB to satisfy the requirements you specified. In this case a smaller memory request would have resulted in your job starting sooner."

I think this point is worth emphasizing, especially as (in my experience, anyway) it isn't uncommon for novice users to end up with submit files containing resource requests that are far above what they actually need, which results in longer wait times as well as wasted resources, especially memory, when the jobs don't use what has been set aside for them.

Thoughts?

"Working on a cluster" -- assumption of user log-in state

The "Working on a cluster" episode begins with the assumption that the user is logged in to the head node of a cluster ("what computer have we logged into?"), but the actual step of logging in to the head node is not called out, either at the start of this lesson, or the end of the previous one. Since the point of the description is to clarify that there are multiple contexts or scopes within a cluster, it's important to make this step explicit.

Auto-include configuration variables

We need to decide how to take forward the configuration variables used to allow customisation to different schedulers and local system setups. There has already been discussion of this in #73 and #80. This issue is to pull together the discussion and decide on the way forward for the configuration variables.

The original set used by @psteinb in HPC-in-a-day was:

workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"

I took this approach and applied it to the hpc-intro lesson with the plan to push this update back into the current hpc-intro source. The variables used to customise in this version actually ended up being a much longer list than I would have liked (and I think there is scope to rationalise):

Local host and scheduler options

workshop_host: "Cirrus"
workshop_host_id: "EPCC_Cirrus"
workshop_host_login: "login.cirrus.ac.uk"
workshop_host_location: "EPCC, The University of Edinburgh"
workshop_host_ip: "129.215.175.28"
workshop_host_homedir: "/lustre/home/tc001"
workshop_host_prompt: "[yourUsername@cirrus-login0 ~]$"
workshop_sched_id: "EPCC_Cirrus_pbs"
workshop_sched_name: "PBS Pro"
workshop_sched_submit: "qsub"
workshop_sched_submitopt: "qsub -A tc001 -q R387726"
workshop_sched_stat: "qstat"
workshop_sched_statu: "qstat -u yourUsername"
workshop_sched_del: "qdel"
workshop_sched_submiti: "qsub"
workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_info: "pbsnodes -a"
workshop_sched_comment: "#PBS"
workshop_sched_nameopt: "-N"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"

However, one thing I found compared to the original set was the requirement to distinguish between different local configurations even if the same scheduler is used, hence the addition of both "workshop_host_id" and "workshop_sched_id". I suppose these could be dropped and the combination of the variables used in naming of the snippets if we want to minimise configuration variables.

Some of the variables are only used in one place in the current lesson and, in these cases, the variable could be dropped and the syntax included directly in a snippet instead (I was too keen on avioding one line snippets). Variables only used in one place:

workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"

Variables not used at all:

workshop_host_ip: "129.215.175.28"

Ideally, I would like to rationalise and issue a PR for the updated list of configuration variables (I currently have time to do this) but, as there has been discussion on this, I wanted to get the thoughts of the community and agree a way forwards before doing this.

Scheduler Concepts

Would it be helpful/possible to create a list of key concepts (scheduler agnostic) that should be discussed in the scheduler lesson? Syntax specific to whatever scheduler is being used could then be filled in around those concepts. I pulled a list of ideas from the current SLURM lesson as a starting point:

  • Definition of a job and batch processing
  • Submitting a job to the scheduler
  • Passing options to the scheduler
  • Changing a job's name
  • Send an email once the job completes
  • Requesting resources on a compute node
  • Log files/job status
  • Wall times
  • Cancelling/deleting a job

"Customising a job" section, pay attention to copy past.

One regular issue we have in HPC at UC Merced is with copy past.
Some OS/text editor/etc try to be "helpful" and convert --/- to em dashes, and u0096 it can be relatively hard to diagnose, in the cli you can use grep $'u0096' (on bash 4.2+ I believe) to find those.

It is relatively sneaky as there is no errors, and flags get just ignore on sbatch.
This is rare but recurrent for user who create batch script on their machine in word/text-edit and then upload to the cluster.

Might not be a good thing in the lesson itself, but for the instructors notes.

"Scheduling jobs" -- remark about environment variables is incorrect

In the "Scheduling jobs" episode, under the "other types of jobs" subsection, there is a remark that environment variables are not available for interactive jobs launched via "srun". On the two SLURM clusters available to me, this is not true for either of them -- doing "srun" followed by "env" shows that in the shell session, the "SLURM_*" environment variables are set, and the explicitly-mentioned SLURM_CPUS_PER_TASK variable is set if the "-c" flag was provided to "srun".

This might be configurable, and vary from one installation to another?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.