carpentries-incubator / hpc-intro Goto Github PK
View Code? Open in Web Editor NEWAn Introduction to High Performance Computing
Home Page: https://carpentries-incubator.github.io/hpc-intro/
License: Other
An Introduction to High Performance Computing
Home Page: https://carpentries-incubator.github.io/hpc-intro/
License: Other
Resource requests
But what [.....], which is probably not what we want.
The following are several key resource requests:
-n <nnodes> - how many nodes does your job need?
-c <ncpus> - How many CPUs does your job need?
-n
is the number of tasks, the number of nodes is uppercase -N
,-c
is the number of CPU but per task, written as above, it feels like it's the total number of CPU.This was holding up the merge of #4, but a distinction should be added between multiple tasks (-n
) and multiple nodes (-N
). Maybe add this as a callout box?
most of the image are broken as they try to load from /fig/
instead of /hpc-intro/fig/...
this seem to be due to the missing baseurl
config.
It seem like you can use filters to prepends only on github, and not when deploying/testing locally.
In the "Using resources effectively" episode, under "Measuring the staticstics of currently-running tasks", there's text that says "One very useful feature of SLURM is the ability to SSH to a node where a job is running...".
The SSH-ability of nodes is a result of cluster configuration independently of the queuing system. SLURM provides interactive access, but direct SSH may be disallowed even in the presence of SLURM, or allowed for other queuing systems.
Propose to reword this to "Typically, clusters allow users to SSH directly into worker nodes from the head node. This is useful to check on a running job and see how it's doing."
One command I have been wondering about including in hpc-shell is parallel
. If worked into a meaningful example it would do the following:
Of course there are downsides, including:
Thoughts?
https://hpc-carpentry.github.io/hpc-intro/12-cluster/index.html
Currently says :
Bytes (one Byte are bits).
This should be:
Bytes (one Byte are 8 bits).
As this is a introductory, we should remove as much confusion as possible. Will send a PR on this
We should identify groups of maintainers for each of the lessons who take responsibility for both the lessons themselves and ensuring that dependencies between lessons are kept up to date and described properly.
Introduction to High-Performance Computing
Link : https://hpc-carpentry.github.io/hpc-intro/
Episode 1: Why Use a Cluster?
Typos : line 54 (tell the the computer)
line 66: When the task to solve become more heavy on computations
Episode 2: Working on a cluster
line 179: Issueing a ssh
command always entails the same ....
Introduction to High-Performance Computing
Link : https://epcced.github.io/hpc-intro/
Episode: Why use High Performance Computing?
line 26 : Summarise your discussion in 2-3 sentances.
line 53: They are often interchangably
line 62: For example, varying an imput parameter (or input data) to a computation and running many copies simultaneously.
line 97: Summarise your discussion in 2-3 sentances.
Episode: What is an HPC system?
line 43 : Each core contains a floating point unit (FPU) which is responsible for actually performning the computations
line 51 : (also referred to as RAM or DRAM) in addtion to the processor memory
line 79 : hey may need differnt options or settings
line 101 : performance and workfows are often categorised
line 103 : his is typically seen when peforming
Episode: Connecting to the HPC system
line 9: "Succesfully connect to a remote HPC system."
line 90: Running PuTTY will not initially produce a terminal but intsead a window full of connection options.
line 175 : then use the follwing command: $ hostname
.
Episode: Transferring files
line 8: "Be able to tranfer files to and from a remote HPC system."
line 19: choose will be decided by what is most covenient for your workflow.
line 83: Or perhaps we're simply not sure which files we want to tranfer yet.
line 288: All file tranfers using the above methods use encrypted communication over
Episode: Scheduling jobs
line 248: A key job enviroment variable in PBS is..
line 325: Absence of any job info indicates that the job has been successfully canceled
line 140 : Intially, Python 3 is not loaded.
Episode : Accessing software
line 45: (Fastest Fourer Transform in the West) software library availble for it to
line 52: it contains the settings required to run a software packace
line 65: may start out with an empty environemnt,
line 172: $PATH
is a special ennvironment variable
Episode : Using resources effectively
line 59: You'll need to figure out a good amount of resources to ask for for this first "test run".
Episode : Using shared resources responsibly
line 13: "Understand how to conver many files to a single archive file using tar."
line 127: submit a short trunctated test to ensure that
line 156: In all these cases, the helpdesk of the system you are using shoud be
Episode: How does parallel computing work
line 91: If you look at the souce code.
Episode: Understanding what resources to use
line 20: Specifically what resources do you need initially and for parallel applicatiosns.
line 22: Remember the basic resources that are mananged by the scheduler on a HPC system are
line 83: such as this exampler for running the 3D animation software Maya.
line 99: **NOTE: This is nice as it automatically creates a seperate timing log file
Just after the references, Key Points
• Basic benchmarking allows you to use HPC resources more effecively.
Episode: Bootstrapping your use of HPC
line 21: This session is designed to give you the opprotunity to explore these questions and
https://github.com/hpc-carpentry/hpc-intro:
Under "Lesson structure"
"full guide" points to http://swcarpentry.github.io/lesson-example/04-formatting/ (404)
Have 2-3 user profiles / typical use cases to further explain why you might need a cluster.
Possibly also an exercise where the students write out why they think they need a cluster?
Notes for possible PRs (I haven't gone through all the material yet, so some of the following may be addressed later).
@jstaf this isn't for you, but for @callaghanmt who is interested in customizing / adding to these materials for an upcoming workshop!
The why of cluster is missing
Cloud and cluster are defined as terms. But why should a researcher care?
Anwser:
Cloud typically runs services.
Cluster runs batch jobs.
Cloud is better when you need to run a service such as a website, or database.
Cluster is better when a researcher needs to run one or more computations (ex simulation or data processing) where it is not really important exactly when the computation runs and where it may take a while for enough resources to become available to run the computation.
Would it not be great if there was something that orchestrated the resources, started your computation when enough resources were available to run even if its 4 am, and emailed you when your computation was done. This is what a HPC cluster does, and why you would use one.
The exact nomenclature computation/simulation/job of this discussion will be difficult to have without invoking other not defined words.
15-transferring-files talks about using wget to grab files from the internets. We should also include a warning about some clusters not being able to get to the internet at all, or maybe only a head node or DTN node being setup for internet access.
17-responsibility says: 'A “quick test” is generally anything that uses less than 10GB of memory, 4 CPUs, and 15 minutes of time. Remember, the login node is to be shared with other users.'
This is going to vary widely by site. Many will have enforced ulimit or cgroup limits on the headnode that are significantly lower than this and may cause this type of 'quick test' to fail when it runs out of memory.
How about just, don't run on the login node, and explain about looking for a debug or interactive queue ?
This is to use the _config.yml (as in hpc-in-a-day lesson) to define, for example queing system used
e.g from hpc-in-a day
this is the scheduler to be used for the workshop
possible values: lsf, slurm, pbs
workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"
@aturner-epcc is working on a full solution for this. But due to time constraint I will fix the first few lessons (as it is easier than manually editing from my part) on my folk.
As discussed, with @aturner-epcc and many others from this team, I wanted to start discussing splitting this lesson. The core reason is, that I am not sure whether to start merging from hpc-novice before or after the split.
12-cluster refers to 'worker (or execute)' nodes. We (and most other US institutions I've worked with) seem to refer to them as the 'compute' nodes. Minor difference that may be regional.
Do we need to think about regional dialect being swapped like cluster names & schedulers ? Or should we just keep adding terms (worker aka execute aka compute) ?
12-cluster talks about components and equates the terms CPUs, processors, and cores. It doesn't make a distinction between sockets, cores, or threads for SMT (hyperthreading). It seems like we should get that straight up front before the wrong understanding is cemented.
I'd be a fan of re-numbering by 5s so that it's easier for future users of this material to slot in their own pages.
srun with x-forwarding
i.e.
$ srun --x11 --pty bash
Is mentioned, but for this to work the user must have established a ssh connection with x-orwarding
i.e.
ssh -Y server.org
As the lessons are now split (some modules moved to hpc-carpentry/hpc-shell and the programming part is removed may be the index.md should be modified to reflect this and some lesson outcomes could be removed as well.
make lesson-check-full
flags the following issues, after updating the backend (#34):
Missing or non-consecutive episode numbers [0, 11, 12, 13, 14, 15, 16]
./CODE_OF_CONDUCT.md: Internally-defined links may be missing definitions: "Carpentry Code of Conduct"=>"coc", "reporting guidelines"=>"coc-reporting"
./CONTRIBUTING.md: Line(s) are too long: 114, 126
./README.md: Line(s) are too long: 41, 42, 43, 46, 47, 53, 61, 69, 77
./_episodes/00-hpc-intro.md: Line(s) are too long: 124, 127
./_episodes/11-cluster.md: Line(s) are too long: 71, 87, 89, 90, 91, 99
./_episodes/12-scheduler.md: Line(s) are too long: 89, 91, 153, 154, 188, 190, 266, 267, 280, 302, 323, 348
./_episodes/13-modules.md: Line(s) are too long: 91, 121, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 366
./_episodes/13-modules.md:434: Unknown or missing code block type None
./_episodes/14-transferring-files.md: Line(s) are too long: 47, 190, 195, 224
./_episodes/15-resources.md: Line(s) are too long: 89, 151, 153, 207
./_episodes/16-responsiblity.md: Line(s) are too long: 48
./index.md: Line(s) are too long: 7, 8, 12, 13
./index.md:18: Unknown or missing blockquote type None
In the "Working on a cluster" episode, where contexts and scopes within the cluster are introduced, it's worth pointing out that the cluster's node collection may be heterogeneous. It's probably overkill to get into resource requests, but a remark that nodes are not always all equal would be valuable here.
@jstaf, is this work in progress?
The glossary ( (reference.md
)guide.md
) needs content. For a start:
The current version of the login suggests that MobaXterm is used on Windows (or PuTTY). This certainly gets the job done. However, Software Carpentry uses GitBash. I know that the reason for MobaXterm is at least partly historical (it had nano while GitBash didn't and so was adopted by Compute Canada) and now not relevant (Oliver Stueker added nano to GitBash).
Would GitBash be a better option here so that:
?
12-cluster says:
This is an important point to remember: files saved on one node (computer) are available everywhere on the cluster!
That statement seems overly broad - we have things like /tmp and /scratch that may or may not be shared in some way across nodes. I'd probably just remove this sentence from this section & address it in a section about filesystems where it can be explained with more detail
I want to add an input toward the simplification of "file transfer" topic. There are currently two places where this was touched:
"Transferring Data" under "Working on a cluster" episode: https://hpc-carpentry.github.io/hpc-intro/12-cluster/index.html#transferring-data
"Transferring Files", a dedicated episode:
https://hpc-carpentry.github.io/hpc-intro/15-transferring-files/index.html
I wonder if we can consider consolidating all into one episode. Perhaps on top subheadings there will be the "preferred" method. Then on the latter subheadings, we present several alternatives. Instructors can pick and choose what method they want to expose the learners to.
A second alternative is to push the alternative methods to an extra episode, per @psteinb suggestion in #27.
PS: I'm aware of issue 27, Simplify file transfer section, but that one was on suggestion to focus only on one method of file transfer.
What is everyone's thought on this? I won't mind helping doing the consolidation/reorg, but I want to get your input first.
Somewhat related to #20, I think we should present one clear option in the file transfer section (https://hpc-carpentry.github.io/hpc-intro/14-transferring-files/) and then refer people to other tools in the "extras" section.
Request from Kamil: early in the intro we should define what a "job" is. Otherwise it's jargon.
To align further with the standard Carpentries approach the technical content should be wrapped up in a narrative that can serve as a template for participants to imagining themselves fitting into. I suspect that hpc-novice and hpc-shell should each have their own narrative such that they can be done independently but they could build as long as hpc-novice provided whatever files hpc-shell might produce.
Thoughts?
I think we should explain that Linux derived from the design of UNIX (but is not UNIX) and introduce the Linux & Posix terms instead of UNIX throughout this unit.
14-modules talks about putting 'module load' commands in .bachrc and .bash_profile.
I generally discourage users from doing this. It often slows down their login session, they forget they have them and then later open tickets about things behaving oddly, module names can change, etc.
If it's to be kept, we should at least explain the difference between the two conf files, and issue a word of caution about why it may cause problems
Filezilla has become riddled with adware and we (and others I assume) are steering users away from it.
I think we should suggest another tool on 15-transferring-files, WiinSCP, CyberDuck, etc.
In the "Accessing software" episode, in "Installing software of our own", it's not mentioned that some source distributions make bad assumptions about the level of privilege, and cluster users without root may need to provide "--prefix" info to the configuration step. They may also need to be prepared for odd or surprising output from the build.
It's possible that building software in a cluster environment should just be removed from "hpc-intro" (except perhaps to remark that it can be done), and put into an advanced follow-on lesson, if one is set up.
Hi,
I'm going through the lesson material, and the step from 11-hpc-intro.md
to 12-cluster.md
feels a bit rough.
Go ahead and log in to the cluster.
```
[user@laptop]$ ssh remote
```
{: .bash}
Very often, many users are tempted to think of a high-performance computing installation as one
giant, magical machine. Sometimes, people will assume that the computer they've logged onto is the
entire computing cluster. So what's really happening? What computer have we logged on to? The name
of the current computer we are logged onto can be checked with the `hostname` command. (Clever users
will notice that the current hostname is also part of our prompt!)
```
[remote]$ hostname
```
Now we can assume that user now about shell, and guess that ssh
is a command and remote
is an argument. But from experience assuming that they understand the [remote]$
prompt is a prompt of the remote machine if a big leap of faith.
I see a number of step missing here that IMHO need to be explained.:
remote
argument is not actually the word remote, but the actual remote address of the cluster (given to you by your admin). You also (likely) want to prefix it by <username>@
I believe that would be the basic of what need to be covered, but I feel like understanding local machine vs remote machine and when we are where is critical. Alternatively this could be in a separate lessons, but I don't see it in the shell-novices.
With the goal of these materials being more widely applicable, we should remove system-specific details and replace them with more general explanations.
The "Why Use These Computers?" section of this lesson has a discussion of "efficiency", which mentions large clusters are often a "pool of resources drawn on by many users" and mentions that they can be in use constantly. There's another side to this coin, which is that projects or groups may have a requirement for very powerful computers, but only part of the time, so sharing the resource makes economic sense for them also.
This dovetails a little bit with the existing "what is a cluster" and "definition of cloud" conversations -- pay-per-use is a much-remarked-on feature of the cloud, and relevant to institutional HPC also.
Lesson: Using resources effectively
Section: Estimating required resources using the scheduler
I'd like to suggest a change and an expansion of the following.
"A good rule of thumb is to ask the scheduler for more time and memory than your job can use. This value is typically two to three times what you think your job will need."
I agree that the user should ask for more than they expect to need but I think two-to-three times is far too high. When giving workshops to HPC novices I usually recommend no more than 20% (assuming the job's resource use is not unusually volatile) in order to ensure that a job does not get stuck in the queue waiting for the requested resources to become available.
How about something like this? I think this is an important enough concept that it is worth spelling out the issues.
"A good rule of thumb is to ask the scheduler for more time and memory than you expect your job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being canceled by the scheduler. Recommendations for how much extra to ask for vary but 10% is probably the minimum, with 20-30% being more typical. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for."
Could also add the following example to make the point even clearer.
"For example, suppose your job requires 20 GB of memory but you requested 60 GB just to be on the safe side. While your job is waiting in the queue, a node with 40 GB becomes available but your job still doesn't start because the scheduler is looking for 60 GB to satisfy the requirements you specified. In this case a smaller memory request would have resulted in your job starting sooner."
I think this point is worth emphasizing, especially as (in my experience, anyway) it isn't uncommon for novice users to end up with submit files containing resource requests that are far above what they actually need, which results in longer wait times as well as wasted resources, especially memory, when the jobs don't use what has been set aside for them.
Thoughts?
The "Working on a cluster" episode begins with the assumption that the user is logged in to the head node of a cluster ("what computer have we logged into?"), but the actual step of logging in to the head node is not called out, either at the start of this lesson, or the end of the previous one. Since the point of the description is to clarify that there are multiple contexts or scopes within a cluster, it's important to make this step explicit.
We need to decide how to take forward the configuration variables used to allow customisation to different schedulers and local system setups. There has already been discussion of this in #73 and #80. This issue is to pull together the discussion and decide on the way forward for the configuration variables.
The original set used by @psteinb in HPC-in-a-day was:
workshop_scheduler: "slurm"
workshop_login_host: "cray-1"
workshop_shared_fast_filesystem: "/fastfs"
I took this approach and applied it to the hpc-intro lesson with the plan to push this update back into the current hpc-intro source. The variables used to customise in this version actually ended up being a much longer list than I would have liked (and I think there is scope to rationalise):
workshop_host: "Cirrus"
workshop_host_id: "EPCC_Cirrus"
workshop_host_login: "login.cirrus.ac.uk"
workshop_host_location: "EPCC, The University of Edinburgh"
workshop_host_ip: "129.215.175.28"
workshop_host_homedir: "/lustre/home/tc001"
workshop_host_prompt: "[yourUsername@cirrus-login0 ~]$"
workshop_sched_id: "EPCC_Cirrus_pbs"
workshop_sched_name: "PBS Pro"
workshop_sched_submit: "qsub"
workshop_sched_submitopt: "qsub -A tc001 -q R387726"
workshop_sched_stat: "qstat"
workshop_sched_statu: "qstat -u yourUsername"
workshop_sched_del: "qdel"
workshop_sched_submiti: "qsub"
workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_info: "pbsnodes -a"
workshop_sched_comment: "#PBS"
workshop_sched_nameopt: "-N"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"
However, one thing I found compared to the original set was the requirement to distinguish between different local configurations even if the same scheduler is used, hence the addition of both "workshop_host_id" and "workshop_sched_id". I suppose these could be dropped and the combination of the variables used in naming of the snippets if we want to minimise configuration variables.
Some of the variables are only used in one place in the current lesson and, in these cases, the variable could be dropped and the syntax included directly in a snippet instead (I was too keen on avioding one line snippets). Variables only used in one place:
workshop_sched_submitiopt: "qsub -IVl select=1:ncpus=1 -A tc001 -q R387726"
workshop_sched_hist: "qstat -x"
workshop_sched_histu: "qstat -x -u yourUsername"
workshop_sched_histj: "qstat -x -f"
Variables not used at all:
workshop_host_ip: "129.215.175.28"
Ideally, I would like to rationalise and issue a PR for the updated list of configuration variables (I currently have time to do this) but, as there has been discussion on this, I wanted to get the thoughts of the community and agree a way forwards before doing this.
Would it be helpful/possible to create a list of key concepts (scheduler agnostic) that should be discussed in the scheduler lesson? Syntax specific to whatever scheduler is being used could then be filled in around those concepts. I pulled a list of ideas from the current SLURM lesson as a starting point:
One regular issue we have in HPC at UC Merced is with copy past.
Some OS/text editor/etc try to be "helpful" and convert --
/-
to em dashes, and u0096
it can be relatively hard to diagnose, in the cli you can use grep $'u0096'
(on bash 4.2+ I believe) to find those.
It is relatively sneaky as there is no errors, and flags get just ignore on sbatch.
This is rare but recurrent for user who create batch script on their machine in word/text-edit and then upload to the cluster.
Might not be a good thing in the lesson itself, but for the instructors notes.
On 15-transferring-files: Add an explanation of data locality, and how to not do things like use scp on your macbook in your office (or worse yet at home on vpn) to scp files from the file storage at a facility to the HPC storage. Something like: https://researchit.las.iastate.edu/move-your-files-faster but more generic.
In the "Scheduling jobs" episode, under the "other types of jobs" subsection, there is a remark that environment variables are not available for interactive jobs launched via "srun". On the two SLURM clusters available to me, this is not true for either of them -- doing "srun" followed by "env" shows that in the shell session, the "SLURM_*" environment variables are set, and the explicitly-mentioned SLURM_CPUS_PER_TASK variable is set if the "-c" flag was provided to "srun".
This might be configurable, and vary from one installation to another?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.