Module 1 – HPC Foundations & Cluster Architecture¶
Time: 8:50 to 9:20 AM CST · 30 min total · ~20 min lecture · ~10 min hands-on
Learning Objectives¶
By the end of this module, you will be able to:
Explain why HPC exists and what problems it solves
Describe the major components of an HPC cluster
Navigate the cluster filesystem and use environment modules
Inspect hardware resources using command-line tools
Key Concepts¶
Why HPC?¶
Some problems are too large or too slow for a single computer:
Weather forecasting: Simulating the atmosphere over a global grid
Genomics: Aligning billions of DNA reads against a reference genome
AI training: Adjusting billions of model parameters over terabytes of data
Physics simulations: Modeling molecular dynamics, fluid flow, or cosmological structure
HPC clusters solve this by combining many computers (called nodes) into a single system and splitting the work across them.
Anatomy of an HPC Cluster¶
┌─────────────────────────────────────────────────────────────┐
│ USERS (SSH) │
│ │ │
│ ┌──────▼──────┐ │
│ │ Login Node │ ← You are here │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ High-Speed Network │ │
│ │ (InfiniBand) │ │
│ └──┬─────┬─────┬───────┬──┘ │
│ │ │ │ │ │
│ ┌───▼┐ ┌──▼┐ ┌──▼─┐ ┌───▼┐ │
│ │Node│ │...│ │... │ │Node│ ← Compute nodes │
│ │ 1 │ │ │ │ │ │ N │ (CPUs + GPUs) │
│ └────┘ └───┘ └────┘ └────┘ │
│ │ │ │ │ │
│ ┌───┴─────┴─────┴─────┴───┐ │
│ │ Shared Storage │ │
│ │ (Parallel Filesystem) │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Login nodes are where you:
Edit code, write scripts, manage files
Submit jobs to the scheduler
Do NOT run compute-intensive programs (everyone shares these)
Compute nodes are where your programs actually run:
The compute nodes we’ll be using in this cluster have 16 CPU cores + 1 AMD MI210 GPU
You access them by submitting jobs through the Slurm scheduler (Module 2)
High-speed network (InfiniBand, RoCEv2) connects nodes so they can communicate fast – critical for MPI programs (Module 4).
Shared storage means your files in $HOME and /work1 are visible from every node.
You don’t need to copy files to each node.
Our Cluster: The AUP AI & HPC Cluster¶
The AUP AI & HPC Cluster has nodes containing multiple generations of AMD Instinct GPUs. For this tutorial, we’ll have access to the following resources:
Component |
Details |
|---|---|
Login node CPUs |
AMD EPYC 7V13 64-Core (2 sockets, 128 cores total) |
Compute node CPUs |
AMD EPYC (16 cores per virtual node) |
Compute node GPUs |
1x AMD Instinct MI210 (64 GB HBM2e) |
Compute node RAM |
64 GB |
GPU software |
ROCm 7.2.0 |
Compiler |
GCC 12.2.0 |
MPI |
OpenMPI 4.1.8 |
Scheduler |
Slurm |
Student partition |
|
The Software Stack: Environment Modules¶
HPC clusters use environment modules to manage software. Instead of installing packages globally (like on a laptop), you load and unload modules to make specific software versions available.
module list # What's currently loaded?
module avail # What's available to load?
module show <name> # What does a module do?
module load <name> # Load a module
module unload <name> # Unload a module
On our cluster, the hpcfund module is loaded by default and provides the base
environment (GCC, OpenMPI, ROCm, cmake). Do not run module purge – it will
remove this base.
Programming Models (Preview)¶
Model |
Parallelism Level |
Example |
Module |
|---|---|---|---|
OpenMP |
Threads within a single node |
|
3 |
MPI |
Processes across multiple nodes |
|
4 |
HIP |
GPU threads (thousands) |
|
5 |
You’ll learn all three today!
Hands-On Exercises (~10 min)¶
Core: Explore the Cluster¶
First, navigate to the exercises directory for this module:
cd module-01-hpc-foundations/exercises
Then run the guided exploration script, which walks you through key commands:
bash explore_cluster.sh
The script will pause between sections so you can read the output. Alternatively, you can run each command below individually.
Step 1: Where Are You?¶
hostname # What machine are you on?
whoami # What's your username?
echo $HOME # Your home directory
pwd # Current working directory
Step 2: Explore the Filesystem¶
ls $HOME # Your home directory contents
ls /work1 # Shared work directory
df -h $HOME # Disk space available
df -h $WORK
Step 3: Check the Software Environment¶
module list # Currently loaded modules
module avail # All available modules
module show hpcfund # What does the base module provide?
Step 4: Inspect the Hardware (Login Node)¶
lscpu # CPU information
lscpu | grep "Model name" # Just the CPU model
lscpu | grep "CPU(s):" # Number of CPUs
free -h # Memory information
Step 5: Preview the GPU Software¶
which hipcc # Is the HIP compiler available?
hipcc --version # What version?
rocminfo | head -20 # ROCm system info (first 20 lines)
Note
The login node may or may not have GPUs. The compute nodes in the mi2101x
partition definitely do. You’ll see the full GPU information when you run jobs
in Module 2.
Step 6: Look at the Cluster¶
sinfo # Show all partitions and node states
sinfo -p mi2101x # Just our partition
sinfo -p mi2101x -N -l # Detailed per-node listing
squeue # Current job queue (may be empty or busy)
Challenge¶
Summarize the cluster. Based on what you found, write a short paragraph (3-4 sentences) describing the cluster: how many nodes are in our partition, how many CPUs per node, what GPU each node has, and what software is available.
AI agent check. Copy the output of
rocminfoand ask your AI coding assistant to explain it. Does the agent correctly identify the CPU model? The GPU? Does it get anything wrong?Tip
After completing Getting Started Step 4, launch the tutorial-provided coding agent (aider) from this directory:
bash ../../setup/launch_aider.shInside aider, paste the
rocminfooutput and ask for an explanation. Type/exitwhen done. See the top-level README for more on the agent.
Quick Reference¶
Command |
What It Does |
|---|---|
|
Print the name of the machine you’re on |
|
Show CPU architecture information |
|
Show memory usage in human-readable format |
|
List currently loaded modules |
|
List all available modules |
|
Show what a module does |
|
Show cluster partition and node information |
|
Show the job queue |
Next up: Module 2 – Job Scheduling with Slurm