Module 2 – Job Scheduling with Slurm¶

Time: 9:20 to 10:05 AM CST · 45 min total · ~15 min lecture · ~30 min hands-on

Learning Objectives¶

By the end of this module, you will be able to:

Explain why HPC clusters use batch scheduling
Write a Slurm batch script with appropriate #SBATCH directives
Submit, monitor, and cancel jobs using sbatch, squeue, and scancel
Run quick commands on compute nodes with srun

Key Concepts¶

Why Batch Scheduling?¶

Imagine 80+ students all trying to run programs on 20 compute nodes at the same time. Without coordination, some nodes would be overloaded while others sit idle, and everyone’s programs would interfere with each other.

A batch scheduler solves this by:

Accepting job requests and placing them in a queue
Allocating compute nodes to jobs in a fair order
Running each job in isolation on its assigned node(s)
Releasing the node when the job finishes

Our cluster uses Slurm (Simple Linux Utility for Resource Management), the most widely used scheduler in HPC.

Slurm Commands at a Glance¶

Command	Purpose	Example
`sbatch script.sh`	Submit a batch job	`sbatch my_job.sh`
`squeue`	View the job queue	`squeue -u $USER`
`scancel <jobid>`	Cancel a job	`scancel 12345`
`srun <command>`	Run a command on a compute node	`srun hostname`
`sinfo`	View partition/node status	`sinfo -p mi2101x`
`scontrol show job <id>`	Detailed info about a job	`scontrol show job 12345`
`sacct -j <id>`	Accounting info after completion	`sacct -j 12345`

Anatomy of a Batch Script¶

A batch script is a regular shell script with special #SBATCH comment lines that tell Slurm what resources you need:

#!/bin/bash
#SBATCH --job-name=my-job          # Name shown in squeue
#SBATCH --partition=mi2101x        # Which partition (group of nodes)
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --ntasks=1                 # Number of tasks (processes)
#SBATCH --time=5:00                # Max wall time (MM:SS or HH:MM:SS)
#SBATCH --output=my-job_%j.out     # Stdout file (%j = job ID)
#SBATCH --error=my-job_%j.err      # Stderr file

# --- Your program runs below this line ---
echo "Hello from $(hostname) at $(date)"

Key directives:

Directive	Meaning
`--partition`	Target node group. Use `mi2101x` for this tutorial.
`--nodes`	How many nodes. Use `1` for most exercises.
`--ntasks`	Number of MPI ranks (processes). Use `1` unless doing MPI.
`--cpus-per-task`	CPU cores per task. Useful for OpenMP threading.
`--time`	Maximum run time. Keep short (5–10 min) to share nodes fairly.
`--output` / `--error`	Where stdout/stderr go. `%j` is replaced with the job ID.

The Job Lifecycle¶

sbatch submit → PENDING (waiting for nodes) → RUNNING (on a node) → COMPLETED
                                                       ↓
                                              output written to file

You can also cancel at any stage with scancel.

Hands-On Exercises (~30 min)¶

First, navigate to the exercises directory for this module:

cd module-02-slurm/exercises

Exercise 1: Observe the Job Lifecycle (Core)¶

Your first batch script includes a sleep so the job stays running long enough for you to practice monitoring it.

Step 1: Look at the template:

cat first_job.sh

Step 2: Submit it:

sbatch first_job.sh

Slurm will print something like Submitted batch job 12345. Note the job ID.

Step 3: Immediately check the queue:

squeue -u $USER

You should see your job in PENDING or RUNNING state. Try these while it runs:

squeue -u $USER                     # Your jobs
scontrol show job <JOBID>           # Detailed job info
squeue -p mi2101x                   # All jobs on our partition

Step 4: Cancel the job (don’t wait for it to finish):

scancel <JOBID>

Verify it’s gone:

squeue -u $USER

Step 5: Submit it again and let it finish. Then read the output:

sbatch first_job.sh
# Wait ~70 seconds for it to complete...
squeue -u $USER                     # Should disappear when done
cat first-job_<JOBID>.out           # Replace <JOBID> with your job ID

Tip

ls -lt *.out | head shows the most recent output files.

Exercise 2: Compile and Run on a Compute Node (Core)¶

Now let’s do something more realistic: compile a C program and run it on a compute node.

Step 1: Look at the source code:

cat hello_compute.c

This program prints information about the compute node, including GPU details.

Step 2: Compile it on the login node (compiling is lightweight, that’s OK):

gcc -o hello_compute hello_compute.c

Step 3: Look at the batch script that runs it:

cat submit_hello.sh

Step 4: Submit:

sbatch submit_hello.sh

Step 5: Once it completes, examine the output:

cat hello-compute_<JOBID>.out

Questions to answer:

What is the hostname of the compute node? Is it different from the login node?
How many CPU cores does the compute node report?
Does it detect a GPU?

Exercise 3: Interactive Commands with `srun` (Core)¶

srun lets you run a single command on a compute node without writing a batch script. Useful for quick tests.

srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 hostname

Try a few more:

srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 lscpu | grep "Model name"
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 rocminfo | head -30

Note

srun waits for a node, runs the command, and returns. If the partition is busy, you may wait a moment.

Exercise 4: Python Environment Checkpoint (Core)¶

In Getting Started Step 4, you submitted setup/setup_venv.sh as a Slurm job. Now that you’ve seen sbatch, squeue, and output files, that command should make more sense: it requested a compute node and installed the Python packages used by the tutorial agent and the afternoon AI modules.

First, verify that the venv exists:

source "$WORK/sc26_venv/bin/activate"
python3 -c "import torch; print(f'PyTorch {torch.__version__} installed')"
deactivate

If that works, you’re done. If you skipped Getting Started Step 4 or the venv is missing, create it now from this module’s exercises directory:

sbatch ../../setup/setup_venv.sh

This takes a few minutes. Check progress with:

squeue -u $USER
tail -f setup_venv_<JOBID>.out      # Watch the log (Ctrl+C to stop)

Once the job finishes, run the verification commands again:

source "$WORK/sc26_venv/bin/activate"
python3 -c "import torch; print(f'PyTorch {torch.__version__} installed')"
deactivate

Challenge Exercises¶

Challenge A: Parameter Sweep¶

Write a batch script that runs a “parameter sweep” – the same computation with different input values. Two template files are provided:

cat parameter_sweep.sh     # The batch script (shell loop)
cat sweep_compute.py        # The Python script (TODO: add the loop)

Fill in the TODO sections in both files and submit the batch script. You should see timing results for each value of N, showing how compute time grows.

Challenge B: Explore Resource Requests¶

Experiment with different Slurm directives and observe their effect:

# Request multiple tasks (for MPI later)
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=4 hostname

# Request multiple CPUs per task (for OpenMP later)
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 --cpus-per-task=8 \
  bash -c 'echo "I have $SLURM_CPUS_PER_TASK CPUs"'

# Check the GPU on the node (every mi2101x node has an MI210)
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 \
  bash -c 'echo "GPU node: $(hostname)"; rocm-smi --showproductname 2>/dev/null | head -10'

Question: What happens if you request --ntasks=4 with --nodes=1? How many times does hostname print?

Quick Reference¶

I want to…	Command
Submit a job	`sbatch script.sh`
See my jobs	`squeue -u $USER`
See all jobs on our partition	`squeue -p mi2101x`
Get details on a job	`scontrol show job <JOBID>`
Cancel a job	`scancel <JOBID>`
Cancel all my jobs	`scancel -u $USER`
Run a quick command on a node	`srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 <cmd>`
See node availability	`sinfo -p mi2101x`

Next up: Module 3 – Shared-Memory Parallelism with OpenMP

Module 2 – Job Scheduling with Slurm¶

Learning Objectives¶

Key Concepts¶

Why Batch Scheduling?¶

Slurm Commands at a Glance¶

Anatomy of a Batch Script¶

The Job Lifecycle¶

Hands-On Exercises (~30 min)¶

Exercise 1: Observe the Job Lifecycle (Core)¶

Exercise 2: Compile and Run on a Compute Node (Core)¶

Exercise 3: Interactive Commands with srun (Core)¶

Exercise 4: Python Environment Checkpoint (Core)¶

Challenge Exercises¶

Challenge A: Parameter Sweep¶

Challenge B: Explore Resource Requests¶

Quick Reference¶

Exercise 3: Interactive Commands with `srun` (Core)¶