Module 4 – Distributed-Memory Parallelism with MPI¶
Time: 11:10 AM to 12:00 PM CST · 50 min total · ~15 min lecture · ~35 min hands-on
Learning Objectives¶
By the end of this module, you will be able to:
Explain the message-passing model and how it differs from shared memory
Write an MPI program that distributes work across multiple processes (ranks)
Use point-to-point communication (
MPI_Send/MPI_Recv)Use collective operations (
MPI_Reduce)
Key Concepts¶
Why MPI?¶
OpenMP threads share memory within a single node, but what if you need more compute power than one node provides? MPI (Message Passing Interface) lets you run a program across multiple processes – on the same node or across many nodes – that communicate by sending and receiving messages.
Node 0 Node 1
┌──────────┐ ┌──────────┐
│ Rank 0 │◄─────────►│ Rank 2 │
│ Rank 1 │ messages │ Rank 3 │
└──────────┘ (network) └──────────┘
Each process is called a rank and has its own private memory. Ranks coordinate by explicitly sending and receiving data.
MPI Basics¶
Every MPI program follows this structure:
#include <mpi.h>
int main(int argc, char **argv) {
MPI_Init(&argc, &argv); // Start MPI
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // My rank (0, 1, 2, ...)
MPI_Comm_size(MPI_COMM_WORLD, &size); // Total number of ranks
// ... do work ...
MPI_Finalize(); // Shut down MPI
return 0;
}
Key rule: Every rank executes the same program, but branches based on its rank number to do different work.
Point-to-Point Communication¶
Send a message from one rank to another:
if (rank == 0) {
int data = 42;
MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); // Send to rank 1
} else if (rank == 1) {
int data;
MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Rank 1 received: %d\n", data); // Prints 42
}
Collective Operations¶
Instead of manually sending between every pair of ranks, MPI provides collectives that coordinate all ranks at once:
Operation |
What it does |
|---|---|
|
One rank sends the same data to all others |
|
All ranks contribute a value; one rank gets the combined result |
|
Like Reduce, but every rank gets the result |
|
One rank distributes different pieces to each rank |
|
Each rank sends a piece; one rank collects them all |
Example – sum a value across all ranks:
double local_sum = compute_my_part(rank);
double global_sum;
MPI_Reduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0)
printf("Global sum = %f\n", global_sum);
Compiling and Running¶
mpicc -O2 -o program program.c # Compile with MPI
srun --ntasks=4 ./program # Run with 4 ranks
On our cluster, srun automatically launches one copy of your program per
rank, configured via --ntasks.
Hands-On Exercises (~35 min)¶
First, navigate to the exercises directory for this module:
cd module-04-mpi/exercises
Step 0: Look at the Example¶
Examine the MPI hello-world program:
cat ../examples/mpi_hello.c
Compile and run it on a compute node with 4 ranks:
mpicc -O2 -o mpi_hello ../examples/mpi_hello.c
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=4 ./mpi_hello
Try running with 1, 4, 8, and 16 ranks. Notice that each rank reports a different ID, but they may print in any order (that’s normal for parallel programs).
Exercise 1: Parallel Sum with MPI_Reduce (Core)¶
Open the exercise template:
cat parallel_sum.c
The program divides a large array across ranks: each rank computes the sum of
its portion, then MPI_Reduce combines the partial sums into a global total.
There are 4 TODOs to fill in:
TODO 1: Initialize MPI
TODO 2: Get the rank and size
TODO 3: Use
MPI_Reduceto sum the partial resultsTODO 4: Finalize MPI
After filling in the TODOs, compile and run:
mpicc -O2 -o parallel_sum parallel_sum.c -lm
Submit the batch script:
sbatch submit_mpi.sh
Check the output:
cat mpi-sum_<JOBID>.out
Questions:
Does the parallel result match the serial reference?
How does the time change from 1 rank to 16 ranks?
What would happen if you forgot the
MPI_Reduceand just printed each rank’s partial sum?
Exercise 2: Interactive MPI (Core)¶
Try running with different rank counts directly:
mpicc -O2 -o parallel_sum parallel_sum.c -lm
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=2 ./parallel_sum
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=8 ./parallel_sum
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=16 ./parallel_sum
Challenge A: Ring Communication¶
In a ring pattern, each rank sends data to the next rank and receives from the previous rank, forming a circle:
Rank 0 → Rank 1 → Rank 2 → Rank 3
▲ │
└───────────────────────────────┘
Open the template:
cat ring.c
Fill in the TODOs to implement the ring using MPI_Send and MPI_Recv. Each
rank sends its rank number to the next; after going around the ring, rank 0
should have received size - 1 (the last rank’s number).
Tip
Think carefully about who sends first to avoid deadlock (all ranks waiting to receive before anyone sends).
Challenge B: Ask the Agent¶
Describe this problem to your AI agent: “Write an MPI program in C where each rank generates a random number, and rank 0 prints the minimum, maximum, and average across all ranks.” Review the generated code for:
Does it call
MPI_InitandMPI_Finalize?Does it use the right MPI datatypes?
Could it deadlock?
Does rank 0 print the correct results?
Quick Reference¶
Function |
Purpose |
|---|---|
|
Initialize MPI |
|
Shut down MPI |
|
Get my rank ID |
|
Get total number of ranks |
|
Send data to a rank |
|
Receive data from a rank |
|
Combine values across ranks |
|
Broadcast from one rank to all |
MPI Datatype |
C Type |
|---|---|
|
|
|
|
|
|
|
|
Reduction Op |
Meaning |
|---|---|
|
Sum |
|
Maximum |
|
Minimum |
Next up: Module 5 – GPU Programming with HIP