Module 5 – GPU Programming with HIP¶
Time: 12:45 to 1:45 PM CST · 60 min total · ~20 min lecture · ~40 min hands-on
Learning Objectives¶
By the end of this module, you will be able to:
Describe GPU architecture and how it differs from a CPU
Write a HIP kernel that runs on the AMD MI210 GPU
Manage data transfers between host (CPU) and device (GPU) memory
Choose thread and block dimensions for a kernel launch
Key Concepts¶
Why GPUs?¶
A CPU has a few powerful cores optimized for complex, sequential tasks. A GPU has thousands of simpler cores optimized for doing the same operation on many data elements at once.
CPU: 16 powerful cores GPU: thousands of lightweight cores
┌──┐┌──┐┌──┐┌──┐ ┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐
│ ││ ││ ││ │ ... (16) │ ││ ││ ││ ││ ││ ││ ││ ││ ││ │ ...
└──┘└──┘└──┘└──┘ └─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘
(thousands)
When your problem involves applying the same math to millions of data points (vector operations, matrix multiply, neural network layers), GPUs can be 10-100x faster than CPUs.
The AMD MI210¶
Our compute nodes each have an AMD Instinct MI210 GPU:
Spec |
MI210 |
|---|---|
Compute Units |
104 |
Stream Processors |
6656 |
Memory |
64 GB HBM2e |
Memory Bandwidth |
1.6 TB/s |
Architecture |
CDNA2 (gfx90a) |
HIP: Portable GPU Programming¶
HIP (Heterogeneous-compute Interface for Portability) is AMD’s GPU programming framework. It’s designed to be nearly identical to NVIDIA’s CUDA, so code is portable across AMD and NVIDIA GPUs.
The Programming Model¶
A HIP program has two parts:
Host code (runs on CPU): manages memory, launches kernels
Device code (runs on GPU): the kernel function that runs in parallel
Host (CPU) Device (GPU)
────────── ────────────
1. Allocate GPU memory
2. Copy data to GPU ──►
3. Launch kernel ──► Each thread processes one element
4. Copy results back ◄──
5. Free GPU memory
Threads, Blocks, and Grids¶
When you launch a kernel, you specify how many threads to create, organized into a hierarchy:
Grid (all threads)
├── Block 0: [Thread 0] [Thread 1] ... [Thread 255]
├── Block 1: [Thread 0] [Thread 1] ... [Thread 255]
├── Block 2: [Thread 0] [Thread 1] ... [Thread 255]
└── ...
Each thread computes its global index to know which data element to process:
int i = blockIdx.x * blockDim.x + threadIdx.x;
Variable |
Meaning |
|---|---|
|
Thread index within its block (0 to blockDim-1) |
|
Block index within the grid |
|
Number of threads per block |
|
Number of blocks in the grid |
Kernel Syntax¶
// Kernel definition (runs on GPU)
__global__ void add_vectors(const double *a, const double *b, double *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
// Launch from host
int threads_per_block = 256;
int blocks = (n + threads_per_block - 1) / threads_per_block;
add_vectors<<<blocks, threads_per_block>>>(d_a, d_b, d_c, n);
Memory Management¶
Function |
Purpose |
|---|---|
|
Allocate memory on the GPU |
|
Copy data between host and device |
|
Free GPU memory |
|
Wait for all GPU work to finish |
Copy directions:
hipMemcpyHostToDevice– CPU to GPUhipMemcpyDeviceToHost– GPU to CPU
Hands-On Exercises (~40 min)¶
First, navigate to the exercises directory for this module:
cd module-05-hip/exercises
Step 0: Look at the Example¶
Examine the complete vector addition program:
cat ../examples/vector_add.cpp
Compile and run it:
hipcc -O2 -o vector_add ../examples/vector_add.cpp
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 ./vector_add
Study the code carefully. Note the pattern:
Allocate host arrays and initialize them
Allocate device arrays with
hipMallocCopy input data to device with
hipMemcpyLaunch the kernel with
<<<blocks, threads>>>Copy results back to host
Verify and free memory
Exercise 1: Vector Scale (Core)¶
Open the exercise template:
cat vector_scale.cpp
This program should multiply every element of a vector by a constant:
result[i] = alpha * input[i]
There are 4 TODOs to fill in:
TODO 1: Write the kernel function
TODO 2: Copy input data from host to device
TODO 3: Calculate grid dimensions and launch the kernel
TODO 4: Copy results from device back to host
After filling in the TODOs, compile and run:
hipcc -O2 -o vector_scale vector_scale.cpp
sbatch submit_hip.sh
Check the output:
cat hip-exercise_<JOBID>.out
The program verifies the results against a CPU reference. If everything is correct, you’ll see “PASSED”.
Exercise 2: Experiment with Block Sizes (Core)¶
Modify vector_scale.cpp (or the solution) to try different block sizes.
Edit the BLOCK_SIZE define and recompile:
# Try block sizes: 64, 128, 256, 512
hipcc -O2 -DBLOCK_SIZE=64 -o vector_scale vector_scale.cpp
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 ./vector_scale
hipcc -O2 -DBLOCK_SIZE=512 -o vector_scale vector_scale.cpp
srun --partition=mi2101x --nodes=1 --time=2:00 --ntasks=1 ./vector_scale
Does the block size affect performance or correctness?
Challenge A: Matrix-Vector Multiply on the GPU¶
Open the challenge template:
cat matvec_hip.cpp
Implement a HIP kernel for matrix-vector multiplication: y[i] = sum_j(A[i][j] * x[j])
Each thread computes one element of the output vector y. This means each thread must iterate over an entire row of the matrix.
Challenge B: Ask the Agent¶
Give your AI agent this serial C loop and ask it to convert it to a HIP kernel:
for (int i = 0; i < n; i++) {
y[i] = 0.0;
for (int j = 0; j < n; j++) {
y[i] += A[i * n + j] * x[j];
}
}
Review the generated code. Does the thread indexing make sense? Did it handle
the if (i < n) bounds check?
Quick Reference¶
HIP Function |
Purpose |
|---|---|
|
Allocate GPU memory |
|
Free GPU memory |
|
Copy data (host↔device) |
|
Wait for GPU to finish |
|
Check for kernel launch errors |
Kernel Syntax |
|
|---|---|
|
Declare a GPU kernel |
|
Launch the kernel |
|
Thread ID within block |
|
Block ID within grid |
|
Threads per block |
Compile & Run |
|
|---|---|
|
Compile HIP code |
|
Run on a GPU node |