Module 6 – AI on HPC: Inference & Fine-Tuning¶

Time: 2:00 to 3:00 PM CST · 60 min total · ~20 min lecture · ~40 min hands-on

Learning Objectives¶

By the end of this module, you will be able to:

Explain the difference between training, fine-tuning, and inference
Run inference with a pre-trained language model on the MI210 GPU
Fine-tune a text classification model using LoRA (parameter-efficient fine-tuning)
Reason about GPU memory usage during inference vs. training

Key Concepts¶

What is AI/ML?¶

Machine learning is about finding patterns in data. Instead of writing explicit rules, you show the computer many examples and it learns the rules itself.

  Traditional Programming:      Machine Learning:
  Rules + Data → Answers        Data + Answers → Rules (Model)

A neural network is a mathematical function with millions (or billions) of adjustable numbers called weights. Training adjusts these weights so the function produces good outputs for given inputs.

Neural Networks in a Nutshell¶

  Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Output
              ↑            ↑                ↑
           weights      weights          weights

Forward pass: data flows through layers to produce a prediction
Loss: measures how wrong the prediction is
Backpropagation: computes how to adjust each weight to reduce the loss
Gradient descent: actually updates the weights

Inference vs. Training vs. Fine-Tuning¶

	Inference	Training	Fine-Tuning
What	Use a trained model to make predictions	Train a model from scratch	Adapt a pre-trained model to a new task
Weights	Frozen (no changes)	Random → learned	Pre-trained → adjusted
Data needed	Just the input	Massive dataset	Small task-specific dataset
Compute cost	Low	Very high	Moderate
GPU memory	Model weights only	Weights + gradients + optimizer states	Less than full training

Parameter-Efficient Fine-Tuning (LoRA)¶

Full fine-tuning updates every weight in the model, which requires enormous memory. LoRA (Low-Rank Adaptation) freezes the original weights and adds small trainable “adapter” matrices:

  Original weight matrix W (frozen)
  +
  Low-rank adapter:  A × B  (trainable, much smaller)
  =
  Effective weight: W + A × B

For a model with 110M parameters, LoRA might only train ~0.5M parameters – a 200x reduction. This means fine-tuning fits on a single GPU and runs fast.

Why GPUs for AI?¶

Neural networks are mostly matrix multiplications – exactly the kind of massively parallel operation GPUs excel at:

The MI210 has 6656 stream processors running in parallel
64 GB HBM2e with 1.6 TB/s bandwidth for fast data access
This is why GPUs make inference 10-100x faster than CPUs

Hands-On Exercises (~40 min)¶

First, navigate to the exercises directory for this module:

cd module-06-ai-inference-finetuning/exercises

Important

All exercises run as batch jobs that produce output files. Submit your script, wait for it to complete, then examine the output while others use the nodes.

Step 0: Verify Your Python Environment¶

You should already have the venv from Getting Started Step 4 (checked again in Module 2). Verify it before starting the AI exercises:

source "$WORK/sc26_venv/bin/activate"
python3 -c "import torch; print(f'PyTorch {torch.__version__}, ROCm: {torch.cuda.is_available()}')"

If the venv is missing, set it up now from this module’s exercises directory:

sbatch ../../setup/setup_venv.sh
# Wait for the job to finish, then activate it:
source "$WORK/sc26_venv/bin/activate"

Note

The venv lives under $WORK (your /work1/<project>/<username>/ directory) because /work1 has much more storage space than $HOME. The $WORK environment variable is set automatically when you log in.

Exercise 1: Your First Inference (Core)¶

Look at the simple inference example:

cat ../examples/simple_inference.py

This script loads a small pre-trained text generation model and generates responses to a few prompts. Examine how it:

Loads the model and tokenizer
Moves the model to the GPU
Tokenizes input text
Generates output tokens
Decodes back to text

Submit it as a batch job:

sbatch submit_inference.sh

Check the output when it finishes:

cat inference_<JOBID>.out

Questions:

How fast does the model generate text (tokens/sec)?
How much GPU memory does the model use?
Do the generated responses make sense?

Exercise 2: Experiment with Inference (Core)¶

Open the inference exercise:

cat run_inference.py

There are 3 TODOs:

TODO 1: Load the model onto the GPU (choose the right device)
TODO 2: Set generation parameters (max tokens, temperature)
TODO 3: Add your own custom prompts to the prompt list

Edit the file, add your own prompts, and submit:

sbatch submit_inference.sh

Try varying the temperature parameter:

temperature=0.1 – very focused, repetitive
temperature=0.7 – balanced creativity
temperature=1.5 – wild and unpredictable

Exercise 3: Fine-Tune a Model with LoRA (Core)¶

Now let’s adapt a pre-trained model to a new task. We’ll fine-tune a text classification model (DistilBERT) on a sentiment analysis dataset using LoRA.

Look at the script:

cat finetune_lora.py

There are 3 TODOs:

TODO 1: Configure the LoRA adapter parameters
TODO 2: Write the training loop (forward pass, loss, backward, optimizer step)
TODO 3: Run evaluation and print accuracy

Submit the fine-tuning job:

sbatch submit_finetune.sh

Check the output:

cat finetune_<JOBID>.out

Questions:

What accuracy does the model achieve before fine-tuning (random baseline)?
What accuracy after fine-tuning?
How long did training take?
How many parameters were actually trained (vs. total)?

Challenge A: LoRA Hyperparameters¶

Experiment with different LoRA settings in finetune_lora.py:

r=4 (rank 4) vs. r=16 (rank 16) vs. r=64 (rank 64)
lora_alpha=16 vs. lora_alpha=32
num_epochs=1 vs. num_epochs=3 vs. num_epochs=5

How do these affect final accuracy and training time?

Challenge B: GPU Memory Profiling¶

Add rocm-smi monitoring to your batch script to see GPU memory usage during inference vs. training:

# Add to your batch script before and during model loading:
rocm-smi --showmeminfo vram
# ... load model ...
rocm-smi --showmeminfo vram

How much GPU memory does the model use during inference vs. fine-tuning?

Quick Reference¶

Concept	Description
Inference	Running a trained model on new inputs (forward pass only)
Training	Learning weights from scratch on a large dataset
Fine-tuning	Adapting a pre-trained model to a specific task
LoRA	Parameter-efficient fine-tuning using low-rank adapters
Tokenizer	Converts text to/from numerical tokens the model understands
Temperature	Controls randomness in text generation (0 = deterministic, >1 = creative)

PyTorch / HuggingFace	Purpose
`AutoModelForCausalLM.from_pretrained(name)`	Load a pre-trained text generation model
`AutoTokenizer.from_pretrained(name)`	Load the matching tokenizer
`model.to("cuda")`	Move model to GPU
`model.generate(input_ids, ...)`	Generate text
`get_peft_model(model, lora_config)`	Wrap model with LoRA adapters

Next up: Module 7 – AI Agents & Capstone