Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Inference has two phases with opposite performance characteristics, and conflating them is the root of most serving confusion. Prefill processes your entire prompt in one parallel pass — compute-heavy, fast per token. Decode generates output tokens one at a time — memory-bandwidth-heavy, slow per token. They have different bottlenecks, different optimizations, and different cost profiles. Almost every serving technique in this course targets one phase or the other, so knowing which is which is foundational. The single most clarifying idea in LLM serving is 'prefill is parallel, decode is sequential'.
The demo contrasts the two phases: prefill runs the whole prompt through the model at once (parallelizable across all prompt tokens), while decode runs one token at a time (each depends on the last). This asymmetry explains TTFT vs. TPOT, why long prompts are cheap-ish, and why long outputs are expensive.
Use these three in order. Each builds on the one before.
What are the prefill and decode phases of LLM inference, and how do they differ?
Explain why prefill is compute-bound and parallel while decode is memory-bandwidth-bound and sequential, and what that means for latency.
Given the prefill/decode asymmetry, explain how it determines TTFT vs. TPOT, why long outputs cost more than long prompts, and which optimizations target each phase.
# PREFILL: process all N prompt tokens in ONE parallel forward pass.
# - GPU is compute-bound (lots of matmuls, high utilization)
# - produces the first output token + the KV cache for the prompt
# - time ~ proportional to prompt length, but parallel -> fast per token
# DECODE: generate output tokens one at a time, each a forward pass over 1 new token.
# - GPU is memory-bandwidth-bound (must read all weights + KV cache per token)
# - each step depends on the previous -> sequential, can't parallelize within a request
# - time ~ proportional to OUTPUT length
def cost_profile(prompt_tokens, output_tokens):
return {"prefill_passes": 1, # all prompt tokens at once
"decode_passes": output_tokens, # one per output token
"dominant_latency": "decode (sequential)" if output_tokens > 20 else "prefill"}
print(cost_profile(2000, 200))python3 main.py