Capstok — learn by doing

Why this matters

Inference has two phases with opposite performance characteristics, and conflating them is the root of most serving confusion. Prefill processes your entire prompt in one parallel pass — compute-heavy, fast per token. Decode generates output tokens one at a time — memory-bandwidth-heavy, slow per token. They have different bottlenecks, different optimizations, and different cost profiles. Almost every serving technique in this course targets one phase or the other, so knowing which is which is foundational. The single most clarifying idea in LLM serving is 'prefill is parallel, decode is sequential'.

Demo

The demo contrasts the two phases: prefill runs the whole prompt through the model at once (parallelizable across all prompt tokens), while decode runs one token at a time (each depends on the last). This asymmetry explains TTFT vs. TPOT, why long prompts are cheap-ish, and why long outputs are expensive.

Loading animation…

Try it yourself

Map prefill→TTFT (time to first token) and decode→TPOT (time per output token); you'll formalize both in Module 6.
Explain why a huge prompt with a one-word answer is dominated by prefill, while a short prompt with a long answer is dominated by decode.
Reason about which phase benefits from more compute (prefill) vs. more memory bandwidth (decode).
Predict what 'chunked prefill' (Module 9) does and why it exists, given prefill is compute-heavy.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

What are the prefill and decode phases of LLM inference, and how do they differ?

2. Why it works (the mechanism)

Explain why prefill is compute-bound and parallel while decode is memory-bandwidth-bound and sequential, and what that means for latency.

3. Advanced — application & what's next

Given the prefill/decode asymmetry, explain how it determines TTFT vs. TPOT, why long outputs cost more than long prompts, and which optimizations target each phase.

References

Chat about this lesson

# PREFILL: process all N prompt tokens in ONE parallel forward pass.
#   - GPU is compute-bound (lots of matmuls, high utilization)
#   - produces the first output token + the KV cache for the prompt
#   - time ~ proportional to prompt length, but parallel -> fast per token

# DECODE: generate output tokens one at a time, each a forward pass over 1 new token.
#   - GPU is memory-bandwidth-bound (must read all weights + KV cache per token)
#   - each step depends on the previous -> sequential, can't parallelize within a request
#   - time ~ proportional to OUTPUT length

def cost_profile(prompt_tokens, output_tokens):
    return {"prefill_passes": 1,                    # all prompt tokens at once
            "decode_passes": output_tokens,         # one per output token
            "dominant_latency": "decode (sequential)" if output_tokens > 20 else "prefill"}
print(cost_profile(2000, 200))

Run: python3 main.py

Prefill vs. decode: two very different phases