What happens when you call an LLM

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Most people use LLM APIs for months without knowing what physically happens when they hit 'generate'. But every serving decision you'll make — batching, caching, GPU sizing, latency budgets — flows from one fact: generation is an autoregressive loop that runs the model once per output token. Understanding that the model produces text one token at a time, re-reading everything so far each step, is the mental model the rest of this course builds on. Without it, serving optimizations look like magic incantations; with it, they're obvious consequences.

Demo

The demo strips generation to its essence: a loop that feeds the current sequence to the model, gets a probability distribution over the next token, picks one, appends it, and repeats until a stop token. This is what an API call is doing behind the scenes.

Loading animation…

Try it yourself

Count the forward passes for a 200-token answer (it's 200) — this is why output length drives latency and cost.
Note the loop re-feeds the full sequence each step; that redundancy is exactly what the KV cache (Module 2) eliminates.
Replace argmax with sampling from the distribution and see why outputs become non-deterministic.
Reason about why a 2,000-token prompt + 50-token answer is cheaper per-token than a 50-token prompt + 2,000-token answer.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In plain terms, what physically happens when I call an LLM to generate text? What is the autoregressive loop?

2. Why it works (the mechanism)

Walk me through one step of generation: from the current token sequence to logits to the next token. Why does the model run once per output token?

3. Advanced — application & what's next

Given that generation is one forward pass per output token, explain why output length dominates latency and how that shapes serving decisions (batching, streaming, cost).