Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Most people use LLM APIs for months without knowing what physically happens when they hit 'generate'. But every serving decision you'll make — batching, caching, GPU sizing, latency budgets — flows from one fact: generation is an autoregressive loop that runs the model once per output token. Understanding that the model produces text one token at a time, re-reading everything so far each step, is the mental model the rest of this course builds on. Without it, serving optimizations look like magic incantations; with it, they're obvious consequences.
The demo strips generation to its essence: a loop that feeds the current sequence to the model, gets a probability distribution over the next token, picks one, appends it, and repeats until a stop token. This is what an API call is doing behind the scenes.
Use these three in order. Each builds on the one before.
In plain terms, what physically happens when I call an LLM to generate text? What is the autoregressive loop?
Walk me through one step of generation: from the current token sequence to logits to the next token. Why does the model run once per output token?
Given that generation is one forward pass per output token, explain why output length dominates latency and how that shapes serving decisions (batching, streaming, cost).
# The autoregressive loop, conceptually (real serving does this on a GPU, batched):
def generate(model, tokenizer, prompt, max_new=50):
ids = tokenizer.encode(prompt)
for _ in range(max_new):
logits = model.forward(ids) # run the WHOLE model over the sequence
next_logits = logits[-1] # we only care about the last position
next_id = int(next_logits.argmax()) # greedy: pick the most likely token
ids.append(next_id) # append and loop again
if next_id == tokenizer.eos_id:
break
return tokenizer.decode(ids)
# Key insight: one model.forward() per OUTPUT token. A 500-token answer = 500 forward passes.python3 main.py