Why transformers replaced everything — the context window insight

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Before transformers, sequence models (RNNs, LSTMs) processed tokens one at a time. Each step depended on a hidden state that had to carry all prior context — and that state decayed over long sequences, making it hard for the model to relate a token to something 100 positions back. Transformers abandoned the sequential processing entirely: every token attends directly to every other token in one parallel operation. This single architectural choice — all-pairs attention — is why transformers can handle long documents, why they parallelise perfectly on GPUs, and why every major AI system since 2018 (BERT, GPT, T5, LLaMA, Gemini) is a transformer.

Demo

Self-attention computes a score between every pair of tokens simultaneously, which is what lets a transformer relate a pronoun to its antecedent 200 tokens earlier without any sequential pass. The demo builds the full Q·Kᵀ/√d_k → softmax → ·V operation in raw NumPy so every matrix shape is visible before any PyTorch abstraction hides it.

Try it yourself

Print scores before the softmax. Notice that token 0 (which is [1,0,0]) attends most strongly to itself. Now make tokens 0 and 3 identical (X[3] = X[0]). Do they attend equally to each other?
Change d_k = 1 (scale by 1 instead of √3). Rerun. The attention weights become more peaked (extreme). This is why the 1/√d_k scaling matters — it prevents the softmax from saturating.
Add a causal mask (GPT-style): set scores[i, j] = -1e9 for all j > i before softmax. Now each token can only attend to past tokens. Print the masked weights and confirm the upper triangle is ~0.
Replace the identity projections with random weight matrices: W_Q = rng.standard_normal((3,3)); Q = X @ W_Q (and similar for K, V). Now Q, K, V are distinct — this is real attention. Does the output change?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why transformers process all tokens in parallel while RNNs process them sequentially. What is the practical consequence for training speed and for modeling long-range dependencies?

2. Why it works (the mechanism)

Walk me through scaled dot-product attention step by step: what are Q, K, V, what does `Q @ K.T / sqrt(d_k)` compute, why does softmax turn scores into weights, and what does the final `weights @ V` produce? Use the 4-token example above as a concrete reference.

3. Advanced — application & what's next

The transformer's all-pairs attention has O(n²) memory and compute complexity in sequence length n. For n=1000 (typical sentence), that's fine. For n=100,000 (a book), it's prohibitive. Name three architectural modifications that address this (sparse attention, linear attention, sliding window) and for each: what's the tradeoff vs full attention?