Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Before transformers, sequence models (RNNs, LSTMs) processed tokens one at a time. Each step depended on a hidden state that had to carry all prior context — and that state decayed over long sequences, making it hard for the model to relate a token to something 100 positions back. Transformers abandoned the sequential processing entirely: every token attends directly to every other token in one parallel operation. This single architectural choice — all-pairs attention — is why transformers can handle long documents, why they parallelise perfectly on GPUs, and why every major AI system since 2018 (BERT, GPT, T5, LLaMA, Gemini) is a transformer.
Self-attention computes a score between every pair of tokens simultaneously, which is what lets a transformer relate a pronoun to its antecedent 200 tokens earlier without any sequential pass. The demo builds the full Q·Kᵀ/√d_k → softmax → ·V operation in raw NumPy so every matrix shape is visible before any PyTorch abstraction hides it.
scores before the softmax. Notice that token 0 (which is [1,0,0]) attends most strongly to itself. Now make tokens 0 and 3 identical (X[3] = X[0]). Do they attend equally to each other?d_k = 1 (scale by 1 instead of √3). Rerun. The attention weights become more peaked (extreme). This is why the 1/√d_k scaling matters — it prevents the softmax from saturating.scores[i, j] = -1e9 for all j > i before softmax. Now each token can only attend to past tokens. Print the masked weights and confirm the upper triangle is ~0.W_Q = rng.standard_normal((3,3)); Q = X @ W_Q (and similar for K, V). Now Q, K, V are distinct — this is real attention. Does the output change?Use these three in order. Each builds on the one before.
In one paragraph, explain why transformers process all tokens in parallel while RNNs process them sequentially. What is the practical consequence for training speed and for modeling long-range dependencies?
Walk me through scaled dot-product attention step by step: what are Q, K, V, what does `Q @ K.T / sqrt(d_k)` compute, why does softmax turn scores into weights, and what does the final `weights @ V` produce? Use the 4-token example above as a concrete reference.
The transformer's all-pairs attention has O(n²) memory and compute complexity in sequence length n. For n=1000 (typical sentence), that's fine. For n=100,000 (a book), it's prohibitive. Name three architectural modifications that address this (sparse attention, linear attention, sliding window) and for each: what's the tradeoff vs full attention?
import numpy as np
# Toy sequence: 4 tokens, each a 3-dim embedding
X = np.array([
[1.0, 0.0, 0.0], # token 0
[0.0, 1.0, 0.0], # token 1
[0.0, 0.0, 1.0], # token 2
[0.5, 0.5, 0.0], # token 3
]) # shape (4, 3) — seq_len × d_model
d_k = X.shape[1] # key dimension
# Simplified: use X itself as Q, K, V (tied projections)
scores = X @ X.T / np.sqrt(d_k) # (4, 4) — every pair gets a score
weights = np.exp(scores) / np.exp(scores).sum(axis=1, keepdims=True) # softmax
output = weights @ X # (4, 3) — weighted sum of all tokens
print("Attention weights (row i = how much token i attends to each token):")
print(weights.round(3))
print("\nOutput (each token is now a blend of all others):")
print(output.round(3))python3 main.py