Gradient descent is just a loop

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Once you have gradients, 'training' is embarrassingly simple: compute the gradient, subtract a small multiple of it from each parameter, repeat. That's it. Every sophisticated optimizer (Momentum, Adam, AdamW, Lion) is a variation on this core loop. Understanding the vanilla loop means you'll understand AdamW when it shows up in Module 7 as a tweak, not a mystery.

Demo

Task: fit the line y = 2x + 3 to 100 noisy points by learning w and b from scratch with gradient descent. No framework, no autograd — we hand-derive the two gradients.

Loss: mean squared error. Per-sample gradient: dL/dw = 2(w*x + b - y) * x, dL/db = 2(w*x + b - y).

Update rule: w -= lr * mean(dL/dw), b -= lr * mean(dL/db). Run for 200 iterations and watch w approach 2 and b approach 3.

Try it yourself

Run the demo and watch w approach 2 and b approach 3. Note which value converges faster and hypothesize why.

Change lr from 0.01 to 1.0 and re-run. Observe the divergence: the updates overshoot and the loss explodes.

Change lr to 0.0001 and re-run for 200 steps. Does it converge? Increase steps to 10000 and compare.

Change the true function to y = 2x^2 + 3 and re-run the same fitter. Plot the final line — it's the best linear fit to a quadratic. Why is this the 'bias' of linear models?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Define gradient descent in one paragraph. Explain each of the three knobs: learning rate, initialization, number of steps.

2. How it actually works (the mechanism)

Walk me through why we subtract (not add) the gradient, and why we scale by a learning rate. Use a 1D parabola and trace three steps.

3. Advanced — application & what's next

Explain why plain SGD is almost never used for LLMs. Compare to Momentum, RMSProp, Adam, and AdamW — what each fixes and what the optimizer state (memory overhead per parameter) looks like.