Gradients on paper — the chain rule is everything

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Every learning algorithm, from linear regression to GPT-5, is 'take a derivative and go downhill.' The derivative tells you which knobs to turn and how hard. The chain rule tells you how to compute derivatives through a long sequence of operations — which is exactly what a neural network is. Most people who 'don't get deep learning' are people who skipped the chain rule. Do not skip it.

Demo

The chain rule: if y = f(g(x)), then dy/dx = f'(g(x)) * g'(x). Read it left-to-right: the derivative of the outer function evaluated at the inner function's output, times the derivative of the inner function.

Worked example you can do on paper right now. Let L = (w*x - y)^2 (mean-squared-error loss for one example, with w learnable and x, y fixed).

Let u = w*x - y. Then L = u^2.
dL/du = 2u (power rule).
du/dw = x (the -y is a constant wrt w).
Chain: dL/dw = dL/du * du/dw = 2u * x = 2(w*x - y) * x.

That last expression is literally the "gradient" the optimizer uses to update w. One more example with a non-linearity follows in the next challenge; make sure you can do this one by hand without looking.

Try it yourself

Given L = (w*x - y)^2 with w=3, x=2, y=5, compute L and dL/dw by hand. Answer: L=1, dL/dw=2.
Compute d/dw of L = relu(wx) * (wx - y) using the product rule and chain rule. State where the non-differentiable point is.
For L = (sigmoid(w*x) - y)^2, derive dL/dw. Use sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)).
Write out the chain rule for a 3-layer MLP with weights W1, W2, W3. Which gradient expression has the most terms? Why do deep nets 'vanish' gradients?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

State the chain rule in one sentence, then give a worked example with numbers (no variables — just arithmetic).

2. How it actually works (the mechanism)

Walk me through computing dL/dw for L = (sigmoid(wx + b) - y)^2 on paper, step by step, using the chain rule. Explain each application.

3. Advanced — application & what's next

Explain why multiplying many small numbers in deep networks causes vanishing gradients. What architectural choices (residuals, layer norm, careful init) mitigate this, and why does each one work?