Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Every learning algorithm, from linear regression to GPT-5, is 'take a derivative and go downhill.' The derivative tells you which knobs to turn and how hard. The chain rule tells you how to compute derivatives through a long sequence of operations — which is exactly what a neural network is. Most people who 'don't get deep learning' are people who skipped the chain rule. Do not skip it.
The chain rule: if y = f(g(x)), then dy/dx = f'(g(x)) * g'(x). Read it left-to-right: the derivative of the outer function evaluated at the inner function's output, times the derivative of the inner function.
Worked example you can do on paper right now. Let L = (w*x - y)^2 (mean-squared-error loss for one example, with w learnable and x, y fixed).
u = w*x - y. Then L = u^2.dL/du = 2u (power rule).du/dw = x (the -y is a constant wrt w).dL/dw = dL/du * du/dw = 2u * x = 2(w*x - y) * x.That last expression is literally the "gradient" the optimizer uses to update w. One more example with a non-linearity follows in the next challenge; make sure you can do this one by hand without looking.
Use these three in order. Each builds on the one before.
State the chain rule in one sentence, then give a worked example with numbers (no variables — just arithmetic).
Walk me through computing dL/dw for L = (sigmoid(wx + b) - y)^2 on paper, step by step, using the chain rule. Explain each application.
Explain why multiplying many small numbers in deep networks causes vanishing gradients. What architectural choices (residuals, layer norm, careful init) mitigate this, and why does each one work?