Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Nobody differentiates LLMs by hand. We build a data structure — a graph of operations — and a generic backward pass that walks the graph applying the chain rule mechanically. This is called reverse-mode automatic differentiation, and it's what PyTorch's .backward() actually does. Building a 100-line version yourself (Karpathy's micrograd) is a rite of passage: once you've done it, no framework looks like magic anymore.
Below: a minimal Value class that records every operation as a node in a DAG. Calling .backward() walks the graph in reverse topological order, accumulating gradients using the chain rule at each node.
This is the same architectural idea as PyTorch's autograd — just stripped to the bone. Port it to your favorite language; the exercise pays off forever.
dL/dw matches what you got on paper in the previous challenge.__pow__ operation so L = diff**2 works. Write the backward as self.grad += n * self.data**(n-1) * out.grad.z = (a+b)*c) and verify that a.grad, b.grad, c.grad match what you compute on paper.Use these three in order. Each builds on the one before.
In one paragraph, explain what reverse-mode automatic differentiation is and why neural networks use it instead of forward-mode or symbolic differentiation.
Walk me through exactly what happens during `loss.backward()` in PyTorch: graph construction during forward, topological sort, and the per-op backward function. Use a 3-op example.
For a forward pass that takes T FLOPs, a reverse-mode backward pass takes ~2T FLOPs but needs to cache all intermediate activations. Explain the memory/compute tradeoff and how gradient checkpointing trades one for the other.
# main.py — minimal scalar autograd (Karpathy-style)
class Value:
def __init__(self, data, _children=()):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other))
def _bw():
self.grad += out.grad
other.grad += out.grad
out._backward = _bw
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other))
def _bw():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _bw
return out
def backward(self):
order, seen = [], set()
def topo(v):
if v in seen: return
seen.add(v)
for p in v._prev: topo(p)
order.append(v)
topo(self)
self.grad = 1.0
for v in reversed(order): v._backward()
# Same worked example as the chain rule task:
w, x, y = Value(3.0), Value(2.0), Value(5.0)
L = (w * x + Value(-1) * y) * (w * x + Value(-1) * y) # (w*x - y)^2
L.backward()
print("L =", L.data) # 1.0
print("dL/dw =", w.grad) # 2.0 (matches paper)python3 main.py