Capstok — learn by doing

Why this matters

Nobody differentiates LLMs by hand. We build a data structure — a graph of operations — and a generic backward pass that walks the graph applying the chain rule mechanically. This is called reverse-mode automatic differentiation, and it's what PyTorch's .backward() actually does. Building a 100-line version yourself (Karpathy's micrograd) is a rite of passage: once you've done it, no framework looks like magic anymore.

Demo

Below: a minimal Value class that records every operation as a node in a DAG. Calling .backward() walks the graph in reverse topological order, accumulating gradients using the chain rule at each node.

This is the same architectural idea as PyTorch's autograd — just stripped to the bone. Port it to your favorite language; the exercise pays off forever.

Try it yourself

Run the demo. Verify dL/dw matches what you got on paper in the previous challenge.
Add a __pow__ operation so L = diff**2 works. Write the backward as self.grad += n * self.data**(n-1) * out.grad.
Build a 3-node computation graph (e.g. z = (a+b)*c) and verify that a.grad, b.grad, c.grad match what you compute on paper.
Intentionally break the topological sort (process nodes in insertion order instead of reverse). Observe how gradients come out wrong. Why does order matter?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what reverse-mode automatic differentiation is and why neural networks use it instead of forward-mode or symbolic differentiation.

2. How it actually works (the mechanism)

Walk me through exactly what happens during `loss.backward()` in PyTorch: graph construction during forward, topological sort, and the per-op backward function. Use a 3-op example.

3. Advanced — application & what's next

For a forward pass that takes T FLOPs, a reverse-mode backward pass takes ~2T FLOPs but needs to cache all intermediate activations. Explain the memory/compute tradeoff and how gradient checkpointing trades one for the other.

References

# main.py — minimal scalar autograd (Karpathy-style)
class Value:
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other))
        def _bw():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _bw
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other))
        def _bw():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _bw
        return out

    def backward(self):
        order, seen = [], set()
        def topo(v):
            if v in seen: return
            seen.add(v)
            for p in v._prev: topo(p)
            order.append(v)
        topo(self)
        self.grad = 1.0
        for v in reversed(order): v._backward()

# Same worked example as the chain rule task:
w, x, y = Value(3.0), Value(2.0), Value(5.0)
L = (w * x + Value(-1) * y) * (w * x + Value(-1) * y)   # (w*x - y)^2
L.backward()
print("L =", L.data)            # 1.0
print("dL/dw =", w.grad)         # 2.0  (matches paper)

Run: python3 main.py

Autograd: the chain rule as code