Matrix multiplication: a whole layer in one operation

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

When you compute a linear layer's output y = W x + b, you're doing one dot product per output neuron — and a matmul is just those dot products bundled together and executed in parallel. Modern GPUs and TPUs are literally matmul machines: 90%+ of LLM training time is spent inside matmul kernels. If you hold 'matmul = batched dot products' in your head, you'll read every PyTorch shape error correctly the first time.

Demo

For C = A @ B where A is (M, K) and B is (K, N), the result C has shape (M, N) and C[i, j] = sum_k A[i, k] * B[k, j]. The inner dimension (K) must match and gets "contracted" away.

Below: a 3-language naïve implementation. The naïve version is O(M·N·K) in FLOPs but cache-oblivious — real matmul libraries (cuBLAS, oneMKL) can be 100× faster because they block for cache lines. We'll quantify later. For now, get the shapes right.

Try it yourself

Hand-compute A@B where A is [[1,2],[3,4]] and B is [[5,6],[7,8]]. Verify against code.

Swap A and B in the demo. Will it still compile/compile? If not, which shape assertion fails?

Time the naïve matmul on two 512×512 matrices, then compare against NumPy's @. Measure the speedup ratio.

For a Transformer forward pass with seq=2048, d=4096, count how many FLOPs a single Q = X @ W_q matmul does. (Hint: 2 × M × N × K.)

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain matrix multiplication shape rules. Give two pairs of shapes: one that works, one that doesn't, and say why.

2. How it actually works (the mechanism)

Why is matmul the single most-optimized operation in numerical computing? Walk through how a cache-blocked matmul hides memory latency, and what a Tensor Core adds on top.

3. Advanced — application & what's next

For GPT-3 175B, estimate what fraction of total FLOPs are matmul vs everything else (attention softmax, layer norm, activations). Use the per-layer FLOP breakdown and explain why matmul dominates.