Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Every neural network, every LLM, every 'AI' you'll encounter is the same shape: a function from numbers to numbers with a pile of learnable parameters inside. The mysticism in the field fills the gap left when people skip this framing. If you can hold 'input → weighted combination → non-linearity → output' in your head as the whole thing, every next module is a variation on a theme rather than a new mystery. Start here and you'll never be confused about what a model is again.
A linear regression model y = w*x + b has two parameters (w, b). A 7B-parameter LLM has seven billion. The only structural differences are (a) tensor-shaped parameters instead of scalars, (b) non-linearities between layers, and (c) attention instead of plain matmul. There is no other magic.
A parameter is a number that gets adjusted during training. Inference is running the function forward with parameters frozen. Training is running it forward, measuring how wrong it was, and nudging the parameters to be less wrong next time. That's the entire game.
Use these three in order. Each builds on the one before.
In one paragraph, explain what a neural network parameter is and the difference between parameters, hyperparameters, and activations. Use a concrete 3-parameter example.
Walk me through exactly how the 175B parameters of GPT-3 break down across embedding tables, attention (Q/K/V/O), feed-forward (up/down), and layer norms. Show the arithmetic per layer.
Given a model with P parameters, compute the minimum hardware (VRAM, system RAM) needed to run it in fp16, int8, and int4. Then explain why training takes ~6× more memory than inference and show the decomposition (weights, gradients, optimizer state, activations).