What a 'model' actually is

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Every neural network, every LLM, every 'AI' you'll encounter is the same shape: a function from numbers to numbers with a pile of learnable parameters inside. The mysticism in the field fills the gap left when people skip this framing. If you can hold 'input → weighted combination → non-linearity → output' in your head as the whole thing, every next module is a variation on a theme rather than a new mystery. Start here and you'll never be confused about what a model is again.

Demo

A linear regression model y = w*x + b has two parameters (w, b). A 7B-parameter LLM has seven billion. The only structural differences are (a) tensor-shaped parameters instead of scalars, (b) non-linearities between layers, and (c) attention instead of plain matmul. There is no other magic.

A parameter is a number that gets adjusted during training. Inference is running the function forward with parameters frozen. Training is running it forward, measuring how wrong it was, and nudging the parameters to be less wrong next time. That's the entire game.

Try it yourself

Write down the parameter count of: (a) a dot product of two 4-dim vectors, (b) a 4→4 linear layer, (c) a 4→8→4 MLP without biases. Check your arithmetic.
Estimate the parameter count of GPT-3's 175B model given 96 layers, d_model=12288, and 4x FFN expansion. Ignore embeddings; just attention + FFN.
For a model with P parameters stored in fp32, compute the disk and memory cost in GB. Redo for fp16 and int4.
Find the parameter count in a model card for any open-weight LLM (e.g. Llama 3.1 8B). Identify what the '8B' is actually counting.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what a neural network parameter is and the difference between parameters, hyperparameters, and activations. Use a concrete 3-parameter example.

2. How it actually works (the mechanism)

Walk me through exactly how the 175B parameters of GPT-3 break down across embedding tables, attention (Q/K/V/O), feed-forward (up/down), and layer norms. Show the arithmetic per layer.

3. Advanced — application & what's next

Given a model with P parameters, compute the minimum hardware (VRAM, system RAM) needed to run it in fp16, int8, and int4. Then explain why training takes ~6× more memory than inference and show the decomposition (weights, gradients, optimizer state, activations).