Loss functions — MSE, cross-entropy, and why

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A loss function tells the model 'how wrong you were' — but different tasks need different definitions of wrong. Regression uses MSE because the error is a real number. Classification uses cross-entropy because the prediction is a probability distribution. Language modeling is classification over the vocabulary — so cross-entropy is the loss you'll see for every LLM for the rest of your life. Pick the wrong loss and your model learns the wrong objective.

Demo

Two losses, two use cases:

MSE (regression): L = mean((y_pred - y_true)^2). Penalizes large errors quadratically. Assumes the target is a continuous real value.

Cross-entropy (classification): L = -sum(y_true * log(y_pred)). Given a probability distribution y_pred over V classes and a one-hot y_true, this simplifies to -log(y_pred[correct_class]). Penalizes being confidently wrong much more than being uncertain. This is the LM loss.

Why not MSE for classification? Because your outputs after softmax are probabilities in [0,1]. An MSE on probabilities treats "confidently right" the same as "barely right" — it doesn't push the model to be certain. Cross-entropy does. Walk the numbers in the demo.

// main.go — MSE vs cross-entropy, by hand
package main

import (
	"fmt"
	"math"
)

// ----- MSE (regression) -----
func mse(preds, targets []float64) float64 {
	var sum float64
	for i := range preds {
		d := preds[i] - targets[i]
		sum += d * d
	}
	return sum / float64(len(preds))
}

// ----- Cross-entropy (classification) -----
func softmax(logits []float64) []float64 {
	m := logits[0]
	for _, v := range logits {
		if v > m {
			m = v
		}
	}
	e := make([]float64, len(logits))
	var s float64
	for i, l := range logits {
		e[i] = math.Exp(l - m)
		s += e[i]
	}
	for i := range e {
		e[i] /= s
	}
	return e
}

func crossEntropy(logits []float64, targetIdx int) float64 {
	p := softmax(logits)
	return -math.Log(p[targetIdx])
}

func main() {
	fmt.Println("MSE for [2.1, 2.9, 5.0] vs [2, 3, 5]:",
		mse([]float64{2.1, 2.9, 5.0}, []float64{2.0, 3.0, 5.0}))

	// A 3-class prediction: model thinks class 0 is most likely.
	// Target is actually class 1.
	logits := []float64{3.0, 1.0, 0.5}
	fmt.Println("softmax:", softmax(logits))          // heavily on class 0
	fmt.Println("CE loss (target=1):", crossEntropy(logits, 1)) // high — model was confidently wrong

	// If model had been right:
	fmt.Println("CE loss (target=0):", crossEntropy(logits, 0)) // low

	// The LM loss: vocab size V=50257. Target = next-token id. CE over 50257 classes.
}

Run: go run main.go

Try it yourself

Compute cross-entropy by hand for logits [2, 1, 0] and target class 0 and target class 2. Confirm the model loses more when the correct class has a smaller logit.

Replace softmax with 'hard argmax + MSE on one-hot'. Observe: if two logits are 3.0 and 2.9, the MSE is similar whether you predict the right or wrong class. Why is this bad?

Compute the cross-entropy loss expected from a uniform random guess over a vocabulary of 50257 tokens. (Answer: log(50257) ≈ 10.82.)

If a pretrained LLM achieves validation CE of 2.5 on some dataset, what's its perplexity? (Perplexity = exp(CE). Answer: ≈ 12.18.)

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain what a loss function is and why different tasks use different ones. Name two and give the task each is for.

2. How it actually works (the mechanism)

Derive why cross-entropy, not MSE, is correct for classification. Start from maximum likelihood estimation of a categorical distribution and show that minimizing negative log-likelihood equals cross-entropy.

3. Advanced — application & what's next

For LLMs, cross-entropy on the next token gives perplexity as exp(CE). Explain what 'bits-per-byte' is, how to convert CE to bits-per-byte, and why bits-per-byte is a fairer comparison across tokenizers.