Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A loss function tells the model 'how wrong you were' — but different tasks need different definitions of wrong. Regression uses MSE because the error is a real number. Classification uses cross-entropy because the prediction is a probability distribution. Language modeling is classification over the vocabulary — so cross-entropy is the loss you'll see for every LLM for the rest of your life. Pick the wrong loss and your model learns the wrong objective.
Two losses, two use cases:
MSE (regression): L = mean((y_pred - y_true)^2). Penalizes large errors quadratically. Assumes the target is a continuous real value.
Cross-entropy (classification): L = -sum(y_true * log(y_pred)). Given a probability distribution y_pred over V classes and a one-hot y_true, this simplifies to -log(y_pred[correct_class]). Penalizes being confidently wrong much more than being uncertain. This is the LM loss.
Why not MSE for classification? Because your outputs after softmax are probabilities in [0,1]. An MSE on probabilities treats "confidently right" the same as "barely right" — it doesn't push the model to be certain. Cross-entropy does. Walk the numbers in the demo.
// main.go — MSE vs cross-entropy, by hand
package main
import (
"fmt"
"math"
)
// ----- MSE (regression) -----
func mse(preds, targets []float64) float64 {
var sum float64
for i := range preds {
d := preds[i] - targets[i]
sum += d * d
}
return sum / float64(len(preds))
}
// ----- Cross-entropy (classification) -----
func softmax(logits []float64) []float64 {
m := logits[0]
for _, v := range logits {
if v > m {
m = v
}
}
e := make([]float64, len(logits))
var s float64
for i, l := range logits {
e[i] = math.Exp(l - m)
s += e[i]
}
for i := range e {
e[i] /= s
}
return e
}
func crossEntropy(logits []float64, targetIdx int) float64 {
p := softmax(logits)
return -math.Log(p[targetIdx])
}
func main() {
fmt.Println("MSE for [2.1, 2.9, 5.0] vs [2, 3, 5]:",
mse([]float64{2.1, 2.9, 5.0}, []float64{2.0, 3.0, 5.0}))
// A 3-class prediction: model thinks class 0 is most likely.
// Target is actually class 1.
logits := []float64{3.0, 1.0, 0.5}
fmt.Println("softmax:", softmax(logits)) // heavily on class 0
fmt.Println("CE loss (target=1):", crossEntropy(logits, 1)) // high — model was confidently wrong
// If model had been right:
fmt.Println("CE loss (target=0):", crossEntropy(logits, 0)) // low
// The LM loss: vocab size V=50257. Target = next-token id. CE over 50257 classes.
}go run main.goUse these three in order. Each builds on the one before.
Explain what a loss function is and why different tasks use different ones. Name two and give the task each is for.
Derive why cross-entropy, not MSE, is correct for classification. Start from maximum likelihood estimation of a categorical distribution and show that minimizing negative log-likelihood equals cross-entropy.
For LLMs, cross-entropy on the next token gives perplexity as exp(CE). Explain what 'bits-per-byte' is, how to convert CE to bits-per-byte, and why bits-per-byte is a fairer comparison across tokenizers.