Overfitting, underfitting, and the bias-variance tradeoff

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Every model has two failure modes: underfitting (too simple, misses real patterns, high bias) and overfitting (too complex, memorizes noise, high variance). The bias-variance tradeoff describes how these trade off as complexity increases. A degree-1 polynomial underfits a curved signal; degree-15 overfits it; degree-3 is about right. This framework is behind every regularization technique, every dropout layer, every early-stopping criterion, and every cross-validation loop you'll write.

Demo

Polynomial degree is one of the cleanest knobs for dialing bias versus variance: degree-1 is too rigid to capture a sine curve (high bias), degree-15 passes through every noisy training point perfectly but generalises badly (high variance), and degree-3 sits in the productive middle. Plotting train and test MSE together as degree increases makes the famous U-shaped test error curve concrete and shows exactly where the model transitions from underfitting to overfitting.

Try it yourself

Record test MSE for degrees 1 through 15. Find the degree that minimizes test MSE — the model complexity sweet spot. Confirm train MSE always decreases but test MSE has a U-shape.
Keep degree=9 but cut training data in half (test_size=0.7). Does overfitting get worse? This shows overfitting depends on data size, not just model complexity.
Add Ridge regularization: replace LinearRegression() with Ridge(alpha=1.0) at degree=9. Compare train/test MSE gap with and without. Increase alpha to 10 and 100 to see underfitting emerge.
Increase noise from 0.3 to 1.0 and rerun degree=3. Does test MSE go up even though you didn't change the model? This is the irreducible error floor — no model can learn below the noise.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain overfitting in plain language: what does 'memorizing training data' mean, and why does 99% train accuracy with 70% test accuracy indicate a problem?

2. Why it works (the mechanism)

Walk me through the bias-variance decomposition: Expected MSE = Bias² + Variance + Irreducible noise. Define each term for the polynomial example. Why does increasing complexity reduce bias but increase variance?

3. Advanced — application & what's next

My validation loss plateaus at epoch 12 while training loss keeps falling to epoch 50. Name three fixes (not 'get more data'), and for each: the mechanism, the hyperparameter to tune, and the sign that you've overcorrected into underfitting.

References

Bishop — PRML ch. 1 (curve fitting, bias-variance)