Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Linear regression is the simplest model that shows the full training loop: initialize parameters, compute prediction, compute loss, compute gradient, update. If you can implement this from scratch in NumPy, every other model becomes a variation on that loop. Most practitioners skip this and go straight to sklearn.LinearRegression(), which means they can't debug when the loss doesn't converge, can't tune learning rate intelligently, and can't read papers that use the same notation.
Gradient descent makes the connection between calculus and code explicit: at each epoch the code computes how much each parameter is responsible for the current error, then nudges every parameter a small step in the direction that reduces that error. Watching the MSE print line-by-line reveals what convergence actually looks like — a rapid early drop followed by diminishing returns that explains why most training runs plateau well before epoch 200.
lr=0.01 to lr=0.1. Watch the MSE — it may diverge (NaN). Then try lr=0.001 and compare convergence speed. This is the learning-rate sensitivity you'll hit on every real model.w = np.linalg.pinv(X_b) @ y. Compare to 200 epochs of GD. When would you prefer each?rng.normal(0, 10, 100)). Does the model still converge to slope≈3? How does final MSE change?w before training and after 5, 50, and 200 epochs. Watch slope converge toward 3 and intercept toward 2.Use these three in order. Each builds on the one before.
In one paragraph, explain what MSE measures, why we square the residuals instead of taking absolute value, and why minimizing MSE is equivalent to finding the best-fit line.
Derive the gradient of MSE with respect to `w` step by step. Start from `MSE = (1/n) Σ (y_hat − y)²`, expand `y_hat = X_b @ w`, and take the partial derivative. Why does the result have `X_b.T` in it?
When does gradient descent fail on linear regression that the closed-form normal equation handles fine — and vice versa? Specifically address: feature collinearity, very large n, very large d, ill-conditioned X. Which scenario pushes you toward each?
import numpy as np
rng = np.random.default_rng(0)
X = rng.uniform(0, 10, 100).reshape(-1, 1)
y = 3 * X.squeeze() + 2 + rng.normal(0, 1, 100)
X_b = np.hstack([X, np.ones((100, 1))]) # add bias column → (100, 2)
w = rng.standard_normal(2) # random init
lr, n = 0.01, len(y)
for epoch in range(200):
y_hat = X_b @ w
residuals = y_hat - y
mse = (residuals ** 2).mean()
grad = (2 / n) * X_b.T @ residuals
w -= lr * grad
if epoch % 40 == 0:
print(f"epoch {epoch:3d} MSE={mse:.4f} w={w.round(3)}")
print(f"\nLearned: slope={w[0]:.3f}, intercept={w[1]:.3f}")
print(f"Truth: slope=3.000, intercept=2.000")python3 main.py