Linear regression from scratch — NumPy, no sklearn

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Linear regression is the simplest model that shows the full training loop: initialize parameters, compute prediction, compute loss, compute gradient, update. If you can implement this from scratch in NumPy, every other model becomes a variation on that loop. Most practitioners skip this and go straight to sklearn.LinearRegression(), which means they can't debug when the loss doesn't converge, can't tune learning rate intelligently, and can't read papers that use the same notation.

Demo

Gradient descent makes the connection between calculus and code explicit: at each epoch the code computes how much each parameter is responsible for the current error, then nudges every parameter a small step in the direction that reduces that error. Watching the MSE print line-by-line reveals what convergence actually looks like — a rapid early drop followed by diminishing returns that explains why most training runs plateau well before epoch 200.

Try it yourself

Change lr=0.01 to lr=0.1. Watch the MSE — it may diverge (NaN). Then try lr=0.001 and compare convergence speed. This is the learning-rate sensitivity you'll hit on every real model.
Replace the gradient loop with the closed-form solution: w = np.linalg.pinv(X_b) @ y. Compare to 200 epochs of GD. When would you prefer each?
Change noise magnitude from 1 to 10 (rng.normal(0, 10, 100)). Does the model still converge to slope≈3? How does final MSE change?
Print w before training and after 5, 50, and 200 epochs. Watch slope converge toward 3 and intercept toward 2.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what MSE measures, why we square the residuals instead of taking absolute value, and why minimizing MSE is equivalent to finding the best-fit line.

2. Why it works (the mechanism)

Derive the gradient of MSE with respect to `w` step by step. Start from `MSE = (1/n) Σ (y_hat − y)²`, expand `y_hat = X_b @ w`, and take the partial derivative. Why does the result have `X_b.T` in it?

3. Advanced — application & what's next

When does gradient descent fail on linear regression that the closed-form normal equation handles fine — and vice versa? Specifically address: feature collinearity, very large n, very large d, ill-conditioned X. Which scenario pushes you toward each?