Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Logistic regression is the bridge between linear models and neural networks. The sigmoid squashes the linear score w·x + b to (0,1) so it can be interpreted as a probability. The decision boundary is the hyperplane where predicted probability equals 0.5 — exactly where w·x + b = 0. This means logistic regression's entire expressive power comes from one linear boundary. Understanding why that matters (and when it fails — any XOR-like problem) is why neural networks add hidden layers.
Logistic regression is the minimal neural network: one layer, no hidden units, sigmoid activation. Implementing gradient descent on cross-entropy loss by hand exposes the gradient formula — X.T @ (y_hat - y) / n — which is the same backpropagation equation that will appear inside every neural network layer you write later. Deriving it once from scratch means you can read any ML paper's gradient section without reaching for a reference.
X_pos = rng.normal([1,0], 1.0, (50,2)), X_neg = rng.normal([0,1], 1.0, (50,2)). Retrain and watch accuracy drop below 100% — this is the fundamental limit of linear classifiers.x_new = np.array([0.0, 0.0, 1.0]). sigmoid(w @ x_new) should be near 0.5 — it's on the decision boundary. Why?grad = X_b.T @ (y_hat - y) / len(y) + (lam/len(y)) * np.append(w[:-1], 0) with lam=1.0. Compare weights with and without. They shrink — this is weight decay.LogisticRegression(C=1e6).fit(X, y).coef_[0] vs your w[:2]. They should be close but not identical — investigate why (solver differences).Use these three in order. Each builds on the one before.
Explain what the sigmoid function does and why it's used in logistic regression. If `w·x + b = 0`, what probability does sigmoid output? If the score is +10?
Walk me through the decision boundary: from features → linear score → probability → class prediction. What is the boundary geometrically and algebraically?
Logistic regression assumes log-odds linearity in x. Give one real classification problem where this holds, one where it's borderline, and one where it clearly fails. For the failing case, name two fixes (feature engineering, kernel trick, neural net) and explain why each works.
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def logistic_gd(X, y, lr=0.1, epochs=200):
X_b = np.hstack([X, np.ones((len(X), 1))])
w = np.zeros(X_b.shape[1])
for _ in range(epochs):
y_hat = sigmoid(X_b @ w)
grad = X_b.T @ (y_hat - y) / len(y)
w -= lr * grad
return w
rng = np.random.default_rng(42)
X_pos = rng.normal([2, 2], 0.8, (50, 2))
X_neg = rng.normal([-2, -2], 0.8, (50, 2))
X = np.vstack([X_pos, X_neg])
y = np.array([1]*50 + [0]*50)
w = logistic_gd(X, y)
print(f"Weights: {w.round(3)}")
X_b = np.hstack([X, np.ones((100, 1))])
preds = (sigmoid(X_b @ w) >= 0.5).astype(int)
print(f"Train accuracy: {(preds == y).mean():.3f}")python3 main.py