Logistic regression — sigmoid, decision boundary, binary classification

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Logistic regression is the bridge between linear models and neural networks. The sigmoid squashes the linear score w·x + b to (0,1) so it can be interpreted as a probability. The decision boundary is the hyperplane where predicted probability equals 0.5 — exactly where w·x + b = 0. This means logistic regression's entire expressive power comes from one linear boundary. Understanding why that matters (and when it fails — any XOR-like problem) is why neural networks add hidden layers.

Demo

Logistic regression is the minimal neural network: one layer, no hidden units, sigmoid activation. Implementing gradient descent on cross-entropy loss by hand exposes the gradient formula — X.T @ (y_hat - y) / n — which is the same backpropagation equation that will appear inside every neural network layer you write later. Deriving it once from scratch means you can read any ML paper's gradient section without reaching for a reference.

Try it yourself

Make the data non-linearly separable: X_pos = rng.normal([1,0], 1.0, (50,2)), X_neg = rng.normal([0,1], 1.0, (50,2)). Retrain and watch accuracy drop below 100% — this is the fundamental limit of linear classifiers.

Compute the predicted probability for x_new = np.array([0.0, 0.0, 1.0]). sigmoid(w @ x_new) should be near 0.5 — it's on the decision boundary. Why?

Add L2 regularization: grad = X_b.T @ (y_hat - y) / len(y) + (lam/len(y)) * np.append(w[:-1], 0) with lam=1.0. Compare weights with and without. They shrink — this is weight decay.

Compare to sklearn: LogisticRegression(C=1e6).fit(X, y).coef_[0] vs your w[:2]. They should be close but not identical — investigate why (solver differences).

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain what the sigmoid function does and why it's used in logistic regression. If `w·x + b = 0`, what probability does sigmoid output? If the score is +10?

2. Why it works (the mechanism)

Walk me through the decision boundary: from features → linear score → probability → class prediction. What is the boundary geometrically and algebraically?

3. Advanced — application & what's next

Logistic regression assumes log-odds linearity in x. Give one real classification problem where this holds, one where it's borderline, and one where it clearly fails. For the failing case, name two fixes (feature engineering, kernel trick, neural net) and explain why each works.