The supervised learning loop — data, labels, fit, predict

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Supervised learning is the workhorse of production ML: give the algorithm input–output pairs and it learns a mapping. Every training pipeline — data loading, preprocessing, fitting, evaluation, deployment — exists to serve this loop. Understanding the vocabulary (train split, val split, epoch, batch size, overfitting) in terms of the loop, not as abstract buzzwords, makes every paper and tutorial immediately readable.

Demo

Supervised learning follows a fixed four-step contract: split your labelled data, call .fit() on the training portion, call .predict() on the held-out test portion, and score the predictions against the true labels. The Iris dataset fits this pattern cleanly in 20 lines — short enough to read in one pass, real enough that the accuracy number is actually meaningful.

Try it yourself

Change test_size=0.2 to test_size=0.5. Record the new accuracy. Then try test_size=0.9. Why can a larger test split lower accuracy even though the model sees less noisy training data?
Remove random_state=42. Run the code 3 times and record the accuracy each time. This demonstrates why you must fix the seed for reproducible experiments.
After fitting, print model.coef_ and model.intercept_. How many rows does model.coef_ have, and why? (Hint: how many classes?)
Replace LogisticRegression with from sklearn.tree import DecisionTreeClassifier — no other change. Compare accuracy. This shows how swapping models while keeping the pipeline identical is the core of algorithm comparison.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the difference between training data and test data. Why does evaluating on the same data you trained on give a misleading accuracy number?

2. Why it works (the mechanism)

Walk me through what `model.fit(X_train, y_train)` actually does internally for LogisticRegression — not just 'it learns' but the iterative optimization that adjusts weights to minimize a loss. What stops the iteration?

3. Advanced — application & what's next

Production ML systems often have data leakage — information from the test set somehow influences training. List three concrete ways leakage can occur in a standard scikit-learn pipeline (preprocessing, feature engineering, cross-validation) and how to prevent each.