Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Supervised learning is the workhorse of production ML: give the algorithm input–output pairs and it learns a mapping. Every training pipeline — data loading, preprocessing, fitting, evaluation, deployment — exists to serve this loop. Understanding the vocabulary (train split, val split, epoch, batch size, overfitting) in terms of the loop, not as abstract buzzwords, makes every paper and tutorial immediately readable.
Supervised learning follows a fixed four-step contract: split your labelled data, call .fit() on the training portion, call .predict() on the held-out test portion, and score the predictions against the true labels. The Iris dataset fits this pattern cleanly in 20 lines — short enough to read in one pass, real enough that the accuracy number is actually meaningful.
test_size=0.2 to test_size=0.5. Record the new accuracy. Then try test_size=0.9. Why can a larger test split lower accuracy even though the model sees less noisy training data?random_state=42. Run the code 3 times and record the accuracy each time. This demonstrates why you must fix the seed for reproducible experiments.model.coef_ and model.intercept_. How many rows does model.coef_ have, and why? (Hint: how many classes?)LogisticRegression with from sklearn.tree import DecisionTreeClassifier — no other change. Compare accuracy. This shows how swapping models while keeping the pipeline identical is the core of algorithm comparison.Use these three in order. Each builds on the one before.
In one paragraph, explain the difference between training data and test data. Why does evaluating on the same data you trained on give a misleading accuracy number?
Walk me through what `model.fit(X_train, y_train)` actually does internally for LogisticRegression — not just 'it learns' but the iterative optimization that adjusts weights to minimize a loss. What stops the iteration?
Production ML systems often have data leakage — information from the test set somehow influences training. List three concrete ways leakage can occur in a standard scikit-learn pipeline (preprocessing, feature engineering, cross-validation) and how to prevent each.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
print(f"X shape: {X.shape}, y shape: {y.shape}") # (150, 4), (150,)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train) # learn mapping X_train → y_train
y_pred = model.predict(X_test) # apply to unseen X_test
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") # ~0.967python3 main.py