scikit-learn in practice — Pipeline, cross_val_score, joblib

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Raw model.fit(X_train, y_train) is not how production ML works. In production you preprocess features before fitting, evaluate generalization across multiple splits (not a single train/test), and persist the fitted pipeline to disk so you can serve predictions later without re-running preprocessing by hand. scikit-learn's Pipeline chains preprocessing and model into a single estimator that cross-validates and serializes atomically. Skip this and you'll leak validation data into preprocessing — breaking your metrics in ways that only surface after the model ships.

Demo

A scikit-learn Pipeline chains preprocessing and modelling steps into a single object that can be cross-validated and serialised as a unit. The critical advantage over manual preprocessing is that the Pipeline applies transformers only to training folds during cross-validation — fitting a StandardScaler inside the fold rather than before it — which is the difference between an honest held-out score and one inflated by leakage from the test data.

Try it yourself

Remove StandardScaler from the pipeline and re-run CV. Does R² change? For Ridge (scale-sensitive) it should. This proves scaling must be inside the Pipeline, not done beforehand.
Change k=6 to k=2 and then k=8. Record CV R² at each k. You're doing a manual grid search — this is exactly what GridSearchCV automates.
Replace cross_val_score with cross_validate(pipe, X, y, cv=5, scoring=['r2', 'neg_mean_squared_error']). Print both metrics.
Add ('poly', PolynomialFeatures(degree=2, include_bias=False)) between select and model. Does R² improve? How much does fit time increase? This is how you add nonlinearity to a linear pipeline.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what a scikit-learn Pipeline is and why it matters for cross-validation. What goes wrong if you fit a StandardScaler on the full dataset before splitting into CV folds?

2. Why it works (the mechanism)

Walk me through what happens step-by-step when `cross_val_score(pipe, X, y, cv=5)` runs. For each fold: which data does each Pipeline step see for fit, which for transform, which for predict? This data flow is what prevents leakage.

3. Advanced — application & what's next

I have Pipeline with StandardScaler → PolynomialFeatures(degree=2) → Ridge. I want to tune: Ridge alpha (0.1, 1.0, 10.0), PolynomialFeatures degree (1, 2, 3), and whether to include StandardScaler (True/False). Show me the param_grid dict for GridSearchCV and estimate how many model fits happen with 5-fold CV.