Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Raw model.fit(X_train, y_train) is not how production ML works. In production you preprocess features before fitting, evaluate generalization across multiple splits (not a single train/test), and persist the fitted pipeline to disk so you can serve predictions later without re-running preprocessing by hand. scikit-learn's Pipeline chains preprocessing and model into a single estimator that cross-validates and serializes atomically. Skip this and you'll leak validation data into preprocessing — breaking your metrics in ways that only surface after the model ships.
A scikit-learn Pipeline chains preprocessing and modelling steps into a single object that can be cross-validated and serialised as a unit. The critical advantage over manual preprocessing is that the Pipeline applies transformers only to training folds during cross-validation — fitting a StandardScaler inside the fold rather than before it — which is the difference between an honest held-out score and one inflated by leakage from the test data.
StandardScaler from the pipeline and re-run CV. Does R² change? For Ridge (scale-sensitive) it should. This proves scaling must be inside the Pipeline, not done beforehand.k=6 to k=2 and then k=8. Record CV R² at each k. You're doing a manual grid search — this is exactly what GridSearchCV automates.cross_val_score with cross_validate(pipe, X, y, cv=5, scoring=['r2', 'neg_mean_squared_error']). Print both metrics.('poly', PolynomialFeatures(degree=2, include_bias=False)) between select and model. Does R² improve? How much does fit time increase? This is how you add nonlinearity to a linear pipeline.Use these three in order. Each builds on the one before.
In one paragraph, explain what a scikit-learn Pipeline is and why it matters for cross-validation. What goes wrong if you fit a StandardScaler on the full dataset before splitting into CV folds?
Walk me through what happens step-by-step when `cross_val_score(pipe, X, y, cv=5)` runs. For each fold: which data does each Pipeline step see for fit, which for transform, which for predict? This data flow is what prevents leakage.
I have Pipeline with StandardScaler → PolynomialFeatures(degree=2) → Ridge. I want to tune: Ridge alpha (0.1, 1.0, 10.0), PolynomialFeatures degree (1, 2, 3), and whether to include StandardScaler (True/False). Show me the param_grid dict for GridSearchCV and estimate how many model fits happen with 5-fold CV.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import joblib
X, y = fetch_california_housing(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
pipe = Pipeline([
("scaler", StandardScaler()),
("select", SelectKBest(f_regression, k=6)),
("model", Ridge(alpha=1.0)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="r2")
print(f"CV R²: {scores.round(3)} mean={scores.mean():.3f}")
pipe.fit(X, y)
joblib.dump(pipe, "/tmp/housing_pipe.pkl")
loaded = joblib.load("/tmp/housing_pipe.pkl")
print(f"Prediction: {loaded.predict(X[:1])[0]:.3f} True: {y[0]:.3f}")python3 main.py