Cross-Validation Reliable Scores
Best Practice scikit-learn

Cross-Validation

Learn how to use k-fold cross-validation to get a more reliable estimate of model performance than a single train/test split.

Why Cross-Validation?

  • A single train/test split can be unlucky (too easy or too hard).
  • Cross-validation uses multiple splits to average performance.
  • Helps when data is limited.

K-Fold Cross-Validation

In k-fold CV, the data is split into k equal parts (folds). Each fold is used once as a test set, while the remaining k−1 folds act as training data.

K-Fold with cross_val_score
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

cv = KFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

scores = cross_val_score(
    rf,
    X,
    y,
    cv=cv,
    scoring="accuracy"
)

print("CV scores:", scores)
print("Mean accuracy:", np.mean(scores))
print("Std dev:", np.std(scores))