scikit-learn ML Toolkit
Pipelines & Models

scikit-learn Guide

scikit‑learn (sklearn) is the go‑to Python library for classical Machine Learning, providing models, preprocessing tools, metrics and utilities in a consistent API.

The fit / predict Pattern

Every estimator in scikit‑learn follows a simple pattern:

model = SomeEstimator(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

This consistency makes it easy to swap models and use tooling like pipelines and grid search.

Preprocessing & Pipelines

Use transformers (with fit / transform) and combine them with estimators in a Pipeline so your preprocessing and model are trained together.

Simple Pipeline example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=1000))
])

Cross-Validation & Model Selection

Use cross‑validation to estimate model performance and GridSearchCV / RandomizedSearchCV to tune hyperparameters.

from sklearn.model_selection import cross_val_score, GridSearchCV

scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")

param_grid = {"logreg__C": [0.1, 1.0, 10.0]}
search = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy")
search.fit(X, y)

Practical Tips

  • Use ColumnTransformer to apply different preprocessing to numeric and categorical features.
  • Keep preprocessing and modeling inside a single pipeline to avoid data leakage.
  • Leverage model inspection tools such as permutation_importance and partial dependence plots for interpretability.

Where scikit-learn Fits

  • Best suited for small to medium tabular datasets.
  • Often used together with pandas (data), NumPy (arrays) and joblib (model persistence).
  • Deep learning is typically handled by TensorFlow / PyTorch, while sklearn remains the standard for classical ML.