scikit-learn
ML Toolkit
Pipelines & Models
scikit-learn Guide
scikit‑learn (sklearn) is the go‑to Python library for classical Machine Learning, providing models, preprocessing tools, metrics and utilities in a consistent API.
The fit / predict Pattern
Every estimator in scikit‑learn follows a simple pattern:
model = SomeEstimator(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
This consistency makes it easy to swap models and use tooling like pipelines and grid search.
Preprocessing & Pipelines
Use transformers (with fit / transform) and combine them with estimators in a Pipeline so your preprocessing and model are trained together.
Simple Pipeline example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
("scaler", StandardScaler()),
("logreg", LogisticRegression(max_iter=1000))
])
Cross-Validation & Model Selection
Use cross‑validation to estimate model performance and GridSearchCV / RandomizedSearchCV to tune hyperparameters.
from sklearn.model_selection import cross_val_score, GridSearchCV
scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
param_grid = {"logreg__C": [0.1, 1.0, 10.0]}
search = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy")
search.fit(X, y)
Practical Tips
- Use
ColumnTransformerto apply different preprocessing to numeric and categorical features. - Keep preprocessing and modeling inside a single pipeline to avoid data leakage.
- Leverage model inspection tools such as
permutation_importanceand partial dependence plots for interpretability.
Where scikit-learn Fits
- Best suited for small to medium tabular datasets.
- Often used together with pandas (data), NumPy (arrays) and joblib (model persistence).
- Deep learning is typically handled by TensorFlow / PyTorch, while sklearn remains the standard for classical ML.