Data Preprocessing & Feature Engineering

Turn raw, messy data into clean ML‑ready features using Python and scikit‑learn pipelines.

Why Preprocessing Matters

Most of the real effort in Machine Learning is preparing data. Good preprocessing can improve model performance more than switching algorithms.

Real‑world data has missing values, outliers and inconsistent formats.
Many models assume numeric, scaled features.
Leakage in preprocessing can lead to overly optimistic metrics.

Handling Missing Values

Imputing missing numeric and categorical data

from sklearn.impute import SimpleImputer
import pandas as pd

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

df_num = df[numeric_cols]
df_cat = df[categorical_cols]

df_num_imputed = pd.DataFrame(num_imputer.fit_transform(df_num),
                              columns=numeric_cols)
df_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(df_cat),
                              columns=categorical_cols)

Encoding & Scaling

Many algorithms require numeric features. We encode categories and scale numeric values to comparable ranges.

ColumnTransformer with OneHotEncoder and StandardScaler

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(max_iter=1000))
])

End-to-End Pipelines

Using pipelines ensures the exact same preprocessing is applied during training and prediction, avoiding data leakage and manual mistakes.

Previous: ML Workflow Next: Model Evaluation