Machine Learning
Preprocessing
Clean Data First
Data Preprocessing & Feature Engineering
Turn raw, messy data into clean ML‑ready features using Python and scikit‑learn pipelines.
Why Preprocessing Matters
Most of the real effort in Machine Learning is preparing data. Good preprocessing can improve model performance more than switching algorithms.
- Real‑world data has missing values, outliers and inconsistent formats.
- Many models assume numeric, scaled features.
- Leakage in preprocessing can lead to overly optimistic metrics.
Handling Missing Values
Imputing missing numeric and categorical data
from sklearn.impute import SimpleImputer
import pandas as pd
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
df_num = df[numeric_cols]
df_cat = df[categorical_cols]
df_num_imputed = pd.DataFrame(num_imputer.fit_transform(df_num),
columns=numeric_cols)
df_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(df_cat),
columns=categorical_cols)
Encoding & Scaling
Many algorithms require numeric features. We encode categories and scale numeric values to comparable ranges.
ColumnTransformer with OneHotEncoder and StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_cols),
("cat", categorical_transformer, categorical_cols),
]
)
clf = Pipeline(steps=[
("preprocess", preprocessor),
("model", LogisticRegression(max_iter=1000))
])
End-to-End Pipelines
Using pipelines ensures the exact same preprocessing is applied during training and prediction, avoiding data leakage and manual mistakes.