Machine Learning

ML Workflow & Preprocessing

End-to-end ML workflow, data cleaning, feature engineering, and preprocessing techniques.

Machine Learning Workflow

High-level Stages

1. Problem Framing — what business question are we answering? What is the prediction target?
2. Data Collection & Understanding — gather data, explore distributions, spot leaks and biases.
3. Data Preprocessing & Feature Engineering — clean, transform and create features suitable for modeling.
4. Model Training & Selection — try different algorithms and tune hyperparameters.
5. Evaluation & Validation — measure generalization performance using proper metrics and CV.
6. Deployment & Monitoring — serve predictions in production and monitor for drift and degradation.

1. Problem Framing

Good ML starts with a clear problem statement. Examples:

  • “Predict probability of churn in the next 30 days.”
  • “Forecast demand for each product next week.”

2. Data Collection & Understanding

Identify data sources (databases, logs, APIs), then use EDA (exploratory data analysis) to understand quality and patterns.

3. Data Preprocessing & Feature Engineering

This stage connects to the dedicated Data Preprocessing page and typically includes:

  • Handling missing values and outliers.
  • Encoding categorical variables.
  • Scaling and normalizing numeric features.
  • Creating domain‑specific features.

4. Model Training & Selection

We choose algorithms based on problem type, data size and constraints, then tune hyperparameters using validation data or cross‑validation.

5. Deployment & Monitoring

Models only create value when they are integrated into products or decision processes:

  • Expose models through APIs or batch jobs.
  • Monitor latency, error rates and prediction quality.
  • Detect data drift and retrain when needed.

Data Preprocessing & Feature Engineering

Why Preprocessing Matters

Most of the real effort in Machine Learning is preparing data. Good preprocessing can improve model performance more than switching algorithms.

  • Real‑world data has missing values, outliers and inconsistent formats.
  • Many models assume numeric, scaled features.
  • Leakage in preprocessing can lead to overly optimistic metrics.

Handling Missing Values

Imputing missing numeric and categorical data
from sklearn.impute import SimpleImputer
import pandas as pd

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

df_num = df[numeric_cols]
df_cat = df[categorical_cols]

df_num_imputed = pd.DataFrame(num_imputer.fit_transform(df_num),
                              columns=numeric_cols)
df_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(df_cat),
                              columns=categorical_cols)

Encoding & Scaling

Many algorithms require numeric features. We encode categories and scale numeric values to comparable ranges.

ColumnTransformer with OneHotEncoder and StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(max_iter=1000))
])

End-to-End Pipelines

Using pipelines ensures the exact same preprocessing is applied during training and prediction, avoiding data leakage and manual mistakes.