ML Workflow & Preprocessing
End-to-end ML workflow, data cleaning, feature engineering, and preprocessing techniques.
Machine Learning Workflow
High-level Stages
1. Problem Framing
Good ML starts with a clear problem statement. Examples:
- “Predict probability of churn in the next 30 days.â€
- “Forecast demand for each product next week.â€
2. Data Collection & Understanding
Identify data sources (databases, logs, APIs), then use EDA (exploratory data analysis) to understand quality and patterns.
3. Data Preprocessing & Feature Engineering
This stage connects to the dedicated Data Preprocessing page and typically includes:
- Handling missing values and outliers.
- Encoding categorical variables.
- Scaling and normalizing numeric features.
- Creating domain‑specific features.
4. Model Training & Selection
We choose algorithms based on problem type, data size and constraints, then tune hyperparameters using validation data or cross‑validation.
5. Deployment & Monitoring
Models only create value when they are integrated into products or decision processes:
- Expose models through APIs or batch jobs.
- Monitor latency, error rates and prediction quality.
- Detect data drift and retrain when needed.
Data Preprocessing & Feature Engineering
Why Preprocessing Matters
Most of the real effort in Machine Learning is preparing data. Good preprocessing can improve model performance more than switching algorithms.
- Real‑world data has missing values, outliers and inconsistent formats.
- Many models assume numeric, scaled features.
- Leakage in preprocessing can lead to overly optimistic metrics.
Handling Missing Values
from sklearn.impute import SimpleImputer
import pandas as pd
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
df_num = df[numeric_cols]
df_cat = df[categorical_cols]
df_num_imputed = pd.DataFrame(num_imputer.fit_transform(df_num),
columns=numeric_cols)
df_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(df_cat),
columns=categorical_cols)
Encoding & Scaling
Many algorithms require numeric features. We encode categories and scale numeric values to comparable ranges.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_cols),
("cat", categorical_transformer, categorical_cols),
]
)
clf = Pipeline(steps=[
("preprocess", preprocessor),
("model", LogisticRegression(max_iter=1000))
])
End-to-End Pipelines
Using pipelines ensures the exact same preprocessing is applied during training and prediction, avoiding data leakage and manual mistakes.