Data Cleaning & Wrangling
Handle missing values, outliers, duplicates, and reshape data for analysis.
Cleaning Real‑World Data Before Modeling
Handling Missing Values
Missing values can bias your analysis if not handled properly. Common strategies include removing rows, filling with simple statistics (mean/median/mode), or using model‑based imputations.
The right strategy depends on why the data is missing (Missing Completely At Random, Missing At Random or Missing Not At Random). For example, dropping rows is usually safe when only a small percentage of values are missing at random, while more advanced techniques like KNN or model‑based imputation are preferred when missingness has structure.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"age": [25, np.nan, 30, 35, np.nan],
"salary": [50000, 52000, np.nan, 60000, 58000]
})
print("Missing per column:")
print(df.isna().sum())
# Simple imputations
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].mean())
print("After imputation:")
print(df)
Detecting & Treating Outliers
Outliers can heavily influence means and linear models. Use the Interquartile Range (IQR) or z‑scores to detect unusual points, then decide whether to cap, transform or remove them based on domain knowledge.
import numpy as np
import pandas as pd
df = pd.DataFrame({"income": [30, 32, 35, 36, 37, 500]})
q1, q3 = df["income"].quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = df[(df["income"] < lower) | (df["income"] > upper)]
print("Outliers:\n", outliers)
Removing Duplicates & Fixing Formats
Duplicated rows and inconsistent formats (dates, categories, casing) are another common source of bad data. Pandas provides helpers for both.
df = pd.DataFrame({
"email": ["a@example.com", "B@Example.com", "a@example.com"],
"signup_date": ["2024-01-01", "01/02/2024", "2024-01-01"]
})
# Normalize email casing
df["email"] = df["email"].str.lower()
# Parse dates
df["signup_date"] = pd.to_datetime(df["signup_date"], dayfirst=False, errors="coerce")
# Drop duplicates
df = df.drop_duplicates(subset=["email"])
print(df)
Reshaping & Transforming Data with Pandas
Melt & Pivot
Melt converts wide data (many columns) into long format (row per observation), while pivot does the opposite. Many visualization and modeling tasks prefer long format.
Thinking in terms of tidy data, each variable should have its own column and each observation its own row. Melt and pivot are the core operations that let you move between human‑friendly report layouts and machine‑friendly tidy tables without losing information.
import pandas as pd
df = pd.DataFrame({
"country": ["US", "US", "IN", "IN"],
"year": [2023, 2024, 2023, 2024],
"sales_a": [10, 12, 8, 9],
"sales_b": [5, 7, 4, 6]
})
long = df.melt(
id_vars=["country", "year"],
value_vars=["sales_a", "sales_b"],
var_name="product",
value_name="sales"
)
wide = long.pivot(
index=["country", "year"],
columns="product",
values="sales"
).reset_index()
Merging Multiple Tables
pd.merge implements SQL‑style joins (inner, left, right, outer). It is essential
when combining fact tables with dimension tables like users, products or time.
users = pd.DataFrame({
"user_id": [1, 2, 3],
"country": ["US", "IN", "US"]
})
orders = pd.DataFrame({
"order_id": [101, 102, 103],
"user_id": [1, 1, 3],
"amount": [50, 30, 80]
})
joined = pd.merge(
orders,
users,
on="user_id",
how="left"
)
print(joined)