Data Science

Data Cleaning & Wrangling

Handle missing values, outliers, duplicates, and reshape data for analysis.

Cleaning Real‑World Data Before Modeling

Handling Missing Values

Missing values can bias your analysis if not handled properly. Common strategies include removing rows, filling with simple statistics (mean/median/mode), or using model‑based imputations.

The right strategy depends on why the data is missing (Missing Completely At Random, Missing At Random or Missing Not At Random). For example, dropping rows is usually safe when only a small percentage of values are missing at random, while more advanced techniques like KNN or model‑based imputation are preferred when missingness has structure.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, np.nan, 30, 35, np.nan],
    "salary": [50000, 52000, np.nan, 60000, 58000]
})

print("Missing per column:")
print(df.isna().sum())

# Simple imputations
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].fillna(df["salary"].mean())

print("After imputation:")
print(df)

Detecting & Treating Outliers

Outliers can heavily influence means and linear models. Use the Interquartile Range (IQR) or z‑scores to detect unusual points, then decide whether to cap, transform or remove them based on domain knowledge.

import numpy as np
import pandas as pd

df = pd.DataFrame({"income": [30, 32, 35, 36, 37, 500]})

q1, q3 = df["income"].quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = df[(df["income"] < lower) | (df["income"] > upper)]
print("Outliers:\n", outliers)

Removing Duplicates & Fixing Formats

Duplicated rows and inconsistent formats (dates, categories, casing) are another common source of bad data. Pandas provides helpers for both.

df = pd.DataFrame({
    "email": ["a@example.com", "B@Example.com", "a@example.com"],
    "signup_date": ["2024-01-01", "01/02/2024", "2024-01-01"]
})

# Normalize email casing
df["email"] = df["email"].str.lower()

# Parse dates
df["signup_date"] = pd.to_datetime(df["signup_date"], dayfirst=False, errors="coerce")

# Drop duplicates
df = df.drop_duplicates(subset=["email"])
print(df)

Reshaping & Transforming Data with Pandas

Melt & Pivot

Melt converts wide data (many columns) into long format (row per observation), while pivot does the opposite. Many visualization and modeling tasks prefer long format.

Thinking in terms of tidy data, each variable should have its own column and each observation its own row. Melt and pivot are the core operations that let you move between human‑friendly report layouts and machine‑friendly tidy tables without losing information.

import pandas as pd

df = pd.DataFrame({
    "country": ["US", "US", "IN", "IN"],
    "year": [2023, 2024, 2023, 2024],
    "sales_a": [10, 12, 8, 9],
    "sales_b": [5, 7, 4, 6]
})

long = df.melt(
    id_vars=["country", "year"],
    value_vars=["sales_a", "sales_b"],
    var_name="product",
    value_name="sales"
)

wide = long.pivot(
    index=["country", "year"],
    columns="product",
    values="sales"
).reset_index()

Merging Multiple Tables

pd.merge implements SQL‑style joins (inner, left, right, outer). It is essential when combining fact tables with dimension tables like users, products or time.

users = pd.DataFrame({
    "user_id": [1, 2, 3],
    "country": ["US", "IN", "US"]
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103],
    "user_id": [1, 1, 3],
    "amount": [50, 30, 80]
})

joined = pd.merge(
    orders,
    users,
    on="user_id",
    how="left"
)

print(joined)

Previous Next