Probability Theory for Data Science & Machine Learning
Probability quantifies uncertainty. Most machine learning algorithms either use probability directly (Naive Bayes, Bayesian models) or are best understood with a probabilistic view.
Random Variables & Events
A random variable is a variable whose value is uncertain. We model it using a probability distribution. Examples:
- Number of clicks on an ad (discrete).
- Height of a person (continuous).
- Class label in classification (categorical).
import numpy as np
# Simulate 10,000 coin flips (0 = tails, 1 = heads)
np.random.seed(42)
flips = np.random.binomial(n=1, p=0.5, size=10_000)
prob_heads = flips.mean()
prob_tails = 1 - prob_heads
print("P(heads) ≈", round(prob_heads, 3))
print("P(tails) ≈", round(prob_tails, 3))
Common Probability Distributions
Some distributions appear again and again in Data Science:
- Bernoulli / Binomial: binary outcomes and counts.
- Normal (Gaussian): continuous, “bell‑shaped” data.
- Poisson: counts over time (events per minute).
- Exponential: time between events.
import numpy as np
from scipy import stats
np.random.seed(0)
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.3, size=1000)
# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)
print("Normal mean/std:", round(normal_data.mean(), 3), round(normal_data.std(), 3))
print("Binomial mean:", round(binom_data.mean(), 3))
print("Poisson mean:", round(poisson_data.mean(), 3))
Conditional Probability & Bayes' Theorem
Conditional probability is the probability of an event given that another event has occurred: \(P(A \mid B)\). Bayes' theorem connects prior and posterior probabilities:
\[ P(A \mid B) = \frac{P(B \mid A) \; P(A)}{P(B)} \]
# Simple Bayes theorem example in code
P_disease = 0.01 # 1% have the disease (prior)
P_positive_given_disease = 0.99
P_positive_given_healthy = 0.05
P_healthy = 1 - P_disease
# Total probability of a positive test
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_healthy * P_healthy)
# Posterior: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
print("P(positive):", round(P_positive, 3))
print("P(disease | positive):", round(P_disease_given_positive, 3))