Mathematics & Statistics Foundation
Linear algebra, calculus, probability, descriptive and inferential statistics for data science.
Linear Algebra for Data Science & Machine Learning
Vectors
A vector is an ordered list of numbers. Geometrically, a vector represents a point or direction in space. In Data Science, a feature vector contains all features for one observation (row).
- 1D vector: features of one sample, e.g. \([height, weight, age]\).
- n‑dimensional vectors: embeddings in NLP, image pixels, etc.
- Common operations: addition, scaling, dot product, norm (length).
import numpy as np
# Column and row vectors
v = np.array([1, 2, 3]) # row vector (shape: (3,))
v_col = v.reshape(-1, 1) # column vector (shape: (3, 1))
# Vector addition and scaling
u = np.array([4, 5, 6])
sum_vec = v + u # [5, 7, 9]
scaled = 2 * v # [2, 4, 6]
# Dot product and norm
dot = np.dot(v, u) # 1*4 + 2*5 + 3*6 = 32
norm = np.linalg.norm(v) # length of vector
print("v:", v)
print("u:", u)
print("dot(v, u):", dot)
print("||v||:", norm)
Matrices & Matrix Multiplication
A matrix is a 2D array of numbers. In Data Science, a dataset is typically represented as a matrix \(X \in \mathbb{R}^{n \times d}\) where \(n\) is the number of rows (samples) and \(d\) is the number of features.
- Each row: one data point.
- Each column: one feature.
- Matrix multiplication is used heavily in neural networks and linear models.
# Design matrix X and parameter vector w
X = np.array([
[1, 2],
[3, 4],
[5, 6]
]) # shape: (3, 2)
w = np.array([0.1, 0.5]) # shape: (2,)
b = 0.3 # bias term
# Linear model: y = Xw + b
y = X @ w + b # "@" is matrix multiplication in Python
print("X shape:", X.shape)
print("w shape:", w.shape)
print("Predictions y:", y)
Eigenvalues, Eigenvectors & SVD (Intuition)
Many dimensionality reduction techniques such as PCA rely on eigenvalues and singular values.
- Eigenvectors indicate important directions in the data.
- Eigenvalues tell you how much variance lies along each direction.
- SVD factorizes a matrix into \(U \Sigma V^T\) and is used under the hood in PCA.
# Covariance matrix example
X = np.array([[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0]])
X_centered = X - X.mean(axis=0)
cov = np.cov(X_centered, rowvar=False)
eig_vals, eig_vecs = np.linalg.eig(cov)
print("Covariance matrix:\n", cov)
print("Eigenvalues:", eig_vals)
print("Eigenvectors:\n", eig_vecs)
# SVD
U, S, Vt = np.linalg.svd(X_centered)
print("Singular values:", S)
In practice, libraries like scikit-learn hide these details, but understanding the
Linear Algebra behind them helps you debug, tune and explain models better.
Calculus Essentials for Machine Learning
Derivatives & Slopes
The derivative of a function measures how fast it changes. Intuitively, it is the slope of the tangent line at a point. In Machine Learning, we use derivatives to see how the loss changes when we change the parameters.
- For \(f(x) = x^2\), the derivative is \(f'(x) = 2x\).
- For \(f(x) = wx + b\), the derivative w.r.t \(w\) is \(x\).
- The derivative tells us the direction to move to reduce the loss.
import numpy as np
def f(x):
return x**2
def numerical_derivative(func, x, eps=1e-5):
return (func(x + eps) - func(x - eps)) / (2 * eps)
xs = np.linspace(-3, 3, 7)
for x in xs:
print(f"x={x: .1f}, f(x)={f(x): .2f}, approx f'(x)={numerical_derivative(f, x): .2f}")
Gradients & Multivariate Functions
For functions with many parameters, we use a gradient, which is a vector of partial derivatives. It tells us how the function changes with respect to each parameter.
import numpy as np
def loss(w):
# Simple quadratic loss: L(w1, w2) = w1^2 + 2*w2^2
return w[0]**2 + 2 * w[1]**2
def grad_loss(w, eps=1e-5):
g = np.zeros_like(w, dtype=float)
for i in range(len(w)):
w_pos = w.copy()
w_neg = w.copy()
w_pos[i] += eps
w_neg[i] -= eps
g[i] = (loss(w_pos) - loss(w_neg)) / (2 * eps)
return g
w = np.array([1.0, -2.0])
print("w:", w)
print("loss(w):", loss(w))
print("grad L(w):", grad_loss(w))
Gradient Descent (Optimization)
Gradient Descent is an iterative optimization algorithm. At each step we move in the opposite direction of the gradient to reduce the loss.
# Simple gradient descent on a 1D function
def f(x):
return x**2
def f_prime(x):
return 2 * x
x = 5.0
lr = 0.1
for step in range(10):
grad = f_prime(x)
x = x - lr * grad
print(f"step={step:02d}, x={x:.4f}, f(x)={f(x):.4f}")
Probability Theory for Data Science & Machine Learning
Random Variables & Events
A random variable is a variable whose value is uncertain. We model it using a probability distribution. Examples:
- Number of clicks on an ad (discrete).
- Height of a person (continuous).
- Class label in classification (categorical).
import numpy as np
# Simulate 10,000 coin flips (0 = tails, 1 = heads)
np.random.seed(42)
flips = np.random.binomial(n=1, p=0.5, size=10_000)
prob_heads = flips.mean()
prob_tails = 1 - prob_heads
print("P(heads) ≈", round(prob_heads, 3))
print("P(tails) ≈", round(prob_tails, 3))
Common Probability Distributions
Some distributions appear again and again in Data Science:
- Bernoulli / Binomial: binary outcomes and counts.
- Normal (Gaussian): continuous, “bell‑shaped” data.
- Poisson: counts over time (events per minute).
- Exponential: time between events.
import numpy as np
from scipy import stats
np.random.seed(0)
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.3, size=1000)
# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)
print("Normal mean/std:", round(normal_data.mean(), 3), round(normal_data.std(), 3))
print("Binomial mean:", round(binom_data.mean(), 3))
print("Poisson mean:", round(poisson_data.mean(), 3))
Conditional Probability & Bayes' Theorem
Conditional probability is the probability of an event given that another event has occurred: \(P(A \mid B)\). Bayes' theorem connects prior and posterior probabilities:
\[ P(A \mid B) = \frac{P(B \mid A) \; P(A)}{P(B)} \]
# Simple Bayes theorem example in code
P_disease = 0.01 # 1% have the disease (prior)
P_positive_given_disease = 0.99
P_positive_given_healthy = 0.05
P_healthy = 1 - P_disease
# Total probability of a positive test
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_healthy * P_healthy)
# Posterior: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive
print("P(positive):", round(P_positive, 3))
print("P(disease | positive):", round(P_disease_given_positive, 3))
Descriptive Statistics for Exploratory Data Analysis (EDA)
Measures of Central Tendency
Central tendency measures tell you where the “center” of the data lies.
- Mean: arithmetic average, sensitive to outliers.
- Median: middle value, robust to outliers.
- Mode: most frequent value.
import numpy as np
import pandas as pd
from scipy import stats
data = np.array([10, 12, 13, 13, 14, 100]) # 100 is an outlier
mean = data.mean()
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]
print("Data:", data)
print("Mean :", mean)
print("Median:", median)
print("Mode :", mode)
Measures of Spread (Dispersion)
Spread tells you how variable your data is. Two datasets can have the same mean with very different spreads.
- Range: max − min.
- Variance & Standard Deviation: average squared deviation from the mean.
- Percentiles & IQR: robust spread measures (IQR = Q3 − Q1).
import numpy as np
data = np.array([10, 12, 13, 13, 14, 100])
data_min, data_max = data.min(), data.max()
data_range = data_max - data_min
variance = np.var(data, ddof=1) # sample variance
std_dev = np.std(data, ddof=1) # sample standard deviation
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
print("Range :", data_range)
print("Variance :", round(variance, 2))
print("Std Dev :", round(std_dev, 2))
print("Q1, Q3, IQR:", q1, q3, iqr)
Quick Summary with pandas.describe()
In real projects you rarely compute all statistics manually. Instead, you use
pandas.DataFrame.describe() to get a quick overview.
import pandas as pd
df = pd.DataFrame({
"age": [23, 25, 31, 40, 29, 37, 45],
"salary": [35000, 42000, 50000, 70000, 48000, 65000, 90000]
})
print(df.describe())
Inferential Statistics: From Sample to Population
Sampling & Confidence Intervals
A confidence interval (CI) gives a range of plausible values for a population parameter (e.g., the true mean). It is built from a sample but interpreted at the population level.
import numpy as np
from scipy import stats
np.random.seed(0)
# Suppose these are sample observations of a metric (e.g. session length)
sample = np.random.normal(loc=5.0, scale=1.0, size=100)
mean = sample.mean()
std_err = stats.sem(sample) # standard error of the mean
confidence = 0.95
ci_low, ci_high = stats.t.interval(
confidence,
df=len(sample) - 1,
loc=mean,
scale=std_err
)
print("Sample mean:", round(mean, 3))
print("95% CI :", (round(ci_low, 3), round(ci_high, 3)))
Hypothesis Testing & p‑values
In hypothesis testing we start with a null hypothesis \(H_0\) (no effect), and an alternative \(H_1\) (there is an effect). We compute a test statistic and its p‑value to decide whether to reject \(H_0\).
from scipy import stats
import numpy as np
np.random.seed(1)
# Example: one-sample t-test
# H0: true mean = 0, H1: true mean ≠ 0
sample = np.random.normal(loc=0.5, scale=1.0, size=50)
t_stat, p_value = stats.ttest_1samp(sample, popmean=0.0)
print("t statistic:", round(t_stat, 3))
print("p value :", round(p_value, 4))
alpha = 0.05
if p_value < alpha:
print("Reject H0 at 5% level")
else:
print("Fail to reject H0 at 5% level")