Mathematics & Statistics Foundation

Linear Algebra for Data Science & Machine Learning

Vectors

A vector is an ordered list of numbers. Geometrically, a vector represents a point or direction in space. In Data Science, a feature vector contains all features for one observation (row).

1D vector: features of one sample, e.g. \([height, weight, age]\).
n‑dimensional vectors: embeddings in NLP, image pixels, etc.
Common operations: addition, scaling, dot product, norm (length).

import numpy as np

# Column and row vectors
v = np.array([1, 2, 3])          # row vector (shape: (3,))
v_col = v.reshape(-1, 1)         # column vector (shape: (3, 1))

# Vector addition and scaling
u = np.array([4, 5, 6])
sum_vec = v + u                  # [5, 7, 9]
scaled = 2 * v                   # [2, 4, 6]

# Dot product and norm
dot = np.dot(v, u)               # 1*4 + 2*5 + 3*6 = 32
norm = np.linalg.norm(v)         # length of vector

print("v:", v)
print("u:", u)
print("dot(v, u):", dot)
print("||v||:", norm)

Matrices & Matrix Multiplication

A matrix is a 2D array of numbers. In Data Science, a dataset is typically represented as a matrix \(X \in \mathbb{R}^{n \times d}\) where \(n\) is the number of rows (samples) and \(d\) is the number of features.

Each row: one data point.
Each column: one feature.
Matrix multiplication is used heavily in neural networks and linear models.

# Design matrix X and parameter vector w
X = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])                              # shape: (3, 2)

w = np.array([0.1, 0.5])        # shape: (2,)
b = 0.3                         # bias term

# Linear model: y = Xw + b
y = X @ w + b                   # "@" is matrix multiplication in Python

print("X shape:", X.shape)
print("w shape:", w.shape)
print("Predictions y:", y)

Eigenvalues, Eigenvectors & SVD (Intuition)

Many dimensionality reduction techniques such as PCA rely on eigenvalues and singular values.

Eigenvectors indicate important directions in the data.
Eigenvalues tell you how much variance lies along each direction.
SVD factorizes a matrix into \(U \Sigma V^T\) and is used under the hood in PCA.

# Covariance matrix example
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0]])

X_centered = X - X.mean(axis=0)
cov = np.cov(X_centered, rowvar=False)

eig_vals, eig_vecs = np.linalg.eig(cov)

print("Covariance matrix:\n", cov)
print("Eigenvalues:", eig_vals)
print("Eigenvectors:\n", eig_vecs)

# SVD
U, S, Vt = np.linalg.svd(X_centered)
print("Singular values:", S)

In practice, libraries like scikit-learn hide these details, but understanding the Linear Algebra behind them helps you debug, tune and explain models better.

Calculus Essentials for Machine Learning

Derivatives & Slopes

The derivative of a function measures how fast it changes. Intuitively, it is the slope of the tangent line at a point. In Machine Learning, we use derivatives to see how the loss changes when we change the parameters.

For \(f(x) = x^2\), the derivative is \(f'(x) = 2x\).
For \(f(x) = wx + b\), the derivative w.r.t \(w\) is \(x\).
The derivative tells us the direction to move to reduce the loss.

import numpy as np

def f(x):
    return x**2

def numerical_derivative(func, x, eps=1e-5):
    return (func(x + eps) - func(x - eps)) / (2 * eps)

xs = np.linspace(-3, 3, 7)
for x in xs:
    print(f"x={x: .1f}, f(x)={f(x): .2f}, approx f'(x)={numerical_derivative(f, x): .2f}")

Gradients & Multivariate Functions

For functions with many parameters, we use a gradient, which is a vector of partial derivatives. It tells us how the function changes with respect to each parameter.

import numpy as np

def loss(w):
    # Simple quadratic loss: L(w1, w2) = w1^2 + 2*w2^2
    return w[0]**2 + 2 * w[1]**2

def grad_loss(w, eps=1e-5):
    g = np.zeros_like(w, dtype=float)
    for i in range(len(w)):
        w_pos = w.copy()
        w_neg = w.copy()
        w_pos[i] += eps
        w_neg[i] -= eps
        g[i] = (loss(w_pos) - loss(w_neg)) / (2 * eps)
    return g

w = np.array([1.0, -2.0])
print("w:", w)
print("loss(w):", loss(w))
print("grad L(w):", grad_loss(w))

Gradient Descent (Optimization)

Gradient Descent is an iterative optimization algorithm. At each step we move in the opposite direction of the gradient to reduce the loss.

# Simple gradient descent on a 1D function
def f(x):
    return x**2

def f_prime(x):
    return 2 * x

x = 5.0
lr = 0.1

for step in range(10):
    grad = f_prime(x)
    x = x - lr * grad
    print(f"step={step:02d}, x={x:.4f}, f(x)={f(x):.4f}")

Probability Theory for Data Science & Machine Learning

Random Variables & Events

A random variable is a variable whose value is uncertain. We model it using a probability distribution. Examples:

Number of clicks on an ad (discrete).
Height of a person (continuous).
Class label in classification (categorical).

import numpy as np

# Simulate 10,000 coin flips (0 = tails, 1 = heads)
np.random.seed(42)
flips = np.random.binomial(n=1, p=0.5, size=10_000)

prob_heads = flips.mean()
prob_tails = 1 - prob_heads

print("P(heads) ≈", round(prob_heads, 3))
print("P(tails) ≈", round(prob_tails, 3))

Common Probability Distributions

Some distributions appear again and again in Data Science:

Bernoulli / Binomial: binary outcomes and counts.
Normal (Gaussian): continuous, “bell‑shaped” data.
Poisson: counts over time (events per minute).
Exponential: time between events.

import numpy as np
from scipy import stats

np.random.seed(0)

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.3, size=1000)

# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)

print("Normal mean/std:", round(normal_data.mean(), 3), round(normal_data.std(), 3))
print("Binomial mean:", round(binom_data.mean(), 3))
print("Poisson mean:", round(poisson_data.mean(), 3))

Conditional Probability & Bayes' Theorem

Conditional probability is the probability of an event given that another event has occurred: \(P(A \mid B)\). Bayes' theorem connects prior and posterior probabilities:

\[ P(A \mid B) = \frac{P(B \mid A) \; P(A)}{P(B)} \]

# Simple Bayes theorem example in code

P_disease = 0.01          # 1% have the disease (prior)
P_positive_given_disease = 0.99
P_positive_given_healthy = 0.05

P_healthy = 1 - P_disease

# Total probability of a positive test
P_positive = (P_positive_given_disease * P_disease +
              P_positive_given_healthy * P_healthy)

# Posterior: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print("P(positive):", round(P_positive, 3))
print("P(disease | positive):", round(P_disease_given_positive, 3))

Descriptive Statistics for Exploratory Data Analysis (EDA)

Measures of Central Tendency

Central tendency measures tell you where the “center” of the data lies.

Mean: arithmetic average, sensitive to outliers.
Median: middle value, robust to outliers.
Mode: most frequent value.

import numpy as np
import pandas as pd
from scipy import stats

data = np.array([10, 12, 13, 13, 14, 100])  # 100 is an outlier

mean = data.mean()
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]

print("Data:", data)
print("Mean  :", mean)
print("Median:", median)
print("Mode  :", mode)

Measures of Spread (Dispersion)

Spread tells you how variable your data is. Two datasets can have the same mean with very different spreads.

Range: max − min.
Variance & Standard Deviation: average squared deviation from the mean.
Percentiles & IQR: robust spread measures (IQR = Q3 − Q1).

import numpy as np

data = np.array([10, 12, 13, 13, 14, 100])

data_min, data_max = data.min(), data.max()
data_range = data_max - data_min
variance = np.var(data, ddof=1)          # sample variance
std_dev = np.std(data, ddof=1)           # sample standard deviation
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print("Range      :", data_range)
print("Variance   :", round(variance, 2))
print("Std Dev    :", round(std_dev, 2))
print("Q1, Q3, IQR:", q1, q3, iqr)

Quick Summary with pandas.describe()

In real projects you rarely compute all statistics manually. Instead, you use pandas.DataFrame.describe() to get a quick overview.

import pandas as pd

df = pd.DataFrame({
    "age": [23, 25, 31, 40, 29, 37, 45],
    "salary": [35000, 42000, 50000, 70000, 48000, 65000, 90000]
})

print(df.describe())

Inferential Statistics: From Sample to Population

Sampling & Confidence Intervals

A confidence interval (CI) gives a range of plausible values for a population parameter (e.g., the true mean). It is built from a sample but interpreted at the population level.

import numpy as np
from scipy import stats

np.random.seed(0)

# Suppose these are sample observations of a metric (e.g. session length)
sample = np.random.normal(loc=5.0, scale=1.0, size=100)

mean = sample.mean()
std_err = stats.sem(sample)   # standard error of the mean
confidence = 0.95

ci_low, ci_high = stats.t.interval(
    confidence,
    df=len(sample) - 1,
    loc=mean,
    scale=std_err
)

print("Sample mean:", round(mean, 3))
print("95% CI    :", (round(ci_low, 3), round(ci_high, 3)))

Hypothesis Testing & p‑values

In hypothesis testing we start with a null hypothesis \(H_0\) (no effect), and an alternative \(H_1\) (there is an effect). We compute a test statistic and its p‑value to decide whether to reject \(H_0\).

from scipy import stats
import numpy as np

np.random.seed(1)

# Example: one-sample t-test
# H0: true mean = 0, H1: true mean ≠ 0
sample = np.random.normal(loc=0.5, scale=1.0, size=50)

t_stat, p_value = stats.ttest_1samp(sample, popmean=0.0)

print("t statistic:", round(t_stat, 3))
print("p value    :", round(p_value, 4))

alpha = 0.05
if p_value < alpha:
    print("Reject H0 at 5% level")
else:
    print("Fail to reject H0 at 5% level")