Data Science

Mathematics & Statistics Foundation

Linear algebra, calculus, probability, descriptive and inferential statistics for data science.

Linear Algebra for Data Science & Machine Learning

Vectors

A vector is an ordered list of numbers. Geometrically, a vector represents a point or direction in space. In Data Science, a feature vector contains all features for one observation (row).

  • 1D vector: features of one sample, e.g. \([height, weight, age]\).
  • n‑dimensional vectors: embeddings in NLP, image pixels, etc.
  • Common operations: addition, scaling, dot product, norm (length).
import numpy as np

# Column and row vectors
v = np.array([1, 2, 3])          # row vector (shape: (3,))
v_col = v.reshape(-1, 1)         # column vector (shape: (3, 1))

# Vector addition and scaling
u = np.array([4, 5, 6])
sum_vec = v + u                  # [5, 7, 9]
scaled = 2 * v                   # [2, 4, 6]

# Dot product and norm
dot = np.dot(v, u)               # 1*4 + 2*5 + 3*6 = 32
norm = np.linalg.norm(v)         # length of vector

print("v:", v)
print("u:", u)
print("dot(v, u):", dot)
print("||v||:", norm)

Matrices & Matrix Multiplication

A matrix is a 2D array of numbers. In Data Science, a dataset is typically represented as a matrix \(X \in \mathbb{R}^{n \times d}\) where \(n\) is the number of rows (samples) and \(d\) is the number of features.

  • Each row: one data point.
  • Each column: one feature.
  • Matrix multiplication is used heavily in neural networks and linear models.
# Design matrix X and parameter vector w
X = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])                              # shape: (3, 2)

w = np.array([0.1, 0.5])        # shape: (2,)
b = 0.3                         # bias term

# Linear model: y = Xw + b
y = X @ w + b                   # "@" is matrix multiplication in Python

print("X shape:", X.shape)
print("w shape:", w.shape)
print("Predictions y:", y)

Eigenvalues, Eigenvectors & SVD (Intuition)

Many dimensionality reduction techniques such as PCA rely on eigenvalues and singular values.

  • Eigenvectors indicate important directions in the data.
  • Eigenvalues tell you how much variance lies along each direction.
  • SVD factorizes a matrix into \(U \Sigma V^T\) and is used under the hood in PCA.
# Covariance matrix example
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0]])

X_centered = X - X.mean(axis=0)
cov = np.cov(X_centered, rowvar=False)

eig_vals, eig_vecs = np.linalg.eig(cov)

print("Covariance matrix:\n", cov)
print("Eigenvalues:", eig_vals)
print("Eigenvectors:\n", eig_vecs)

# SVD
U, S, Vt = np.linalg.svd(X_centered)
print("Singular values:", S)

In practice, libraries like scikit-learn hide these details, but understanding the Linear Algebra behind them helps you debug, tune and explain models better.

Calculus Essentials for Machine Learning

Derivatives & Slopes

The derivative of a function measures how fast it changes. Intuitively, it is the slope of the tangent line at a point. In Machine Learning, we use derivatives to see how the loss changes when we change the parameters.

  • For \(f(x) = x^2\), the derivative is \(f'(x) = 2x\).
  • For \(f(x) = wx + b\), the derivative w.r.t \(w\) is \(x\).
  • The derivative tells us the direction to move to reduce the loss.
import numpy as np

def f(x):
    return x**2

def numerical_derivative(func, x, eps=1e-5):
    return (func(x + eps) - func(x - eps)) / (2 * eps)

xs = np.linspace(-3, 3, 7)
for x in xs:
    print(f"x={x: .1f}, f(x)={f(x): .2f}, approx f'(x)={numerical_derivative(f, x): .2f}")

Gradients & Multivariate Functions

For functions with many parameters, we use a gradient, which is a vector of partial derivatives. It tells us how the function changes with respect to each parameter.

import numpy as np

def loss(w):
    # Simple quadratic loss: L(w1, w2) = w1^2 + 2*w2^2
    return w[0]**2 + 2 * w[1]**2

def grad_loss(w, eps=1e-5):
    g = np.zeros_like(w, dtype=float)
    for i in range(len(w)):
        w_pos = w.copy()
        w_neg = w.copy()
        w_pos[i] += eps
        w_neg[i] -= eps
        g[i] = (loss(w_pos) - loss(w_neg)) / (2 * eps)
    return g

w = np.array([1.0, -2.0])
print("w:", w)
print("loss(w):", loss(w))
print("grad L(w):", grad_loss(w))

Gradient Descent (Optimization)

Gradient Descent is an iterative optimization algorithm. At each step we move in the opposite direction of the gradient to reduce the loss.

# Simple gradient descent on a 1D function
def f(x):
    return x**2

def f_prime(x):
    return 2 * x

x = 5.0
lr = 0.1

for step in range(10):
    grad = f_prime(x)
    x = x - lr * grad
    print(f"step={step:02d}, x={x:.4f}, f(x)={f(x):.4f}")

Probability Theory for Data Science & Machine Learning

Random Variables & Events

A random variable is a variable whose value is uncertain. We model it using a probability distribution. Examples:

  • Number of clicks on an ad (discrete).
  • Height of a person (continuous).
  • Class label in classification (categorical).
import numpy as np

# Simulate 10,000 coin flips (0 = tails, 1 = heads)
np.random.seed(42)
flips = np.random.binomial(n=1, p=0.5, size=10_000)

prob_heads = flips.mean()
prob_tails = 1 - prob_heads

print("P(heads) ≈", round(prob_heads, 3))
print("P(tails) ≈", round(prob_tails, 3))

Common Probability Distributions

Some distributions appear again and again in Data Science:

  • Bernoulli / Binomial: binary outcomes and counts.
  • Normal (Gaussian): continuous, “bell‑shaped” data.
  • Poisson: counts over time (events per minute).
  • Exponential: time between events.
import numpy as np
from scipy import stats

np.random.seed(0)

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.3, size=1000)

# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)

print("Normal mean/std:", round(normal_data.mean(), 3), round(normal_data.std(), 3))
print("Binomial mean:", round(binom_data.mean(), 3))
print("Poisson mean:", round(poisson_data.mean(), 3))

Conditional Probability & Bayes' Theorem

Conditional probability is the probability of an event given that another event has occurred: \(P(A \mid B)\). Bayes' theorem connects prior and posterior probabilities:

\[ P(A \mid B) = \frac{P(B \mid A) \; P(A)}{P(B)} \]

# Simple Bayes theorem example in code

P_disease = 0.01          # 1% have the disease (prior)
P_positive_given_disease = 0.99
P_positive_given_healthy = 0.05

P_healthy = 1 - P_disease

# Total probability of a positive test
P_positive = (P_positive_given_disease * P_disease +
              P_positive_given_healthy * P_healthy)

# Posterior: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print("P(positive):", round(P_positive, 3))
print("P(disease | positive):", round(P_disease_given_positive, 3))

Descriptive Statistics for Exploratory Data Analysis (EDA)

Measures of Central Tendency

Central tendency measures tell you where the “center” of the data lies.

  • Mean: arithmetic average, sensitive to outliers.
  • Median: middle value, robust to outliers.
  • Mode: most frequent value.
import numpy as np
import pandas as pd
from scipy import stats

data = np.array([10, 12, 13, 13, 14, 100])  # 100 is an outlier

mean = data.mean()
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]

print("Data:", data)
print("Mean  :", mean)
print("Median:", median)
print("Mode  :", mode)

Measures of Spread (Dispersion)

Spread tells you how variable your data is. Two datasets can have the same mean with very different spreads.

  • Range: max − min.
  • Variance & Standard Deviation: average squared deviation from the mean.
  • Percentiles & IQR: robust spread measures (IQR = Q3 − Q1).
import numpy as np

data = np.array([10, 12, 13, 13, 14, 100])

data_min, data_max = data.min(), data.max()
data_range = data_max - data_min
variance = np.var(data, ddof=1)          # sample variance
std_dev = np.std(data, ddof=1)           # sample standard deviation
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print("Range      :", data_range)
print("Variance   :", round(variance, 2))
print("Std Dev    :", round(std_dev, 2))
print("Q1, Q3, IQR:", q1, q3, iqr)

Quick Summary with pandas.describe()

In real projects you rarely compute all statistics manually. Instead, you use pandas.DataFrame.describe() to get a quick overview.

import pandas as pd

df = pd.DataFrame({
    "age": [23, 25, 31, 40, 29, 37, 45],
    "salary": [35000, 42000, 50000, 70000, 48000, 65000, 90000]
})

print(df.describe())

Inferential Statistics: From Sample to Population

Sampling & Confidence Intervals

A confidence interval (CI) gives a range of plausible values for a population parameter (e.g., the true mean). It is built from a sample but interpreted at the population level.

import numpy as np
from scipy import stats

np.random.seed(0)

# Suppose these are sample observations of a metric (e.g. session length)
sample = np.random.normal(loc=5.0, scale=1.0, size=100)

mean = sample.mean()
std_err = stats.sem(sample)   # standard error of the mean
confidence = 0.95

ci_low, ci_high = stats.t.interval(
    confidence,
    df=len(sample) - 1,
    loc=mean,
    scale=std_err
)

print("Sample mean:", round(mean, 3))
print("95% CI    :", (round(ci_low, 3), round(ci_high, 3)))

Hypothesis Testing & p‑values

In hypothesis testing we start with a null hypothesis \(H_0\) (no effect), and an alternative \(H_1\) (there is an effect). We compute a test statistic and its p‑value to decide whether to reject \(H_0\).

from scipy import stats
import numpy as np

np.random.seed(1)

# Example: one-sample t-test
# H0: true mean = 0, H1: true mean ≠ 0
sample = np.random.normal(loc=0.5, scale=1.0, size=50)

t_stat, p_value = stats.ttest_1samp(sample, popmean=0.0)

print("t statistic:", round(t_stat, 3))
print("p value    :", round(p_value, 4))

alpha = 0.05
if p_value < alpha:
    print("Reject H0 at 5% level")
else:
    print("Fail to reject H0 at 5% level")