Machine Learning
Key Concepts
Must-know Terms
Core Machine Learning Concepts & Terminology
Before diving into algorithms, it’s critical to understand the common language of Machine Learning — datasets, features, labels, loss, overfitting, generalization and more.
Datasets, Features & Labels
A typical ML dataset can be represented as a matrix \(X\) (rows = samples, columns = features) and a target vector \(y\) (labels).
- Sample / Instance: a single row of data (e.g., one customer, one transaction).
- Feature: an input variable (age, income, number of clicks).
- Label / Target: what we want to predict (price, churn yes/no).
samples
features
labels
tabular data
Training, Validation & Test Sets
We split the dataset to estimate how well our model will perform on unseen data:
- Training set: used to fit the model parameters.
- Validation set: used for model selection and hyperparameter tuning.
- Test set: used once at the end to report final performance.
Rule of thumb: never use your test data to make modeling decisions. That leads to optimistic and unreliable metrics.
Overfitting vs Underfitting
Models must balance fit and simplicity:
- Underfitting: model is too simple, cannot capture the pattern (high bias).
- Overfitting: model memorizes noise in training data (high variance).
- Just right: low training error and low validation error.
Loss Functions & Evaluation Metrics
During training, we minimize a loss function. After training, we report metrics that are easier to interpret.
- Regression: MSE, RMSE, MAE, \(R^2\).
- Classification: Accuracy, Precision, Recall, F1‑Score, ROC‑AUC.