Machine Learning Key Concepts
Must-know Terms

Core Machine Learning Concepts & Terminology

Before diving into algorithms, it’s critical to understand the common language of Machine Learning — datasets, features, labels, loss, overfitting, generalization and more.

Datasets, Features & Labels

A typical ML dataset can be represented as a matrix \(X\) (rows = samples, columns = features) and a target vector \(y\) (labels).

  • Sample / Instance: a single row of data (e.g., one customer, one transaction).
  • Feature: an input variable (age, income, number of clicks).
  • Label / Target: what we want to predict (price, churn yes/no).
samples features labels tabular data

Training, Validation & Test Sets

We split the dataset to estimate how well our model will perform on unseen data:

  • Training set: used to fit the model parameters.
  • Validation set: used for model selection and hyperparameter tuning.
  • Test set: used once at the end to report final performance.
Rule of thumb: never use your test data to make modeling decisions. That leads to optimistic and unreliable metrics.

Overfitting vs Underfitting

Models must balance fit and simplicity:

  • Underfitting: model is too simple, cannot capture the pattern (high bias).
  • Overfitting: model memorizes noise in training data (high variance).
  • Just right: low training error and low validation error.

Loss Functions & Evaluation Metrics

During training, we minimize a loss function. After training, we report metrics that are easier to interpret.

  • Regression: MSE, RMSE, MAE, \(R^2\).
  • Classification: Accuracy, Precision, Recall, F1‑Score, ROC‑AUC.