Train/Test Split in Machine Learning

Learn why we split data into training and test sets, common split ratios, and how to do it in Python with scikit-learn.

Why Split Data?

We want to know how well our model performs on new, unseen data. If we evaluate on the same data used for training, we get overly optimistic results. The solution is to split data into:

Training set: used to learn (fit) the model.
Test set: used only for final evaluation.

                Rule: Never train the model on the test set. Use it only once for final evaluation.
            

Common Split Ratios

80% train / 20% test
70% train / 30% test

For small datasets, we may also create a validation set or use cross-validation for more reliable estimates.

Example: Train/Test Split with scikit-learn

Linear Regression with Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Example data: house size (X) and price (y)
X = np.array([[500], [750], [1000], [1250], [1500], [1750], [2000]])
y = np.array([100, 150, 200, 250, 300, 320, 350])

# Split data into train and test sets
# test_size=0.3 means 30% of data used for testing
# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

# Create and train the model on training data only
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data (unseen during training)
y_pred = model.predict(X_test)

# Calculate error (lower is better)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error on test set:", mse)

Important Guidelines

Do not let information from the test set leak into training.
Use the training set to fit the model and tune hyperparameters.
Use the test set only once, at the end, to estimate real-world performance.

Train / Validation / Test Split

For more advanced workflows, we split the dataset into three parts:

Train: learn model parameters.
Validation: tune hyperparameters and select the best model.
Test: final evaluation after all decisions are made.

Back to Data Science Tutorial

Related Data Science Links