Train-Test Split Interview Q&A

1Why split data into train and test?

Answer: To estimate model generalization on unseen data.

2What is validation set?

Answer: Dataset for model tuning between train and final test evaluation.

3Typical split ratios?

Answer: Commonly 80/20 or 70/15/15 depending on data size.

4What is stratified split?

Answer: Preserves target class distribution across train/test sets.

5Why random seed matters?

Answer: Ensures reproducibility of data partitions and results.

6What is data leakage in splitting?

Answer: Information from test set influencing training decisions.

7When use time-based split?

Answer: For temporal data to respect chronology and avoid look-ahead bias.

8What is k-fold cross validation?

Answer: Repeated train/validation across folds for reliable performance estimate.

9Can test set be used for tuning?

Answer: No, test set should be reserved for final unbiased evaluation.

10How handle imbalanced data during split?

Answer: Use stratification and evaluate with proper metrics (F1/PR-AUC).

11What if dataset is very small?

Answer: Prefer cross-validation and simpler models to reduce variance.

12One-line train/test summary?

Answer: Proper splitting is essential for trustworthy model evaluation.

Related Data Science Links