ML Data Preprocessing: Interview Q&A

Short questions and answers on cleaning, transforming and preparing data before feeding it into machine learning models.

Missing Values Scaling Encoding Outliers

1 What is data preprocessing in machine learning? âš¡ Beginner

Answer: Data preprocessing is the set of steps you take to clean, transform and organize raw data so that it is suitable and reliable for training machine learning models.

2 How can you handle missing values? âš¡ Beginner

Answer: Common strategies are to drop rows/columns with many missing values or impute them with simple statistics (mean/median/mode) or model-based estimates.

3 Why is feature scaling important? âš¡ Beginner

Answer: Scaling ensures that all numeric features are on a similar range so that distance-based algorithms and gradient descent behave properly and no feature dominates because of its units.

4 What is the difference between normalization and standardization? ðŸ“Š Intermediate

Answer: Normalization usually rescales data to a fixed range (e.g., [0,1]), while standardization transforms data to have mean 0 and standard deviation 1.

5 How do you encode categorical variables? âš¡ Beginner

Answer: You can use oneâ€‘hot encoding for nominal categories or ordinal/integer encoding when the categories have an inherent order.

6 What is oneâ€‘hot encoding? âš¡ Beginner

Answer: Oneâ€‘hot encoding converts each category into a binary vector with a 1 in the position of that category and 0s elsewhere, so models can work with categorical data numerically.

7 How do you detect outliers in data? ðŸ“Š Intermediate

Answer: You can use visualizations (box plots, scatter plots), simple rules (values beyond 3 standard deviations or IQR rules) or dedicated anomaly detection algorithms (Isolation Forest, LOF).

8 What is data leakage during preprocessing? ðŸ“Š Intermediate

Answer: Leakage happens when information from the validation or test set influences preprocessing (e.g., scaling or imputing using statistics computed on the whole dataset instead of training data only).

9 How do you avoid data leakage in preprocessing? ðŸ“Š Intermediate

Answer: Always fit preprocessing steps only on the training set (e.g., scalers, imputers) and then apply the learned transformations to validation/test data.

10 What is a data pipeline in sklearn? ðŸ“Š Intermediate

Answer: A pipeline chains preprocessing steps and the estimator into a single object so you can fit and predict with the whole workflow consistently and safely inside crossâ€‘validation.

11 Why is it useful to handle text and numeric features separately? ðŸ“Š Intermediate

Answer: Text and numeric data often require different preprocessing (e.g., TFâ€‘IDF vs scaling), so separating them ensures appropriate transformations for each type.

12 What is feature hashing? ðŸ”¥ Advanced

Answer: Feature hashing maps highâ€‘cardinality categorical features into a fixedâ€‘size numeric vector using a hash function, reducing memory at the cost of possible collisions.

13 When would you logâ€‘transform a feature? ðŸ“Š Intermediate

Answer: Log transforms are useful for positively skewed, strictly positive features (e.g., income) to compress large values and make distributions more symmetric.

14 What is the purpose of shuffling data before splitting? âš¡ Beginner

Answer: Shuffling prevents ordering biases (like time or grouping effects) from causing the train/test split to be unrepresentative of the overall data distribution.

15 How do you handle highly correlated features? ðŸ“Š Intermediate

Answer: You can drop one of the correlated features, combine them (e.g., via PCA) or use regularized models that are less sensitive to multicollinearity.

16 Why is target encoding risky? ðŸ”¥ Advanced

Answer: Target encoding replaces categories with statistics of the target, which can leak label information if not done carefully (e.g., using crossâ€‘validation schemes), leading to overfitting.

17 What are dummy variable traps? ðŸ“Š Intermediate

Answer: The dummy variable trap occurs when oneâ€‘hot encoded features are perfectly collinear (e.g., including all categories), which can cause problems for linear models; typically one category is dropped.

18 How do you treat unseen categories at prediction time? ðŸ“Š Intermediate

Answer: You can map unseen categories to a special "unknown" bucket, ignore them or use encoders that support unknown handling (e.g., sklearnâ€™s OneHotEncoder(handle_unknown='ignore')).

19 Why should preprocessing be part of the deployed model? ðŸ“Š Intermediate

Answer: Bundling preprocessing with the model avoids train/serve skew; production data is transformed in exactly the same way as during training.

20 Name three common preprocessing mistakes. ðŸ“Š Intermediate

Answer: Examples: (1) Fitting scalers on full data instead of training only, (2) Dropping rows with missing values blindly, and (3) Not encoding categorical variables properly before using algorithms that only accept numbers.

Quick Recap: Data Preprocessing

Clean, correctly transformed data is the foundation of every successful ML system. Always think carefully about what information each preprocessing step uses and how it will behave in production.

Back to ML Tutorial Next: Model Evaluation Q&A

Related Machine Learning Links

ML Data Preprocessing: Interview Q&A

Quick Recap: Data Preprocessing