Related Machine Learning Links
Learn Preprocessing Machine Learning Tutorial, validate concepts with Preprocessing Machine Learning MCQ Questions, and prepare interviews through Preprocessing Machine Learning Interview Questions and Answers.
Data Preprocessing Q&A
20 Core Questions
Interview Prep
ML Data Preprocessing: Interview Q&A
Short questions and answers on cleaning, transforming and preparing data before feeding it into machine learning models.
Missing Values
Scaling
Encoding
Outliers
1
What is data preprocessing in machine learning?
âš¡ Beginner
Answer: Data preprocessing is the set of steps you take to clean, transform and organize raw data so that it is suitable and reliable for training machine learning models.
2
How can you handle missing values?
âš¡ Beginner
Answer: Common strategies are to drop rows/columns with many missing values or impute them with simple statistics (mean/median/mode) or model-based estimates.
3
Why is feature scaling important?
âš¡ Beginner
Answer: Scaling ensures that all numeric features are on a similar range so that distance-based algorithms and gradient descent behave properly and no feature dominates because of its units.
4
What is the difference between normalization and standardization?
📊 Intermediate
Answer: Normalization usually rescales data to a fixed range (e.g., [0,1]), while standardization transforms data to have mean 0 and standard deviation 1.
5
How do you encode categorical variables?
âš¡ Beginner
Answer: You can use one‑hot encoding for nominal categories or ordinal/integer encoding when the categories have an inherent order.
6
What is one‑hot encoding?
âš¡ Beginner
Answer: One‑hot encoding converts each category into a binary vector with a 1 in the position of that category and 0s elsewhere, so models can work with categorical data numerically.
7
How do you detect outliers in data?
📊 Intermediate
Answer: You can use visualizations (box plots, scatter plots), simple rules (values beyond 3 standard deviations or IQR rules) or dedicated anomaly detection algorithms (Isolation Forest, LOF).
8
What is data leakage during preprocessing?
📊 Intermediate
Answer: Leakage happens when information from the validation or test set influences preprocessing (e.g., scaling or imputing using statistics computed on the whole dataset instead of training data only).
9
How do you avoid data leakage in preprocessing?
📊 Intermediate
Answer: Always fit preprocessing steps only on the training set (e.g., scalers, imputers) and then apply the learned transformations to validation/test data.
10
What is a data pipeline in sklearn?
📊 Intermediate
Answer: A pipeline chains preprocessing steps and the estimator into a single object so you can fit and predict with the whole workflow consistently and safely inside cross‑validation.
11
Why is it useful to handle text and numeric features separately?
📊 Intermediate
Answer: Text and numeric data often require different preprocessing (e.g., TF‑IDF vs scaling), so separating them ensures appropriate transformations for each type.
12
What is feature hashing?
🔥 Advanced
Answer: Feature hashing maps high‑cardinality categorical features into a fixed‑size numeric vector using a hash function, reducing memory at the cost of possible collisions.
13
When would you log‑transform a feature?
📊 Intermediate
Answer: Log transforms are useful for positively skewed, strictly positive features (e.g., income) to compress large values and make distributions more symmetric.
14
What is the purpose of shuffling data before splitting?
âš¡ Beginner
Answer: Shuffling prevents ordering biases (like time or grouping effects) from causing the train/test split to be unrepresentative of the overall data distribution.
15
How do you handle highly correlated features?
📊 Intermediate
Answer: You can drop one of the correlated features, combine them (e.g., via PCA) or use regularized models that are less sensitive to multicollinearity.
16
Why is target encoding risky?
🔥 Advanced
Answer: Target encoding replaces categories with statistics of the target, which can leak label information if not done carefully (e.g., using cross‑validation schemes), leading to overfitting.
17
What are dummy variable traps?
📊 Intermediate
Answer: The dummy variable trap occurs when one‑hot encoded features are perfectly collinear (e.g., including all categories), which can cause problems for linear models; typically one category is dropped.
18
How do you treat unseen categories at prediction time?
📊 Intermediate
Answer: You can map unseen categories to a special "unknown" bucket, ignore them or use encoders that support unknown handling (e.g., sklearn’s
OneHotEncoder(handle_unknown='ignore')).
19
Why should preprocessing be part of the deployed model?
📊 Intermediate
Answer: Bundling preprocessing with the model avoids train/serve skew; production data is transformed in exactly the same way as during training.
20
Name three common preprocessing mistakes.
📊 Intermediate
Answer: Examples: (1) Fitting scalers on full data instead of training only, (2) Dropping rows with missing values blindly, and (3) Not encoding categorical variables properly before using algorithms that only accept numbers.
Quick Recap: Data Preprocessing
Clean, correctly transformed data is the foundation of every successful ML system. Always think carefully about what information each preprocessing step uses and how it will behave in production.