Data Cleaning

Data Cleaning Interview Q&A

1What is data cleaning?
Answer: Fixing quality issues like missing values, duplicates, invalid formats, and inconsistencies.
2How detect missing data?
Answer: Profile null counts per column and inspect patterns by segment/time.
3Drop vs impute missing values?
Answer: Drop when impact is low; impute when preserving data is important and assumptions are valid.
4How handle duplicates?
Answer: Define business key, identify exact/near duplicates, keep authoritative record.
5What are outliers?
Answer: Extreme observations that may be valid rare events or data errors.
6How treat outliers?
Answer: Investigate source, then cap, transform, segment, or remove with justification.
7Why standardize text values?
Answer: Prevent category explosion due to case/spelling variations.
8Date parsing best practice?
Answer: Enforce one timezone and one canonical datetime format.
9How validate cleaning steps?
Answer: Use before/after metrics, data tests, and sample audits.
10What is data leakage during cleaning?
Answer: Using future/test information while preparing training data.
11Should cleaning be reproducible?
Answer: Yes, via scripted pipelines and versioned transformation logic.
12One-line data cleaning summary?
Answer: Clean data is the foundation of trustworthy analytics and ML models.