Data Cleaning & Wrangling24 Q&A

Data Cleaning & Wrangling — Q&A

Handle missing values, outliers, duplicates, and reshape data for analysis.

Data Cleaning Interview Q&A

1What is data cleaning?
Answer: Fixing quality issues like missing values, duplicates, invalid formats, and inconsistencies.
2How detect missing data?
Answer: Profile null counts per column and inspect patterns by segment/time.
3Drop vs impute missing values?
Answer: Drop when impact is low; impute when preserving data is important and assumptions are valid.
4How handle duplicates?
Answer: Define business key, identify exact/near duplicates, keep authoritative record.
5What are outliers?
Answer: Extreme observations that may be valid rare events or data errors.
6How treat outliers?
Answer: Investigate source, then cap, transform, segment, or remove with justification.
7Why standardize text values?
Answer: Prevent category explosion due to case/spelling variations.
8Date parsing best practice?
Answer: Enforce one timezone and one canonical datetime format.
9How validate cleaning steps?
Answer: Use before/after metrics, data tests, and sample audits.
10What is data leakage during cleaning?
Answer: Using future/test information while preparing training data.
11Should cleaning be reproducible?
Answer: Yes, via scripted pipelines and versioned transformation logic.
12One-line data cleaning summary?
Answer: Clean data is the foundation of trustworthy analytics and ML models.

Data Wrangling Interview Q&A

13What is data wrangling?
Answer: Transforming raw data into usable, analysis-ready format.
14Wrangling vs cleaning?
Answer: Cleaning fixes quality; wrangling includes reshaping, joining, and feature-ready transformation.
15What is tidy data?
Answer: Each variable column, each observation row, each value cell.
16Wide vs long format?
Answer: Wide stores repeated measures across columns; long stacks them in rows.
17Why keys matter in joins?
Answer: Correct keys prevent duplication and incorrect row matching.
18How handle schema drift?
Answer: Add schema checks and transformation mapping by source version.
19What is feature engineering in wrangling?
Answer: Creating informative variables from raw inputs.
20Why type casting important?
Answer: Wrong dtypes cause calculation errors and inefficient memory use.
21How validate joins?
Answer: Compare pre/post row counts and key uniqueness diagnostics.
22How manage pipeline steps?
Answer: Make each transform modular, testable, and idempotent.
23What is idempotent transform?
Answer: Re-running it produces same result without side effects.
24Wrangling in one line?
Answer: Wrangling bridges messy sources and reliable analytical outputs.