Data Science Interview: 20 Essential Q&A

Question 1

1 What is the difference between Data Science, Machine Learning, and AI? ⚡ easy

Answer

Answer: AI is the broad field of making machines intelligent. Machine Learning is a subset of AI where systems learn from data. Data Science is an interdisciplinary field that uses ML, statistics, and domain expertise to extract insights from data. Data Science encompasses the entire data pipeline.

Question 2

2 Explain different types of data in Data Science. ⚡ easy

Answer

Answer:

Structured: Tabular data (databases, Excel)
Unstructured: Text, images, audio, video
Semi-structured: JSON, XML, HTML
Time-series: Data points indexed in time order

Question 3

3 What are common data cleaning steps? ⚡ easy

Answer

Answer: Handling missing values (drop, impute), removing duplicates, fixing data types, handling outliers, standardizing formats, and correcting inconsistencies. Data cleaning typically takes 60-80% of project time.

Question 4

4 What is feature engineering and why is it important? 📊 medium

Answer

Answer: Feature engineering is creating new features or transforming existing ones to improve model performance. Examples: creating interaction terms, binning, encoding categorical variables, scaling. Good features can often beat complex models.

Question 5

5 Explain Linear Regression and its assumptions. 📊 medium

Answer

Answer: Linear regression models relationship between dependent and independent variables as y = β₀ + β₁x + ε. Key assumptions: linearity, independence, homoscedasticity (constant variance), normality of errors, and no multicollinearity.

Question 6

6 How does Logistic Regression differ from Linear Regression? 📊 medium

Answer

Answer: Logistic regression is for binary classification, while linear regression is for continuous prediction. Logistic uses sigmoid function to output probabilities (0-1), linear regression uses identity function. Loss functions differ: logistic uses log loss, linear uses MSE.

Question 7

7 Explain the Bias-Variance Tradeoff. 📊 medium

Answer

Answer: Bias is error from wrong assumptions (underfitting). Variance is error from sensitivity to training data (overfitting). Tradeoff: complex models have low bias but high variance; simple models have high bias but low variance. Goal is to find optimal balance.

Question 8

8 What is Cross-Validation and why use it? 📊 medium

Answer

Answer: Cross-validation splits data into multiple train/validation sets to assess model performance more robustly. K-fold CV divides data into k folds, training on k-1 and validating on 1, rotating k times. Prevents overfitting and gives better generalization estimates.

Question 9

9 What is PCA and when would you use it? 🔥 hard

Answer

Answer: Principal Component Analysis (PCA) reduces dimensionality by creating new features (principal components) that capture maximum variance. Use for: visualization, reducing overfitting, speeding up training, and handling multicollinearity. Components are orthogonal and interpretable by variance explained.

Question 10

10 Explain K-Means Clustering algorithm. 📊 medium

Answer

Answer: K-means partitions data into k clusters by minimizing within-cluster variance. Steps: initialize k centroids, assign points to nearest centroid, update centroids as mean of assigned points, repeat until convergence. Choose k using elbow method or silhouette score.

Question 11

11 How do Decision Trees work? 📊 medium

Answer

Answer: Decision trees split data recursively based on feature values to maximize information gain or reduce impurity. Each node represents a feature, branches represent decisions, leaves represent outcomes. Advantages: interpretable, handles non-linearity. Disadvantages: prone to overfitting.

Question 12

12 What is Random Forest and how does it improve over decision trees? 📊 medium

Answer

Answer: Random Forest is an ensemble of decision trees using bagging and random feature selection. Each tree trained on bootstrap sample, splitting on random feature subset. Reduces overfitting, handles high dimensions better, provides feature importance. More stable than single trees.

Question 13

13 Explain key classification metrics. 📊 medium

Answer

Answer:

Accuracy: (TP+TN)/(Total) - overall correctness
Precision: TP/(TP+FP) - positive predictive value
Recall: TP/(TP+FN) - sensitivity, true positive rate
F1-Score: 2*(Precision*Recall)/(Precision+Recall) - harmonic mean
AUC-ROC: Area under ROC curve, measures discrimination ability

Question 14

14 Common SQL commands for data science? ⚡ easy

Answer

Answer: SELECT, WHERE, GROUP BY, HAVING, JOIN (INNER, LEFT, RIGHT), ORDER BY, window functions (ROW_NUMBER, RANK, LAG), aggregations (COUNT, SUM, AVG). Essential for data extraction and feature creation.

Question 15

15 L1 vs L2 Regularization: differences? 🔥 hard

Answer

Answer: L1 (Lasso) adds penalty = λ|β|, can shrink coefficients to zero (feature selection). L2 (Ridge) adds penalty = λβ², shrinks coefficients but never zero. L1 produces sparse solutions, L2 handles multicollinearity better. ElasticNet combines both.

Question 16

16 What is Gradient Boosting? 🔥 hard

Answer

Answer: Gradient boosting builds models sequentially, where each new model corrects errors of previous ones by fitting to negative gradients (residuals). Popular implementations: XGBoost, LightGBM, CatBoost. Often best performance but requires careful tuning.

Question 17

17 Key Pandas operations for data manipulation? 📊 medium

Answer

Answer:

Reading: pd.read_csv(), pd.read_sql()
Exploring: df.head(), df.info(), df.describe()
Filtering: df[df['col'] > value]
Grouping: df.groupby('col').agg()
Merging: pd.merge(df1, df2, on='key')
Pivot tables: pd.pivot_table()

Question 18

18 Explain p-value and significance level. 📊 medium

Answer

Answer: p-value is probability of observing results at least as extreme as actual, assuming null hypothesis is true. Significance level (α, usually 0.05) is threshold for rejecting null. If p < α, reject null (statistically significant). Lower p means stronger evidence against null.

Question 19

19 How do you detect and handle outliers? 📊 medium

Answer

Answer: Detection: box plots (points beyond 1.5*IQR), z-scores (>3), domain knowledge. Handling: remove (if errors), cap/winsorize, transform (log), treat separately, or use robust methods (tree-based models handle outliers better).

Question 20

20 Common challenges in Data Science projects? 📊 medium

Answer

Answer: Data quality issues, unclear business objectives, class imbalance, feature selection, model interpretability, deployment challenges, concept drift, data privacy, and stakeholder communication. Success requires both technical and soft skills.

Data Science Interview: 20 Essential Q&A

Quick Navigation

Data Science Interview Cheat Sheet

Statistics

Machine Learning

Tools & Techniques

Practice with real datasets