Data Science Interview
20 essential Q&A
Updated 2024
most asked
Data Science Interview: 20 Essential Q&A
Master Data Science fundamentals, from statistics to machine learning. Short, crisp answers designed to help you ace your next interview.
12 min read
20 questions
Beginner to Advanced
statistics
machine learning
Python
SQL
regression
classification
feature engineering
Quick Navigation
1
What is the difference between Data Science, Machine Learning, and AI?
⚡ easy
Answer: AI is the broad field of making machines intelligent. Machine Learning is a subset of AI where systems learn from data. Data Science is an interdisciplinary field that uses ML, statistics, and domain expertise to extract insights from data. Data Science encompasses the entire data pipeline.
fundamentals
hierarchy
2
Explain different types of data in Data Science.
⚡ easy
Answer:
- Structured: Tabular data (databases, Excel)
- Unstructured: Text, images, audio, video
- Semi-structured: JSON, XML, HTML
- Time-series: Data points indexed in time order
3
What are common data cleaning steps?
⚡ easy
Answer: Handling missing values (drop, impute), removing duplicates, fixing data types, handling outliers, standardizing formats, and correcting inconsistencies. Data cleaning typically takes 60-80% of project time.
df.dropna() # Remove missing
df.fillna(df.mean()) # Impute with mean
4
What is feature engineering and why is it important?
📊 medium
Answer: Feature engineering is creating new features or transforming existing ones to improve model performance. Examples: creating interaction terms, binning, encoding categorical variables, scaling. Good features can often beat complex models.
5
Explain Linear Regression and its assumptions.
📊 medium
Answer: Linear regression models relationship between dependent and independent variables as y = β₀ + β₁x + ε. Key assumptions: linearity, independence, homoscedasticity (constant variance), normality of errors, and no multicollinearity.
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
6
How does Logistic Regression differ from Linear Regression?
📊 medium
Answer: Logistic regression is for binary classification, while linear regression is for continuous prediction. Logistic uses sigmoid function to output probabilities (0-1), linear regression uses identity function. Loss functions differ: logistic uses log loss, linear uses MSE.
sigmoid: 1/(1+e^-x)
log loss
7
Explain the Bias-Variance Tradeoff.
📊 medium
Answer: Bias is error from wrong assumptions (underfitting). Variance is error from sensitivity to training data (overfitting). Tradeoff: complex models have low bias but high variance; simple models have high bias but low variance. Goal is to find optimal balance.
8
What is Cross-Validation and why use it?
📊 medium
Answer: Cross-validation splits data into multiple train/validation sets to assess model performance more robustly. K-fold CV divides data into k folds, training on k-1 and validating on 1, rotating k times. Prevents overfitting and gives better generalization estimates.
9
What is PCA and when would you use it?
🔥 hard
Answer: Principal Component Analysis (PCA) reduces dimensionality by creating new features (principal components) that capture maximum variance. Use for: visualization, reducing overfitting, speeding up training, and handling multicollinearity. Components are orthogonal and interpretable by variance explained.
10
Explain K-Means Clustering algorithm.
📊 medium
Answer: K-means partitions data into k clusters by minimizing within-cluster variance. Steps: initialize k centroids, assign points to nearest centroid, update centroids as mean of assigned points, repeat until convergence. Choose k using elbow method or silhouette score.
11
How do Decision Trees work?
📊 medium
Answer: Decision trees split data recursively based on feature values to maximize information gain or reduce impurity. Each node represents a feature, branches represent decisions, leaves represent outcomes. Advantages: interpretable, handles non-linearity. Disadvantages: prone to overfitting.
entropy
gini impurity
information gain
12
What is Random Forest and how does it improve over decision trees?
📊 medium
Answer: Random Forest is an ensemble of decision trees using bagging and random feature selection. Each tree trained on bootstrap sample, splitting on random feature subset. Reduces overfitting, handles high dimensions better, provides feature importance. More stable than single trees.
13
Explain key classification metrics.
📊 medium
Answer:
- Accuracy: (TP+TN)/(Total) - overall correctness
- Precision: TP/(TP+FP) - positive predictive value
- Recall: TP/(TP+FN) - sensitivity, true positive rate
- F1-Score: 2*(Precision*Recall)/(Precision+Recall) - harmonic mean
- AUC-ROC: Area under ROC curve, measures discrimination ability
14
Common SQL commands for data science?
⚡ easy
Answer: SELECT, WHERE, GROUP BY, HAVING, JOIN (INNER, LEFT, RIGHT), ORDER BY, window functions (ROW_NUMBER, RANK, LAG), aggregations (COUNT, SUM, AVG). Essential for data extraction and feature creation.
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
15
L1 vs L2 Regularization: differences?
🔥 hard
Answer: L1 (Lasso) adds penalty = λ|β|, can shrink coefficients to zero (feature selection). L2 (Ridge) adds penalty = λβ², shrinks coefficients but never zero. L1 produces sparse solutions, L2 handles multicollinearity better. ElasticNet combines both.
16
What is Gradient Boosting?
🔥 hard
Answer: Gradient boosting builds models sequentially, where each new model corrects errors of previous ones by fitting to negative gradients (residuals). Popular implementations: XGBoost, LightGBM, CatBoost. Often best performance but requires careful tuning.
17
Key Pandas operations for data manipulation?
📊 medium
Answer:
- Reading: pd.read_csv(), pd.read_sql()
- Exploring: df.head(), df.info(), df.describe()
- Filtering: df[df['col'] > value]
- Grouping: df.groupby('col').agg()
- Merging: pd.merge(df1, df2, on='key')
- Pivot tables: pd.pivot_table()
18
Explain p-value and significance level.
📊 medium
Answer: p-value is probability of observing results at least as extreme as actual, assuming null hypothesis is true. Significance level (α, usually 0.05) is threshold for rejecting null. If p < α, reject null (statistically significant). Lower p means stronger evidence against null.
19
How do you detect and handle outliers?
📊 medium
Answer: Detection: box plots (points beyond 1.5*IQR), z-scores (>3), domain knowledge. Handling: remove (if errors), cap/winsorize, transform (log), treat separately, or use robust methods (tree-based models handle outliers better).
20
Common challenges in Data Science projects?
📊 medium
Answer: Data quality issues, unclear business objectives, class imbalance, feature selection, model interpretability, deployment challenges, concept drift, data privacy, and stakeholder communication. Success requires both technical and soft skills.
📊 data quality
🎯 business alignment
⚖️ class imbalance
🔧 deployment
Data Science Interview Cheat Sheet
Statistics
- Descriptive stats
- Hypothesis testing
- Probability distributions
- Bayesian thinking
Machine Learning
- Regression & Classification
- Tree-based models
- Ensemble methods
- Clustering & dimensionality
Tools & Techniques
- Python (Pandas, NumPy)
- SQL queries
- Scikit-learn
- Model evaluation
💡 Pro tip: For each algorithm, understand: assumptions, pros/cons, when to use, and how to tune
Practice with real datasets
Apply these concepts on Kaggle competitions or real-world projects. Hands-on experience is crucial.