Scikit-Learn Q&A
20 Core Questions
Interview Prep
Scikit-Learn: Interview Q&A
Short questions and answers on using scikit-learn for practical machine learning in Python.
Estimators
Pipelines
CV & Search
Preprocessing
1
What is scikit-learn and when would you use it?
⚡ Beginner
Answer: Scikit-learn is a popular Python library for classical ML (trees, SVMs, linear models, clustering, preprocessing) on tabular data.
2
What is the common estimator API pattern in sklearn?
⚡ Beginner
Answer: Estimators follow the fit / predict / transform pattern, sometimes with fit_transform and score.
3
What is a transformer vs an estimator in sklearn?
📊 Intermediate
Answer: Transformers implement transform (e.g., scaling, encoding); estimators implement predict (models) or both in some cases.
4
Why are pipelines useful in scikit-learn?
📊 Intermediate
Answer: Pipelines chain preprocessing and modeling steps so you can fit and cross-validate the whole workflow safely without leakage.
5
What is ColumnTransformer and when would you use it?
📊 Intermediate
Answer: ColumnTransformer applies different transformers to different columns (e.g., scale numerics, one-hot encode categoricals) in a single pipeline.
6
How do you perform cross-validation in sklearn?
⚡ Beginner
Answer: Use helpers like cross_val_score, cross_validate or pass a CV splitter to GridSearchCV / RandomizedSearchCV.
7
What is GridSearchCV and why is it useful?
⚡ Beginner
Answer: GridSearchCV exhaustively tests parameter combinations using CV, providing best params and a tuned estimator.
8
When would you prefer RandomizedSearchCV over GridSearchCV?
📊 Intermediate
Answer: When the parameter space is large; RandomizedSearchCV samples combinations and is usually more efficient.
9
How do you handle class imbalance in sklearn classifiers?
📊 Intermediate
Answer: Use class_weight='balanced' (where supported), resample with imbalanced-learn, or adjust thresholds/metrics.
10
Why should preprocessing be inside the pipeline rather than done beforehand?
🔥 Advanced
Answer: Putting preprocessing in the pipeline ensures it is fit only on training folds during CV, preventing data leakage.
11
How do you save and load trained sklearn models?
⚡ Beginner
Answer: Typically using joblib.dump and joblib.load (or pickle with care).
12
What are some key preprocessing utilities in sklearn?
⚡ Beginner
Answer: Important transformers: StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder, SimpleImputer, PolynomialFeatures.
13
How do you access model coefficients or feature importances in sklearn?
📊 Intermediate
Answer: Many linear models expose coef_, tree-based models expose feature_importances_.
14
What is the purpose of the random_state parameter?
⚡ Beginner
Answer: random_state controls random number generation for reproducibility of model training and splits.
15
How do you create a custom transformer in sklearn?
🔥 Advanced
Answer: Subclass BaseEstimator and TransformerMixin, implement fit (often returning self) and transform.
16
How do you handle time series with sklearn to avoid leakage?
🔥 Advanced
Answer: Use TimeSeriesSplit or custom CV, create lag features, and ensure all transforms use only past data in each fold.
17
When would you choose sklearn over deep learning frameworks?
📊 Intermediate
Answer: For tabular data, smaller datasets, quicker iteration and simpler deployment, sklearn models are often the best choice.
18
Give an example of a full sklearn workflow from raw data to model.
🔥 Advanced
Answer: Typical flow: train_test_split → ColumnTransformer (impute+scale/encode) → Pipeline with model → cross-validation / GridSearchCV → fit best model → evaluate on test.
19
What are some common mistakes when using sklearn?
🔥 Advanced
Answer: Common mistakes: data leakage from preprocessing outside pipelines, improper CV, not scaling when needed, ignoring class imbalance.
20
What is the key message to remember about scikit-learn?
⚡ Beginner
Answer: Scikit-learn provides a clean, consistent API; mastering estimators, pipelines and CV lets you build robust ML workflows quickly.
Quick Recap: Scikit-Learn
Think in terms of transformers + estimators + pipelines; this mindset helps you structure nearly any classical ML project in sklearn.