Model Deployment: 20 Interview Questions

Question 1

1 What does model deployment mean in ML? Key challenges? ⚡ Easy

Answer

Answer: Deployment is integrating a trained model into a production environment to serve predictions. Challenges: latency, scalability, versioning, monitoring, reproducibility, data drift, and infrastructure.

Question 2

2 Difference between batch inference and online (real-time) inference? 📊 Medium

Answer

Answer: Batch: periodic, large-scale, high throughput, no immediate response (e.g., nightly recommendations). Online: low-latency (<100ms), real-time, REST/gRPC endpoints (e.g., fraud detection). Trade-off: cost vs responsiveness.

Question 3

3 What is TensorFlow Serving? How does it handle versioning? 🔥 Hard

Answer

Answer: TF Serving is a high-performance serving system for TensorFlow models. It supports versioned model repositories (filesystem paths). It loads newest version by default, enables zero-downtime rollback, and manages model lifecycle (loading/unloading) dynamically.

Question 4

4 What is ONNX and when would you use it? 📊 Medium

Answer

Answer: Open Neural Network Exchange (ONNX) is an open format for model interoperability. Convert models from PyTorch, TF, etc. to a unified format. Use when you need framework-agnostic serving or deploy to hardware-specific runtimes (ONNX Runtime, Intel OpenVINO).

Question 5

5 Why is Docker important for model deployment? 📊 Medium

Answer

Answer: Docker provides environment reproducibility: encapsulates model, dependencies, and system libraries. Solves "works on my machine" problem. Enables consistent deployment across dev, staging, prod, and scales with orchestrators (K8s).

Question 6

6 What role does Kubernetes play in deploying ML models? 🔥 Hard

Answer

Answer: K8s orchestrates containerized models: auto-scaling based on load, rolling updates, self-healing, service discovery. Tools like KFServing, Seldon Core leverage K8s for ML-specific inference workloads with canary deployments and explainability.

Question 7

7 Compare REST and gRPC for model inference endpoints. 📊 Medium

Question 8

8 What is data drift and concept drift? How to detect them? 🔥 Hard

Answer

Answer:

Data drift: input distribution changes (e.g., user demographics). Detect via statistical tests (KS-test, PSI) or feature distribution monitoring.
Concept drift: relationship input->target changes (e.g., new fraud patterns). Detect via accuracy drop, prediction distribution shift.

Tools: Evidently AI, WhyLabs, Fiddler, Seldon Alibi Detect.

Question 9

9 How do you A/B test a new model in production? 🔥 Hard

Answer

Answer: Route a percentage of traffic (e.g., 10%) to the new model (B), rest to current (A). Define success metrics (CTR, conversion). Use statistical significance to decide rollout. Infrastructure: feature flags, Istio (traffic splitting), or model serving proxies.

Question 10

10 What is TorchServe? How does it differ from TF Serving? 📊 Medium

Answer

Answer: TorchServe is PyTorch's official serving framework. Features: multi-model serving, REST/gRPC endpoints, model versioning, logging, and built-in default handlers. Unlike TF Serving, it's PyTorch-native but conceptually similar. Both support ONNX export.

Question 11

11 When would you use FastAPI/Flask instead of dedicated model servers? 📊 Medium

Answer

Answer: Use for custom pre/post-processing, lightweight deployments, or when serving non-standard models. FastAPI is modern (async, OpenAPI docs, high performance). Downside: you must handle scaling, versioning, monitoring yourself.

Question 12

12 What is model quantization? Why is it used in deployment? 🔥 Hard

Answer

Answer: Quantization reduces model precision (e.g., FP32 → INT8) to decrease model size and inference latency, crucial for edge/mobile. Techniques: post-training quantization, quantization-aware training. Trade-off: minor accuracy loss.

Question 13

13 How does CI/CD for ML differ from traditional CI/CD? 🔥 Hard

Answer

Answer: Traditional CI/CD tests code. ML CI/CD (CT: Continuous Training) also tests data and models: data validation, model evaluation, feature integrity, and model versioning. Tools: Jenkins, GitLab CI + DVC, MLflow, Kubeflow.

Question 14

14 How do you version ML models in production? 📊 Medium

Answer

Answer:

Model registry (MLflow, DVC, S3 with versioning): store model artifacts + metadata (hyperparameters, metrics).
Semantic versioning or timestamped builds.
Serve multiple versions simultaneously for shadow testing.

Question 15

15 What is a shadow deployment? Why use it? 🔥 Hard

Answer

Answer: Shadow (mirror) deployment: new model receives live traffic copy but predictions aren't served to users. Compare performance offline without risk. Validates stability and accuracy before full rollout.

Question 16

16 What are common frameworks for edge model deployment? 📊 Medium

Answer

Answer:

TFLite: mobile/embedded (Android, iOS).
CoreML: Apple devices.
TensorRT: NVIDIA GPU optimization.
ONNX Runtime: cross-platform.
OpenVINO: Intel hardware.

Question 17

17 How do you serve multiple models efficiently on same infrastructure? 🔥 Hard

Answer

Answer:

Model servers (TF Serving, MLServer) support loading multiple models.
Sidecar pattern: each model in separate container, orchestrated.
Model caching for frequently used models.
Model ensembles combined in single deployment.

Question 18

18 How do you provide explanations with model predictions in production? 🔥 Hard

Answer

Answer: Integrate post-hoc explainability libraries: SHAP, LIME. Precompute explanations or serve as endpoints. For regulatory/compliance (e.g., credit scoring). Use serving tools like Seldon Alibi or Azure ML explainability SDK.

Question 19

19 What is a feature store? Why is it important for deployment? 🔥 Hard

Answer

Answer: Feature store (Feast, Tecton) centralizes feature engineering, ensures training-serving consistency (same logic applied online/offline), low-latency feature retrieval, and reusability across teams. Avoids training/serving skew.

Question 20

20 Sketch a complete model deployment pipeline (MLOps). 🔥 Hard

Answer

Answer:

Data validation (TFX, Great Expectations).
Training/experiment tracking (MLflow).
Model evaluation (compare to baseline).
Model registry (promote to staging).
Containerization (Docker).
Deployment to staging, integration tests.
Canary/shadow deployment in prod.
Monitoring (drift, performance).
Continuous retraining trigger.

Model Deployment: 20 Interview Questions

Model Deployment – Interview Cheat Sheet

Serving

Container/Orch

Monitoring

Strategies