Model Deployment & MLOps
20 Essential Q/A
MLOps Interview Prep
Model Deployment: 20 Interview Questions
Master production ML: serving frameworks (TF Serving, TorchServe, ONNX), containerization (Docker), orchestration (K8s), monitoring, drift detection, CI/CD, A/B testing, and edge deployment. Concise, interview-ready answers.
TensorFlow Serving
ONNX
Docker
Kubernetes
Model Monitoring
CI/CD
Edge AI
1
What does model deployment mean in ML? Key challenges?
⚡ Easy
Answer: Deployment is integrating a trained model into a production environment to serve predictions. Challenges: latency, scalability, versioning, monitoring, reproducibility, data drift, and infrastructure.
2
Difference between batch inference and online (real-time) inference?
📊 Medium
Answer: Batch: periodic, large-scale, high throughput, no immediate response (e.g., nightly recommendations). Online: low-latency (<100ms), real-time, REST/gRPC endpoints (e.g., fraud detection). Trade-off: cost vs responsiveness.
3
What is TensorFlow Serving? How does it handle versioning?
🔥 Hard
Answer: TF Serving is a high-performance serving system for TensorFlow models. It supports versioned model repositories (filesystem paths). It loads newest version by default, enables zero-downtime rollback, and manages model lifecycle (loading/unloading) dynamically.
model_repository/
└── my_model/
├── 1/ # version 1
│ └── saved_model.pb
└── 2/ # version 2 (active)
└── saved_model.pb
4
What is ONNX and when would you use it?
📊 Medium
Answer: Open Neural Network Exchange (ONNX) is an open format for model interoperability. Convert models from PyTorch, TF, etc. to a unified format. Use when you need framework-agnostic serving or deploy to hardware-specific runtimes (ONNX Runtime, Intel OpenVINO).
5
Why is Docker important for model deployment?
📊 Medium
Answer: Docker provides environment reproducibility: encapsulates model, dependencies, and system libraries. Solves "works on my machine" problem. Enables consistent deployment across dev, staging, prod, and scales with orchestrators (K8s).
FROM tensorflow/serving
COPY ./my_model /models/my_model/1
ENV MODEL_NAME=my_model
6
What role does Kubernetes play in deploying ML models?
🔥 Hard
Answer: K8s orchestrates containerized models: auto-scaling based on load, rolling updates, self-healing, service discovery. Tools like KFServing, Seldon Core leverage K8s for ML-specific inference workloads with canary deployments and explainability.
7
Compare REST and gRPC for model inference endpoints.
📊 Medium
REST: HTTP/1.1, JSON, widely supported, larger payload, browser-compatible.
gRPC: HTTP/2, Protocol Buffers, smaller/faster, streaming, strict typing. Preferred for high-performance microservices (TensorFlow Serving, TorchServe support both).
gRPC: HTTP/2, Protocol Buffers, smaller/faster, streaming, strict typing. Preferred for high-performance microservices (TensorFlow Serving, TorchServe support both).
8
What is data drift and concept drift? How to detect them?
🔥 Hard
Answer:
- Data drift: input distribution changes (e.g., user demographics). Detect via statistical tests (KS-test, PSI) or feature distribution monitoring.
- Concept drift: relationship input->target changes (e.g., new fraud patterns). Detect via accuracy drop, prediction distribution shift.
PSI = Σ (Actual% - Expected%) * ln(Actual%/Expected%)
9
How do you A/B test a new model in production?
🔥 Hard
Answer: Route a percentage of traffic (e.g., 10%) to the new model (B), rest to current (A). Define success metrics (CTR, conversion). Use statistical significance to decide rollout. Infrastructure: feature flags, Istio (traffic splitting), or model serving proxies.
10
What is TorchServe? How does it differ from TF Serving?
📊 Medium
Answer: TorchServe is PyTorch's official serving framework. Features: multi-model serving, REST/gRPC endpoints, model versioning, logging, and built-in default handlers. Unlike TF Serving, it's PyTorch-native but conceptually similar. Both support ONNX export.
11
When would you use FastAPI/Flask instead of dedicated model servers?
📊 Medium
Answer: Use for custom pre/post-processing, lightweight deployments, or when serving non-standard models. FastAPI is modern (async, OpenAPI docs, high performance). Downside: you must handle scaling, versioning, monitoring yourself.
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
@app.post("/predict")
def predict(features: dict):
return {"prediction": model.predict([features])}
12
What is model quantization? Why is it used in deployment?
🔥 Hard
Answer: Quantization reduces model precision (e.g., FP32 → INT8) to decrease model size and inference latency, crucial for edge/mobile. Techniques: post-training quantization, quantization-aware training. Trade-off: minor accuracy loss.
13
How does CI/CD for ML differ from traditional CI/CD?
🔥 Hard
Answer: Traditional CI/CD tests code. ML CI/CD (CT: Continuous Training) also tests data and models: data validation, model evaluation, feature integrity, and model versioning. Tools: Jenkins, GitLab CI + DVC, MLflow, Kubeflow.
Automated retraining pipelines
Model reproducibility is harder
14
How do you version ML models in production?
📊 Medium
Answer:
- Model registry (MLflow, DVC, S3 with versioning): store model artifacts + metadata (hyperparameters, metrics).
- Semantic versioning or timestamped builds.
- Serve multiple versions simultaneously for shadow testing.
15
What is a shadow deployment? Why use it?
🔥 Hard
Answer: Shadow (mirror) deployment: new model receives live traffic copy but predictions aren't served to users. Compare performance offline without risk. Validates stability and accuracy before full rollout.
16
What are common frameworks for edge model deployment?
📊 Medium
Answer:
- TFLite: mobile/embedded (Android, iOS).
- CoreML: Apple devices.
- TensorRT: NVIDIA GPU optimization.
- ONNX Runtime: cross-platform.
- OpenVINO: Intel hardware.
17
How do you serve multiple models efficiently on same infrastructure?
🔥 Hard
Answer:
- Model servers (TF Serving, MLServer) support loading multiple models.
- Sidecar pattern: each model in separate container, orchestrated.
- Model caching for frequently used models.
- Model ensembles combined in single deployment.
18
How do you provide explanations with model predictions in production?
🔥 Hard
Answer: Integrate post-hoc explainability libraries: SHAP, LIME. Precompute explanations or serve as endpoints. For regulatory/compliance (e.g., credit scoring). Use serving tools like Seldon Alibi or Azure ML explainability SDK.
19
What is a feature store? Why is it important for deployment?
🔥 Hard
Answer: Feature store (Feast, Tecton) centralizes feature engineering, ensures training-serving consistency (same logic applied online/offline), low-latency feature retrieval, and reusability across teams. Avoids training/serving skew.
20
Sketch a complete model deployment pipeline (MLOps).
🔥 Hard
Answer:
- Data validation (TFX, Great Expectations).
- Training/experiment tracking (MLflow).
- Model evaluation (compare to baseline).
- Model registry (promote to staging).
- Containerization (Docker).
- Deployment to staging, integration tests.
- Canary/shadow deployment in prod.
- Monitoring (drift, performance).
- Continuous retraining trigger.
Model Deployment – Interview Cheat Sheet
Serving
- TF TensorFlow Serving
- PyT TorchServe
- ONNX Interoperability
Container/Orch
- Docker Reproducibility
- K8s Scaling, self-heal
Monitoring
- 📊 Data/Concept drift
- PSI Population Stability Index
- Evidently Open source
Strategies
- A/B Traffic split
- Shadow Mirror traffic
- Canary Gradual rollout
Verdict: "Deployment is not just serving – it's monitoring, versioning, scaling, and continuous validation. MLOps bridges data science and engineering."