Computer Vision Interview
40 Q&A
Chapter 18
OCR & Autonomous Driving — Interview Q&A
Optical character recognition and computer vision stacks for autonomous vehicles.
40 questions
Chapter 18
Optical Character Recognition: 20 Essential Q&A
1
What is OCR?
⚡ easy
Answer: Converting images of text into machine-encoded text—includes layout, detection, and reading order for documents or natural scenes.
2
Detection vs recognition?
📊 medium
Answer: Detection finds where text is (boxes/polygons); recognition reads what characters—often separate stages or unified models.
3
Scene text difficulties?
📊 medium
Answer: Arbitrary orientation, fonts, lighting, perspective, small size, and background clutter vs clean scanned pages.
4
How does Tesseract work (classic)?
📊 medium
Answer: Adaptive thresholding, connected components, line/word finding, then LSTM-based recognizer in modern versions—strong on clean scans.
text = pytesseract.image_to_string(img) # OCR API
5
Preprocessing?
⚡ easy
Answer: Deskew, denoise, binarization, contrast normalize—improves classical OCR; deep models learn invariances but still benefit from sane crops.
6
Character segmentation?
🔥 hard
Answer: Splitting cursive or touching characters is hard—sequence models avoid explicit per-char cuts via CTC or attention.
7
CRNN?
📊 medium
Answer: CNN feature extractor → RNN (e.g. BiLSTM) for sequence → CTC or attention—classic pipeline for curved/horizontal text lines.
8
What is CTC?
🔥 hard
Answer: Loss aligning variable-length outputs to labels without per-timestep alignment—blank symbol collapses repeats; fits OCR output length ≠ input width.
9
Attention decoders?
📊 medium
Answer: Autoregressive prediction with visual attention over feature map—handles irregular scripts; slower than CTC but flexible.
10
EAST / DB?
📊 medium
Answer: Single-shot detectors producing rotated boxes or shrink-based segmentation for text instances—fast scene-text detection.
11
What is ICDAR?
⚡ easy
Answer: Competition/benchmark series for document and scene text—standard mAP / edit-distance metrics across tasks.
12
Multilingual OCR?
📊 medium
Answer: Separate language models, script-specific normalizers, or Unicode output layer—training data must cover target scripts.
13
Document layout?
📊 medium
Answer: Tables, columns, reading order—needs layout analysis (Detectron-style or transformer LMs) beyond line OCR.
14
End-to-end OCR?
🔥 hard
Answer: One network predicts boxes and text together (e.g. some transformer detectors)—reduces error propagation between stages.
15
Synthetic data?
⚡ easy
Answer: Render text on random backgrounds for detection/recognition pretrain—domain gap to real photos needs finetune.
16
Metrics?
📊 medium
Answer: Character error rate (CER), word error rate (WER), normalized edit distance—detection uses IoU + transcription match (Hmean).
17
Handling blur/skew?
📊 medium
Answer: Super-resolution, rectification networks, or train with aggressive augmentations—geometric augment critical for robustness.
18
Handwriting?
🔥 hard
Answer: Higher intra-class variability—needs writer-independent features, larger datasets (IAM), often HMM/CTC or seq2seq.
19
Deployment?
⚡ easy
Answer: ONNX/TensorRT for speed; batch line images; language models for post-correction in search/product pipelines.
20
TrOCR-style?
📊 medium
Answer: Vision encoder + text decoder pretrained on large image-text—strong zero-shot/finetune on documents without classical pipeline.
Autonomous Vehicles (CV): 20 Essential Q&A
21
What does perception do in AVs?
⚡ easy
Answer: Estimate drivable space, lanes, traffic actors, signs, and hazards from sensors to support planning and control.
22
Camera vs LiDAR vs radar?
📊 medium
Answer: Camera: rich semantics, cheap; LiDAR: accurate range, weather limits; radar: velocity, robust weather—stacks often fuse all three.
23
Sensor fusion levels?
🔥 hard
Answer: Early (raw/feature), object-level, late decision fusion—trade calibration complexity vs robustness to single-sensor failure.
24
Lane detection?
📊 medium
Answer: Segmentation masks, polynomial fits, or transformer lanes in BEV—must handle markings, merges, and construction zones.
25
Segmentation use?
📊 medium
Answer: Drivable area, road vs sidewalk, freespace for parking—often multi-class at high resolution with temporal smoothing.
26
Monocular depth?
📊 medium
Answer: Supplement LiDAR in camera-only tiers or dense depth for fusion—learned depth can fail on unseen textures.
27
Detection classes?
⚡ easy
Answer: Vehicles, pedestrians, cyclists, traffic lights/signs—need range, velocity hooks for tracker and planner.
28
Tracking role?
📊 medium
Answer: Maintain stable IDs, smooth boxes, predict future motion—critical for collision avoidance and behavior prediction.
29
HD maps?
🔥 hard
Answer: Centimeter lane geometry, semantics—anchor localization; mapless stacks push more burden onto online perception.
30
Calibration?
📊 medium
Answer: Extrinsics drift, vibration—online self-calibration vs factory; bad cal breaks fusion and projection.
31
Weather / night?
📊 medium
Answer: Sensor degradation, glare, spray—domain adaptation, multi-sensor redundancy, conservative ODD restrictions.
32
Functional safety (concept)?
🔥 hard
Answer: ISO 26262 mindset: fault detection, redundancy, validated perception uncertainty for ASIL-rated paths—not just model accuracy.
33
Simulation?
📊 medium
Answer: CARLA, NVIDIA DRIVE Sim—scale rare scenarios; sim-to-real gap remains a research and validation topic.
34
Long-tail objects?
📊 medium
Answer: Debris, animals, unusual vehicles—need active learning, fleet logging, and conservative planner reactions.
35
Occlusion?
⚡ easy
Answer: Pedestrians between cars—temporal reasoning, bird’s-eye fusion, and prediction to “see” briefly hidden actors.
36
Latency budgets?
📊 medium
Answer: End-to-end perception often tens of ms—tensorRT, sparse models, ROI processing; planner assumes aged observations.
37
Bird’s-eye view models?
🔥 hard
Answer: Lift image features to 3D/BEV grid (LSS, transformers) for consistent multi-camera reasoning—popular in modern detectors.
# BEV: lift 2D features to bird's-eye grid for fusion
38
What is ODD?
⚡ easy
Answer: Operational design domain—where the system is validated to operate; leaving ODD requires disengagement or human takeover.
39
Annotation?
📊 medium
Answer: LiDAR cuboids, polyline lanes, radar association—expensive; weak labels and self-supervision reduce cost.
40
End-to-end driving?
🔥 hard
Answer: Direct sensor→control learning challenges interpretability and safety case—hybrid stacks dominate production today.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.