Computer Vision Interview 40 Q&A Chapter 18

OCR & Autonomous Driving — Interview Q&A

Optical character recognition and computer vision stacks for autonomous vehicles.

40 questions Chapter 18

Optical Character Recognition: 20 Essential Q&A

1 What is OCR? ⚡ easy
Answer: Converting images of text into machine-encoded text—includes layout, detection, and reading order for documents or natural scenes.
2 Detection vs recognition? 📊 medium
Answer: Detection finds where text is (boxes/polygons); recognition reads what characters—often separate stages or unified models.
3 Scene text difficulties? 📊 medium
Answer: Arbitrary orientation, fonts, lighting, perspective, small size, and background clutter vs clean scanned pages.
4 How does Tesseract work (classic)? 📊 medium
Answer: Adaptive thresholding, connected components, line/word finding, then LSTM-based recognizer in modern versions—strong on clean scans.
text = pytesseract.image_to_string(img)  # OCR API
5 Preprocessing? ⚡ easy
Answer: Deskew, denoise, binarization, contrast normalize—improves classical OCR; deep models learn invariances but still benefit from sane crops.
6 Character segmentation? 🔥 hard
Answer: Splitting cursive or touching characters is hard—sequence models avoid explicit per-char cuts via CTC or attention.
7 CRNN? 📊 medium
Answer: CNN feature extractor → RNN (e.g. BiLSTM) for sequence → CTC or attention—classic pipeline for curved/horizontal text lines.
8 What is CTC? 🔥 hard
Answer: Loss aligning variable-length outputs to labels without per-timestep alignment—blank symbol collapses repeats; fits OCR output length ≠ input width.
9 Attention decoders? 📊 medium
Answer: Autoregressive prediction with visual attention over feature map—handles irregular scripts; slower than CTC but flexible.
10 EAST / DB? 📊 medium
Answer: Single-shot detectors producing rotated boxes or shrink-based segmentation for text instances—fast scene-text detection.
11 What is ICDAR? ⚡ easy
Answer: Competition/benchmark series for document and scene text—standard mAP / edit-distance metrics across tasks.
12 Multilingual OCR? 📊 medium
Answer: Separate language models, script-specific normalizers, or Unicode output layer—training data must cover target scripts.
13 Document layout? 📊 medium
Answer: Tables, columns, reading order—needs layout analysis (Detectron-style or transformer LMs) beyond line OCR.
14 End-to-end OCR? 🔥 hard
Answer: One network predicts boxes and text together (e.g. some transformer detectors)—reduces error propagation between stages.
15 Synthetic data? ⚡ easy
Answer: Render text on random backgrounds for detection/recognition pretrain—domain gap to real photos needs finetune.
16 Metrics? 📊 medium
Answer: Character error rate (CER), word error rate (WER), normalized edit distance—detection uses IoU + transcription match (Hmean).
17 Handling blur/skew? 📊 medium
Answer: Super-resolution, rectification networks, or train with aggressive augmentations—geometric augment critical for robustness.
18 Handwriting? 🔥 hard
Answer: Higher intra-class variability—needs writer-independent features, larger datasets (IAM), often HMM/CTC or seq2seq.
19 Deployment? ⚡ easy
Answer: ONNX/TensorRT for speed; batch line images; language models for post-correction in search/product pipelines.
20 TrOCR-style? 📊 medium
Answer: Vision encoder + text decoder pretrained on large image-text—strong zero-shot/finetune on documents without classical pipeline.

Autonomous Vehicles (CV): 20 Essential Q&A

21 What does perception do in AVs? ⚡ easy
Answer: Estimate drivable space, lanes, traffic actors, signs, and hazards from sensors to support planning and control.
22 Camera vs LiDAR vs radar? 📊 medium
Answer: Camera: rich semantics, cheap; LiDAR: accurate range, weather limits; radar: velocity, robust weather—stacks often fuse all three.
23 Sensor fusion levels? 🔥 hard
Answer: Early (raw/feature), object-level, late decision fusion—trade calibration complexity vs robustness to single-sensor failure.
24 Lane detection? 📊 medium
Answer: Segmentation masks, polynomial fits, or transformer lanes in BEV—must handle markings, merges, and construction zones.
25 Segmentation use? 📊 medium
Answer: Drivable area, road vs sidewalk, freespace for parking—often multi-class at high resolution with temporal smoothing.
26 Monocular depth? 📊 medium
Answer: Supplement LiDAR in camera-only tiers or dense depth for fusion—learned depth can fail on unseen textures.
27 Detection classes? ⚡ easy
Answer: Vehicles, pedestrians, cyclists, traffic lights/signs—need range, velocity hooks for tracker and planner.
28 Tracking role? 📊 medium
Answer: Maintain stable IDs, smooth boxes, predict future motion—critical for collision avoidance and behavior prediction.
29 HD maps? 🔥 hard
Answer: Centimeter lane geometry, semantics—anchor localization; mapless stacks push more burden onto online perception.
30 Calibration? 📊 medium
Answer: Extrinsics drift, vibration—online self-calibration vs factory; bad cal breaks fusion and projection.
31 Weather / night? 📊 medium
Answer: Sensor degradation, glare, spray—domain adaptation, multi-sensor redundancy, conservative ODD restrictions.
32 Functional safety (concept)? 🔥 hard
Answer: ISO 26262 mindset: fault detection, redundancy, validated perception uncertainty for ASIL-rated paths—not just model accuracy.
33 Simulation? 📊 medium
Answer: CARLA, NVIDIA DRIVE Sim—scale rare scenarios; sim-to-real gap remains a research and validation topic.
34 Long-tail objects? 📊 medium
Answer: Debris, animals, unusual vehicles—need active learning, fleet logging, and conservative planner reactions.
35 Occlusion? ⚡ easy
Answer: Pedestrians between cars—temporal reasoning, bird’s-eye fusion, and prediction to “see” briefly hidden actors.
36 Latency budgets? 📊 medium
Answer: End-to-end perception often tens of ms—tensorRT, sparse models, ROI processing; planner assumes aged observations.
37 Bird’s-eye view models? 🔥 hard
Answer: Lift image features to 3D/BEV grid (LSS, transformers) for consistent multi-camera reasoning—popular in modern detectors.
# BEV: lift 2D features to bird's-eye grid for fusion
38 What is ODD? ⚡ easy
Answer: Operational design domain—where the system is validated to operate; leaving ODD requires disengagement or human takeover.
39 Annotation? 📊 medium
Answer: LiDAR cuboids, polyline lanes, radar association—expensive; weak labels and self-supervision reduce cost.
40 End-to-end driving? 🔥 hard
Answer: Direct sensor→control learning challenges interpretability and safety case—hybrid stacks dominate production today.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next