Optical Character Recognition: 20 Essential Q&A

Question 1

1 What is OCR? ⚡ easy

Answer

Answer: Converting images of text into machine-encoded text—includes layout, detection, and reading order for documents or natural scenes.

Question 2

2 Detection vs recognition? 📊 medium

Answer

Answer: Detection finds where text is (boxes/polygons); recognition reads what characters—often separate stages or unified models.

Question 3

3 Scene text difficulties? 📊 medium

Answer

Answer: Arbitrary orientation, fonts, lighting, perspective, small size, and background clutter vs clean scanned pages.

Question 4

4 How does Tesseract work (classic)? 📊 medium

Answer

Answer: Adaptive thresholding, connected components, line/word finding, then LSTM-based recognizer in modern versions—strong on clean scans.

Question 5

5 Preprocessing? ⚡ easy

Answer

Answer: Deskew, denoise, binarization, contrast normalize—improves classical OCR; deep models learn invariances but still benefit from sane crops.

Question 6

6 Character segmentation? 🔥 hard

Answer

Answer: Splitting cursive or touching characters is hard—sequence models avoid explicit per-char cuts via CTC or attention.

Question 7

7 CRNN? 📊 medium

Answer

Answer: CNN feature extractor → RNN (e.g. BiLSTM) for sequence → CTC or attention—classic pipeline for curved/horizontal text lines.

Question 8

8 What is CTC? 🔥 hard

Answer

Answer: Loss aligning variable-length outputs to labels without per-timestep alignment—blank symbol collapses repeats; fits OCR output length ≠ input width.

Question 9

9 Attention decoders? 📊 medium

Answer

Answer: Autoregressive prediction with visual attention over feature map—handles irregular scripts; slower than CTC but flexible.

Question 10

10 EAST / DB? 📊 medium

Answer

Answer: Single-shot detectors producing rotated boxes or shrink-based segmentation for text instances—fast scene-text detection.

Question 11

11 What is ICDAR? ⚡ easy

Answer

Answer: Competition/benchmark series for document and scene text—standard mAP / edit-distance metrics across tasks.

Question 12

12 Multilingual OCR? 📊 medium

Answer

Answer: Separate language models, script-specific normalizers, or Unicode output layer—training data must cover target scripts.

Question 13

13 Document layout? 📊 medium

Answer

Answer: Tables, columns, reading order—needs layout analysis (Detectron-style or transformer LMs) beyond line OCR.

Question 14

14 End-to-end OCR? 🔥 hard

Answer

Answer: One network predicts boxes and text together (e.g. some transformer detectors)—reduces error propagation between stages.

Question 15

15 Synthetic data? ⚡ easy

Answer

Answer: Render text on random backgrounds for detection/recognition pretrain—domain gap to real photos needs finetune.

Question 16

16 Metrics? 📊 medium

Answer

Answer: Character error rate (CER), word error rate (WER), normalized edit distance—detection uses IoU + transcription match (Hmean).

Question 17

17 Handling blur/skew? 📊 medium

Answer

Answer: Super-resolution, rectification networks, or train with aggressive augmentations—geometric augment critical for robustness.

Question 18

18 Handwriting? 🔥 hard

Answer

Answer: Higher intra-class variability—needs writer-independent features, larger datasets (IAM), often HMM/CTC or seq2seq.

Question 19

19 Deployment? ⚡ easy

Answer

Answer: ONNX/TensorRT for speed; batch line images; language models for post-correction in search/product pipelines.

Question 20

20 TrOCR-style? 📊 medium

Answer

Answer: Vision encoder + text decoder pretrained on large image-text—strong zero-shot/finetune on documents without classical pipeline.

Related Computer Vision Links

Optical Character Recognition: 20 Essential Q&A

Quick Navigation

OCR Cheat Sheet

Stages

Sequence

Metrics

Full tutorial track