Definition
Computer vision (CV) is the area of artificial intelligence that deals with making computers understand visual data: digital images, video streams, or 3D sensor outputs. The goal is to recover useful information—objects, boundaries, motion, depth, text, or activities—and support decisions, measurements, or automation.
Unlike simply storing or displaying pictures, a CV system builds representations (features, regions, embeddings) and applies models (classical geometry, statistical learning, or deep neural networks) to interpret what the pixels mean in context.
Computer vision vs image processing
Image processing
Operates on pixels and signals: filtering, denoising, resizing, color transforms, compression, and enhancement. Output is usually another image or numeric map.
Computer vision
Aims at semantics: detection, recognition, segmentation, tracking, pose, 3D structure, and scene understanding—often combining image processing with learning and geometry.
Short history
- 1960s–70s: Early work on blocks world, edges, and simple 3D from images.
- 1990s–2000s: Robust features (e.g. SIFT), stereo, tracking, and statistical methods.
- 2012 onward: Deep convolutional networks (AlexNet on ImageNet) shifted many benchmarks to end-to-end learning.
- Today: Large-scale detection and segmentation, transformers for vision, generative image models, and real-time embedded CV.
Applications
Computer vision appears in many industries:
- Quality inspection and robotics
- Medical imaging assistance
- Autonomous driving and driver assistance
- Security, retail analytics, and sports analytics
- OCR, document scanning, and accessibility tools
- AR/VR and content creation
How this series is organized
Later chapters cover preprocessing, features, segmentation, object detection, tracking, 3D vision, CNNs, generative models, video, applications, tools, and evaluation—aligned with the sidebar. Use the left menu to jump between topics; verify code and library versions on your machine.